Introduction:
As a dba there should always be some room to improve your backup and recovery scenarios. This blog is about a judgment day scenario – losing a 3TB database in full and your bosses are asking what will you do and when will the database be back on its feet. In such scenarios I do tend to debate when do you really lose the full database with a need to restore it .. proper responses might be maybe during an upgrade to a newer Oracle Release or when you lose the server where the database is running (however given the fact that a lot of sites use SAN that would even offer other options than a restore of such a big database not that likely). In my experience it would be much more common to come across scenarios to restore single datafiles , tablespaces . But OK for the argument’s sake and to proof we have a scenario when we lose the full database ( means all data files and control files) let us go into more detail with that.
Details:
On my most important databases this has been the setup in the past when implementing the database(s) in ASM to give us a save feeling.
- Every database has three disk groups called SID_DATA , SID_FRA and SID_REDO .
- Of course database files are in the SID_DATA , Control files and REDOLOG group members are in SID_DATA and SID_REDO and the archives are put to the SID_FRA.
- Fortunately we are using EMC technology , both Snap and BCV. We use them to set up clone – reporting environments, pre-production – and do our backups with it.
As with regard to the backups an interesting debate has risen this week which triggered this blog. In the setup of the Snap and BCV it has always been considered that there have to be two of them for each database. One holding the SID_DATA disk group ( means the database files) and one Snap or BCV holding SID_FRA and SID_REDO. When a backup is triggered, the following actions are performed on the production side:
- In the database it is brought to begin backup ,
- In the database a copy of the controlfile is pushed to SID_FRA disk group ,
- In the EMC tools the mirror is broken,
- In the database an end backup is issued.
- On the backup server the database is mounted with the copied version of the controlfile and a level backup is running with rman (both level0 and level 1 ( indeed also using block change tracking ).
- After a backup the snap / bcv remains mounted on the backup server for when of a true disaster to be used to restore the full database from disk..
- On the production server the archives are saved in intervals to tape directly .
Soo far so good right ? .. We have it all covered (or not ?) . ….
If you take a closer look at a judgment day disaster – recovery ask where customer will say we need a restore of the FULL database ( of 3TB) would we be happy when seeing this ?
If we check the redos and control files we will see this:
1* select a.GROUP#,MEMBER,BYTES/1024/1024 from v$logfile a, v$log b where a.group#= b.group# order by group#;
GROUP# MEMBER BYTES/1024/1024
———- ——————————————————————————– —————
1 +MYDBP1_REDO/mydbp1/onlinelog/group_1.258.772722547 2048
1 +MYDBP1_DATA/mydbp1/onlinelog/group_1.262.772722575 2048
2 +MYDBP1_REDO/mydbp1/onlinelog/group_2.259.772723069 2048
2 +MYDBP1_DATA/mydbp1/onlinelog/group_2.263.772723099 2048
3 +MYDBP1_REDO/mydbp1/onlinelog/group_3.260.772723879 2048
3 +MYDBP1_DATA/mydbp1/onlinelog/group_3.266.772723909 2048
4 +MYDBP1_REDO/mydbp1/onlinelog/group_4.261.772723247 2048
4 +MYDBP1_DATA/mydbp1/onlinelog/group_4.267.772723277 2048
5 +MYDBP1_REDO/mydbp1/onlinelog/group_5.262.772723627 2048
5 +MYDBP1_DATA/mydbp1/onlinelog/group_5.534.772723657 2048
6 +MYDBP1_REDO/mydbp1/onlinelog/group_6.263.772722935 2048
6 +MYDBP1_DATA/mydbp1/onlinelog/group_6.535.772722963 2048
12 rows selected.
## Control files are in :
SQL> show parameter control_files
NAME TYPE VALUE
———————————— ———– ——————————
control_files string +MYDBP1_REDO/mydbp1/controlfile/current.257.771667625, +MYDBP1_DATA/mydbp1/controlfile/current.261.771667625
So if all hell breaks loose and you are asked to restore the backup from the Snap or BCV what will happen …..
First scenario was : well it is all taken care of since we have two bcv – snaps remember, so the dba would contact the storage colleague that he should restore the SID_DATA snap( he should sync it back to production) and after that we would do a recovery using the archives and the online redo logs who are still present on the production box . And we feel very good with this cause we thought of the _FRA and _REDO should not be synced back to production .. right ? ( actually the proper answer is wrong).
As we were discussing this scenario.. And after looking at the locations in the v$logfile we did notice a week spot in that idea.. If we restore the SID_DATA from BCV that would also mean that the first member of every online redolog would be wiped out since we have a redo member in the SID_DATA diskgroup ( Oracle does not like online redologs NOT being in sync).. And we would also lose a current control file since that is too in the same diskgroup ( Neither not synced control files btw are like by Oracle). Of course such a scenario will only be needed between dusk till dawn after 4 nights of being on call .. and all shouting we need this issue fixed as soon as possible so there would be little time to investigate and set up a scenario.
Looking at this scenario at hand, when being asked to restore asap from BCV, losing one of the control files well that could easily be fixed by starting the database with one control file ( since we did not touch the _FRA and _REDO diskgroups (we did not restore the second BCV !! lucky us) that would offer us still a current control file and after doing our activities to restore and recover , we could use rman to copy the controlfile again to a second – third place and amend the spfile) .
The online redo member however that would hurt bad if it really would occur that one online redo log member would be overwritten ( since we Restore from the SID_DATA SNAP – BCV) to be fastest possible way back on our feet from disk once the full database would be destroyed. In a lab test I could consider perhaps a scenario where mounting the database and dropping the first member from each group ….. but not on a production highly important database like in this scenario..
So that means that we will need to make sure that when a SID_DATA snap- bcv is used (resynced to production) to restore from disk it should only hold Datafiles and ok a control file (and not also a online redo log member as happens at the moment) .
In this case I will go and set up new online redo log groups and I will set them up in the SID_FRA and SID_REDO diskgroup. Then I will do log switches and in the end will drop the groups that had members both in _DATA and _REDO. After that .. I would be much more relaxed for my judgment day restore request to recover the FULL 3 TB database from the snap – bcv at maximum speed. It is not that much work to carry out and I am sure I feel a lot more confident that even a scenario where a 3TB database needs to be restored from a BCV can be done quit quickly and with less risks.
As Always happy reading and till we meet again,
Mathijs