Is your backup strategy safe ( a story of BCV – Snap restores with EMC)

Introduction:

As a dba there should always be some room to improve your backup and recovery scenarios.  This blog is about a  judgment  day scenario – losing a 3TB database in full and your bosses are asking  what will you do and when will the database be back on its feet. In such scenarios I do tend to debate when do you really lose the full database with a need to restore it ..  proper responses might be maybe during an upgrade  to a newer Oracle Release or when you lose the server where the  database is running (however  given the fact that  a lot of sites use SAN  that would even  offer other options than a restore of such a big database not that likely).  In my experience it would be much more common to come across scenarios to restore single datafiles , tablespaces . But OK  for the argument’s sake and to proof we have a scenario when we lose the full database ( means all data files and control files) let us go into more detail with that.

Details:

On my most important databases this has been the setup in the past when implementing the database(s) in ASM to give us a save feeling.

  • Every database has three disk groups   called SID_DATA , SID_FRA and SID_REDO .
  • Of course database files are in the SID_DATA , Control files  and  REDOLOG group members are  in   SID_DATA  and SID_REDO and the archives are put to the SID_FRA.
  • Fortunately we are using  EMC technology , both Snap and BCV. We use them to  set up  clone – reporting environments, pre-production – and  do our backups with it.

As with regard to the backups an interesting debate has risen this week which triggered this blog. In the setup  of the Snap and BCV it has always been considered that  there have to be two of them for each database. One holding the SID_DATA disk group ( means the database files)  and one Snap or BCV holding SID_FRA  and SID_REDO.  When a backup is triggered,  the following actions are performed on the production side:

  • In the database it is brought to begin backup ,
  • In the database a copy of the controlfile is pushed to SID_FRA  disk group ,
  • In the EMC tools the  mirror is broken,
  • In the database an end backup is issued.
  • On the backup server the database is mounted with the copied version of the controlfile and a level backup is running with rman (both level0 and level 1 ( indeed also using block change tracking ).
  • After a backup  the snap / bcv remains mounted on the backup server for when of a true disaster to be used to restore  the full database from disk..
  • On the production server  the archives are saved in intervals to tape directly .

 

Soo  far so good right ? .. We have it all covered (or not ?)  . ….

If you take a closer look at a judgment day disaster – recovery ask  where customer will say we need a restore of the FULL database  ( of 3TB) would we be happy  when seeing this ?

If we check the redos  and control files we will see this:

1* select a.GROUP#,MEMBER,BYTES/1024/1024 from v$logfile a, v$log b where a.group#= b.group#  order by group#;

 

GROUP# MEMBER                                                                                                                                   BYTES/1024/1024

———- ——————————————————————————– —————

1 +MYDBP1_REDO/mydbp1/onlinelog/group_1.258.772722547                                                                       2048

1 +MYDBP1_DATA/mydbp1/onlinelog/group_1.262.772722575                                                                       2048

2 +MYDBP1_REDO/mydbp1/onlinelog/group_2.259.772723069                                                                       2048

2 +MYDBP1_DATA/mydbp1/onlinelog/group_2.263.772723099                                                                       2048

3 +MYDBP1_REDO/mydbp1/onlinelog/group_3.260.772723879                                                                       2048

3 +MYDBP1_DATA/mydbp1/onlinelog/group_3.266.772723909                                                                       2048

4 +MYDBP1_REDO/mydbp1/onlinelog/group_4.261.772723247                                                                       2048

4 +MYDBP1_DATA/mydbp1/onlinelog/group_4.267.772723277                                                                       2048

5 +MYDBP1_REDO/mydbp1/onlinelog/group_5.262.772723627                                                                       2048

5 +MYDBP1_DATA/mydbp1/onlinelog/group_5.534.772723657                                                                       2048

6 +MYDBP1_REDO/mydbp1/onlinelog/group_6.263.772722935                                                                       2048

6 +MYDBP1_DATA/mydbp1/onlinelog/group_6.535.772722963                                                                       2048

 

12 rows selected.

 

## Control files are in :

SQL> show parameter control_files

 

NAME                                                      TYPE VALUE

———————————— ———– ——————————

control_files                                          string               +MYDBP1_REDO/mydbp1/controlfile/current.257.771667625, +MYDBP1_DATA/mydbp1/controlfile/current.261.771667625

 

So if all hell breaks loose and you are asked to restore the backup from the  Snap or BCV  what will happen …..

First scenario was : well it is all  taken care of since we have two  bcv – snaps  remember, so   the dba would contact the storage colleague that he should  restore the  SID_DATA  snap( he should sync it back to production) and after that we would do a recovery  using the archives and the online redo logs  who are still present on the production box . And we feel very good with this cause we thought of the  _FRA and _REDO should not be synced back to production .. right ? ( actually the proper answer is wrong).

As we were discussing this scenario.. And after looking at the locations in the v$logfile we did notice  a week spot in that idea.. If we restore the  SID_DATA from BCV that would also mean that the first member of every online redolog would be wiped out since we have a redo member in  the SID_DATA diskgroup ( Oracle does not like online redologs NOT being in sync)..  And  we would also lose a current control file since that is too  in the same diskgroup ( Neither not synced control files btw are like by Oracle).  Of course such a scenario will only be needed  between dusk till dawn  after 4 nights of being on call .. and all shouting we need this issue fixed as soon as possible  so there would be little time to investigate  and set up a scenario.

Looking at this scenario  at hand,  when being asked to restore asap from BCV, losing one of the control files well that could easily be fixed by starting the database with  one control file ( since we did not touch the _FRA and _REDO diskgroups (we did not restore the second BCV  !! lucky us)   that would offer us still a current control file and after  doing our activities to restore and recover , we could use rman to copy the controlfile again to a second – third place and amend the spfile) .

The online redo member however  that would hurt bad if it really would occur that one online redo log member would be overwritten ( since we Restore from the  SID_DATA SNAP – BCV) to be fastest possible way back on our feet from disk once  the full database would be destroyed.  In a lab test  I could consider perhaps  a scenario where mounting the database and dropping the first member from each group ….. but not on a production highly important database like in this scenario..

So  that means that we will need to make sure that when a SID_DATA  snap- bcv  is  used (resynced to production) to restore from disk  it should only hold  Datafiles  and ok a control file (and not  also a online redo log member as happens at the moment) .

In this case I will go and set up new online redo log groups and  I will set them up in the SID_FRA and SID_REDO diskgroup. Then I will  do log switches  and in the end will drop the groups that had members both in _DATA and _REDO.   After that .. I would be much more relaxed for my judgment day restore request to recover the FULL 3 TB database from the snap – bcv at maximum speed. It is not that much work to carry out and I am sure I feel a lot more confident that even a scenario where a 3TB database needs to be restored  from a BCV can be done quit quickly and with less  risks.

 

As Always  happy reading and till we meet again,

 

Mathijs

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s