Missing Disk in Fra Diskgroup in ASM , and work your way back

Introduction

A single instance/Db  11gr2 under control of 11gr2 grid infrastructure with ASM as storage solution. In ASM the disks are defined as external redundancy due to  business rules ( no shooting the messenger please). I was Informed that a database  had crashed . Grid Infra structure ( nor manual actions ) were able to start the database again.  So I was asked  please go take look and fix asap . The good thing was that I was aware of maintenance in the ASM Disk group on that Tuesday September 4th. So when alarms started to go off  I had first clues (even Sherlock Holmes needs a  first clue right ?). As always the alert log was first place to check for me.  This document follows a number of steps during the analyses and the end status which was a proper database open (without reset logs). Bottom-line is give not up too fast. BUT also in this specific scenario it had been better if we had kicked the disk and started from scratch instead of trying to repair. But that being said I know that sounds wise when looking back ….

What happened?

In the alert file right before crash this information was found:

Tue Sep 04 13:49:41 2012

SUCCESS: diskgroup MYDB_FRA01 was dismounted

SUCCESS: diskgroup MYDB_FRA01 was dismounted

Tue Sep 04 13:49:41 2012

ARC0: Error 19504 Creating archive log file to ‘+MYDB_FRA01’

Errors in file /opt/oracle/MYDB/diag/rdbms/MYDB/MYDB/trace/MYDB_arc0_19126.trc:

ORA-00202: control file: ‘+MYDB_FRA01/MYDB/control03.ctl’

ORA-15078: ASM diskgroup was forcibly dismounted

Tue Sep 04 13:49:41 2012

Errors in file /opt/oracle/MYDB/diag/rdbms/MYDB/MYDB/trace/MYDB_m000_1555.trc:

ORA-00313: open failed for members of log group 1 of thread 1

ORA-00312: online log 1 thread 1: ‘+MYDB_FRA01/MYDB/onlinelog/group_1.257.784139163’

ORA-17503: ksfdopn:2 Failed to open file +MYDB_FRA01/MYDB/onlinelog/group_1.257.784139163

ORA-15001: diskgroup “MYDB_FRA01” does not exist or is not mounted

ORA-15001: diskgroup “MYDB_FRA01” does not exist or is not mounted

Master archival failure: 204

Errors in file /opt/oracle/MYDB/diag/rdbms/MYDB/MYDB/trace/MYDB_arc0_19126.trc:

ORA-00202: control file: ‘+MYDB_FRA01/MYDB/control03.ctl’

ORA-15078: ASM diskgroup was forcibly dismounted

Errors in file /opt/oracle/MYDB/diag/rdbms/MYDB/MYDB/trace/MYDB_m000_1555.trc:

ORA-00313: open failed for members of log group 2 of thread 1

ORA-00312: online log 2 thread 1: ‘+MYDB_FRA01/MYDB/onlinelog/group_2.258.784139167’

ORA-17503: ksfdopn:2 Failed to open file +MYDB_FRA01/MYDB/onlinelog/group_2.258.784139167

ORA-15001: diskgroup “MYDB_FRA01” does not exist or is not mounted

ORA-15001: diskgroup “MYDB_FRA01” does not exist or is not mounted

Errors in file /opt/oracle/MYDB/diag/rdbms/MYDB/MYDB/trace/MYDB_m000_1555.trc:

ORA-00313: open failed for members of log group 3 of thread 1

ORA-00312: online log 3 thread 1: ‘+MYDB_FRA01/MYDB/onlinelog/group_3.259.784139171’

ORA-17503: ksfdopn:2 Failed to open file +MYDB_FRA01/MYDB/onlinelog/group_3.259.784139171

ORA-15001: diskgroup “MYDB_FRA01” does not exist or is not mounted

ORA-15001: diskgroup “MYDB_FRA01” does not exist or is not mounted

Checker run found 3 new persistent data failures

Tue Sep 04 13:49:42 2012

Errors in file /opt/oracle/MYDB/diag/rdbms/MYDB/MYDB/trace/MYDB_ckpt_18717.trc:

ORA-00206: error in writing (block 3, # blocks 1) of control file

ORA-00202: control file: ‘+MYDB_FRA01/MYDB/control03.ctl’

ORA-15078: ASM diskgroup was forcibly dismounted

ORA-15078: ASM diskgroup was forcibly dismounted

Errors in file /opt/oracle/MYDB/diag/rdbms/MYDB/MYDB/trace/MYDB_ckpt_18717.trc:

ORA-00221: error on write to control file

ORA-00206: error in writing (block 3, # blocks 1) of control file

ORA-00202: control file: ‘+MYDB_FRA01/MYDB/control03.ctl’

ORA-15078: ASM diskgroup was forcibly dismounted

ORA-15078: ASM diskgroup was forcibly dismounted

Tue Sep 04 13:49:42 2012

System state dump requested by (instance=1, osid=18717 (CKPT)), summary=[abnormal instance termination].

System State dumped to trace file /opt/oracle/MYDB/diag/rdbms/MYDB/MYDB/trace/MYDB_diag_18691.trc

CKPT (ospid: 18717): terminating the instance due to error 221

Dumping diagnostic data in directory=[cdmp_20120904134942], requested by (instance=1, osid=18717 (CKPT)), summary=[abnormal instance termination].

Instance terminated by CKPT, pid = 18717

As you can see when an archive was supposed to be written (and information update in the control file was needed) (marked in bold) the database was unable to write that archive. Furthermore the database was unable to talk to each of its second member s in the three existing redo groups. So it crashed. Oracle cluster ware tried to restart the database which failed (because in the missing Fra diskgroup also a copy of the controlfile was located so the database could not even be mounted.

SUCCESS: diskgroup DATA2 was mounted

ORA-00210: cannot open the specified control file

ORA-00202: control file: ‘+MYDB_FRA01/MYDB/control03.ctl’

ORA-17503: ksfdopn:2 Failed to open file +MYDB_FRA01/MYDB/control03.ctl

ORA-15001: diskgroup “MYDB_FRA01” does not exist or is not mounted

ORA-15001: diskgroup “MYDB_FRA01” does not exist or is not mounted

ORA-205 signalled during: ALTER DATABASE MOUNT /* db agent *//* {0:4:106} */…

Tue Sep 04 13:49:56 2012

Checker run found 1 new persistent data failures

Tue Sep 04 13:49:58 2012

Shutting down instance (abort)

 Work

This chapter will show the various approaches made.

 srvctl

First step needed when working with the environment was disabling the efforts of the clusterware to start the database as soon as we worked with it:

srvctl disable database  -d  MYDB

Repairing the asm Disk

When we had error below  lot of energy has been put in to  trying to fix the disk that had been gone.  Once  it was a returned Following steps  were tried:

ORA-15196: invalid ASM block header

SQL> alter diskgroup MYDB_fra01 check all repair;                                  

 alter diskgroup MYDB_fra01 check all repair                                        

*                                                                                    

ERROR at line 1:                                                                     

ORA-15032: not all alterations performed                                             

ORA-15130: diskgroup “MYDB_FRA01” is being dismounted                              

ORA-15066: offlining disk “FRA2_0001” in group “MYDB_FRA01” may result in a        

data loss                                                                            

ORA-15196: invalid ASM block header [kfc.c:26076] [endian_kfbh] [2147483648]

[2] [0 != 1]                                                                         

ORA-15196: invalid ASM block header [kfc.c:26076] [endian_kfbh] [2147483648]         

[2] [0 != 1]                                                                         

Plan B

So it became time for a plan b

We Mounted the diskgroup  by hand in asm instance, that worked.

Then we checked the status of the redo log groups:

 

set lines 1000

Select * from v$log;

 

    GROUP#    THREAD#  SEQUENCE#      BYTES  BLOCKSIZE              MEMBERS ARC STATUS                     FIRST_CHANGE# FIRST_TIME        NEXT_CHANGE# NEXT_TIME

———- ———- ———- ———- ———- ———- — —————- ————- ——————- ———— ——————-

                 1                1             5272  209715200              512                     2 NO  INACTIVE                                   36547887 04.09.2012 13:19:39     36557431 04.09.2012 13:49:39

                 3                1             5274  209715200              512                     2 NO  CURRENT                                   36577434 04.09.2012 14:46:26   2,8147E+14

                 2                1             5273  209715200              512                     2 NO  INACTIVE                                   36557431 04.09.2012 13:49:39     36577434 04.09.2012 14:46:26

 

select GROUP#,STATUS,MEMBER from v$logfile;

 

    GROUP# STATUS  MEMBER

———- ——- ———————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

                 1               +DATA2/MYDB/onlinelog/group_1.258.784139161

                 1               +MYDB_FRA01/MYDB/onlinelog/group_1.257.784139163

                 2               +DATA2/MYDB/onlinelog/group_2.259.784139165

                 2               +MYDB_FRA01/MYDB/onlinelog/group_2.258.784139167

                 3               +DATA2/MYDB/onlinelog/group_3.260.784139169

                 3               +MYDB_FRA01/MYDB/onlinelog/group_3.259.784139171

 

It showed that Group 3 was current. So the master plan was to drop the second redo members in the Fra Disk group (marked in bold)

In ASM I had the Diskgroup MYDB_FRA01 mounted again (alter diskgroup  MYDB_FRA01 mount;) ( every  time when it dismounted).

Then I dropped the  logfile member:

ALTER DATABASE DROP LOGFILE MEMBER ‘+MYDB_FRA01/MYDB/onlinelog/group_1.257.784139163‘;

This succeeded but in the alert I so that the controls file in that FRA diskgroup was also not happy with this action:

Wed Sep 05 14:28:37 2012

Errors in file /opt/oracle/MYDB/diag/rdbms/MYDB/MYDB/trace/MYDB_ckpt_20685.trc:

ORA-00206: error in writing (block 3, # blocks 1) of control file

ORA-00202: control file: ‘+MYDB_FRA01/MYDB/control03.ctl’

ORA-15078: ASM diskgroup was forcibly dismounted

So I investigated where the controlfiles where located. I created controlfile from spfile and checked.

*.control_files=’+DATA2/MYDB/control01.ctl’,’+DATA2/MYDB/control02.ctl’,’+MYDB_FRA01/MYDB/control03.ctl’

So I wanted to get the controlfile first in a right state by performing following steps:

Shutdown

Startup nomount;

 

rman target=/              

restore controlfile to ‘+MYDB_FRA01/MYDB/control03.ctl

from ‘+DATA2/MYDB/control01.ctl‘;

Since that worked  I mounted the database again  in order to proceed with drop of second members in this disgroup MYDB_FRA01

##

ALTER DATABASE DROP LOGFILE MEMBER ‘+MYDB_FRA01/MYDB/onlinelog/group_1.257.784139163‘;

##  first  one was already gone  so Oracle complained.

ALTER DATABASE DROP LOGFILE MEMBER ‘+MYDB_FRA01/MYDB/onlinelog/group_2.258.784139167‘;

After deleting instance fell over as diskgroup was dismounted had remount before moving on.

Then I tried to drop the second redo member of the current redolog  group (3)

ALTER DATABASE DROP LOGFILE MEMBER ‘+MYDB_FRA01/MYDB/onlinelog/group_3.259.784139171‘;

ALTER DATABASE DROP LOGFILE MEMBER ‘+MYDB_FRA01/MYDB/onlinelog/group_3.259.784139171’

*

ERROR at line 1:

ORA-01609: log 3 is the current log for thread 1 – cannot drop members

ORA-00312: online log 3 thread 1:

‘+DATA2/MYDB/onlinelog/group_3.260.784139169’

ORA-00312: online log 3 thread 1:

‘+MYDB_FRA01/MYDB/onlinelog/group_3.259.784139171’

That was a major setback  cause  I could not proceed. So I checked one more time:

SQL> set lines 1000;

SQL> /

 

    GROUP#    THREAD#  SEQUENCE#      BYTES  BLOCKSIZE              MEMBERS ARC STATUS                     FIRST_CHANGE# FIRST_TIME        NEXT_CHANGE# NEXT_TIME

———- ———- ———- ———- ———- ———- — —————- ————- ——————- ———— ——————-

                 1                1             5272  209715200              512                     1 NO  INACTIVE                                   36547887 04.09.2012 13:19:39     36557431 04.09.2012 13:49:39

                 3                1             5274  209715200              512                     2 NO  CURRENT                                   36577434 04.09.2012 14:46:26   2,8147E+14

                 2                1             5273  209715200              512                     1 NO  INACTIVE                                   36557431 04.09.2012 13:49:39     36577434 04.09.2012 14:46:26

My thought was will bounce DB and mount again so other group becomes current. But that was not correct (only after an open of DB that switch is performed).

Another attempt was to copy the first member of that  redo  3 group to the “retrieved”  MYDB_fra01 disk group:

ASMCMD [+] > cp +DATA2/MYDB/onlinelog/group_3.260.784139169 +MYDB_FRA01/MYDB/onlinelog/group_3

This however showed:

ASMCMD-8016: copy source->’+DATA2/MYDB/ONLINELOG/group_3.260.784139169′ and target->’+MYDB_FRA01/MYDB/onlinelog/group_3′ failed

ORA-19505: failed to identify file “+MYDB_FRA01/MYDB/onlinelog/group_3”

ORA-17502: ksfdcre:4 Failed to create file +MYDB_FRA01/MYDB/onlinelog/group_3

ORA-15196: invalid ASM block header [kfc.c:19572] [check_kfbh] [1] [533] [271535456 != 3231181597]

ORA-15130: diskgroup “MYDB_FRA01” is being dismounted

ORA-15066: offlining disk “FRA2_0001” in group “MYDB_FRA01” may result in a data loss

ORA-15196: invalid ASM block header [kfc.c:26076] [endian_kfbh] [2147483648] [1] [0 != 1]

ORA-15196: invalid ASM block header [kfc.c:26076] [endian_kfbh] [2147483648] [1] [0 != 1]

ORA-06512: at “SYS.X$DBMS_DISKGROUP”, line 413

ORA-06512: at line 3 (DBD ERROR: OCIStmtExecute)

At that time honestly I gave up ……………………….. 😦

Hope always dies last:

I spoke to the Linux colleague  about next steps to perform. In the end we settled for the following steps to be performed:

  • Dropped diskgroup MYDB_fra01 including contents.
  • Linux wise the disk has been cleared with dd 00000. So we would be sure this is like a new disk.
  • Diskgroup  MYDB_fra01 was recreated with zeroed disk.
  • Copied across control file from data disk group to fra diskgroup

Shutdown

Startup nomount;

rman target=/

restore controlfile to ‘+MYDB_FRA01/MYDB/control03.ctl’

from ‘+DATA2/MYDB/control01.ctl’;

  • Alter database mount (worked).
  1. Alter database open worked too honestly a bit to my surprise. But hey we all need some good news sometime (not expected this since db still new about one member of redo grp 3 in MYDB_fra01. ) .
  2. Since the database opened up it also performed a  logswitch. So ISaw that  the problem  redo group 3 ( which was current in former attempts) was now inactive so dropped the member.

ALTER DATABASE DROP LOGFILE MEMBER ‘+MYDB_FRA01/MYDB/onlinelog/group_3.259.784139171’

  • Added 3 new redo members for all three groups in fra disk group

alter database add logfile member ‘+MYDB_FRA01’ to group 1

alter database add logfile member ‘+MYDB_FRA01’ to group 2

alter database add logfile member ‘+MYDB_FRA01’ to group 3

  •  I checked

SQL> /

Database altered.

SQL> select * from v$log;

GROUP#    THREAD#  SEQUENCE#      BYTES  BLOCKSIZE    MEMBERS ARC STATUS

———- ———- ———- ———- ———- ———- — —————-

1          1       5278  209715200        512          2 NO  CURRENT

2          1       5276  209715200        512          2 YES INACTIVE

36599977 05.09.2012 18:00:21     36599981 05.09.2012 18:00:27

3          1       5277  209715200        512          2 YES INACTIVE

SQL> select * from v$logfile;

GROUP# STATUS  TYPE    MEMBER

  • Performed alter system switch logfile various time and checked .

SQL> select * from v$logfile;

GROUP# STATUS  TYPE    MEMBER

————————————————————————————–

1         ONLINE  +DATA2/MYDB/onlinelog/group_1.258.784139161

1         ONLINE  +MYDB_FRA01/MYDB/onlinelog/group_1.263.793217497

2         ONLINE  +DATA2/MYDB/onlinelog/group_2.259.784139165

2         ONLINE  +MYDB_FRA01/MYDB/onlinelog/group_2.264.793217515

3         ONLINE  +DATA2/MYDB/onlinelog/group_3.260.784139169

3         ONLINE  +MYDB_FRA01/MYDB/onlinelog/group_3.265.793217533

  • Looked good. I bounced database and after that I brought database down again cause I wanted to start it via the clusterware.
  • Enabbled the Resource again in the  Clusterware:

Srvctl enable database –d MYDB

  • Started the database via srvctl:

Srvctl start database –d MYDB

Happy end ….

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s