Introduction
A single instance/Db 11gr2 under control of 11gr2 grid infrastructure with ASM as storage solution. In ASM the disks are defined as external redundancy due to business rules ( no shooting the messenger please). I was Informed that a database had crashed . Grid Infra structure ( nor manual actions ) were able to start the database again. So I was asked please go take look and fix asap . The good thing was that I was aware of maintenance in the ASM Disk group on that Tuesday September 4th. So when alarms started to go off I had first clues (even Sherlock Holmes needs a first clue right ?). As always the alert log was first place to check for me. This document follows a number of steps during the analyses and the end status which was a proper database open (without reset logs). Bottom-line is give not up too fast. BUT also in this specific scenario it had been better if we had kicked the disk and started from scratch instead of trying to repair. But that being said I know that sounds wise when looking back ….
What happened?
In the alert file right before crash this information was found:
Tue Sep 04 13:49:41 2012
SUCCESS: diskgroup MYDB_FRA01 was dismounted
SUCCESS: diskgroup MYDB_FRA01 was dismounted
Tue Sep 04 13:49:41 2012
ARC0: Error 19504 Creating archive log file to ‘+MYDB_FRA01’
Errors in file /opt/oracle/MYDB/diag/rdbms/MYDB/MYDB/trace/MYDB_arc0_19126.trc:
ORA-00202: control file: ‘+MYDB_FRA01/MYDB/control03.ctl’
ORA-15078: ASM diskgroup was forcibly dismounted
Tue Sep 04 13:49:41 2012
Errors in file /opt/oracle/MYDB/diag/rdbms/MYDB/MYDB/trace/MYDB_m000_1555.trc:
ORA-00313: open failed for members of log group 1 of thread 1
ORA-00312: online log 1 thread 1: ‘+MYDB_FRA01/MYDB/onlinelog/group_1.257.784139163’
ORA-17503: ksfdopn:2 Failed to open file +MYDB_FRA01/MYDB/onlinelog/group_1.257.784139163
ORA-15001: diskgroup “MYDB_FRA01” does not exist or is not mounted
ORA-15001: diskgroup “MYDB_FRA01” does not exist or is not mounted
Master archival failure: 204
Errors in file /opt/oracle/MYDB/diag/rdbms/MYDB/MYDB/trace/MYDB_arc0_19126.trc:
ORA-00202: control file: ‘+MYDB_FRA01/MYDB/control03.ctl’
ORA-15078: ASM diskgroup was forcibly dismounted
Errors in file /opt/oracle/MYDB/diag/rdbms/MYDB/MYDB/trace/MYDB_m000_1555.trc:
ORA-00313: open failed for members of log group 2 of thread 1
ORA-00312: online log 2 thread 1: ‘+MYDB_FRA01/MYDB/onlinelog/group_2.258.784139167’
ORA-17503: ksfdopn:2 Failed to open file +MYDB_FRA01/MYDB/onlinelog/group_2.258.784139167
ORA-15001: diskgroup “MYDB_FRA01” does not exist or is not mounted
ORA-15001: diskgroup “MYDB_FRA01” does not exist or is not mounted
Errors in file /opt/oracle/MYDB/diag/rdbms/MYDB/MYDB/trace/MYDB_m000_1555.trc:
ORA-00313: open failed for members of log group 3 of thread 1
ORA-00312: online log 3 thread 1: ‘+MYDB_FRA01/MYDB/onlinelog/group_3.259.784139171’
ORA-17503: ksfdopn:2 Failed to open file +MYDB_FRA01/MYDB/onlinelog/group_3.259.784139171
ORA-15001: diskgroup “MYDB_FRA01” does not exist or is not mounted
ORA-15001: diskgroup “MYDB_FRA01” does not exist or is not mounted
Checker run found 3 new persistent data failures
Tue Sep 04 13:49:42 2012
Errors in file /opt/oracle/MYDB/diag/rdbms/MYDB/MYDB/trace/MYDB_ckpt_18717.trc:
ORA-00206: error in writing (block 3, # blocks 1) of control file
ORA-00202: control file: ‘+MYDB_FRA01/MYDB/control03.ctl’
ORA-15078: ASM diskgroup was forcibly dismounted
ORA-15078: ASM diskgroup was forcibly dismounted
Errors in file /opt/oracle/MYDB/diag/rdbms/MYDB/MYDB/trace/MYDB_ckpt_18717.trc:
ORA-00221: error on write to control file
ORA-00206: error in writing (block 3, # blocks 1) of control file
ORA-00202: control file: ‘+MYDB_FRA01/MYDB/control03.ctl’
ORA-15078: ASM diskgroup was forcibly dismounted
ORA-15078: ASM diskgroup was forcibly dismounted
Tue Sep 04 13:49:42 2012
System state dump requested by (instance=1, osid=18717 (CKPT)), summary=[abnormal instance termination].
System State dumped to trace file /opt/oracle/MYDB/diag/rdbms/MYDB/MYDB/trace/MYDB_diag_18691.trc
CKPT (ospid: 18717): terminating the instance due to error 221
Dumping diagnostic data in directory=[cdmp_20120904134942], requested by (instance=1, osid=18717 (CKPT)), summary=[abnormal instance termination].
Instance terminated by CKPT, pid = 18717
As you can see when an archive was supposed to be written (and information update in the control file was needed) (marked in bold) the database was unable to write that archive. Furthermore the database was unable to talk to each of its second member s in the three existing redo groups. So it crashed. Oracle cluster ware tried to restart the database which failed (because in the missing Fra diskgroup also a copy of the controlfile was located so the database could not even be mounted.
SUCCESS: diskgroup DATA2 was mounted
ORA-00210: cannot open the specified control file
ORA-00202: control file: ‘+MYDB_FRA01/MYDB/control03.ctl’
ORA-17503: ksfdopn:2 Failed to open file +MYDB_FRA01/MYDB/control03.ctl
ORA-15001: diskgroup “MYDB_FRA01” does not exist or is not mounted
ORA-15001: diskgroup “MYDB_FRA01” does not exist or is not mounted
ORA-205 signalled during: ALTER DATABASE MOUNT /* db agent *//* {0:4:106} */…
Tue Sep 04 13:49:56 2012
Checker run found 1 new persistent data failures
Tue Sep 04 13:49:58 2012
Shutting down instance (abort)
Work
This chapter will show the various approaches made.
srvctl
First step needed when working with the environment was disabling the efforts of the clusterware to start the database as soon as we worked with it:
srvctl disable database -d MYDB
Repairing the asm Disk
When we had error below lot of energy has been put in to trying to fix the disk that had been gone. Once it was a returned Following steps were tried:
ORA-15196: invalid ASM block header
SQL> alter diskgroup MYDB_fra01 check all repair;
alter diskgroup MYDB_fra01 check all repair
*
ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15130: diskgroup “MYDB_FRA01” is being dismounted
ORA-15066: offlining disk “FRA2_0001” in group “MYDB_FRA01” may result in a
data loss
ORA-15196: invalid ASM block header [kfc.c:26076] [endian_kfbh] [2147483648]
[2] [0 != 1]
ORA-15196: invalid ASM block header [kfc.c:26076] [endian_kfbh] [2147483648]
[2] [0 != 1]
Plan B
So it became time for a plan b
We Mounted the diskgroup by hand in asm instance, that worked.
Then we checked the status of the redo log groups:
set lines 1000
Select * from v$log;
GROUP# THREAD# SEQUENCE# BYTES BLOCKSIZE MEMBERS ARC STATUS FIRST_CHANGE# FIRST_TIME NEXT_CHANGE# NEXT_TIME
———- ———- ———- ———- ———- ———- — —————- ————- ——————- ———— ——————-
1 1 5272 209715200 512 2 NO INACTIVE 36547887 04.09.2012 13:19:39 36557431 04.09.2012 13:49:39
3 1 5274 209715200 512 2 NO CURRENT 36577434 04.09.2012 14:46:26 2,8147E+14
2 1 5273 209715200 512 2 NO INACTIVE 36557431 04.09.2012 13:49:39 36577434 04.09.2012 14:46:26
select GROUP#,STATUS,MEMBER from v$logfile;
GROUP# STATUS MEMBER
———- ——- ———————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————
1 +DATA2/MYDB/onlinelog/group_1.258.784139161
1 +MYDB_FRA01/MYDB/onlinelog/group_1.257.784139163
2 +DATA2/MYDB/onlinelog/group_2.259.784139165
2 +MYDB_FRA01/MYDB/onlinelog/group_2.258.784139167
3 +DATA2/MYDB/onlinelog/group_3.260.784139169
3 +MYDB_FRA01/MYDB/onlinelog/group_3.259.784139171
It showed that Group 3 was current. So the master plan was to drop the second redo members in the Fra Disk group (marked in bold)
In ASM I had the Diskgroup MYDB_FRA01 mounted again (alter diskgroup MYDB_FRA01 mount;) ( every time when it dismounted).
Then I dropped the logfile member:
ALTER DATABASE DROP LOGFILE MEMBER ‘+MYDB_FRA01/MYDB/onlinelog/group_1.257.784139163‘;
This succeeded but in the alert I so that the controls file in that FRA diskgroup was also not happy with this action:
Wed Sep 05 14:28:37 2012
Errors in file /opt/oracle/MYDB/diag/rdbms/MYDB/MYDB/trace/MYDB_ckpt_20685.trc:
ORA-00206: error in writing (block 3, # blocks 1) of control file
ORA-00202: control file: ‘+MYDB_FRA01/MYDB/control03.ctl’
ORA-15078: ASM diskgroup was forcibly dismounted
So I investigated where the controlfiles where located. I created controlfile from spfile and checked.
*.control_files=’+DATA2/MYDB/control01.ctl’,’+DATA2/MYDB/control02.ctl’,’+MYDB_FRA01/MYDB/control03.ctl’
So I wanted to get the controlfile first in a right state by performing following steps:
Shutdown
Startup nomount;
rman target=/
restore controlfile to ‘+MYDB_FRA01/MYDB/control03.ctl‘
from ‘+DATA2/MYDB/control01.ctl‘;
Since that worked I mounted the database again in order to proceed with drop of second members in this disgroup MYDB_FRA01
##
ALTER DATABASE DROP LOGFILE MEMBER ‘+MYDB_FRA01/MYDB/onlinelog/group_1.257.784139163‘;
## first one was already gone so Oracle complained.
ALTER DATABASE DROP LOGFILE MEMBER ‘+MYDB_FRA01/MYDB/onlinelog/group_2.258.784139167‘;
After deleting instance fell over as diskgroup was dismounted had remount before moving on.
Then I tried to drop the second redo member of the current redolog group (3)
ALTER DATABASE DROP LOGFILE MEMBER ‘+MYDB_FRA01/MYDB/onlinelog/group_3.259.784139171‘;
ALTER DATABASE DROP LOGFILE MEMBER ‘+MYDB_FRA01/MYDB/onlinelog/group_3.259.784139171’
*
ERROR at line 1:
ORA-01609: log 3 is the current log for thread 1 – cannot drop members
ORA-00312: online log 3 thread 1:
‘+DATA2/MYDB/onlinelog/group_3.260.784139169’
ORA-00312: online log 3 thread 1:
‘+MYDB_FRA01/MYDB/onlinelog/group_3.259.784139171’
That was a major setback cause I could not proceed. So I checked one more time:
SQL> set lines 1000;
SQL> /
GROUP# THREAD# SEQUENCE# BYTES BLOCKSIZE MEMBERS ARC STATUS FIRST_CHANGE# FIRST_TIME NEXT_CHANGE# NEXT_TIME
———- ———- ———- ———- ———- ———- — —————- ————- ——————- ———— ——————-
1 1 5272 209715200 512 1 NO INACTIVE 36547887 04.09.2012 13:19:39 36557431 04.09.2012 13:49:39
3 1 5274 209715200 512 2 NO CURRENT 36577434 04.09.2012 14:46:26 2,8147E+14
2 1 5273 209715200 512 1 NO INACTIVE 36557431 04.09.2012 13:49:39 36577434 04.09.2012 14:46:26
My thought was will bounce DB and mount again so other group becomes current. But that was not correct (only after an open of DB that switch is performed).
Another attempt was to copy the first member of that redo 3 group to the “retrieved” MYDB_fra01 disk group:
ASMCMD [+] > cp +DATA2/MYDB/onlinelog/group_3.260.784139169 +MYDB_FRA01/MYDB/onlinelog/group_3
This however showed:
ASMCMD-8016: copy source->’+DATA2/MYDB/ONLINELOG/group_3.260.784139169′ and target->’+MYDB_FRA01/MYDB/onlinelog/group_3′ failed
ORA-19505: failed to identify file “+MYDB_FRA01/MYDB/onlinelog/group_3”
ORA-17502: ksfdcre:4 Failed to create file +MYDB_FRA01/MYDB/onlinelog/group_3
ORA-15196: invalid ASM block header [kfc.c:19572] [check_kfbh] [1] [533] [271535456 != 3231181597]
ORA-15130: diskgroup “MYDB_FRA01” is being dismounted
ORA-15066: offlining disk “FRA2_0001” in group “MYDB_FRA01” may result in a data loss
ORA-15196: invalid ASM block header [kfc.c:26076] [endian_kfbh] [2147483648] [1] [0 != 1]
ORA-15196: invalid ASM block header [kfc.c:26076] [endian_kfbh] [2147483648] [1] [0 != 1]
ORA-06512: at “SYS.X$DBMS_DISKGROUP”, line 413
ORA-06512: at line 3 (DBD ERROR: OCIStmtExecute)
At that time honestly I gave up ……………………….. 😦
Hope always dies last:
I spoke to the Linux colleague about next steps to perform. In the end we settled for the following steps to be performed:
- Dropped diskgroup MYDB_fra01 including contents.
- Linux wise the disk has been cleared with dd 00000. So we would be sure this is like a new disk.
- Diskgroup MYDB_fra01 was recreated with zeroed disk.
- Copied across control file from data disk group to fra diskgroup
Shutdown
Startup nomount;
rman target=/
restore controlfile to ‘+MYDB_FRA01/MYDB/control03.ctl’
from ‘+DATA2/MYDB/control01.ctl’;
- Alter database mount (worked).
- Alter database open worked too honestly a bit to my surprise. But hey we all need some good news sometime (not expected this since db still new about one member of redo grp 3 in MYDB_fra01. ) .
- Since the database opened up it also performed a logswitch. So ISaw that the problem redo group 3 ( which was current in former attempts) was now inactive so dropped the member.
ALTER DATABASE DROP LOGFILE MEMBER ‘+MYDB_FRA01/MYDB/onlinelog/group_3.259.784139171’
- Added 3 new redo members for all three groups in fra disk group
alter database add logfile member ‘+MYDB_FRA01’ to group 1
alter database add logfile member ‘+MYDB_FRA01’ to group 2
alter database add logfile member ‘+MYDB_FRA01’ to group 3
- I checked
SQL> /
Database altered.
SQL> select * from v$log;
GROUP# THREAD# SEQUENCE# BYTES BLOCKSIZE MEMBERS ARC STATUS
———- ———- ———- ———- ———- ———- — —————-
1 1 5278 209715200 512 2 NO CURRENT
2 1 5276 209715200 512 2 YES INACTIVE
36599977 05.09.2012 18:00:21 36599981 05.09.2012 18:00:27
3 1 5277 209715200 512 2 YES INACTIVE
SQL> select * from v$logfile;
GROUP# STATUS TYPE MEMBER
- Performed alter system switch logfile various time and checked .
SQL> select * from v$logfile;
GROUP# STATUS TYPE MEMBER
————————————————————————————–
1 ONLINE +DATA2/MYDB/onlinelog/group_1.258.784139161
1 ONLINE +MYDB_FRA01/MYDB/onlinelog/group_1.263.793217497
2 ONLINE +DATA2/MYDB/onlinelog/group_2.259.784139165
2 ONLINE +MYDB_FRA01/MYDB/onlinelog/group_2.264.793217515
3 ONLINE +DATA2/MYDB/onlinelog/group_3.260.784139169
3 ONLINE +MYDB_FRA01/MYDB/onlinelog/group_3.265.793217533
- Looked good. I bounced database and after that I brought database down again cause I wanted to start it via the clusterware.
- Enabbled the Resource again in the Clusterware:
Srvctl enable database –d MYDB
- Started the database via srvctl:
Srvctl start database –d MYDB
Happy end ….