Introduction.
I have been involved again in a situation where the Rac cluster did not start after a reboot of the server during a maintenance window. And as always a true challenge that was. In such cases it is true that the alert log of the node and the ohasd logging will be your best friends ( well together with Metalink and Google of course).
Details:
After a Os patching action on one of the nodes on one of my 11.2 Racs (Grid Infrastructure) i was contacted can you please take a look cause the clusterware is not starting. After first investigation it showed that statement was not entirely true . The cluster ware itself had been started but the log file for the ohasd. showed following details , that it was not able to start the asm Resource due to ORA-01031: insufficient privileges.
this is what it showed:
## /opt/crs/product/11.2.0.2_a/crs/log/Mysrvr1r/ohasd [+ASM1]# view ohasd.log 2013-08-06 11:05:01.643: [ AGFW][1980881216] {0:0:2} Received the reply to the message: RESOURCE_CLEAN[ora.asm 1 1] ID 4100:411 from the agent /opt/crs/product/11.2.0.2_a/crs/bin/oraagent_oracle 2013-08-06 11:05:01.644: [ AGFW][1980881216] {0:0:2} Agfw Proxy Server sending the reply to PE for message:RESOURCE_CLEAN[ora.asm 1 1] ID 4100:410 2013-08-06 11:05:01.644: [ CRSPE][1991387456] {0:0:2} Received reply to action [Clean] message ID: 410 2013-08-06 11:05:01.644: [ CRSPE][1991387456] {0:0:2} Got agent-specific msg: ORA-01031: insufficient privileges 2013-08-06 11:05:01.646: [ AGFW][1980881216] {0:0:2} Received the reply to the message: RESOURCE_CLEAN[ora.asm 1 1] ID 4100:411 from the agent /opt/crs/product/11.2.0.2_a/crs/bin/oraagent_oracle 2013-08-06 11:05:01.646: [ AGFW][1980881216] {0:0:2} Agfw Proxy Server sending the reply to PE for message:RESOURCE_CLEAN[ora.asm 1 1] ID 4100:410 2013-08-06 11:05:01.646: [ CRSPE][1991387456] {0:0:2} Received reply to action [Clean] message ID: 410 2013-08-06 11:05:01.829: [ AGFW][1980881216] {0:0:2} Received the reply to the message: RESOURCE_CLEAN[ora.asm 1 1] ID 4100:411 from the agent /opt/crs/product/11.2.0.2_a/crs/bin/oraagent_oracle 2013-08-06 11:05:01.829: [ AGFW][1980881216] {0:0:2} Agfw Proxy Server sending the last reply to PE for message:RESOURCE_CLEAN[ora.asm 1 1] ID 4100:410 2013-08-06 11:05:01.829: [ CRSPE][1991387456] {0:0:2} Received reply to action [Clean] message ID: 410 2013-08-06 11:05:01.829: [ CRSPE][1991387456] {0:0:2} RI [ora.asm 1 1] new internal state: [STABLE] old value: [CLEANING] 2013-08-06 11:05:01.829: [ CRSPE][1991387456] {0:0:2} CRS-2681: Clean of 'ora.asm' on 'Mysrvr1r' succeeded
That did not look all to good. I had a first guess about what was going on by trying to connect to the asm instance on that box via sqlplus ( sqlplus / as sysasm). When that showed also the ORA-01031: insufficient privileges.
I had to giggle cause when looking for that message on the web i ended up with my blog. Which proves once again that you can help yourself by helping others by sharing in the Oracle community. Basically i focused on three metalink notes that might apply:
Troubleshooting ORA-1031: Insufficient Privileges While Connecting As SYSDBA [ID 730067.1]
UNIX: Checklist for Resolving Connect AS SYSDBA Issues [ID 69642.1]
UNIX: Diagnostic C program for ORA-1031 from CONNECT INTERNAL / AS SYSDBA [ID 67984.1]
The third note (67984.1) was my bingo ! So it was proved that my groupid ( dba) altered from 101 to some other value by a ldap lookup.. I have asked the Linux colleague to disable these lookups and after that the asm instance started and all the instances as well. As a workaround , in the /etc/ldap.conf they have added the oracle user to the nss_initgroups_ignoreusers to prevent this from happening.
Happy reading,
Mathijs