Asm Instance not starting after Cluster Node Reboot

Introduction.

I have been involved again in a situation where  the Rac cluster did not start after a reboot of the server during a maintenance window. And as always a true challenge that was. In such cases it is true that the alert log of the node and the ohasd logging  will be your best friends ( well together with Metalink and Google of course).

Details:

After a Os patching action on one of the nodes on one of my 11.2 Racs (Grid Infrastructure)  i was contacted can you please take a look cause the clusterware is not starting. After first investigation it showed that statement was not entirely true .  The cluster ware itself had been started but the  log file for the ohasd. showed following details , that it was not able to start the asm Resource due to ORA-01031: insufficient privileges.

this is what it showed:

## /opt/crs/product/11.2.0.2_a/crs/log/Mysrvr1r/ohasd [+ASM1]# view ohasd.log
2013-08-06 11:05:01.643: [    AGFW][1980881216] {0:0:2} Received the reply to the message: RESOURCE_CLEAN[ora.asm 1 1] ID 4100:411 from the agent /opt/crs/product/11.2.0.2_a/crs/bin/oraagent_oracle
2013-08-06 11:05:01.644: [    AGFW][1980881216] {0:0:2} Agfw Proxy Server sending the reply to PE for message:RESOURCE_CLEAN[ora.asm 1 1] ID 4100:410
2013-08-06 11:05:01.644: [   CRSPE][1991387456] {0:0:2} Received reply to action [Clean] message ID: 410
2013-08-06 11:05:01.644: [   CRSPE][1991387456] {0:0:2} Got agent-specific msg: ORA-01031: insufficient privileges
2013-08-06 11:05:01.646: [    AGFW][1980881216] {0:0:2} Received the reply to the message: RESOURCE_CLEAN[ora.asm 1 1] ID 4100:411 from the agent /opt/crs/product/11.2.0.2_a/crs/bin/oraagent_oracle
2013-08-06 11:05:01.646: [    AGFW][1980881216] {0:0:2} Agfw Proxy Server sending the reply to PE for message:RESOURCE_CLEAN[ora.asm 1 1] ID 4100:410
2013-08-06 11:05:01.646: [   CRSPE][1991387456] {0:0:2} Received reply to action [Clean] message ID: 410
2013-08-06 11:05:01.829: [    AGFW][1980881216] {0:0:2} Received the reply to the message: RESOURCE_CLEAN[ora.asm 1 1] ID 4100:411 from the agent /opt/crs/product/11.2.0.2_a/crs/bin/oraagent_oracle
2013-08-06 11:05:01.829: [    AGFW][1980881216] {0:0:2} Agfw Proxy Server sending the last reply to PE for message:RESOURCE_CLEAN[ora.asm 1 1] ID 4100:410
2013-08-06 11:05:01.829: [   CRSPE][1991387456] {0:0:2} Received reply to action [Clean] message ID: 410
2013-08-06 11:05:01.829: [   CRSPE][1991387456] {0:0:2} RI [ora.asm 1 1] new internal state: [STABLE] old value: [CLEANING]
2013-08-06 11:05:01.829: [   CRSPE][1991387456] {0:0:2} CRS-2681: Clean of 'ora.asm' on 'Mysrvr1r' succeeded

That did not look all to good. I had a first guess about what was going on by trying to connect to the asm instance on that box via sqlplus ( sqlplus / as sysasm). When that showed  also the ORA-01031: insufficient privileges.

I had to giggle cause when  looking for that  message on the web  i ended up with my blog. Which proves once again that you can help yourself by helping others by sharing in the Oracle community.   Basically i  focused on  three metalink notes that might apply:

Troubleshooting ORA-1031: Insufficient Privileges While Connecting As SYSDBA [ID 730067.1]

UNIX: Checklist for Resolving Connect AS SYSDBA Issues [ID 69642.1]

UNIX: Diagnostic C program for ORA-1031 from CONNECT INTERNAL / AS SYSDBA [ID 67984.1]

The third note (67984.1) was my bingo !  So it was proved that my groupid ( dba) altered from 101 to some other value by a ldap lookup.. I have asked the Linux colleague  to disable these lookups and after that the asm instance started and all the instances as well.  As a workaround , in the /etc/ldap.conf they have added the oracle user to the nss_initgroups_ignoreusers to prevent this from happening.

Happy reading,

Mathijs

The return of ASM communication Issues ((ORA-01031: insufficient privileges) (WARNING: ASM communication(the aftermath)

Introduction

January  5th I wrote a post on  the issues we faced with ASM instance which would not let me log in as sqlplus /  as sysasm at specific point and time during which time alert log of the databases  on  that box would also be sending warnings to the alert log “.. ASM communication error”. With information on the web (Metalink)  a solution and a workaround had been offered and implemented.  For example on that specific box the oinstall  gid was lacking in the first place (primary os group is dba (oracle:dba) so I had th Linux colleague added the oinstall onthat box. And  as a workaround  I created a tnsnames entry  and connected via: sys@asm as sysasm that was also working well.  So at that point and time we all thought , case closed.

Well…… Not entirely cause the issue showed again recently and even though the workaround (using the connect string method was working)  I was not a happy Database Administrator with it.  I opened a Tar with Oracle  but  I was going in circles with it this time.

Work Info

Last Friday the Issue showed again on a box in one of the clusters. An internal mail was sent within our team about this and a very interesting clue came back from one of the Colleagues who had similar experience in different project. He came up with following information on MOS:

Troubleshooting ORA-1031: Insufficient Privileges While Connecting As SYSDBA [ID 730067.1]
UNIX: Checklist for Resolving Connect AS SYSDBA Issues [ID 69642.1]
UNIX: Diagnostic C program for ORA-1031 from CONNECT INTERNAL / AS SYSDBA [ID 67984.1]

Actually especially  last Note 67984.1 was very useful cause it showed  that during time of issue the gid  ( group Id ) was no longer valid due to an Ldap call.

With the Output of that note and the analyses after that it turned out that the NCSD daemon (http://www.linux.ncsu.edu/realm_linux/usersguide-EL4/ch04s06.php) might be part of the issue when something like that was queried on the OS:

# getent group dba
 101
 # getent group 5000
 dba
#getent group dba
 5000

When the Linux administrator configured the correct (exception) information in /etc/ldap.conf the problem vanished and the Phantom hunt ended.

Happy end

Bottom line of this:

  • Never believe in phantoms, thinks like described happen for a reason.
  • Always be willing to communicate with in the team and beyond cause communication might bring a so-called aha – Erlebnis (déjà vu).
  • Standardize, standardize,  standardize when you are using Ldap and local configurations cause you really let the ghost out of the machine otherwise.
  • A special thank  you to the colleagues who started the internal mail and to the one who shared his experiences with the team.

Happy  reading,

Mathijs

ASM communication Issues ((ORA-01031: insufficient privileges) (WARNING: ASM communication error: op 0 state 0x0 (15055))

Introduction:

A couple of months ago  I had the issue that in an 11.2.0.2 environment  with Grid Infra structure  and ASM on Linux , without any specific reason at a specific point and time would  see  messages  about communication error between the databases and  the ASM instance and  I was not able to connect to the ASM instance with a sqlplus / as sysasm. I have opened a tar back then and even got myself  a fresh bug 14767353 number but no answers. So  But practically this remained unsolved. Friday 4th  I had same issue again, so it was time to get back to arms and investigate this.

Below you will find the steps , what i saw and how I solved it with the help of my friends(Google and Metalink).

Environment: 4 Node GI 11.2.0.2.0 on Red Hat Linux  with ASM. Installations performed as oracle:dba

Continue reading