Asm Instance not starting after Cluster Node Reboot

Introduction.

I have been involved again in a situation where  the Rac cluster did not start after a reboot of the server during a maintenance window. And as always a true challenge that was. In such cases it is true that the alert log of the node and the ohasd logging  will be your best friends ( well together with Metalink and Google of course).

Details:

After a Os patching action on one of the nodes on one of my 11.2 Racs (Grid Infrastructure)  i was contacted can you please take a look cause the clusterware is not starting. After first investigation it showed that statement was not entirely true .  The cluster ware itself had been started but the  log file for the ohasd. showed following details , that it was not able to start the asm Resource due to ORA-01031: insufficient privileges.

this is what it showed:

## /opt/crs/product/11.2.0.2_a/crs/log/Mysrvr1r/ohasd [+ASM1]# view ohasd.log
2013-08-06 11:05:01.643: [    AGFW][1980881216] {0:0:2} Received the reply to the message: RESOURCE_CLEAN[ora.asm 1 1] ID 4100:411 from the agent /opt/crs/product/11.2.0.2_a/crs/bin/oraagent_oracle
2013-08-06 11:05:01.644: [    AGFW][1980881216] {0:0:2} Agfw Proxy Server sending the reply to PE for message:RESOURCE_CLEAN[ora.asm 1 1] ID 4100:410
2013-08-06 11:05:01.644: [   CRSPE][1991387456] {0:0:2} Received reply to action [Clean] message ID: 410
2013-08-06 11:05:01.644: [   CRSPE][1991387456] {0:0:2} Got agent-specific msg: ORA-01031: insufficient privileges
2013-08-06 11:05:01.646: [    AGFW][1980881216] {0:0:2} Received the reply to the message: RESOURCE_CLEAN[ora.asm 1 1] ID 4100:411 from the agent /opt/crs/product/11.2.0.2_a/crs/bin/oraagent_oracle
2013-08-06 11:05:01.646: [    AGFW][1980881216] {0:0:2} Agfw Proxy Server sending the reply to PE for message:RESOURCE_CLEAN[ora.asm 1 1] ID 4100:410
2013-08-06 11:05:01.646: [   CRSPE][1991387456] {0:0:2} Received reply to action [Clean] message ID: 410
2013-08-06 11:05:01.829: [    AGFW][1980881216] {0:0:2} Received the reply to the message: RESOURCE_CLEAN[ora.asm 1 1] ID 4100:411 from the agent /opt/crs/product/11.2.0.2_a/crs/bin/oraagent_oracle
2013-08-06 11:05:01.829: [    AGFW][1980881216] {0:0:2} Agfw Proxy Server sending the last reply to PE for message:RESOURCE_CLEAN[ora.asm 1 1] ID 4100:410
2013-08-06 11:05:01.829: [   CRSPE][1991387456] {0:0:2} Received reply to action [Clean] message ID: 410
2013-08-06 11:05:01.829: [   CRSPE][1991387456] {0:0:2} RI [ora.asm 1 1] new internal state: [STABLE] old value: [CLEANING]
2013-08-06 11:05:01.829: [   CRSPE][1991387456] {0:0:2} CRS-2681: Clean of 'ora.asm' on 'Mysrvr1r' succeeded

That did not look all to good. I had a first guess about what was going on by trying to connect to the asm instance on that box via sqlplus ( sqlplus / as sysasm). When that showed  also the ORA-01031: insufficient privileges.

I had to giggle cause when  looking for that  message on the web  i ended up with my blog. Which proves once again that you can help yourself by helping others by sharing in the Oracle community.   Basically i  focused on  three metalink notes that might apply:

Troubleshooting ORA-1031: Insufficient Privileges While Connecting As SYSDBA [ID 730067.1]

UNIX: Checklist for Resolving Connect AS SYSDBA Issues [ID 69642.1]

UNIX: Diagnostic C program for ORA-1031 from CONNECT INTERNAL / AS SYSDBA [ID 67984.1]

The third note (67984.1) was my bingo !  So it was proved that my groupid ( dba) altered from 101 to some other value by a ldap lookup.. I have asked the Linux colleague  to disable these lookups and after that the asm instance started and all the instances as well.  As a workaround , in the /etc/ldap.conf they have added the oracle user to the nss_initgroups_ignoreusers to prevent this from happening.

Happy reading,

Mathijs

Plan to setup OID (Oracle implementation of the Ldap) in a Rac environment

Introduction:

For some time it was planned to implement OID and frankly from the start I knew that would be something different compared to the things I have done so far.  Well but then again it does represent a new frontier does it not? So I felt challenged with the implementation. As an extra aspect during this setup we implemented the OID with Rac databases as a backend.  But as always the quote of the day for this was Success just loves good preperations. So we started with a setup of  a plan. The plan we have implemented captures the following 7 steps to success:

  • Check and plan the layout of your Ldap environment and your ldap.ora. For us that meant creating an environment to hold all production environments, one to hold the test environments well and after all  we added a third content as a concatination of both other environemnts since we  wanted to make sure that in the ldap.ora with only one setting in place we could contact all environments for clients who  will be on citrix servers ( and as far as I know in ldap.ora only one DEFAULT_ADMIN_CONTEXT = “dc=central,dc=env ,dc=EU” which you can use without using full qualified names):

CN=ORCLADMIN

And

dc= prod dc=EU

And

dc=test dc=EU

And

“dc=central,dc=env,dc=EU”

### think about the Ports for the Ldap you will need, one port for NON SSL, one for SSL traffic. Ask yourself if you will come across Firewalls during the setup  & operation.

  • Install a database (in this setup we use 11.2 as version). Since we wanted to deliver a High available solution in this setup we have implanted 3 Rac databases with 2 Oracle Instances each.
  • Install Weblogic 10.3.2 —> It should not be needed in the first place I read in some articles but I spoke to colleagues and they recommended that I should download the Weblogic part as well.  Only download I see is Weblogic server 12c.  So I download installers with Oracle Weblogic Server and Oracle coherence: http://www.oracle.com/technetwork/middleware/weblogic/downloads/index.html Since I am installing on 64bit Linux I was considering download generic file but given the fact that these boxes do not have Java setup properly decided to go with (wls1211_linux32.bin).
  • Setup and implement rules with regard to bulk load and bulk delete.
  • Implement Replication between the Environments.

Happy reading,

Mathijs