Asm Instance not starting after Cluster Node Reboot

Introduction.

I have been involved again in a situation where  the Rac cluster did not start after a reboot of the server during a maintenance window. And as always a true challenge that was. In such cases it is true that the alert log of the node and the ohasd logging  will be your best friends ( well together with Metalink and Google of course).

Details:

After a Os patching action on one of the nodes on one of my 11.2 Racs (Grid Infrastructure)  i was contacted can you please take a look cause the clusterware is not starting. After first investigation it showed that statement was not entirely true .  The cluster ware itself had been started but the  log file for the ohasd. showed following details , that it was not able to start the asm Resource due to ORA-01031: insufficient privileges.

this is what it showed:

## /opt/crs/product/11.2.0.2_a/crs/log/Mysrvr1r/ohasd [+ASM1]# view ohasd.log
2013-08-06 11:05:01.643: [    AGFW][1980881216] {0:0:2} Received the reply to the message: RESOURCE_CLEAN[ora.asm 1 1] ID 4100:411 from the agent /opt/crs/product/11.2.0.2_a/crs/bin/oraagent_oracle
2013-08-06 11:05:01.644: [    AGFW][1980881216] {0:0:2} Agfw Proxy Server sending the reply to PE for message:RESOURCE_CLEAN[ora.asm 1 1] ID 4100:410
2013-08-06 11:05:01.644: [   CRSPE][1991387456] {0:0:2} Received reply to action [Clean] message ID: 410
2013-08-06 11:05:01.644: [   CRSPE][1991387456] {0:0:2} Got agent-specific msg: ORA-01031: insufficient privileges
2013-08-06 11:05:01.646: [    AGFW][1980881216] {0:0:2} Received the reply to the message: RESOURCE_CLEAN[ora.asm 1 1] ID 4100:411 from the agent /opt/crs/product/11.2.0.2_a/crs/bin/oraagent_oracle
2013-08-06 11:05:01.646: [    AGFW][1980881216] {0:0:2} Agfw Proxy Server sending the reply to PE for message:RESOURCE_CLEAN[ora.asm 1 1] ID 4100:410
2013-08-06 11:05:01.646: [   CRSPE][1991387456] {0:0:2} Received reply to action [Clean] message ID: 410
2013-08-06 11:05:01.829: [    AGFW][1980881216] {0:0:2} Received the reply to the message: RESOURCE_CLEAN[ora.asm 1 1] ID 4100:411 from the agent /opt/crs/product/11.2.0.2_a/crs/bin/oraagent_oracle
2013-08-06 11:05:01.829: [    AGFW][1980881216] {0:0:2} Agfw Proxy Server sending the last reply to PE for message:RESOURCE_CLEAN[ora.asm 1 1] ID 4100:410
2013-08-06 11:05:01.829: [   CRSPE][1991387456] {0:0:2} Received reply to action [Clean] message ID: 410
2013-08-06 11:05:01.829: [   CRSPE][1991387456] {0:0:2} RI [ora.asm 1 1] new internal state: [STABLE] old value: [CLEANING]
2013-08-06 11:05:01.829: [   CRSPE][1991387456] {0:0:2} CRS-2681: Clean of 'ora.asm' on 'Mysrvr1r' succeeded

That did not look all to good. I had a first guess about what was going on by trying to connect to the asm instance on that box via sqlplus ( sqlplus / as sysasm). When that showed  also the ORA-01031: insufficient privileges.

I had to giggle cause when  looking for that  message on the web  i ended up with my blog. Which proves once again that you can help yourself by helping others by sharing in the Oracle community.   Basically i  focused on  three metalink notes that might apply:

Troubleshooting ORA-1031: Insufficient Privileges While Connecting As SYSDBA [ID 730067.1]

UNIX: Checklist for Resolving Connect AS SYSDBA Issues [ID 69642.1]

UNIX: Diagnostic C program for ORA-1031 from CONNECT INTERNAL / AS SYSDBA [ID 67984.1]

The third note (67984.1) was my bingo !  So it was proved that my groupid ( dba) altered from 101 to some other value by a ldap lookup.. I have asked the Linux colleague  to disable these lookups and after that the asm instance started and all the instances as well.  As a workaround , in the /etc/ldap.conf they have added the oracle user to the nss_initgroups_ignoreusers to prevent this from happening.

Happy reading,

Mathijs

Plan to setup OID (Oracle implementation of the Ldap) in a Rac environment

Introduction:

For some time it was planned to implement OID and frankly from the start I knew that would be something different compared to the things I have done so far.  Well but then again it does represent a new frontier does it not? So I felt challenged with the implementation. As an extra aspect during this setup we implemented the OID with Rac databases as a backend.  But as always the quote of the day for this was Success just loves good preperations. So we started with a setup of  a plan. The plan we have implemented captures the following 7 steps to success:

  • Check and plan the layout of your Ldap environment and your ldap.ora. For us that meant creating an environment to hold all production environments, one to hold the test environments well and after all  we added a third content as a concatination of both other environemnts since we  wanted to make sure that in the ldap.ora with only one setting in place we could contact all environments for clients who  will be on citrix servers ( and as far as I know in ldap.ora only one DEFAULT_ADMIN_CONTEXT = “dc=central,dc=env ,dc=EU” which you can use without using full qualified names):

CN=ORCLADMIN

And

dc= prod dc=EU

And

dc=test dc=EU

And

“dc=central,dc=env,dc=EU”

### think about the Ports for the Ldap you will need, one port for NON SSL, one for SSL traffic. Ask yourself if you will come across Firewalls during the setup  & operation.

  • Install a database (in this setup we use 11.2 as version). Since we wanted to deliver a High available solution in this setup we have implanted 3 Rac databases with 2 Oracle Instances each.
  • Install Weblogic 10.3.2 —> It should not be needed in the first place I read in some articles but I spoke to colleagues and they recommended that I should download the Weblogic part as well.  Only download I see is Weblogic server 12c.  So I download installers with Oracle Weblogic Server and Oracle coherence: http://www.oracle.com/technetwork/middleware/weblogic/downloads/index.html Since I am installing on 64bit Linux I was considering download generic file but given the fact that these boxes do not have Java setup properly decided to go with (wls1211_linux32.bin).
  • Setup and implement rules with regard to bulk load and bulk delete.
  • Implement Replication between the Environments.

Happy reading,

Mathijs

Thoughts on Adding a Single Instance To Oracle Clusterware (Grid Infra).

General: Configuring Active/Passive Clustering for Oracle Database ( Single Instance in Crs)

The Oracle database can easily be configured to use the Cluster ware framework for high availability.

Using Grid Infrastructure to protect a database resource is a very cost-effective way of setting up an Active/passive cluster. As an added advantage, using only one vendor’s software stack to implement the cluster can make troubleshooting easier. Staff already familiar with RAC will easily be able to set up and run this configuration because it uses a same software stack: all commands and log files are in familiar locations, and troubleshooting does not rely on the input from other teams.

To set up an active/passive cluster with 11g Release 2, you need to initially install Grid Infrastructure on all nodes of the cluster. Grid Infrastructure will give the user with a cluster logical volume manager: ASM. If for some reason another file system is required, you can choose the supported cluster file systems, including ACFS. Using a cluster file system that is mounted concurrently to both cluster nodes offers the advantage of not having to remount the database files and binaries in case of a node failure. Some configurations we saw suffered from extended fail over periods caused by required file system checks before the file system could be remounted.

On top of the Grid Infrastructure build, you do a local installation of the RDBMS. It is important that you do not choose a cluster installation when prompted so; otherwise, you risk violating your license agreement with Oracle.

Continue reading

Steps to Recreate my Central Inventory in a Real Applications Cluster environment on Linux

Introduction:

In the last quarter of 2012 i have set up  a 4 Node Real Application Cluster on Linux which was great to do as always. In day-to-day business I had lost track of this environment  which in itself is not that bad of course. However this week I needed the environment as a preparation for patching so it would be mandatory to check with opatch last patch installed in the grid infra structure and in the rdbms homes that have been set up on those boxes. Well it was to be expected that other dbas would work on those boxes, maybe / likely might have installed new patches so it was really like getting back together with old friends to see what was under the hood.

Well it was a bit of disappointment that the inventory destination  (/opt/oracle/orInventory) was simply empty . Opatch did not like it either cause it  exited with a returncode 73. But then again on the bright side this meant i could take note of the necessary steps  performed  and turn it into a blog.

Continue reading

Rman Catalog Crosscheck and Delete obsolete an Approach

Introduction

When You are or will be working with Rman  as your backup solution you will have to ask yourself several questions . One of the elementary ones will be that you have to consider whether or not you will be utilizing an Rman Catalog or that  you will have faith in the existing  multiplexed controlfiles.  On the web I see debates going on about the use of a catalog and indeed given the idea that you will not need a catalog , since you are not be using stored scripts, you might get timeouts simply cause working with catalog shows long time no communication ( and you have those hmm lets say aggressive  firewalls in place that will snap at that so-it-seems idle sessions) etc. Well this post is about what you should do if you decide to have a catalog  for rman after all or if a catalog is in place due to company hmm  history – habits – or standards.

What  you should consider to have in place ( of course always  depends on your settings and configuration) Should be  a maintenance job for your Rman Catalog for the specific Database ( target)  you have added there.  With some explain below I would like to show you what was done for that maintenance part  by showing scripts we used for that.

Details:

#!/bin/ksh
#set -vx
PROG=$(basename $0)

if [ -z “$1” ]; then
echo “ERROR – wrong start commando”
echo “EXAMPLE – $PROG <SID>”
exit 1
fi

## Script will take the Instance that has been registered as an input parameter
export ORACLE_SID=$1

## Finding the ORATAB since it will help to set Oracle Home
if [ “${OS}” = “Linux” ];then
export ORATAB=/var/opt/oracle/oratab
else
export ORATAB=/etc/oratab
fi
ORATABLINE=`grep $ORACLE_SID $ORATAB`
export ORACLE_HOME=`echo $ORATABLINE | cut -f2 -d:`

## exporting several Parameters among which logging path and logfilename with a timestamp in it
export LOGPATH=/opt/oracle/scripts/log
export CURRENT_TIMESTAMP=`date +%Y%m%d_%H%M%S`   # Format: YYYYMMDD_HHMISS   e.g.: 20110907_150455
export LOGFILE=${ORACLE_SID}_crosscheck_${CURRENT_TIMESTAMP}.log
export NLS_DATE_FORMAT=”DD-MON-YYYY HH24:MI:SS”
export ORA_NLS32=${ORACLE_HOME}/ocommon/nls/admin/data
export ORA_NLS33=${ORACLE_HOME}/ocommon/nls/admin/data
export NLS_LANG=American_America.WE8ISO8859P15

## Connect to rman and your catalog  and start logging (msglog)  to the specified logdir and logfile.
## then allocate a channel. Note if you make backups to TAPE as we do you HAVE to allocate a channel for maintenance type SBT_TAPE and you HAVE to mention in our case specific Networker Server and Networker Client.
## then you perform the crosscheck backup which will check in the Networker database if the Backups is still available. If it is not the backup will be tagged expired in the Rman catalog.
## Once the  crosscheck is finished we wanted to see a result so did a list backup.
## If you are sure that what you see is correct  delete noprompt expired will erase the no longer existing backups in rman.. MAYBE best to run script first without this. Always best to be safe then sorry.
## After delete another Listing is performed as a record of the post  delete information in the catalog.

${ORACLE_HOME}/bin/rman TARGET  / RCVCAT rman_${ORACLE_SID}/${ORACLE_SID}@RMAN MSGLOG ${LOGPATH}\/${LOGFILE} << EOF
allocate channel for maintenance device type ‘SBT_TAPE’;
send ‘NSR_ENV=(NSR_SERVER=adebckus-nl,NSR_CLIENT=adesnouh)’;
crosscheck backup;
list backup of database summary;
delete noprompt expired backup;
list backup of database summary;
release channel;
exit
EOF

Happy Reading and best  of Luck,

Mathijs

When Starting a Rac database brings you: ORA-29760: instance_number parameter not specified.

Introduction

Recently was supporting a colleague who was unable to start a Real Application Database after he had  created the environment by manual scripts. As i already mentioned in another post nothing wrong with exploring both methods ( either using DBCA (with care but hey any tool will come with a manual and  some shortcomings) or manual. Well if you decide to work manually  of course it will take maybe even  more discipline in setting up init.ora files , registering the database and its instances into the Grid Infra structure layer etc.  But generally speaking there is a challenge in either approach you choose. Well on to our case.  The Database would not start via the clusterware (with srvctl ) but manually  it was possible to start either of the two instances in this Rac.

Investigation:

Well first glimpse to see what might be going on:

SQL> select * from gv$instance;

INST_ID INSTANCE_NUMBER INSTANCE_NAME HOST_NAME VERSION STARTUP_TIME STATUS PAR THREAD# ARCHIVE LOG_SWITCH_WAIT
———- ————— —————- —————————————————————- —————– ——————- ———— — ———- ——- —————
LOGINS SHU DATABASE_STATUS INSTANCE_ROLE ACTIVE_ST BLO
———- — —————– —————— ——— —
1 1 MYDB1 server1hr 11.2.0.3.0 11.04.2013 16:14:31 OPEN YES 1 STARTED
ALLOWED NO ACTIVE PRIMARY_INSTANCE NORMAL NO

2 2 MYDB2 server2hr 11.2.0.3.0 11.04.2013 16:25:54 OPEN YES 2 STARTED
ALLOWED NO ACTIVE PRIMARY_INSTANCE NORMAL NO

That query  showed both instances running. 

Then the colleague tried:

srvctl start database -d  … 

PRCR-1079 : Failed to start resource ora.mydb.db

CRS-5017: The resource action “ora.mydb.db start” encountered the following error:
ORA-29760: instance_number parameter not specified
. For details refer to “(:CLSN00107:)” in “/opt/crs/product/11203_ee_64/crs/log/server1hr/agent/crsd/oraagent_oracle/oraagent_oracle.log”.

CRS-2674: Start of ‘ora.mydb.db’ on ‘server1hr’ failed
CRS-2632: There are no more servers to try to place resource ‘ora.mydb.db’ on that would satisfy its placement policy
CRS-5017: The resource action “ora.mydb.db start” encountered the following error:
ORA-29760: instance_number parameter not specified
. For details refer to “(:CLSN00107:)” in “/opt/crs/product/11203_ee_64/crs/log/server2hr/agent/crsd/oraagent_oracle/oraagent_oracle.log”.

CRS-2674: Start of ‘ora.mydb.db’ on ‘server2hr’ failed

Well then it was time to Google  for ORA-29760: instance_number parameter not specified

A Fellow blogger  mentioned the following Mos Note to solve this issue. And it once again proved that if you really settle for doing manual setup of you Rac environment you have to be well prepared and consistent in the naming of your parameters in you configuration file ( init.ora as a base for an spfile) and the way you enter services in the cluster layer. It once again proved that  names are registered in the way you enter them.  In plain English it is Case sensitive so  the old saying : what you type is what you get is very applicable.

MOS note ‘ORA-29760: instance_number parameter not specified’ When Starting the Database with Srvctl [ID 749515.1]

Happy reading and as always best of luck.

Mathijs.

Is it really the good – the – bad – and the Gui Dba building Racs?

Introduction

From my experience for some Database Administrators / sites it  still seem to be more tough /  hardcore to use the command line instead of using tools that  could do the same for you only with a lot less effort. And maybe that is also regarded to be a risk cause it is so easy. Well my thoughts on this is that a healthy mix on the use Gui – tools and command lines should  be an asset  to every administrator. And indeed before even starting debate when using the Gui of course you should always know what you are doing .. but then again that is always the case is it not ?

Creating an Rac manually – using DBCA.

Personally i think that the DBCA tool is  your best way forward if you are creating a new Rac environment. Of course  this tool also has also its limits ( such as on a 4 Node environment it will  by default set your cluster_database_instances default to 4 even when you build only a two node Rac)  but in general it is best way to work cause not only will create the Database , and the instances on the various Nodes  but it also adds the information to the 11G Grid infrastructure ( the Cluster ware ). And  for documentation purposes you can / should also always generate the create scripts in a default ( for me always /opt/oracle//admin/create) directory.

Yesterday i had to support a colleague who decided to create a two node RAC by hand . Indeed  even  when i suggested to him that using the DBCA would be helpful and fast in his actions it was his call to do it all by hand. But then he got stuck in the process with various actions.

If you decide to do  a create manually ( as they say maybe you should try to not use the Gui just cause it is written in this blog).  I think following steps are important:

  1. Work accurately ( setting up init oras , one for single Instance – Db) add the Instance specific information (in my case undo tablespace and local_listener ) were needed. Create spfile from that pfile.
  2. Start – create the single instance-db in full with the spfile.
  3. Once the single instance is alive start alter your spfile:
    1. alter system set cluster_database=true scope = spfile:
    2. alter system set instance_number= 1 scope = spfile sid=”instance1″ ;
    3. alter system set instance_number= 2 scope = spfile sid=”instance2″ ;
    4. alter system set thread = 1 scope = spfile sid=”instance1″ ;
    5. alter system set thread = 2 scope = spfile sid=”instance2″ ;
  4. Add at least the 3 redo groups to the second thread( alter database add logfile thread 2 group x:)
  5. set up the remote_listener with the scan listener
  6. alter system enable thread 2 ;
  7. Add a Taf service.
  8. As a check stop and start and stop both instances manually to see if the database is shared as it should be between the two instances.
  9. Add the information to the cluster:

srvctl  add database -d DB -o /opt/oracle/product/oracleVersion

srvctl  add instance -d DB -i DB1 -n Node1 srvctl  add instance -d DB -i DB2 -n Node2

srvctl modify instance -d DB -i DB1 -s +ASM1 srvctl modify instance -d DB -i DB2 -s +ASM2

srvctl  add service -d DB -s DB_TAF.domain -r DB1,DB2 srvctl start database -d DB srvctl start  service -d DB -s DB_TAF.domain

Ok, are  you still with me , do i rest my case here that it is a lot of work to do this the manual way ? And that these steps can be very  lets say error prone if you do not work 100%  accurate..

Yesterdays event where indeed not successful at first . Amongst other issues He got :

PRCR-1079 : Failed to start resource ora.db.db
CRS-5017: The resource action “ora.db.db start” met the following error:
ORA-29760: instance_number parameter not specified

Once again Google was our friend it showed post of a fellow blogger with regard to a Mos note:

MOS note ‘ORA-29760: instance_number parameter not specified’ When Starting the Database with Srvctl [ID 749515.1].

After that . Things worked .

Bottom-line of this story ? Plain and simple use the right and easiest tool for the Job you need to do. If that is a Gui  there is no dishonor in that . If you decide to go manual create , make sure you work accurately  hmm and have Google available for when things come up during the process.

Happy Reading ,

Mathijs

ASM 11.2.0.2 Is Not Releasing File Descriptors After Drop or Dismount Diskgroup. [ID 1306574.1]

Introduction.

One of my Linux environments is set up with 11.2.0.2.0. both Grid Infra structure and Rdbms together with ASM  using  luns that are part of EMC  BCVS copy  of production for reporting purposes.   This environment can be classified as a  Oracle – Restart environment. As part of a process that is supposed to run by  automated  Job during the night the Database should  go through various stages ( shutdown  abort,  dismount the dedicated ASM diskgroups  per database in ASM).  After that the BCVS are handed over to the source side again for syncing  after which the database will be rebuild using the fresh  BCV split.

Well as I said , supposed  to run automated…  cause at this point and time , during the day this is working  properly but during the night we get  per disk in the disk group a message similar to:

asm-diskgroupname-data-001: map in use

From Reading  and searching  metalink  it was suggested to implement the latest PSU Patch January 2013  due to:

ASM 11.2.0.2 Is Not Releasing File Descriptors After Drop or Dismount Diskgroup. [ID 1306574.1]

So i started my Journey on  patching  this environment.

Continue reading

Selects Against ALL_SYNONYMS Very Slow in 11.2.0.1. and 11.2.0.2. [ID 1242844.1]

Tools such as Toad appeared 2 b very slow.

From mos I quote:

…The bug is fixed in patchset 11.2.0.3 and higher versions.
Release 11.2.0.1 and the patchset 11.2.0.2 are affected by this.

end quote.

Since patching was not an option the workaround has been followed:

Mos note is offering create or replace view ALL_SYNONYMS.. and  to set spfile value (_FIX_CONTROL).

Workaround has been implemented in Test and Preprod environments, no issues found  with it and it showed improvement @  user side.

Happy reading

Mathijs

The return of ASM communication Issues ((ORA-01031: insufficient privileges) (WARNING: ASM communication(the aftermath)

Introduction

January  5th I wrote a post on  the issues we faced with ASM instance which would not let me log in as sqlplus /  as sysasm at specific point and time during which time alert log of the databases  on  that box would also be sending warnings to the alert log “.. ASM communication error”. With information on the web (Metalink)  a solution and a workaround had been offered and implemented.  For example on that specific box the oinstall  gid was lacking in the first place (primary os group is dba (oracle:dba) so I had th Linux colleague added the oinstall onthat box. And  as a workaround  I created a tnsnames entry  and connected via: sys@asm as sysasm that was also working well.  So at that point and time we all thought , case closed.

Well…… Not entirely cause the issue showed again recently and even though the workaround (using the connect string method was working)  I was not a happy Database Administrator with it.  I opened a Tar with Oracle  but  I was going in circles with it this time.

Work Info

Last Friday the Issue showed again on a box in one of the clusters. An internal mail was sent within our team about this and a very interesting clue came back from one of the Colleagues who had similar experience in different project. He came up with following information on MOS:

Troubleshooting ORA-1031: Insufficient Privileges While Connecting As SYSDBA [ID 730067.1]
UNIX: Checklist for Resolving Connect AS SYSDBA Issues [ID 69642.1]
UNIX: Diagnostic C program for ORA-1031 from CONNECT INTERNAL / AS SYSDBA [ID 67984.1]

Actually especially  last Note 67984.1 was very useful cause it showed  that during time of issue the gid  ( group Id ) was no longer valid due to an Ldap call.

With the Output of that note and the analyses after that it turned out that the NCSD daemon (http://www.linux.ncsu.edu/realm_linux/usersguide-EL4/ch04s06.php) might be part of the issue when something like that was queried on the OS:

# getent group dba
 101
 # getent group 5000
 dba
#getent group dba
 5000

When the Linux administrator configured the correct (exception) information in /etc/ldap.conf the problem vanished and the Phantom hunt ended.

Happy end

Bottom line of this:

  • Never believe in phantoms, thinks like described happen for a reason.
  • Always be willing to communicate with in the team and beyond cause communication might bring a so-called aha – Erlebnis (déjà vu).
  • Standardize, standardize,  standardize when you are using Ldap and local configurations cause you really let the ghost out of the machine otherwise.
  • A special thank  you to the colleagues who started the internal mail and to the one who shared his experiences with the team.

Happy  reading,

Mathijs