The Case of a Missing or Corrupt Spfile in your Grid Infra (11.2.0.3.0)

Introduction

On a grid infra-structure (11.2.0.3.0) cluster (SRVR1Rr – 2r -3r) we got more and more tickets that the ASM instances where having fragmented memory for the shared pool. The investigation showed that there was no SPFILE present, with automatic memory management in there. So this environment clearly needed better.  Only thing found was some out of date INIT.ORA.  From history this environment had been set up as an 11gr1 RAC cluster, and some time ago it had been upgraded to 11gr2 Grid Infra.

In this blog you will see the steps to move to an SPFILE in a cluster environment step by step.  But this procedure can / should also be followed in case of lost SPFILE in a RAC cluster.

Note.  What was interesting to find out was the way Oracle uses a specific configuration file as a properties file when starting the Cluster ware.  It is VERY important to understand that the activities below will have to be performed on ONE Node in the cluster!

After the work had finished I noticed that Oracle had created a new property file.  But there was more to be wow about.  After starting the Grid-Infra for the other two nodes I see that grid infra took care of those files on the other machines too.   So apparently the order of importance with starting the ASM instances =

  1. The property file
  2. SPFILE if present
  3. INIT.ORA  if present

I have performed my activities on the first NODE in the cluster:

oracle@SRVR1R:/opt/crs/product/112_ee_64/crs/gpnp/SRVR1R/profiles/peer [+ASM1]# ls -al

total 20

drwxr-x— 2 oracle dba 4096 Sep 25 20:03 .

drwxr-x— 3 oracle dba 4096 Mar 16  2012 ..

-rw-r–r– 1 oracle dba 1946 Sep 25 19:41 profile.old

-rw-r–r– 1 oracle dba 2004 Mar 16  2012 profile_orig.xml

-rw-r–r– 1 oracle dba 2007 Sep 25 19:41 profile.xml

When I finished and first node was happy. I started the other two Cluster ware environments (on the other nodes) .  And to my positive surprise I saw this:

oracle@SRVR2R:/opt/crs/product/112_ee_64/crs/gpnp/SRVR2R/profiles/peer []# ls -ltr

total 16

-rw-r–r– 1 oracle dba 2004 Mar 16  2012 profile_orig.xml

-rw-r–r– 1 oracle dba 2004 Mar 16  2012 profile.old

-rw-r–r– 1 oracle dba 2004 Sep 25 20:03 pending.old

-rw-r–r– 1 oracle dba 2007 Sep 25 20:03 profile.xml

oracle@SRVR3R:/opt/oracle [+ASM3]# cd oracle@SRVR3R:/opt/crs/product/112_ee_64/crs/gpnp/SRVR3R/profiles/peer [+ASM3]# ls -ltr

total 20

-rw-r–r– 1 oracle dba 2004 Mar 16  2012 profile_orig.xml

-rw-r–r– 1 oracle dba 2004 Mar 16  2012 profile.old

-rw-r–r– 1 oracle dba 2004 Sep 25 20:08 pending.old

-rw-r–r– 1 oracle dba 2007 Sep 25 20:08 profile.xml

-rw-r–r– 1 oracle dba 2007 Sep 25 20:08 pending.xml

Investigate the environments.

Since I had issues with memory on the SRVR1R Node ( +ASM1) I checked both other nodes for their settings first. Interesting part is that a lot of parameters are being default already like cluster_database = true etc. I am sure that is cause of that property file I spoke about in previous chapter. I used values for  asm_diskgroup as preparation for my own  new INIT.ORA file.  In which I would also increase memory settings.

Plan of approach

I learned that my only chance will be to get things right now is to stop the FULL  Cluster ware stack  on  ALL  nodes in the cluster. THIS cannot be done on a per Node base ( trust me I have investigated that option). After shutting down then you will have to work with cluster ware on one node in a kind of restricted mode .

Once that situation is in place ( all down ,  working from One node without the crs active) I will have to present a prepared INIT.ORA  , and create an SPFILE in a disk group in ASM.

Steps below will follow that scenario, describe it and show the needed aftercare I came across.

Stopping the grid infra:

First I stopped the databases using the corresponding srvctl commands. NOTE . this is a mixed env of 11.1 and 11.2 Databases all under control of the Grid infra structure of 11.2.0.3.0 and listeners so you NEED to use appropriate environment setting before going to work.

As ROOT:

On ALL Nodes to be done !

export PATH=/opt/crs/product/112_ee_64/crs/bin:$PATH

which crsctl

crsctl stop crs -f

Stopping Grid Infra Information on SRVR1R

First attempt showed errors below (marked in red).  That was  to be expected since SRVR1R, +ASM1 was having the memory issues in the first place

SRVR1R:root:/opt/crs/product/112_ee_64/crs/bin # crsctl stop crs -f

CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on ‘SRVR1R’

CRS-2673: Attempting to stop ‘ora.crsd’ on ‘SRVR1R’

CRS-2790: Starting shutdown of Cluster Ready Services-managed resources on ‘SRVR1R’

CRS-2673: Attempting to stop ‘ora.oc4j’ on ‘SRVR1R’

CRS-2673: Attempting to stop ‘ora.cvu’ on ‘SRVR1R’

CRS-2673: Attempting to stop ‘ora.LISTENER_SCAN3.lsnr’ on ‘SRVR1R’

CRS-2673: Attempting to stop ‘ora.DATA03.dg’ on ‘SRVR1R’

CRS-2673: Attempting to stop ‘ora.DATA05.dg’ on ‘SRVR1R’

CRS-2673: Attempting to stop ‘ora.DATA1.dg’ on ‘SRVR1R’

CRS-2673: Attempting to stop ‘ora.FRA1.dg’ on ‘SRVR1R’

.

CRS-2677: Stop of ‘ora.scan3.vip’ on ‘SRVR1R’ succeeded

CRS-2672: Attempting to start ‘ora.scan3.vip’ on ‘SRVR2R’

CRS-5017: The resource action “ora.UNICAT_FRA01.dg stop” encountered the following error:

ORA-04031: unable to allocate 4000 bytes of shared memory (“shared pool”,”ALTER DISKGROUP UNICAT_FRA01…”,”sga heap(1,0)”,”kglsim heap”)

. For details refer to “(:CLSN00108:)” in “/opt/crs/product/112_ee_64/crs/log/SRVR1R/agent/crsd/oraagent_oracle/oraagent_oracle.log”.

 CRS-5017: The resource action “ora.DATA1.dg stop” encountered the following error:

ORA-04031: unable to allocate 3896 bytes of shared memory (“shared pool”,”ALTER DISKGROUP DATA1 DISMOU…”,”sga heap(1,0)”,”kglsim object batch”)

. For details refer to “(:CLSN00108:)” in “/opt/crs/product/112_ee_64/crs/log/SRVR1R/agent/crsd/oraagent_oracle/oraagent_oracle.log”.

.

CRS-2794: Shutdown of Cluster Ready Services-managed resources on ‘SRVR1R’ has failed

CRS-2675: Stop of ‘ora.crsd’ on ‘SRVR1R’ failed

CRS-2799: Failed to shut down resource ‘ora.crsd’ on ‘SRVR1R’

CRS-2795: Shutdown of Oracle High Availability Services-managed resources on ‘SRVR1R’ has failed

CRS-4687: Shutdown command has completed with errors.

CRS-4000: Command Stop failed, or completed with errors.

Now  that did not look good .

Stopping Grid Infra Information on SRVR2R

This one went smooth the way it is supposed 2 b.

SRVR2R:root:/root # crsctl stop crs -f

CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on ‘SRVR2R’

CRS-2673: Attempting to stop ‘ora.crsd’ on ‘SRVR2R’

CRS-2790: Starting shutdown of Cluster Ready Services-managed resources on ‘SRVR2R’

.

.

.

CRS-2673: Attempting to stop ‘ora.ons’ on ‘SRVR2R’

CRS-2677: Stop of ‘ora.ons’ on ‘SRVR2R’ succeeded

CRS-2673: Attempting to stop ‘ora.net1.network’ on ‘SRVR2R’

CRS-2677: Stop of ‘ora.net1.network’ on ‘SRVR2R’ succeeded

CRS-2792: Shutdown of Cluster Ready Services-managed resources on ‘SRVR2R’ has completed

CRS-2677: Stop of ‘ora.crsd’ on ‘SRVR2R’ succeeded

.

CRS-2793: Shutdown of Oracle High Availability Services-managed resources on ‘SRVR2R’ has completed

CRS-4133: Oracle High Availability Services has been stopped.

I checked  it to see if there was any process hanging:

ps -ef|grep d.bin

And that was not the case nothing was showing so that was great.

Stopping Grid Infra Information on SRVR3R

SRVR3R:root:/root # crsctl stop crs -f

CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on ‘SRVR3R’

CRS-2673: Attempting to stop ‘ora.crsd’ on ‘SRVR3R’

CRS-2790: Starting shutdown of Cluster Ready Services-managed resources on ‘SRVR3R’

.

.

.

CRS-2793: Shutdown of Oracle High Availability Services-managed resources on ‘SRVR3R’ has completed

CRS-4133: Oracle High Availability Services has been stopped.

ps -ef|grep d.bin

Again I check on this node  with the command above. Nothing was showing so that was great 2. Hmm some would say 2 out of three is not bad but of course we wanted the full score.

Issues on SRVR1R

As we saw earlier that shutdown the grid was not happy. An additional check:

SRVR1R:root:/opt/crs/product/112_ee_64/crs/bin # ps -ef|grep d.bin

It showed  the following process still out there . Note I had checked the other two nodes also and there they were not present.

root      7868     1  0 May17 ?        04:14:54 /opt/crs/product/112_ee_64/crs/bin/ohasd.bin reboot

oracle    8466     1  0 May17 ?        00:04:07 /opt/crs/product/112_ee_64/crs/bin/mdnsd.bin

oracle    8494     1  0 May17 ?        01:12:02 /opt/crs/product/112_ee_64/crs/bin/gpnpd.bin

oracle    8549     1  0 May17 ?        05:22:23 /opt/crs/product/112_ee_64/crs/bin/gipcd.bin

oracle    9507     1  5 May17 ?        7-10:59:30 /opt/crs/product/112_ee_64/crs/bin/ocssd.bin

root     10113     1  0 May17 ?        01:12:38 /opt/crs/product/112_ee_64/crs/bin/octssd.bin reboot

root     10146     1  7 May17 ?        10-01:10:41 /opt/crs/product/112_ee_64/crs/bin/crsd.bin reboot

oracle   10159     1  0 May17 ?        00:43:47 /opt/crs/product/112_ee_64/crs/bin/evmd.bin

root     13975     1  0 Sep01 ?        04:47:23 /opt/crs/product/112_ee_64/crs/bin/osysmond.bin

root     28633 13319  0 18:45 pts/3    00:00:00 grep d.bin

Documentation showed I would have to kill them with kill -15

## So I issued:

kill -15 7868 8466 8494 8549 9507 10113 10146 10159 13975

Oh well that worked out not very well. System was hanging and after some time I noticed that the node was in a panic because it rebooted itself.  That was not great at all. I showed patience and waited for the node to come back in full. Asked a Linux colleague to take a look and he confirmed in the end, system is struggling but getting online anytime soon.  So I waited till the end.

Second attempt stop Grid Infra SRVR1R

OK to let the cat out of the bag J  And this time it looked MUCH better!

SRVR1R:root:/root # crsctl stop crs -f

CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on ‘SRVR1R’

CRS-2673: Attempting to stop ‘ora.crsd’ on ‘SRVR1R’

CRS-2790: Starting shutdown of Cluster Ready Services-managed resources on ‘SRVR1R’

.

CRS-2793: Shutdown of Oracle High Availability Services-managed resources on ‘SRVR1R’ has completed

CRS-4133: Oracle High Availability Services has been stopped.

Ah so this time it was not only happy Node but also happy Dba.  Of course I did final check:

ps -ef|grep d.bin

And all looks great now.

Started Grid Infra on one node  in exclusive (restricted mode):

As the root user I would have to start the cluster ware (Grid Infra) in an exclusive mode without starting the crs.

SRVR1R:root:/root # crsctl start crs -excl -nocrs

CRS-4123: Oracle High Availability Services has been started.

CRS-2672: Attempting to start ‘ora.mdnsd’ on ‘SRVR1R’

CRS-2676: Start of ‘ora.mdnsd’ on ‘SRVR1R’ succeeded

CRS-2672: Attempting to start ‘ora.gpnpd’ on ‘SRVR1R’

CRS-2676: Start of ‘ora.gpnpd’ on ‘SRVR1R’ succeeded

CRS-2672: Attempting to start ‘ora.cssdmonitor’ on ‘SRVR1R’

CRS-2672: Attempting to start ‘ora.gipcd’ on ‘SRVR1R’

CRS-2676: Start of ‘ora.cssdmonitor’ on ‘SRVR1R’ succeeded

CRS-2676: Start of ‘ora.gipcd’ on ‘SRVR1R’ succeeded

CRS-2672: Attempting to start ‘ora.cssd’ on ‘SRVR1R’

CRS-2672: Attempting to start ‘ora.diskmon’ on ‘SRVR1R’

CRS-2676: Start of ‘ora.diskmon’ on ‘SRVR1R’ succeeded

CRS-2676: Start of ‘ora.cssd’ on ‘SRVR1R’ succeeded

Startup the ASM instance manually.

I had prepared an INIT.ORA file with new and higher settings (and had included the Automatic memory management that came recommend by the Oracle.  So  with that specific INIT.ORA  I  started the ASM instance of the first node manually.

SQL> startup pfile=’/opt/oracle/initNew.ora’;

ASM instance started

Total System Global Area 4275781632 bytes

Fixed Size                               2235208 bytes

Variable Size                     4239991992 bytes

ASM Cache                           33554432 bytes

ASM diskgroups mounted

## I checked all the diskgroups that I wanted to be mounted for all ASM instances and was  happy to see them when I checked.

SQL> select * from v$asm_diskgroup;

SQL> select name,state from v$asm_diskgroup order by 1;

Note .. there was  no specific disk group available for voting Disk  and for my soon to be born SPFILE So that is a bummer .

So I choose the +DATA1 Disk group and created   the SPFILE:

create SPFILE=’+DATA1′ from pfile=’/opt/oracle/initNew.ora’;

 And then I saw error message below..

 SQL> create SPFILE=’+DATA1′ from pfile=’/opt/oracle/initNew.ora’;

create SPFILE=’+DATA1′ from pfile=’/opt/oracle/initNew.ora’

*

ERROR at line 1:

ORA-17502: ksfdcre:4 Failed to create file +DATA1

ORA-15221: ASM operation requires compatible.asm of 11.2.0.0.0 or higher

 

So I had to check another thing:

col COMPATIBILITY form a10

col DATABASE_COMPATIBILITY form a10

col NAME form a20

select group_number, name,compatibility, database_compatibility from v$asm_diskgroup order by 1;

 

GROUP_NUMBER NAME         COMPATIBIL DATABASE_C

.

.

         6 DATA1          10.1.0.0.0 10.1.0.0.0

.

.

17 rows selected.

 

So I had to chance compatible.asm to proceed with the Disk group I had selected (being data1).

 alter diskgroup data1 set attribute ‘compatible.asm’=’11.2’;

Diskgroup altered.

Then I created the SPFILE from pfile:

SQL> create SPFILE=’+DATA1′ from pfile=’/opt/oracle/initNew.ora’;

 

File created.

 

SQL> shutdown immediate;

 

As a non-believer of course I wanted to make sure about the new automatic memory management settings because maintenance windows like these should be very rare on a RAC cluster. So I restarted the instance manually again one more time. Cause I wanted 2 b happy with final checks.

 

The parameters marked are the ones I had increased or had set.

 

ASM instance started

 

Total System Global Area 4275781632 bytes

Fixed Size                              2235208 bytes

Variable Size                    4239991992 bytes

ASM Cache                           33554432 bytes

ASM diskgroups mounted

 

SQL> show parameter SPFILE

 

NAME                                                       TYPE VALUE

———————————— ———– ——————————

SPFILE                                                       string               +DATA1/srvr1r/asmparameterfile/registry.253.794950891

SQL> show parameter asm

 

NAME                                                       TYPE VALUE

———————————— ———– ——————————

asm_diskgroups                                  string               ALL Disk groups to be on mounted

asm_diskstring                                                    string

asm_power_limit                               integer           1

asm_preferred_read_failure_groups    string

 

SQL> show parameter memory

 

NAME                                                       TYPE VALUE

———————————— ———– ——————————

memory_max_target                        big integer 4G

memory_target                                      big integer 2G

 

SQL> show parameter sga

 

NAME                                                       TYPE VALUE

———————————— ———– ——————————

lock_sga                                                  boolean         FALSE

sga_max_size                                       big integer 4G

sga_target                                              big integer 0

  SQL> show parameter large

 NAME                                                       TYPE VALUE

———————————— ———– ——————————

large_pool_size                                   big integer 256M

use_large_pages                                 string               TRUE

SQL> show parameter proce

NAME                                                       TYPE VALUE

———————————— ———– ——————————

processes                                               integer           999

This was everything I had planned for so  time to shut down the instance .

SQL> shutdown immediate;

ASM diskgroups dismounted

ASM instance shutdown

SQL> exit

Time for the Normal restart of the first Node in the Grid infra

Since the node had been started  in exclusive mode  I first had to stop the cluster again:

AS the user ROOT.

crsctl stop crs

I was  very curious, and crossed finger about the outcome.

As part of shops standards where each node has a mount point per DB. I established the link in the /opt/oracle

ls -s +ASM +ASM1

 

And started the grid-infra structure.

crsctl start crs

Once the first node was up and ASM was on its feet I did same for  the two other nodes.

Last things

Had  to restart the  databases that where 11.2 Manually.  Noticed that the 11.1 had restarted themselves.  My stop procedure was causing that . If this is scenario is being performed on production box  this part will need improvement.

Listeners  of ASM did not come up as a resource due to .

  • Missing log directory  but I added that one
  • Tried starting the listener from crs environment and it failed.

ora.SRVR1R.ASM1.asm                              ONLINE     ONLINE on SRVR1R

ora.SRVR1R.listener_+ASM1.lsnr                   ONLINE     OFFLINE

ora.SRVR2R.ASM2.asm                              ONLINE     ONLINE on SRVR2R

ora.SRVR2R.listener_+ASM2.lsnr                   ONLINE     ONLINE on SRVR2R

ora.SRVR3R.ASM3.asm                              ONLINE     ONLINE on SRVR3R

ora.SRVR3R.listener_+ASM3.lsnr                   ONLINE     OFFLINE

Manully I could start  the listener after adding logfile directory (also part of standards). So I  had to understrand what is going on .  I think 11gr2 grid infra did not accept the + in the Name which was perfectly valind in 11.1. This is an action Point  for the future.

Grid infra did start  the listner for ASM2  but  it I try to check  the  resource :

oracle@SRVR1R:/opt/oracle [+ASM1]# crsctl status resource ora.SRVR2R.listener_+ASM2.lsnr -p

CRS-4037: The name ‘ora.SRVR2R.listener_+ASM2.lsnr’ contains the following invalid characters: ‘+’

CRS-4000: Command Status failed, or completed with errors.

I have started the  listeners manually ,  and they work .  if I use srvctl start listener  listener_+ASM1.lsnr then  had errors similar as above.

In the end I had to start them with the oracle home pointed to  11gr1–   and it worked.

 Happy END

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s