Introduction
On a grid infra-structure (11.2.0.3.0) cluster (SRVR1Rr – 2r -3r) we got more and more tickets that the ASM instances where having fragmented memory for the shared pool. The investigation showed that there was no SPFILE present, with automatic memory management in there. So this environment clearly needed better. Only thing found was some out of date INIT.ORA. From history this environment had been set up as an 11gr1 RAC cluster, and some time ago it had been upgraded to 11gr2 Grid Infra.
In this blog you will see the steps to move to an SPFILE in a cluster environment step by step. But this procedure can / should also be followed in case of lost SPFILE in a RAC cluster.
Note. What was interesting to find out was the way Oracle uses a specific configuration file as a properties file when starting the Cluster ware. It is VERY important to understand that the activities below will have to be performed on ONE Node in the cluster!
After the work had finished I noticed that Oracle had created a new property file. But there was more to be wow about. After starting the Grid-Infra for the other two nodes I see that grid infra took care of those files on the other machines too. So apparently the order of importance with starting the ASM instances =
- The property file
- SPFILE if present
- INIT.ORA if present
I have performed my activities on the first NODE in the cluster:
oracle@SRVR1R:/opt/crs/product/112_ee_64/crs/gpnp/SRVR1R/profiles/peer [+ASM1]# ls -al
total 20
drwxr-x— 2 oracle dba 4096 Sep 25 20:03 .
drwxr-x— 3 oracle dba 4096 Mar 16 2012 ..
-rw-r–r– 1 oracle dba 1946 Sep 25 19:41 profile.old
-rw-r–r– 1 oracle dba 2004 Mar 16 2012 profile_orig.xml
-rw-r–r– 1 oracle dba 2007 Sep 25 19:41 profile.xml
When I finished and first node was happy. I started the other two Cluster ware environments (on the other nodes) . And to my positive surprise I saw this:
oracle@SRVR2R:/opt/crs/product/112_ee_64/crs/gpnp/SRVR2R/profiles/peer []# ls -ltr
total 16
-rw-r–r– 1 oracle dba 2004 Mar 16 2012 profile_orig.xml
-rw-r–r– 1 oracle dba 2004 Mar 16 2012 profile.old
-rw-r–r– 1 oracle dba 2004 Sep 25 20:03 pending.old
-rw-r–r– 1 oracle dba 2007 Sep 25 20:03 profile.xml
oracle@SRVR3R:/opt/oracle [+ASM3]# cd oracle@SRVR3R:/opt/crs/product/112_ee_64/crs/gpnp/SRVR3R/profiles/peer [+ASM3]# ls -ltr
total 20
-rw-r–r– 1 oracle dba 2004 Mar 16 2012 profile_orig.xml
-rw-r–r– 1 oracle dba 2004 Mar 16 2012 profile.old
-rw-r–r– 1 oracle dba 2004 Sep 25 20:08 pending.old
-rw-r–r– 1 oracle dba 2007 Sep 25 20:08 profile.xml
-rw-r–r– 1 oracle dba 2007 Sep 25 20:08 pending.xml
Investigate the environments.
Since I had issues with memory on the SRVR1R Node ( +ASM1) I checked both other nodes for their settings first. Interesting part is that a lot of parameters are being default already like cluster_database = true etc. I am sure that is cause of that property file I spoke about in previous chapter. I used values for asm_diskgroup as preparation for my own new INIT.ORA file. In which I would also increase memory settings.
Plan of approach
I learned that my only chance will be to get things right now is to stop the FULL Cluster ware stack on ALL nodes in the cluster. THIS cannot be done on a per Node base ( trust me I have investigated that option). After shutting down then you will have to work with cluster ware on one node in a kind of restricted mode .
Once that situation is in place ( all down , working from One node without the crs active) I will have to present a prepared INIT.ORA , and create an SPFILE in a disk group in ASM.
Steps below will follow that scenario, describe it and show the needed aftercare I came across.
Stopping the grid infra:
First I stopped the databases using the corresponding srvctl commands. NOTE . this is a mixed env of 11.1 and 11.2 Databases all under control of the Grid infra structure of 11.2.0.3.0 and listeners so you NEED to use appropriate environment setting before going to work.
As ROOT:
On ALL Nodes to be done !
export PATH=/opt/crs/product/112_ee_64/crs/bin:$PATH
which crsctl
crsctl stop crs -f
Stopping Grid Infra Information on SRVR1R
First attempt showed errors below (marked in red). That was to be expected since SRVR1R, +ASM1 was having the memory issues in the first place
SRVR1R:root:/opt/crs/product/112_ee_64/crs/bin # crsctl stop crs -f
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on ‘SRVR1R’
CRS-2673: Attempting to stop ‘ora.crsd’ on ‘SRVR1R’
CRS-2790: Starting shutdown of Cluster Ready Services-managed resources on ‘SRVR1R’
CRS-2673: Attempting to stop ‘ora.oc4j’ on ‘SRVR1R’
CRS-2673: Attempting to stop ‘ora.cvu’ on ‘SRVR1R’
CRS-2673: Attempting to stop ‘ora.LISTENER_SCAN3.lsnr’ on ‘SRVR1R’
CRS-2673: Attempting to stop ‘ora.DATA03.dg’ on ‘SRVR1R’
CRS-2673: Attempting to stop ‘ora.DATA05.dg’ on ‘SRVR1R’
CRS-2673: Attempting to stop ‘ora.DATA1.dg’ on ‘SRVR1R’
CRS-2673: Attempting to stop ‘ora.FRA1.dg’ on ‘SRVR1R’
.
CRS-2677: Stop of ‘ora.scan3.vip’ on ‘SRVR1R’ succeeded
CRS-2672: Attempting to start ‘ora.scan3.vip’ on ‘SRVR2R’
CRS-5017: The resource action “ora.UNICAT_FRA01.dg stop” encountered the following error:
ORA-04031: unable to allocate 4000 bytes of shared memory (“shared pool”,”ALTER DISKGROUP UNICAT_FRA01…”,”sga heap(1,0)”,”kglsim heap”)
. For details refer to “(:CLSN00108:)” in “/opt/crs/product/112_ee_64/crs/log/SRVR1R/agent/crsd/oraagent_oracle/oraagent_oracle.log”.
CRS-5017: The resource action “ora.DATA1.dg stop” encountered the following error:
ORA-04031: unable to allocate 3896 bytes of shared memory (“shared pool”,”ALTER DISKGROUP DATA1 DISMOU…”,”sga heap(1,0)”,”kglsim object batch”)
. For details refer to “(:CLSN00108:)” in “/opt/crs/product/112_ee_64/crs/log/SRVR1R/agent/crsd/oraagent_oracle/oraagent_oracle.log”.
.
CRS-2794: Shutdown of Cluster Ready Services-managed resources on ‘SRVR1R’ has failed
CRS-2675: Stop of ‘ora.crsd’ on ‘SRVR1R’ failed
CRS-2799: Failed to shut down resource ‘ora.crsd’ on ‘SRVR1R’
CRS-2795: Shutdown of Oracle High Availability Services-managed resources on ‘SRVR1R’ has failed
CRS-4687: Shutdown command has completed with errors.
CRS-4000: Command Stop failed, or completed with errors.
Now that did not look good .
Stopping Grid Infra Information on SRVR2R
This one went smooth the way it is supposed 2 b.
SRVR2R:root:/root # crsctl stop crs -f
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on ‘SRVR2R’
CRS-2673: Attempting to stop ‘ora.crsd’ on ‘SRVR2R’
CRS-2790: Starting shutdown of Cluster Ready Services-managed resources on ‘SRVR2R’
.
.
.
CRS-2673: Attempting to stop ‘ora.ons’ on ‘SRVR2R’
CRS-2677: Stop of ‘ora.ons’ on ‘SRVR2R’ succeeded
CRS-2673: Attempting to stop ‘ora.net1.network’ on ‘SRVR2R’
CRS-2677: Stop of ‘ora.net1.network’ on ‘SRVR2R’ succeeded
CRS-2792: Shutdown of Cluster Ready Services-managed resources on ‘SRVR2R’ has completed
CRS-2677: Stop of ‘ora.crsd’ on ‘SRVR2R’ succeeded
.
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on ‘SRVR2R’ has completed
CRS-4133: Oracle High Availability Services has been stopped.
I checked it to see if there was any process hanging:
ps -ef|grep d.bin
And that was not the case nothing was showing so that was great.
Stopping Grid Infra Information on SRVR3R
SRVR3R:root:/root # crsctl stop crs -f
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on ‘SRVR3R’
CRS-2673: Attempting to stop ‘ora.crsd’ on ‘SRVR3R’
CRS-2790: Starting shutdown of Cluster Ready Services-managed resources on ‘SRVR3R’
.
.
.
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on ‘SRVR3R’ has completed
CRS-4133: Oracle High Availability Services has been stopped.
ps -ef|grep d.bin
Again I check on this node with the command above. Nothing was showing so that was great 2. Hmm some would say 2 out of three is not bad but of course we wanted the full score.
Issues on SRVR1R
As we saw earlier that shutdown the grid was not happy. An additional check:
SRVR1R:root:/opt/crs/product/112_ee_64/crs/bin # ps -ef|grep d.bin
It showed the following process still out there . Note I had checked the other two nodes also and there they were not present.
root 7868 1 0 May17 ? 04:14:54 /opt/crs/product/112_ee_64/crs/bin/ohasd.bin reboot
oracle 8466 1 0 May17 ? 00:04:07 /opt/crs/product/112_ee_64/crs/bin/mdnsd.bin
oracle 8494 1 0 May17 ? 01:12:02 /opt/crs/product/112_ee_64/crs/bin/gpnpd.bin
oracle 8549 1 0 May17 ? 05:22:23 /opt/crs/product/112_ee_64/crs/bin/gipcd.bin
oracle 9507 1 5 May17 ? 7-10:59:30 /opt/crs/product/112_ee_64/crs/bin/ocssd.bin
root 10113 1 0 May17 ? 01:12:38 /opt/crs/product/112_ee_64/crs/bin/octssd.bin reboot
root 10146 1 7 May17 ? 10-01:10:41 /opt/crs/product/112_ee_64/crs/bin/crsd.bin reboot
oracle 10159 1 0 May17 ? 00:43:47 /opt/crs/product/112_ee_64/crs/bin/evmd.bin
root 13975 1 0 Sep01 ? 04:47:23 /opt/crs/product/112_ee_64/crs/bin/osysmond.bin
root 28633 13319 0 18:45 pts/3 00:00:00 grep d.bin
Documentation showed I would have to kill them with kill -15
## So I issued:
kill -15 7868 8466 8494 8549 9507 10113 10146 10159 13975
Oh well that worked out not very well. System was hanging and after some time I noticed that the node was in a panic because it rebooted itself. That was not great at all. I showed patience and waited for the node to come back in full. Asked a Linux colleague to take a look and he confirmed in the end, system is struggling but getting online anytime soon. So I waited till the end.
Second attempt stop Grid Infra SRVR1R
OK to let the cat out of the bag J And this time it looked MUCH better!
SRVR1R:root:/root # crsctl stop crs -f
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on ‘SRVR1R’
CRS-2673: Attempting to stop ‘ora.crsd’ on ‘SRVR1R’
CRS-2790: Starting shutdown of Cluster Ready Services-managed resources on ‘SRVR1R’
.
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on ‘SRVR1R’ has completed
CRS-4133: Oracle High Availability Services has been stopped.
Ah so this time it was not only happy Node but also happy Dba. Of course I did final check:
ps -ef|grep d.bin
And all looks great now.
Started Grid Infra on one node in exclusive (restricted mode):
As the root user I would have to start the cluster ware (Grid Infra) in an exclusive mode without starting the crs.
SRVR1R:root:/root # crsctl start crs -excl -nocrs
CRS-4123: Oracle High Availability Services has been started.
CRS-2672: Attempting to start ‘ora.mdnsd’ on ‘SRVR1R’
CRS-2676: Start of ‘ora.mdnsd’ on ‘SRVR1R’ succeeded
CRS-2672: Attempting to start ‘ora.gpnpd’ on ‘SRVR1R’
CRS-2676: Start of ‘ora.gpnpd’ on ‘SRVR1R’ succeeded
CRS-2672: Attempting to start ‘ora.cssdmonitor’ on ‘SRVR1R’
CRS-2672: Attempting to start ‘ora.gipcd’ on ‘SRVR1R’
CRS-2676: Start of ‘ora.cssdmonitor’ on ‘SRVR1R’ succeeded
CRS-2676: Start of ‘ora.gipcd’ on ‘SRVR1R’ succeeded
CRS-2672: Attempting to start ‘ora.cssd’ on ‘SRVR1R’
CRS-2672: Attempting to start ‘ora.diskmon’ on ‘SRVR1R’
CRS-2676: Start of ‘ora.diskmon’ on ‘SRVR1R’ succeeded
CRS-2676: Start of ‘ora.cssd’ on ‘SRVR1R’ succeeded
Startup the ASM instance manually.
I had prepared an INIT.ORA file with new and higher settings (and had included the Automatic memory management that came recommend by the Oracle. So with that specific INIT.ORA I started the ASM instance of the first node manually.
SQL> startup pfile=’/opt/oracle/initNew.ora’;
ASM instance started
Total System Global Area 4275781632 bytes
Fixed Size 2235208 bytes
Variable Size 4239991992 bytes
ASM Cache 33554432 bytes
ASM diskgroups mounted
## I checked all the diskgroups that I wanted to be mounted for all ASM instances and was happy to see them when I checked.
SQL> select * from v$asm_diskgroup;
SQL> select name,state from v$asm_diskgroup order by 1;
Note .. there was no specific disk group available for voting Disk and for my soon to be born SPFILE So that is a bummer .
So I choose the +DATA1 Disk group and created the SPFILE:
create SPFILE=’+DATA1′ from pfile=’/opt/oracle/initNew.ora’;
And then I saw error message below..
SQL> create SPFILE=’+DATA1′ from pfile=’/opt/oracle/initNew.ora’;
create SPFILE=’+DATA1′ from pfile=’/opt/oracle/initNew.ora’
*
ERROR at line 1:
ORA-17502: ksfdcre:4 Failed to create file +DATA1
ORA-15221: ASM operation requires compatible.asm of 11.2.0.0.0 or higher
So I had to check another thing:
col COMPATIBILITY form a10
col DATABASE_COMPATIBILITY form a10
col NAME form a20
select group_number, name,compatibility, database_compatibility from v$asm_diskgroup order by 1;
GROUP_NUMBER NAME COMPATIBIL DATABASE_C
.
.
6 DATA1 10.1.0.0.0 10.1.0.0.0
.
.
17 rows selected.
So I had to chance compatible.asm to proceed with the Disk group I had selected (being data1).
alter diskgroup data1 set attribute ‘compatible.asm’=’11.2’;
Diskgroup altered.
Then I created the SPFILE from pfile:
SQL> create SPFILE=’+DATA1′ from pfile=’/opt/oracle/initNew.ora’;
File created.
SQL> shutdown immediate;
As a non-believer of course I wanted to make sure about the new automatic memory management settings because maintenance windows like these should be very rare on a RAC cluster. So I restarted the instance manually again one more time. Cause I wanted 2 b happy with final checks.
The parameters marked are the ones I had increased or had set.
ASM instance started
Total System Global Area 4275781632 bytes
Fixed Size 2235208 bytes
Variable Size 4239991992 bytes
ASM Cache 33554432 bytes
ASM diskgroups mounted
SQL> show parameter SPFILE
NAME TYPE VALUE
———————————— ———– ——————————
SPFILE string +DATA1/srvr1r/asmparameterfile/registry.253.794950891
SQL> show parameter asm
NAME TYPE VALUE
———————————— ———– ——————————
asm_diskgroups string ALL Disk groups to be on mounted
asm_diskstring string
asm_power_limit integer 1
asm_preferred_read_failure_groups string
SQL> show parameter memory
NAME TYPE VALUE
———————————— ———– ——————————
memory_max_target big integer 4G
memory_target big integer 2G
SQL> show parameter sga
NAME TYPE VALUE
———————————— ———– ——————————
lock_sga boolean FALSE
sga_max_size big integer 4G
sga_target big integer 0
SQL> show parameter large
NAME TYPE VALUE
———————————— ———– ——————————
large_pool_size big integer 256M
use_large_pages string TRUE
SQL> show parameter proce
NAME TYPE VALUE
———————————— ———– ——————————
processes integer 999
This was everything I had planned for so time to shut down the instance .
SQL> shutdown immediate;
ASM diskgroups dismounted
ASM instance shutdown
SQL> exit
Time for the Normal restart of the first Node in the Grid infra
Since the node had been started in exclusive mode I first had to stop the cluster again:
AS the user ROOT.
crsctl stop crs
I was very curious, and crossed finger about the outcome.
As part of shops standards where each node has a mount point per DB. I established the link in the /opt/oracle
ls -s +ASM +ASM1
And started the grid-infra structure.
crsctl start crs
Once the first node was up and ASM was on its feet I did same for the two other nodes.
Last things
Had to restart the databases that where 11.2 Manually. Noticed that the 11.1 had restarted themselves. My stop procedure was causing that . If this is scenario is being performed on production box this part will need improvement.
Listeners of ASM did not come up as a resource due to .
- Missing log directory but I added that one
- Tried starting the listener from crs environment and it failed.
ora.SRVR1R.ASM1.asm ONLINE ONLINE on SRVR1R
ora.SRVR1R.listener_+ASM1.lsnr ONLINE OFFLINE
ora.SRVR2R.ASM2.asm ONLINE ONLINE on SRVR2R
ora.SRVR2R.listener_+ASM2.lsnr ONLINE ONLINE on SRVR2R
ora.SRVR3R.ASM3.asm ONLINE ONLINE on SRVR3R
ora.SRVR3R.listener_+ASM3.lsnr ONLINE OFFLINE
Manully I could start the listener after adding logfile directory (also part of standards). So I had to understrand what is going on . I think 11gr2 grid infra did not accept the + in the Name which was perfectly valind in 11.1. This is an action Point for the future.
Grid infra did start the listner for ASM2 but it I try to check the resource :
oracle@SRVR1R:/opt/oracle [+ASM1]# crsctl status resource ora.SRVR2R.listener_+ASM2.lsnr -p
CRS-4037: The name ‘ora.SRVR2R.listener_+ASM2.lsnr’ contains the following invalid characters: ‘+’
CRS-4000: Command Status failed, or completed with errors.
I have started the listeners manually , and they work . if I use srvctl start listener listener_+ASM1.lsnr then had errors similar as above.
In the end I had to start them with the oracle home pointed to 11gr1– and it worked.
Happy END