Altering the Hearbeat in Oracle Rac environment.

Introduction

When being asked for a part of job description of a dba  I would like to take a moment, smile and reply:  to serve and to protect the cluster the databases and the data of course. In the matter at hand this will have to mean that this missions/assignment will be all about increasing High availability and improve the functionality of the cluster interconnect between nodes and that all for one price: doubling  your cluster interconnect ips on the various layers.

And to be honest this assignment is even more of interest because after all how often is there a need or a challenge to alter IP addresses once you have setup the cluster. Hmm only valid options  I could think of would be action where there is a action like lifting and shifting the servers to other rooms  or as in this case because we simply want to improve availability.

Some days before the action was scheduled, I looked into the matter of the required dedicated ips together with the colleagues from team. In the file in /etc/hosts being the source for that we looked for ips which had  hb in the alias of the naming convention which is according to standards on OS level. This information  was used to setup the ips on the OS level. Once that task was completed a first and very important test had to be: can you ping these new and dedicated ips from every node which is part of the cluster. And since this is a happy flow scenario that was of course what happened. During the maintenance window itself detailed steps as seen below have been performed to make these addresses known / usable on the cluster layer in the grid infrastructure.

Detailed Plan

In this scenario we are using a 4 node Rac cluster on RH with 11.2.0.4 Grid Infrastructure. When looking on the OS level this is what we found already present in the hosts file. And even better these ips where all available and not in use.

grep -i hb /etc/hosts
 • 10.124.10.75 mysrvrahr-hb1.dc.nl mysrvrahr-hb1
 • 10.124.11.75 mysrvrahr-hb2.dc.nl mysrvrahr-hb2
 • 10.124.10.76 mysrvrbhr-hb1.dc.nl mysrvrbhr-hb1
 • 10.124.11.76 mysrvrbhr-hb2.dc.nl mysrvrbhr-hb2
 • 10.124.10.77 mysrvrchr-hb1.dc.nl mysrvrchr-hb1
 • 10.124.11.77 mysrvrchr-hb2.dc.nl mysrvrchr-hb2
 • 10.124.10.78 mysrvrdhr-hb1.dc.nl mysrvrdhr-hb1
 • 10.124.11.78 mysrvrdhr-hb2.dc.nl mysrvrdhr-hb2

The steps below have been followed based on a great Mos note in order to complete the tasks that are needed to make the Grid infrastructure (cluster) aware of the new ips. The scenario is running through a number of steps to be well prepared but also of course to be on the save side before and during the changes on the cluster layer.  It is like hmm paying respect and being brave but cautious .

Preparation steps:
As of 11.2 Grid Infrastructure, the private network configuration is not only stored in OCR but also in the gpnp profile. Documentation was very clear on this: If the private network is not available or its definition is incorrect, the CRSD process will not start and any subsequent changes to the OCR will be impossible.

  • Therefore care needs to be taken when making modifications to the configuration of the private network.
  • It is important to perform the changes in the correct order.

Note that manual modification of gpnp profile is not supported so it is best to stick to proper actions and not go into hacking mode!

So let’s take a backup of profile.xml on all cluster nodes before proceeding:

As grid user ( in my case the oracle user) , move to the correct directory( cd $GRID_HOME/gpnp/<hostname>/profiles/peer/).

cd /app/oracle/product/11.2.0/grid/gpnp/mysrvrahr/profiles/peer 
cd /app/oracle/product/11.2.0/grid/gpnp/mysrvrbhr/profiles/peer 
cd /app/oracle/product/11.2.0/grid/gpnp/mysrvrchr/profiles/peer 
cd /app/oracle/product/11.2.0/grid/gpnp/mysrvrdhr/profiles/peer 

During startup of cluster Oracle is relying on this very important xml file
for specific data like spfile / diskgroups and of course IPS. 

cp -p profile.xml profile.xml.bk

Ensure Oracle Clusterware is running on ALL cluster nodes in the cluster and save current status of resource. (Better save then sorry and to make sure you know about the health of  the cluster and its resources as they have been defined. So do check the cluster and save the current status of resources in a file as a pre-change image.

/app/oracle/product/11.2.0/grid/bin/crsctl check cluster -all 
/app/oracle/product/11.2.0/grid/bin/crsctl status resource -t>/tmp/beforeNewIps.lst

As grid user( in my case the oracle user): Get the existing information.

showing which interfaces are defined in the cluster.
##below you will see that the current (single) cluster interconnect is set up at 64.18.112.208

/app/oracle/product/11.2.0/grid/bin/oifcfg getif
bond1 64.18.112.208  global  cluster_interconnect
bond0  195.233.190.64  global  public

The command iflist will show you the network information known on the OS. Showing defined all ( or specific ) ips. Check the interfaces / subnet address can be identified by command for eth specifically:

/app/oracle/product/11.2.0/grid/bin/oifcfg iflist|grep -i eth|sort 
64.18.32.0 eth0
10.124.10.0 eth2
10.124.11.0 eth6

or

## check  interfaces / subnets in general:
 /app/oracle/product/11.2.0/grid/bin/oifcfg iflist|sort

Since we now have a good picture of the status of the cluster and since we know more about the ips being used (oifcfg getif) and about the ips being present on the system (oifcfg iflist) all things set to Add the new cluster_interconnect information. As you can see definition of both eth2 Address and eth6. And with the -global parameter the information is shared in the complete cluster on all nodes

/app/oracle/product/11.2.0/grid/bin/oifcfg setif -global eth2/10.124.10.0:cluster_interconnect 
/app/oracle/product/11.2.0/grid/bin/oifcfg setif -global eth6/10.124.11.0:cluster_interconnect

Of course there cannot be a change without verifying it. So i checked on all nodes with below command.

/app/oracle/product/11.2.0/grid/bin/oifcfg getif  

Since we are using 11GR2 Grid Infrastructure below steps are to be followed now: Shutdown Oracle Cluster ware on all nodes and disable the Oracle Cluster ware as root

Action is to be performed as the root user: 
sudo su -  
./app/oracle/product/11.2.0/grid/bin/crsctl stop crs 
./app/oracle/product/11.2.0/grid/bin/crsctl disable crs

In this specific scenario my Linux brothers in arms had already made the network configuration change at OS level as was required and that great job was seen in the oifcfg iflist command. They made sure that the new interfaces were available on all nodes after their change.

(check to ping the interfaces on all nodes with script kindly provided by Linux team member). 
for x in 10 11;do for xx in 75 76 77 78;do ping -c2 10.124.${x}.${xx}|egrep 'icmp_seq|transmitted';done;echo;done 
for x in a b c d; do for xx in 1 2;do ping -c2 mysrvr${x}hr-hb$xx|egrep 'icmp_seq|transmitted';done;echo;done 

Well all went well and has been checked so it is time to restart Oracle Cluster ware and once completed enable Oracle Cluster ware again.

On all nodes in the cluster:

## as root user: 
sudo su -  
/app/oracle/product/11.2.0/grid/bin/crsctl start crs

Seeing  = believing in this matter so after some time Check:

/app/oracle/product/11.2.0/grid/bin/crsctl check cluster -all 

In the step below we are checking the status of the resources in the cluster again and adding that information to  a file. This “post” operation file is then being used to compare the status of the cluster resources before and after.

/app/oracle/product/11.2.0/grid/bin/crsctl status resource -t>/tmp/afterNewIps.lst 
sdiff /tmp/afterNewIps.lst /tmp/beforeNewIps.lst

This compare showed me that a 10G RAC database resource and its services needed my intention, so via the cluster commands i checked and observed their status after starting them with srvctl command as the oracle user.  Once completed I ran another check as described and ah happy me all resource in the post status file were in a similar status ( online online) as in the pre status file.

as root user: 
sudo su -  
/app/oracle/product/11.2.0/grid/bin/crsctl enable crs

Time to wrap up this scenario. As part of housekeeping remove the old interface:

/app/oracle/product/11.2.0/grid/bin/oifcfg delif -global bond1/64.18.112.208:cluster_interconnect

Verified the environment one more time.

/app/oracle/product/11.2.0/grid/bin/oifcfg getif

Cluster ware proved already of course but checked the databases and listeners as a last sanity check and that completed the tasks for this time.

Time to inform Apps team that they can perform their sanity checks and start the applications again.

Happy reading and till next Time.

Mathijs

Upgrade to 12C GridInfra lessons learned

Introduction:

Not sure if it was word from a wise Dba or just from a fortune cookie (might even have been from a Pink Panther movie). It said always expect the unexpected and as an add-on success just loves preparation.

This week one of my tasks was to upgrade a 4 node Oracle Rac cluster from 11.2.0.4 to 12c (12.1.0.2.0) grid infrastructure. And even though I came well prepared (see also detailed other blog for that ( https://mathijsbruggink.com/2017/05/01/upgrading-11g-gridinfra-to-12c-in-linux/) several small surprises occurred which will be used as a lesson learned in upcoming upgrades of  the grid infra structure. Also I would like to offer some timeline as with regard to how long the upgrade process really took.

Lessons learned:

  • During the preparations needed to order extra disks for ASM storage for the Grid Infrastructure management repository (GIMR). When i started the runInstaller as a first check if all was well prepared  noticed that the installer software is indeed most likely looking for a diskgroup called +OCR or +VOTING. This could be a trap if you had not extended one of them ( but instead a +GRID diskgroup ). So when preparing look for either OCR or VOTING ( best both if present) to add extra disks ( and have some disks at spare).
  • During the start of maintenance window the Linux colleague mentioned that he would have to stop the Hyperion services. This activity took some 45 minutes of the change window. Will have to find out if this was a justified claim to stop those services and will need to add an extra step to the pre-checks to find out about other services – daemons that are running on the cluster that might be impacted when doing an upgrade.
  • Purpose of rootupgrade.sh after the installation part via the runinstaller completes and the upgrade part via the runinstaller commences. Rootupgrade.sh will perform the actual ASM upgrade, will configure the OLR (local registry) amongst other things.

Timelines:

Every change on a test  or production environment will have to come with a plan with regard to an estimated time needed how long the change will take. First and most important of course choose the strategy, will  a rolling window be used (thus minimizing impact since at least one node will be up ( thinking about a kind of batch where first batch will hold first node , second batch holding node 2 and 3 in my  4 node example, and a last batch holding the last node)).

Start of Change Window : 20:00 CET ( 6:00 UTC ) .
According to Linux expert Hyperion services needed to be stopped before we could continue.

Start of installation: 20:45 CET.
Started the runInstaller on the first node. Software was deployed to first node and all the nodes in the cluster (4 Node Rac).

Upgrade part of the existing 11.2.0.4 GridInfra structure:
21:30 – 21:58 on the first Node (MYSRVR09hr) the rootupgrade.sh was started. (used the manual upgrade ( still a bit hmm unwilling to leave it all to the automated option), this means set up a root session on first node and run: ./app/grid/product/`1102/grid/rootupgrade.sh).

In the runInstaller it was offered to automate and to run the rootupgrade.sh in parallel on Node number 2 and 3. So in separate windows but to me it felt better to open a terminal session as root in parallel  to run the script on each server.

22:06 – 22:13 on MYSRVR10hr : ./app/grid/product/`1102/grid/rootupgrade.sh
22:06 – 22:24 on MYSRVR11hr : ./app/grid/product/`1102/grid/rootupgrade.sh

On the last node MYSRVR12hr:
22:28 – 22:48 ./app/grid/product/`1102/grid/rootupgrade.sh

After that install continued with the Grid Infrastructure management repository (GIMR) database  and  once completed  i ran a number sanity checks in the cluster:

22:50 23:55

At 23:59:59 Reported mission completed.

Happy reading and till next time,

Mathijs