Upgraders Nightmare Gridinfrastructure (1120.3.0) Install locked out all Connections

Introduction

This week I have had the challenge to finally install the Grid infrastructure Software 11.2.0.3.0 on a 4 Node production Real application cluster on Linux. I had followed all the guidelines and came well prepared because more than a month ago I had done the same on the preproduction cluster.

Two weeks ago I had installed the required patch as a preparation for the Grid infrastructure as I Documented in an earlier post (but which I will recap for practical purposes here):

1

11.1.0.7.7 for CRS Patch Set Update (psu april  2011  for Crs) 11724953
Appropriate Opatch  version  (11.1.0.8.2 or higher) 6880880

n/a for the migration cause  Databases will not be upgraded. This is a crs Upgrade only.

Oracle Database (includes Oracle Database and Oracle RAC) p10404530_112030_platform_1of7.zipp10404530_112030_platform_2of7.zip

2

Oracle Grid Infrastructure (includes Oracle ASM, Oracle Clusterware, and Oracle Restart) p10404530_112030_platform_3of7.zip

 Summary of approach

On the preproduction environment the following scenario has been in place as a so called rolling upgrade. That means that every SINGLE node was patched first with the required patch for the 11.1.0.7.0 Crs environment (11724953). That means that all the time two nodes were up and running. After I finished patching first node with success, the second one was patched etc.  So after this first step all three nodes came prepared for the grid infrastructure upgrade (10404530). That patch  itself is a rolling one as well  which means while patching one node all the other ones  remain  available  (which means two  or n instances  are up and running.

So following steps have been performed on a  Node wise Base.

  1. Shutdown the old cluster ware.
  2. Install 11724953 on the existing cluster ware.
  3. Then install the 11.2.0.3.0 Grind infrastructure (step 2).
First Challenge, srvctl no(t) longer working from the Rdbms home

First thing that had to be fixed in the preproduction environment before going to production was the fact that the srvctl tool  was not working from the Old (11.1.0.7.0.) Rdbms home. In the past I had always used the srvctl from the clusterware to do my maintenances with the databases and listeners so I never took a look on the how and why srvctl was not working in the RDBMS part of the software.. Well I was in for a surprise cause after the upgrade to  GridInfrastructure 11.2.0.3.0 I was no longer able to control 11.1 Databases as from the grid infrastructure home:

oracle@machine:/opt/oracle [+ASM2]# srvctl status database -d ADB

PRCD-1027 : Failed to retrieve database ADB

PRCD-1027 : Failed to retrieve database ADB

PRKP-1088 : Failed to retrieve configuration of cluster database ADB

PRKR-1078 : Database ADB of version 11.0.0.0.0 cannot be administered using current version of srvctl. Instead run srvctl from /opt/oracle/product/111_ee_64/db

.. But ouch that one was not working …

In the end I found following workaround for this:

Please update the srvctl script IN THE 11107 RDBMS HOME to use the 11107 Db Home for OHOME and for JREDIR / JLIBDIR:

CHOME=/opt/oracle/product/11.1.0/crs
OHOME=/opt/oracle/product/11.1.0/racdb
JREDIR=/opt/oracle/product/11.1.0/racdb/jdk/jre
JLIBDIR=/opt/oracle/product/11.1.0/racdb/jlib

cd  $ORACLE_HOME/bin

cp srvctl srvctl.20120515

vi srvctl

CHOME=/opt/crs/product/112_ee_64/crs

OHOME=/opt/oracle/product/111_ee_64/db

if [ “X$CHOME” != “X$OHOME” ]

then

case $ORACLE_HOME in

“”) echo “****ORACLE_HOME environment variable not set!”

echo ”    ORACLE_HOME should be set to the main”

echo ”    directory that contains Oracle products.”

echo ”    Set and export ORACLE_HOME, then re-run.”

exit 1;;

esac

else

ORACLE_HOME=/opt/oracle/product/111_ee_64/db

export ORACLE_HOME

fi

# External Directory Variables set by the Installer

JREDIR=/opt/oracle/product/111_ee_64/db/jdk/jre

JLIBDIR=/opt/oracle/product/111_ee_64/db/jlib

Then I Saved the damn thing.

chmod u+x srvctl

and  I tested it . And miracles oh miracles it worked. I had tested it with an existing database from the 11.1. Rdbms Home and It worked as it was expected.

Proof :

oracle@machine:/opt/oracle/product/111_ee_64/db/bin [ADB]# srvctl status database -d ADB

Instance ADB1 is running on node machine1

Instance ADB2 is running on node machine2

Instance ADB3 is running on node machine3

Tested it with stopping and starting one instance and all went smooth as well.

I have added it to the required steps for migration night and added it to the documentation.

Installation of the Grid infrastructure

Installation using the runInstaller itself went pretty smooth (with two hick-ups). All the steps had been tested before by running the runcluvfy took:

runcluvfy.sh stage  -pre crsinst -upgrade -n machine1, machine2, machine3,machine4 -src_crshome /opt/crs/product/111_ee_64/crs -dest_crshome /opt/crs/product/112_ee_64/crs  -dest_version 11.2.0.3.0 -fixup -fixupdir /opt/oracle/stage/grid>mymachine.

No issues showed , with regard to connectivity etc. Only thing that should have been noticed was a warning:

  • First system was lacking package: cvuqdisk-1.0.9-1 – This is a prerequisite condition to test whether the package “cvuqdisk-1.0.9-1” is available on the system. CluvFy had warned me about that, but I misinterpreted so during runinstaller it popped up again. And I had to contact the Unix  on call to get that installed.

After installing via the Gui the rootupgrade.sh ran and somehow that has messed up  my /opt/oracle , cause after that I could not login password less any more ..  And in a  Rac that is a show stopper.. So

  • Again I had issues with ssh after the rootupgrade.sh has finished.  That open ssh was broken. Asking me for passwords again and again and again. From preprod migration I have learned to check as root the permissions of /opt/oracle to 755 again and did a test and then It worked again and I could continue.

After this all had been taken care if, it was time to do sanity checks. The cluster was alive and well, and connectivity checks to the databases on those boxes had been performed as well. So I called customer that I was happy with the result and that they could proceed with their Sanity checks.

And that was when lightning struck me ……

Red alert on the production system not accepting connections from outside cluster.

As I wrote I had tested connectivity  to the databases while being on those boxes and it all worked well. If there is one lesson learned in this I have learned also to test connectivity next times from outside the cluster.

Cause  the funky part was that connections on the cluster  worked back and forth. But connections from outside the cluster could not be established.  If you tried a tnsnping ADB  ( and this adb used the VIP-ADDRESS) you would simply be looking and waiting and in the end the connection would time out. This first made me suspicious that something was wrong with firewall , I tested with:

telnet <machine>  <port of listener>

and that also timed out so it looked like making sense to me that a firewall  / network was giving issue … Well after investigations it turned that the vip addresses where ok and  visible,

that there was no firewall issue .. so what could it be.

Bug 13440962 Different subnet failed to connect to vip after restart vip

It turned out that after restarting the cluster resources, we have had these phenomena that NONE of the connections from outside the cluster was able to use the listener (running on the VIP address), but inside the cluster that just worked out great. Apparently the issue is present when a vip is relocated but also when a node is restarted. Even that the grid scan, when relocated, is not accessible from clients. It seems the mac address presented by crs is not the correct one and the router loops in routing the call. The issue is present from all the clients (not from the DB server nodes)

In the end a solution and a workaround have been provided. One being a workaround ( that was tested and implemented during the maintenance window)  but at least that brought the production systems back to real life ( accepting  connections from application servers etc.). And it was quick:

Had my unix administrator run this :

On node 1: /sbin/arping -U -c 3 -I bond0 195.233.666.72

On node 2: /sbin/arping -U -c 3 -I bond0 195.233.666.75

On node 3: /sbin/arping -U -c 3 -I bond0 195.233.666.77

On node 4: /sbin/arping -U -c 3 -I bond0 195.233.666.80

And……  I had requested an emergency change window to install BUNDLE Patch for Base Bug 13440962

In the second night I have installed the patch on all 4 Nodes and that went flawless.

Happy end..

Mathijs

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s