This week I have had the challenge to finally install the Grid infrastructure Software 184.108.40.206.0 on a 4 Node production Real application cluster on Linux. I had followed all the guidelines and came well prepared because more than a month ago I had done the same on the preproduction cluster.
Two weeks ago I had installed the required patch as a preparation for the Grid infrastructure as I Documented in an earlier post (but which I will recap for practical purposes here):
|220.127.116.11.7 for CRS Patch Set Update (psu april 2011 for Crs)||11724953|
|Appropriate Opatch version (18.104.22.168.2 or higher)||6880880|
n/a for the migration cause Databases will not be upgraded. This is a crs Upgrade only.
|Oracle Database (includes Oracle Database and Oracle RAC)||
|Oracle Grid Infrastructure (includes Oracle ASM, Oracle Clusterware, and Oracle Restart)||
Summary of approach
On the preproduction environment the following scenario has been in place as a so called rolling upgrade. That means that every SINGLE node was patched first with the required patch for the 22.214.171.124.0 Crs environment (11724953). That means that all the time two nodes were up and running. After I finished patching first node with success, the second one was patched etc. So after this first step all three nodes came prepared for the grid infrastructure upgrade (
10404530). That patch itself is a rolling one as well which means while patching one node all the other ones remain available (which means two or n instances are up and running.
So following steps have been performed on a Node wise Base.
- Shutdown the old cluster ware.
- Install 11724953 on the existing cluster ware.
- Then install the 126.96.36.199.0 Grind infrastructure (step 2).
First thing that had to be fixed in the preproduction environment before going to production was the fact that the srvctl tool was not working from the Old (188.8.131.52.0.) Rdbms home. In the past I had always used the srvctl from the clusterware to do my maintenances with the databases and listeners so I never took a look on the how and why srvctl was not working in the RDBMS part of the software.. Well I was in for a surprise cause after the upgrade to GridInfrastructure 184.108.40.206.0 I was no longer able to control 11.1 Databases as from the grid infrastructure home:
oracle@machine:/opt/oracle [+ASM2]# srvctl status database -d ADB
PRCD-1027 : Failed to retrieve database ADB
PRCD-1027 : Failed to retrieve database ADB
PRKP-1088 : Failed to retrieve configuration of cluster database ADB
PRKR-1078 : Database ADB of version 220.127.116.11.0 cannot be administered using current version of srvctl. Instead run srvctl from /opt/oracle/product/111_ee_64/db
.. But ouch that one was not working …
In the end I found following workaround for this:
Please update the srvctl script IN THE 11107 RDBMS HOME to use the 11107 Db Home for OHOME and for JREDIR / JLIBDIR:
cp srvctl srvctl.20120515
if [ “X$CHOME” != “X$OHOME” ]
case $ORACLE_HOME in
“”) echo “****ORACLE_HOME environment variable not set!”
echo ” ORACLE_HOME should be set to the main”
echo ” directory that contains Oracle products.”
echo ” Set and export ORACLE_HOME, then re-run.”
# External Directory Variables set by the Installer
Then I Saved the damn thing.
chmod u+x srvctl
and I tested it . And miracles oh miracles it worked. I had tested it with an existing database from the 11.1. Rdbms Home and It worked as it was expected.
oracle@machine:/opt/oracle/product/111_ee_64/db/bin [ADB]# srvctl status database -d ADB
Instance ADB1 is running on node machine1
Instance ADB2 is running on node machine2
Instance ADB3 is running on node machine3
Tested it with stopping and starting one instance and all went smooth as well.
I have added it to the required steps for migration night and added it to the documentation.
Installation of the Grid infrastructure
Installation using the runInstaller itself went pretty smooth (with two hick-ups). All the steps had been tested before by running the runcluvfy took:
runcluvfy.sh stage -pre crsinst -upgrade -n machine1, machine2, machine3,machine4 -src_crshome /opt/crs/product/111_ee_64/crs -dest_crshome /opt/crs/product/112_ee_64/crs -dest_version 18.104.22.168.0 -fixup -fixupdir /opt/oracle/stage/grid>mymachine.
No issues showed , with regard to connectivity etc. Only thing that should have been noticed was a warning:
- First system was lacking package: cvuqdisk-1.0.9-1 – This is a prerequisite condition to test whether the package “cvuqdisk-1.0.9-1” is available on the system. CluvFy had warned me about that, but I misinterpreted so during runinstaller it popped up again. And I had to contact the Unix on call to get that installed.
After installing via the Gui the rootupgrade.sh ran and somehow that has messed up my /opt/oracle , cause after that I could not login password less any more .. And in a Rac that is a show stopper.. So
- Again I had issues with ssh after the rootupgrade.sh has finished. That open ssh was broken. Asking me for passwords again and again and again. From preprod migration I have learned to check as root the permissions of /opt/oracle to 755 again and did a test and then It worked again and I could continue.
After this all had been taken care if, it was time to do sanity checks. The cluster was alive and well, and connectivity checks to the databases on those boxes had been performed as well. So I called customer that I was happy with the result and that they could proceed with their Sanity checks.
And that was when lightning struck me ……
Red alert on the production system not accepting connections from outside cluster.
As I wrote I had tested connectivity to the databases while being on those boxes and it all worked well. If there is one lesson learned in this I have learned also to test connectivity next times from outside the cluster.
Cause the funky part was that connections on the cluster worked back and forth. But connections from outside the cluster could not be established. If you tried a tnsnping ADB ( and this adb used the VIP-ADDRESS) you would simply be looking and waiting and in the end the connection would time out. This first made me suspicious that something was wrong with firewall , I tested with:
telnet <machine> <port of listener>
and that also timed out so it looked like making sense to me that a firewall / network was giving issue … Well after investigations it turned that the vip addresses where ok and visible,
that there was no firewall issue .. so what could it be.
Bug 13440962 Different subnet failed to connect to vip after restart vip
It turned out that after restarting the cluster resources, we have had these phenomena that NONE of the connections from outside the cluster was able to use the listener (running on the VIP address), but inside the cluster that just worked out great. Apparently the issue is present when a vip is relocated but also when a node is restarted. Even that the grid scan, when relocated, is not accessible from clients. It seems the mac address presented by crs is not the correct one and the router loops in routing the call. The issue is present from all the clients (not from the DB server nodes)
In the end a solution and a workaround have been provided. One being a workaround ( that was tested and implemented during the maintenance window) but at least that brought the production systems back to real life ( accepting connections from application servers etc.). And it was quick:
Had my unix administrator run this :
On node 1: /sbin/arping -U -c 3 -I bond0 195.233.666.72
On node 2: /sbin/arping -U -c 3 -I bond0 195.233.666.75
On node 3: /sbin/arping -U -c 3 -I bond0 195.233.666.77
On node 4: /sbin/arping -U -c 3 -I bond0 195.233.666.80
And…… I had requested an emergency change window to install BUNDLE Patch for Base Bug 13440962
In the second night I have installed the patch on all 4 Nodes and that went flawless.