Introduction:
A couple of months ago I had the issue that in an 11.2.0.2 environment with Grid Infra structure and ASM on Linux , without any specific reason at a specific point and time would see messages about communication error between the databases and the ASM instance and I was not able to connect to the ASM instance with a sqlplus / as sysasm. I have opened a tar back then and even got myself a fresh bug 14767353 number but no answers. So But practically this remained unsolved. Friday 4th I had same issue again, so it was time to get back to arms and investigate this.
Below you will find the steps , what i saw and how I solved it with the help of my friends(Google and Metalink).
Environment: 4 Node GI 11.2.0.2.0 on Red Hat Linux with ASM. Installations performed as oracle:dba
Case:
Basically I saw the following:
- in the alerts of the databases connecting to the asm instance on the specific box i saw so that did not look all too good:
Fri Jan 04 15:58:06 2013 WARNING: ASM communication error: op 0 state 0x0 (15055) ERROR: direct connection failure with ASM WARNING: ASM communication error: op 0 state 0x0 (15055) ERROR: direct connection failure with ASM WARNING: ASM communication error: op 0 state 0x0 (15055) ERROR: direct connection failure with ASM WARNING: ASM communication error: op 0 state 0x0 (15055) ERROR: direct connection failure with ASM WARNING: ASM communication error: op 0 state 0x0 (15055) ERROR: direct connection failure with ASM
- when i tried to connect sqlplus / as sysasm to one of my ASM instances on the first box without password I would get:
oracle@mynode01hr:/opt/oracle []# sqlplus / as sysasm SQL*Plus: Release 11.2.0.2.0 Production on Fri Jan 4 15:18:56 2013 Copyright (c) 1982, 2010, Oracle. All rights reserved. ERROR: ORA-01031: insufficient privileges
Approach / Findings:
First change to make things look more standard was that from the second ASM instance I have issued following changes:
In its setup these ASM instances only had local listener set up for the VIP on the specific of the box but no remote listener. So that was one thing I thought needed setting up properly anyhow:
alter system set remote_listener='mynode01cl-scan.vfnl.dc-ratingen.de:1521' scope = both sid = '*';
After that tested connectivity (s+ is an alias for sqlplus ‘/ as sysdba’) :
oracle@mynode01hr:/opt/oracle []# s+
SQL*Plus: Release 11.2.0.2.0 Production on Fri Jan 4 15:18:56 2013 Copyright (c) 1982, 2010, Oracle. All rights reserved. ERROR: ORA-01031: insufficient privileges
By the way I could not connect to any of the instances on the box locally (without connect string). Connecting using the connect string however that did work:
oracle@mynode01hr:/opt/oracle []# sqlplus system@mydb11
SQL*Plus: Release 11.2.0.2.0 Production on Fri Jan 4 15:19:39 2013
Copyright (c) 1982, 2010, Oracle. All rights reserved.
Enter password:
Connected to:
So after round one in this quest , after setting up remote listener I still had that same issue. Though it was better practice in place it only solved the issues partly.
On metalink (which sometimes feels more like an Oracle of Delphi) I saw two workarounds , one would be change primary os-group to oinstall (not preferred because with us still old fashioned but working with dba as primary group so changing that in a running production box did not sound like a cunning plan). Second workaround gave me a good clue , connect via connect string which i tested with the normal databases and the ASM instance.
oracle@mynode01hr:/opt/oracle []# sqlplus sys@myasm1 / as sysdba SQL*Plus: Release 11.2.0.2.0 Production on Fri Jan 4 15:21:01 2013 Copyright (c) 1982, 2010, Oracle. All rights reserved. Enter password: Connected to: oracle@mynode01hr:/opt/oracle []# sqlplus sys@mydb11 as sysdba SQL*Plus: Release 11.2.0.2.0 Production on Fri Jan 4 15:21:25 2013 Copyright (c) 1982, 2010, Oracle. All rights reserved. Enter password: Connected to:
So after reading and test that was a good thing half the puzzle was solved since I was able to communicate with both databases and the ASM instance as soon as I Added a connect string to my sqlplus session. I performed an extra test to see if the communication between the Databases Instance and the ASM instance was ok by resizing a datafile.
SQL> alter database datafile '+MYDB1_DATA/mydb1/datafile/users.259.771667551' resize 150M; Database altered.
So that did prove that the database and the ASM instance did work together properly. Which makes the issues at hand a little less dramatic.
As I wrote on Metalink I had seen a suggestion to change to primary os- group of oracle (my environments are all oracle:dba) to oracle:oinstall). Even though I did not like it in the first or second place it gave me a thought I should check about os-groups on the Linux level. During that check the good thing was that then I noticed a delta in /etc/group on all 3 other boxes.
oracle@mynode02hr:/opt/oracle []# cat /etc/group|grep oracle
oinstall:x:47374:oracle <—- not present on mynode01hr , 04hr , 05hr
oracle@mynode02hr:/opt/oracle [MYDB11]# id -a oracle
uid=101(oracle) gid=101(dba) groups=47374(oinstall),101(dba)
This showed that oracle at least had the group oinstall available !!!! on the second node (02hr) where everything was working properly. The oinstall group was lacking on all other 3 nodes including on the first node with the issues . Now that made me wonder. What if Oracle still want to have the group oinstall present even if was not the primary os group.. I contacted Linux guru and he was kind enough to help me , he added the group ( locally ). Typically to be tested with:
id -a oracle
I tested it on the first Box 01hr which was having issues and it worked! I can connect without ORA-01031: insufficient privileges directly again to ASM without password and without connect string. And also no more messages in the Alerts of the instances that are connected to this Instance:
Fri Jan 04 15:58:06 2013 WARNING: ASM communication error: op 0 state 0x0 (15055) Final check if all was the same: for i in 1 2 4 5 ; do ssh -q oracle@mynode0${i}hr "id -a oracle"; done uid=101(oracle) gid=101(dba) groups=47374(oinstall),101(dba) uid=101(oracle) gid=101(dba) groups=47374(oinstall),101(dba) uid=101(oracle) gid=101(dba) groups=47374(oinstall),101(dba) uid=101(oracle) gid=101(dba) groups=47374(oinstall),101(dba)
Conclusion:
The Linux boxes were not set-up in a similar way and fortunately the correct set-up (with both the os-groups oinstall and dba being present) was in place on one box. By comparing that to the other boxes I found the missing piece in the puzzle after which my connect errors and the messages in the alerts of the databases vanished. It looks very much that the oinstall group simply has to be out there with the user oracle at least being a member of that group.
As always happy reading.
Mathijs.