When your Restart / Rac Node screams : [ohasd(1749)]CRS-0004:logging terminated

Introduction:

During  one of those days on a Monday where  a hot line ( ticket ) service was provided team received a ticket that all instances  of an Oracle Restart ( 11.2.0.3)  where down. I assisted in doing  troubleshooting  which is always fun because that gives a bit of a Sherlock Holmes idea investigating issues at hand .  After logging on to the box  I saw  that the ASM Instance was up  but a  “crsctl check  has”  showed that no communication with has daemon could be established.  So the puzzle became a bit more interesting at that moment .

Details:

  • Environment is an Oracle restart environment on Linux.
  • 11.2.0.3 Grid Infra with 11.2.0.3 Databases running on it
  • Grid infra has been installed in: /opt/crs/product/11203/crs/
  • Environment went down during the weekend ( fortunately it was a Test environment  with a standard office time SLA)

It was time to  take a peek in the alert-log of the cluster on that node.  Since my installation was done in: /opt/crs/product/11203/crs/  that meant I had to look for the log-file in the subdirectory:  /opt/crs/product/11203/crs/log/mysrvr01/   ( where in  the cluster mysrvr01 is the name of the node  i was working on ) to see the alert-file. In there I found this showing what was going on:

2014-10-18 10:47:37.060:
[ohasd(1749)]CRS-0004:logging terminated for the process. log file: “/opt/crs/product/11203/crs/log/mysrvr01/ohasd/ohasd.log”
2014-10-18 10:47:37.128:
[ohasd(1749)]CRS-10000:CLSU-00100: Operating System function: mkdir failed with error data: 28
CLSU-00101: Operating System error message: No space left on device
CLSU-00103: error location: authprep6
CLSU-00104: additional error information: failed to make dir /opt/crs/product/11203/crs/auth/ohasd/mysrvr01/A4107671
2014-10-18 10:47:37.146:
[ohasd(1749)]CRS-10000:CLSU-00100: Operating System function: mkdir failed with error data: 28
CLSU-00101: Operating System error message: No space left on device
CLSU-00103: error location: authprep6
CLSU-00104: additional error information: failed to make dir /opt/crs/product/11203/crs/auth/ohasd/mysrvr01/A5139099

With this as a clue dear Watson it was a lot easier to proceed.  Started hunting large files in the file system of the  /opt/crs/product/11203/crs/  and found a flood of XML files  that had not  been erased for ages there . Which made me realize that a clean up job in cron would become handy for this as well.

After  removing the xmls and checking the mount point again  I logged in as Root  and first issued a  crsctl start has.

After some time i noticed in the alert log of the node :
014-10-20 08:26:50.847:
[/opt/crs/product/11203/crs/bin/oraagent.bin(32461)]CRS-5818:Aborted command ‘check’ for resource ‘ora.DATA1.dg’. Details at (:CRSAGF00113:) {0:0:2} in /opt/crs/product/11203/crs/log/mysrvr01/agent/ohasd/oraagent_oracle/oraagent_oracle.log.
2014-10-20 08:26:50.847:
[/opt/crs/product/11203/crs/bin/oraagent.bin(32461)]CRS-5818:Aborted command ‘check’ for resource ‘ora.DUMMY_DATA.dg’. Details at (:CRSAGF00113:) {0:0:2} in /opt/crs/product/11203/crs/log/mysrvr01/agent/ohasd/oraagent_oracle/oraagent_oracle.log.
2014-10-20 08:26:50.847:
[/opt/crs/product/11203/crs/bin/oraagent.bin(32461)]CRS-5818:Aborted command ‘check’ for resource ‘ora.FRA1.dg’. Details at (:CRSAGF00113:) {0:0:2} in /opt/crs/product/11203/crs/log/mysrvr01/agent/ohasd/oraagent_oracle/oraagent_oracle.log.
2014-10-20 08:27:52.189:
[evmd(4105)]CRS-1401:EVMD started on node mysrvr01.
2014-10-20 08:27:53.685:
[cssd(4193)]CRS-1713:CSSD daemon is started in local-only mode
2014-10-20 08:27:55.581:
[ohasd(32421)]CRS-2767:Resource state recovery not attempted for ‘ora.diskmon’ as its target state is OFFLINE
2014-10-20 08:28:02.233:
[cssd(4193)]CRS-1601:CSSD Reconfiguration complete. Active nodes are mysrvr01 .

When I checked again 🙂 ah good news not only ASM instance running but also all Instances present on the box .  Another puzzle solved.

Happy reading and till next Time ,

Mathijs