To Tfa or not to Tfa

Introduction

Writing a few notes on the tool of Tfa on Oracle Real Application Clusters to share my current experiences on this topic. When you create a Tar @ Oracle for an issue on one of your clusters it might very well be that Support is asking to upload Tfa reports. In such case question number one would be do I have that tool installed in my software stack of the cluster-ware. Well first comfort is that as of Oracle 11.2.0.4 this tool is present as part of the standard cluster stack. But .. ( sometimes thinking why does there always have to be a but ). In my specific case Grid Infra structure is 11.2.0.3 so first I had to look if the tool was installed as an add-on to the cluster,  and second of course had to look if it was / is up and running, if it was able to survive a node reboot etc.

Below you will find some notes on the how and why of  this tool.

On the why you should use it part:

Oracle Trace File Analyzer White Paper

Okay that does make a lot of sense since the tool is helping us not only gather various kind of information in the log files on all the cluster nodes, but it is also able to filter that information to hold only data from lets say 4 hours before the incident till 1 hour after the incident.

On the how to part:

As mentioned since the cluster is using Grid Infra structure 11.2.0.3. it was needed to check if tool was installed and was running. Details below are more with regard to the checking if tools is there and is running. Should you need more on the installation process: Oracle Grid Infrastructure Trace File Analyzer Installation

Tfa summary of commands which all have to run as the mighty ROOT:

## checking actions:
cd /opt/crs/product/tfa/bin/
tfactl print actions
## When all is running and you need to collect data as ROOT:
./tfactl diagcollect -all -node all
## which hosts are being monitored:
./tfactl print hosts
Information with regard to  the configuration
ls -ltr /etc/init.d/init.tfa

-rwxr-xr-x 1 root root 11724 Aug 9 2016 /etc/init.d/init.tfa
## Starting the Tfa Daemon(s):
MYSRVR01hr:root:/root # /etc/init.d/init.tfa start
or 
MYSRVR01hr:root:/opt/crs/product/tfa/bin # ./tfactl start
## enable – disable autostart at node reboot: 
MYSRVR01hr:root:/opt/crs/product/tfa/bin # ./tfactl enable
## show status: 
MYSRVR01hr:root:/opt/crs/product/tfa/bin # ./tfactl print status
## show config:
MYSRVR01hr:root:/opt/crs/product/tfa/bin # ./tfactl print config
##show directories:
MYSRVR01hr:root:/opt/crs/product/tfa/bin # ./tfactl print directories
## collect data for example on all nodes:
# ./tfactl diagcollect -all -node all
or
./tfactl diagcollect -from “MMM/dd/yyyy hh:mm:ss” -to “MMM/dd/yyyy hh:mm:ss”
./tfactl diagcollect -from "Jun/30/2017 09:15:00" -to "Jun/30/2017 12:00:00"
### Examples starting tfa 
MYSRVR01hr:root:/root # /etc/init.d/init.tfa start
Starting TFA..
Waiting up to 100 seconds for TFA to be started..
. . . . . 
. . . . . 
Successfully started TFA Process..
. . . . . 
TFA Started and listening for commands
## Example enabling the start of the daemons at server / node reboot:
MYSRVR01hr:root:/opt/crs/product/tfa/bin # ./tfactl enable
TFA autostart has been enabled ..
## Example status
MYSRVR01hr:root:/opt/crs/product/tfa/bin # ./tfactl print status

.------------------------------------------------------------------------------------------------.
| Host | Status of TFA | PID | Port | Version | Build ID | Inventory Status |
+----------+---------------+-------+------+------------+----------------------+------------------+
| MYSRVR01hr | RUNNING | 3608 | 5000 | 12.1.2.5.0 | 12125020150615061618 | COMPLETE |
| MYSRVR02hr | RUNNING | 20057 | 5000 | 12.1.2.5.0 | 12125020150615061618 | COMPLETE |
'----------+---------------+-------+------+------------+----------------------+------------------'
## Example showing current configuration:

MYSRVR01hr:root:/opt/crs/product/tfa/bin # ./tfactl print config
.----------------------------------------------------------.
| MYSRVR02hr                                               |
+---------------------------------------------+------------+
| Configuration Parameter | Value |
+---------------------------------------------+------------+
| TFA version | 12.1.2.5.0 |
| Automatic diagnostic collection | OFF |
| Trimming of files during diagcollection | ON |
| Repository current size (MB) | 466 |
| Repository maximum size (MB) | 1271 |
| Inventory Trace level | 1 |
| Collection Trace level | 1 |
| Scan Trace level | 1 |
| Other Trace level | 1 |
| Max Size of TFA Log (MB) | 50 |
| Max Number of TFA Logs | 10 |
| Max Size of Core File (MB) | 20 |
| Max Collection Size of Core Files (MB) | 200 |
| Automatic Purging | ON |
| Minimum Age of Collections to Purge (Hours) | 12 |
'---------------------------------------------+------------'

.----------------------------------------------------------.
| MYSRVR01hr |
+---------------------------------------------+------------+
| Configuration Parameter | Value |
+---------------------------------------------+------------+
| TFA version | 12.1.2.5.0 |
| Automatic diagnostic collection | OFF |
| Trimming of files during diagcollection | ON |
| Repository current size (MB) | 473 |
| Repository maximum size (MB) | 1170 |
| Inventory Trace level | 1 |
| Collection Trace level | 1 |
| Scan Trace level | 1 |
| Other Trace level | 1 |
| Max Size of TFA Log (MB) | 50 |
| Max Number of TFA Logs | 10 |
| Max Size of Core File (MB) | 20 |
| Max Collection Size of Core Files (MB) | 200 |
| Automatic Purging | ON |
| Minimum Age of Collections to Purge (Hours) | 12 |
'---------------------------------------------+------------'

Happy reading,

Mathijs

 

TNS-12514 in an Oracle Rac Environment

Introduction:

On a frequent bases team members do hotline service where they act as a first line of contact when there are issues with the Oracle Environment on the various boxes. In the last week of 2013 Monday 30st of December  it was my turn to do that duty when one of the interesting  issues showed that  even caused a Priority One incident on the server srvr_33r which is part of a  4 node Rac cluster with the  database MYDB. So once again time to gear up and go investigate what was going on. Users received the following Error Message:  TNS-12514: TNS:listener does not know of service requested in connect descriptor

Summary:

In the end it showed that the Oracle Instance on the 33r Box was suffering from a communication issue with its ASM instance since this ASM instance had issues with its shared_pool ( in the alert file of the Oracle Instance at the start of the incident i found : ORA-04031: unable to allocate 760 bytes of shared memory (“shared pool”,”unknown object”,”KKSSP^928″,”kglss”).

Due to that  shared_pool issue the Instance was unable to communicate with its ASM Instance when it needed the information about  its Local listener in the spfile. That made that information unavailable to the Oracle instance so the Cluster agent performed this :  ALTER SYSTEM SET local_listener='(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=10.20.30.72)(PORT=1521))))’ SCOPE=MEMORY SID=’MYDB1′;

So It altered my LOCAL_LISTENER in memory to the default listener on the cluster, listening on 1521 with NO services defined for this specific database !!

Even worse was that the environment was hit by an Oracle Bug ( which was pointed out by a team member to me : Bug 11772838 – oraagent modifies LOCAL_LISTENER with an incorrect value, when the parameter was set in spfile (Doc ID 11772838.8)

I have implemented the workaround for the listener and will have to schedule a maintenance window on the cluster to fix the Memory issues in the ASM instance(s).

Details:

Environment setting:  4 node 11.2.0.3 Grid Infrastructure  with a mix of  Databases (and versions) present. Every database has a dedicated listener on a specific port. And services for that database register with that dedicated listener. Every database has its spfile in an ASM Diskgroup.

When i started performing checks i first looked at the listener on the first Node ( Instance1 is on srvr_33r)

# lsnrctl status listener_MYDB
LSNRCTL for Linux: Version 11.2.0.3.0 - Production on 30-DEC-2013 10:02:51
 Copyright (c) 1991, 2011, Oracle. All rights reserved.
Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=srvr_33r-vip)(PORT=20008)(IP=FIRST)))
 STATUS of the LISTENER
 ------------------------
 Alias LISTENER_MYDB
 Version TNSLSNR for Linux: Version 11.2.0.3.0 - Production
 Start Date 23-OCT-2013 01:21:27
 Uptime 68 days 9 hr. 41 min. 24 sec
 Trace Level off
 Security ON: Local OS Authentication
 SNMP OFF
 Listener Parameter File /opt/oracle/product/11203_ee_64/db/network/admin/listener.ora
 Listener Log File /opt/oracle/MYDB/admin/network/log/listener_MYDB.log
 Listening Endpoints Summary...
 (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=10.20.30.72)(PORT=20008)))
 (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=10.20.30.71)(PORT=20008)))
 Services Summary...
 Service "MYDB" has 1 instance(s).
 Instance "MYDB", status UNKNOWN, has 1 handler(s) for this service...
 The command completed successfully

 

And i compared that to the information in the listener on the other node: 

(Instance 2 is on srvr_34r)
oracle@srvr_34r:/opt/oracle [MYDB2]# lsnrctl status listener_MYDB
LSNRCTL for Linux: Version 11.2.0.3.0 - Production on 30-DEC-2013 10:00:27
 Copyright (c) 1991, 2011, Oracle. All rights reserved.
Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=srvr_34r-vip)(PORT=20008)(IP=FIRST)))
 STATUS of the LISTENER
 ------------------------
 Alias LISTENER_MYDB
 Version TNSLSNR for Linux: Version 11.2.0.3.0 - Production
 Start Date 23-OCT-2013 03:10:15
 Uptime 68 days 7 hr. 50 min. 12 sec
 Trace Level off
 Security ON: Local OS Authentication
 SNMP OFF
 Listener Parameter File /opt/oracle/product/11203_ee_64/db/network/admin/listener.ora
 Listener Log File /opt/oracle/MYDB/admin/network/log/listener_MYDB.log
 Listening Endpoints Summary...
 (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=10.20.30.75)(PORT=20008)))
 (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=10.20.30.73)(PORT=20008)))
 Services Summary...
 Service "MYDB" has 1 instance(s).
 Instance "MYDB", status UNKNOWN, has 1 handler(s) for this service...
 Service "MYDB.prod.nl" has 1 instance(s).
 Instance "MYDB2", status READY, has 1 handler(s) for this service...
 Service "MYDB_TAF.prod.nl" has 1 instance(s).
 Instance "MYDB2", status READY, has 1 handler(s) for this service...
 The command completed successfully

 

Conclusion: So in short the listener on srvr_33r for this specific database was unaware of its services even though  the services where / are in place.

HOWEVER  when I checked for the local listener on the srvr_33r server   in an sqlplus session  this is what I saw:

SQL> show parameter listener
NAME TYPE VALUE
 ------------------------------------ ----------- ------------------------------
 listener_networks string
 local_listener string (DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=10.20.30.72)(PORT=1521))))
 remote_listener string srvr_3rcl-scan:33100

Which proves that the Spfile was at this morning time during incident ( December 30 as of 09:00)  no longer aware of its LOCAL listener !!! In the end we fixed the issue by adding the local listener again and after that issues with connectivity where resolved.

Note when I check the listener Logs I do see that listener was unaware of services at the start of  the incident .

/opt/oracle/MYDB/admin/network/log/listener_MYDB.log
30-DEC-2013 00:14:40 * (CONNECT_DATA=(SERVICE_NAME=MYDB.prod.nl)(SERVER=DEDICATED)(CID=(PROGRAM=c:\oraclexe\app\oracle\product\11.2.0\server\bin\ORACLE.EXE)(HOST=X1234567)(USER=SYSTEM))) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.60.205.57)(PORT=58445)) * establish * MYDB.prod.nl * 12514
TNS-12514: TNS:listener does not currently know of service requested in connect descriptor
30-DEC-2013 00:14:40 * (CONNECT_DATA=(SERVICE_NAME=MYDB.prod.nl)(SERVER=DEDICATED)(CID=(PROGRAM=c:\oraclexe\app\oracle\product\11.2.0\server\bin\ORACLE.EXE)(HOST=X1234567)(USER=SYSTEM))) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.60.205.57)(PORT=58446)) * establish * MYDB.prod.nl * 12514
TNS-12514: TNS:listener does not currently know of service requested in connect descriptor

In the alert logs of the Instance I found this information and I think that this has caused our PRiO 1 incident due to the instance could not talk to the asm instance:

view /opt/oracle/MYDB/admin/Arch/alert_MYDB1.log.20131230_000158
See Note 411.1 at My Oracle Support for error and packaging details.
WARNING: ASM communication error: op 0 state 0x0 (15055)
ERROR: direct connection failure with ASM
Dumping diagnostic data in directory=[cdmp_20131229112802], requested by (instance=1, osid=29870 (M000)), summary=[incident=48997].
Sun Dec 29 11:28:07 2013
Errors in file /opt/oracle/MYDB/diag/rdbms/MYDB/MYDB1/trace/MYDB1_m000_29870.trc  (incident=48998):
ORA-04031: unable to allocate 760 bytes of shared memory ("shared pool","unknown object","KKSSP^928","kglss")
Incident details in: /opt/oracle/MYDB/diag/rdbms/MYDB/MYDB1/incident/incdir_48998/MYDB1_m000_29870_i48998.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
WARNING: ASM communication error: op 0 state 0x0 (15055)
ERROR: direct connection failure with ASM
Sun Dec 29 11:28:08 2013
Dumping diagnostic data in directory=[cdmp_20131229112808], requested by (instance=1, osid=29870 (M000)), summary=[incident=48998].
Sun Dec 29 11:28:33 2013
ALTER SYSTEM SET local_listener='(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=10.20.30.72)(PORT=1521))))' SCOPE=MEMORY SID='MYDB1';

Addendum:

After Altering the local listener to the required and correct listener i closed the ticket. Much to my surprise after couple of hours to be kicked by the same issue on the same database again. A  Colleague pointed me to Mos to  Bug 11772838 – oraagent modifies LOCAL_LISTENER with an incorrect value, when the parameter was set in spfile (Doc ID 11772838.8) .  So that proved i would need to make another step to have a proper workaround in place before i could start plan a maintenance window that would help to fix the cause of all this ( which was the Oracle Instance unable to communicate to its ASM instance when it needed  information about  the local listener in its spfile).

As Always Happy reading,

Mathijs

Plan to setup OID (Oracle implementation of the Ldap) in a Rac environment

Introduction:

For some time it was planned to implement OID and frankly from the start I knew that would be something different compared to the things I have done so far.  Well but then again it does represent a new frontier does it not? So I felt challenged with the implementation. As an extra aspect during this setup we implemented the OID with Rac databases as a backend.  But as always the quote of the day for this was Success just loves good preperations. So we started with a setup of  a plan. The plan we have implemented captures the following 7 steps to success:

  • Check and plan the layout of your Ldap environment and your ldap.ora. For us that meant creating an environment to hold all production environments, one to hold the test environments well and after all  we added a third content as a concatination of both other environemnts since we  wanted to make sure that in the ldap.ora with only one setting in place we could contact all environments for clients who  will be on citrix servers ( and as far as I know in ldap.ora only one DEFAULT_ADMIN_CONTEXT = “dc=central,dc=env ,dc=EU” which you can use without using full qualified names):

CN=ORCLADMIN

And

dc= prod dc=EU

And

dc=test dc=EU

And

“dc=central,dc=env,dc=EU”

### think about the Ports for the Ldap you will need, one port for NON SSL, one for SSL traffic. Ask yourself if you will come across Firewalls during the setup  & operation.

  • Install a database (in this setup we use 11.2 as version). Since we wanted to deliver a High available solution in this setup we have implanted 3 Rac databases with 2 Oracle Instances each.
  • Install Weblogic 10.3.2 —> It should not be needed in the first place I read in some articles but I spoke to colleagues and they recommended that I should download the Weblogic part as well.  Only download I see is Weblogic server 12c.  So I download installers with Oracle Weblogic Server and Oracle coherence: http://www.oracle.com/technetwork/middleware/weblogic/downloads/index.html Since I am installing on 64bit Linux I was considering download generic file but given the fact that these boxes do not have Java setup properly decided to go with (wls1211_linux32.bin).
  • Setup and implement rules with regard to bulk load and bulk delete.
  • Implement Replication between the Environments.

Happy reading,

Mathijs

When listener does not know of service requested in connect (ORA-12514).

Introduction:

After setting up a taf service in one of  my rac environments i checked if the (local) listener that was configured for the database was showing the services as to be expected. Much to my surprise that  was not the case. So puzzled as i was ( after all this was not the first rac I build) I went on to a hunt to find what was going on. In this blog is my  share my steps / thoughts how i came to the happy end that of course was out there waiting for me.

Continue reading

Is it really the good – the – bad – and the Gui Dba building Racs?

Introduction

From my experience for some Database Administrators / sites it  still seem to be more tough /  hardcore to use the command line instead of using tools that  could do the same for you only with a lot less effort. And maybe that is also regarded to be a risk cause it is so easy. Well my thoughts on this is that a healthy mix on the use Gui – tools and command lines should  be an asset  to every administrator. And indeed before even starting debate when using the Gui of course you should always know what you are doing .. but then again that is always the case is it not ?

Creating an Rac manually – using DBCA.

Personally i think that the DBCA tool is  your best way forward if you are creating a new Rac environment. Of course  this tool also has also its limits ( such as on a 4 Node environment it will  by default set your cluster_database_instances default to 4 even when you build only a two node Rac)  but in general it is best way to work cause not only will create the Database , and the instances on the various Nodes  but it also adds the information to the 11G Grid infrastructure ( the Cluster ware ). And  for documentation purposes you can / should also always generate the create scripts in a default ( for me always /opt/oracle//admin/create) directory.

Yesterday i had to support a colleague who decided to create a two node RAC by hand . Indeed  even  when i suggested to him that using the DBCA would be helpful and fast in his actions it was his call to do it all by hand. But then he got stuck in the process with various actions.

If you decide to do  a create manually ( as they say maybe you should try to not use the Gui just cause it is written in this blog).  I think following steps are important:

  1. Work accurately ( setting up init oras , one for single Instance – Db) add the Instance specific information (in my case undo tablespace and local_listener ) were needed. Create spfile from that pfile.
  2. Start – create the single instance-db in full with the spfile.
  3. Once the single instance is alive start alter your spfile:
    1. alter system set cluster_database=true scope = spfile:
    2. alter system set instance_number= 1 scope = spfile sid=”instance1″ ;
    3. alter system set instance_number= 2 scope = spfile sid=”instance2″ ;
    4. alter system set thread = 1 scope = spfile sid=”instance1″ ;
    5. alter system set thread = 2 scope = spfile sid=”instance2″ ;
  4. Add at least the 3 redo groups to the second thread( alter database add logfile thread 2 group x:)
  5. set up the remote_listener with the scan listener
  6. alter system enable thread 2 ;
  7. Add a Taf service.
  8. As a check stop and start and stop both instances manually to see if the database is shared as it should be between the two instances.
  9. Add the information to the cluster:

srvctl  add database -d DB -o /opt/oracle/product/oracleVersion

srvctl  add instance -d DB -i DB1 -n Node1 srvctl  add instance -d DB -i DB2 -n Node2

srvctl modify instance -d DB -i DB1 -s +ASM1 srvctl modify instance -d DB -i DB2 -s +ASM2

srvctl  add service -d DB -s DB_TAF.domain -r DB1,DB2 srvctl start database -d DB srvctl start  service -d DB -s DB_TAF.domain

Ok, are  you still with me , do i rest my case here that it is a lot of work to do this the manual way ? And that these steps can be very  lets say error prone if you do not work 100%  accurate..

Yesterdays event where indeed not successful at first . Amongst other issues He got :

PRCR-1079 : Failed to start resource ora.db.db
CRS-5017: The resource action “ora.db.db start” met the following error:
ORA-29760: instance_number parameter not specified

Once again Google was our friend it showed post of a fellow blogger with regard to a Mos note:

MOS note ‘ORA-29760: instance_number parameter not specified’ When Starting the Database with Srvctl [ID 749515.1].

After that . Things worked .

Bottom-line of this story ? Plain and simple use the right and easiest tool for the Job you need to do. If that is a Gui  there is no dishonor in that . If you decide to go manual create , make sure you work accurately  hmm and have Google available for when things come up during the process.

Happy Reading ,

Mathijs