Month: August 2014

Leave a comment Posted on August 26, 2014 Oracle, Oracle Dba, Rman

The good the bad and the corrupted (ORA-01578, ORA-26040, ORA-01110)

Introduction:

At the moment I am part of a team that will upgrade a billing environment from 10.2.0.5 to 11.2.0.4 on Solaris. After a restore and point in time recovery on a preproduction environment the team has upgraded all databases using the manual upgrade method ( still not that confident that the upgrades would run as smooth with the dbua….. ). As always after the upgrade there is a time of hyper care where the alert files should be monitored even more closely for funky issues. This blog is giving you some more details on the funky stuff we came across and how we analyzed it. The one thing that did make my frown during this action however was the fact that various tools showed various results ( not necessarily showing all the same results.

Conclusions:

When upgrading important environments a DBA should really have eyes and ears everywhere before performing that upgrade because once upgraded it will be a tough call to proof that something was broken in the old environment already or that it was caused by the upgrade. So I would really recommend to scrutinize your environment like someone studies a creature under a microscope before upgrading questioning broken jobs , invalid objects and as I learned the hard way during this upgrade .. objects with no logging option (DBA_TABLES, DBA_INDEXES, DBA_LOBS, DBA_TAB_PARTITIONS, DBA_LOB_PARTITIONS, DBA_TAB_SUBPARTITIONS, etc.
LOGGING=’NO’ indicates NOLOGGING. ). In my case e a number of objects had been created with the no logging option and indeed only after upgrading I had to fight my way back to find out what was going on , what objects were involved ( in my case it was related to a number of objects , staging tables for a specific user where a drop and recreate and exp/imp dealt with it). I did recommend to make sure that there will be force logging for the complete database ( but was told that customer wants to be able to create objects with no logging option for performance reasons).

As I mentioned in the blog in the End I was plain lucky that this was caused by staging tables which made the scenario more easy since we could drop and recreate them easily. Once again I learned something ( and found proof again that success just loves preparation). Of course this is a slightly different scenario then the one we will run in production since we will run the Upgrade directly but still would need to bear in mind the objects in production created with the no logging option if one of the databases in scope would need a restore if the upgrade would break the database. Well in this case all databases where upgraded with success so all DBAS happy and confident for the production migration which is scheduled.

Details:

SERVER: mysrvr

Database: OPTMYDB

Timeline:

All Databases of the billing environment have been restored to the new OPT server called mysrvr.
We gave upgraded all Databases from 10.2.0.5 to 11.2.0.4 on Solaris.
After opening the database OPTMYDB and running statistics jobs she started throwing error messages in alert about block corruptions as seen in Example 1 below. Yesterday together with a colleague from Adba team the gather statistics issue was resolved and three objects had been identified to have corrupt blocks ( see Example2) . These three objects have been dropped and recreated and there were no more corruption messages in the alert file.
After that I have been asked by my manager to run a dbv for the full database checking all datafiles .
I also have used rman in Parallel to find possible corruption which is still running but first output is in example3 :

Example 1

## In the alertfile this was the information We came across setting off the first alarm bells:

Mon Aug 04 22:00:43 2014

DBMS_STATS: GATHER_STATS_JOB encountered errors. Check the trace file.

Errors in file /opt/oracle/OPTMYDB/diag/rdbms/MYDB/OPTMYDB/trace/OPTMYDB_j002_3640.trc:

ORA-20011: Approximate NDV failed: ORA-29913: error in executing ODCIEXTTABLEOPEN callout

ORA-29400: data cartridge error

error opening file /home/tmcon/oracle/UNMATCHED_SUBS_3640.log

Errors in file /opt/oracle/OPTMYDB/diag/rdbms/MYDB/OPTMYDB/trace/OPTMYDB_j002_3640.trc (incident=8665):

ORA-01578: ORACLE data block corrupted (file # 620, block # 391690)

ORA-01110: data file 620: ‘/db/MYDB/data/tbsdata_e/tbsdip01_4m_14.dbf’

ORA-26040: Data block was loaded using the NOLOGGING option

Incident details in: /opt/oracle/OPTMYDB/diag/rdbms/MYDB/OPTMYDB/incident/incdir_8665/OPTMYDB_j002_3640_i8665.trc

Example2

I have grepped the ORA-01578 from the alert file and added them to a table to get an overview of datafiles and blocks having the issues. And since the alertfile is a growing objects this is what I got over t he time interval since starting up the database as a 11020.4 environment.

grep ORA-01578 alert_OPTMYDB.log	AFN	BL
ORA-01578: ORACLE data block corrupted (file # 620, block # 391690)	620	391690
ORA-01578: ORACLE data block corrupted (file # 620, block # 391690)	620	391690
ORA-01578: ORACLE data block corrupted (file # 620, block # 391690)	620	391690
ORA-01578: ORACLE data block corrupted (file # 620, block # 391690)	620	391690
ORA-01578: ORACLE data block corrupted (file # 620, block # 391690)	620	391690
ORA-01578: ORACLE data block corrupted (file # 620, block # 391690)	620	391690
ORA-01578: ORACLE data block corrupted (file # 620, block # 391690)	620	391690
ORA-01578: ORACLE data block corrupted (file # 620, block # 392202)	620	392202
ORA-01578: ORACLE data block corrupted (file # 620, block # 392202)	620	392202
ORA-01578: ORACLE data block corrupted (file # 620, block # 392202)	620	392202
ORA-01578: ORACLE data block corrupted (file # 620, block # 392202)	620	392202
ORA-01578: ORACLE data block corrupted (file # 620, block # 392202)	620	392202
ORA-01578: ORACLE data block corrupted (file # 620, block # 392202)	620	392202
ORA-01578: ORACLE data block corrupted (file # 620, block # 392202)	620	392202
ORA-01578: ORACLE data block corrupted (file # 661, block # 423946)	661	423946
ORA-01578: ORACLE data block corrupted (file # 661, block # 423946)	661	423946
ORA-01578: ORACLE data block corrupted (file # 661, block # 423946)	661	423946
ORA-01578: ORACLE data block corrupted (file # 661, block # 423946)	661	423946
ORA-01578: ORACLE data block corrupted (file # 661, block # 423946)	661	423946
ORA-01578: ORACLE data block corrupted (file # 661, block # 423946)	661	423946
ORA-01578: ORACLE data block corrupted (file # 661, block # 423946)	661	423946

## On mos I retrieved a note that offered more information on the object that was located around the corrupted block:

select file_id AFN, relative_fno, tablespace_name
from dba_data_files
where relative_fno=&RFN;
Enter value for rfn: 620
old 3: where relative_fno=&RFN
new 3: where relative_fno=620

AFN RELATIVE_FNO TABLESPACE_NAME
———- ———— ——————————
3689 620 UNDO_TBS_01
620 620 TBSDIP01_4M

## Used this statement to find the object that was on that spot in the tablespace:
select *
from dba_extents
where file_id = &AFN	620
and &BL between block_id AND block_id + blocks – 1;	391690

	620
	392202
SQL> select *
from dba_extents
where file_id = &AFN
and &BL between block_id AND block_id + blocks – 1;
2 3 4 Enter value for afn: 620
old 3: where file_id = &AFN
new 3: where file_id = 620
Enter value for bl: 391690
old 4: and &BL between block_id AND block_id + blocks – 1
new 4: and 391690 between block_id AND block_id + blocks – 1

OWNER
——————————
SEGMENT_NAME
——————————————————————————–
PARTITION_NAME SEGMENT_TYPE TABLESPACE_NAME
—————————— —————— ——————————
EXTENT_ID FILE_ID BLOCK_ID BYTES BLOCKS RELATIVE_FNO
———- ———- ———- ———- ———- ————
SCOTT
SCOTT_BPS_05062014000_NOBP
TABLE TBSDIP01_4M
0 620 391689 4194304 512 620


SQL> select *
from dba_extents
where file_id = &AFN
and &BL between block_id AND block_id + blocks – 1;
2 3 4 Enter value for afn: 620
old 3: where file_id = &AFN
new 3: where file_id = 620
Enter value for bl: 392202
old 4: and &BL between block_id AND block_id + blocks – 1
new 4: and 392202 between block_id AND block_id + blocks – 1

OWNER
——————————
SEGMENT_NAME
——————————————————————————–
PARTITION_NAME SEGMENT_TYPE TABLESPACE_NAME
—————————— —————— ——————————
EXTENT_ID FILE_ID BLOCK_ID BYTES BLOCKS RELATIVE_FNO
———- ———- ———- ———- ———- ————
SCOTT
SCOTT_BPS_05062014000_NORUEB
TABLE TBSDIP01_4M
0 620 392201 4194304 512 620



SQL> select file_id AFN, relative_fno, tablespace_name
from dba_data_files
where relative_fno=&RFN;
2 3 Enter value for rfn: 661
old 3: where relative_fno=&RFN
new 3: where relative_fno=661

AFN RELATIVE_FNO TABLESPACE_NAME
———- ———— ——————————
661 661 TBSDIP01_4M

select *
from dba_extents
where file_id = &AFN
and &BL between block_id AND block_id + blocks – 1;	661	423946

OWNER
——————————
SEGMENT_NAME
——————————————————————————–
PARTITION_NAME SEGMENT_TYPE TABLESPACE_NAME
—————————— —————— ——————————
EXTENT_ID FILE_ID BLOCK_ID BYTES BLOCKS RELATIVE_FNO
———- ———- ———- ———- ———- ————
SCOTT
SCOTT_SII_05062014000R_BAN
TABLE TBSDIP01_4M
0 661 423945 4194304 512 661

Example3

On Mos I also retrieved a note that offered RMAN as a tool to find logical corruption. Since I am curious by nature of course had to use that goodie too !

RMAN> run {
allocate channel d1 type disk;
allocate channel d2 type disk;
allocate channel d3 type disk;
allocate channel d4 type disk;
backup validate check logical database;
}

## Running this in Rman will put rows if any in the v$database_block_corruption. When i asked for more details on this view i was told that it should return to 0 rows when rerunning. I restarted the rman process after some time (so maybe since it did not complete it did not wrap up properly but at second run I noticed still rows present there).

SQL> select * from V$DATABASE_BLOCK_CORRUPTION

FILE# BLOCK# BLOCKS CORRUPTION_CHANGE# CORRUPTIO

———- ———- ———- —————— ———

235 34825 92 1.3760E+13 NOLOGGING

235 323083 510 1.3760E+13 NOLOGGING

548 315914 833 1.3760E+13 NOLOGGING

648 148489 512 1.3760E+13 NOLOGGING

## Well the good thing was that this showed in one view what most likely was the cause of the issue : Objects being created with nologging option thus invalidating a 100 % recovery after a restore ( and recovery of a number of archives ) and upgrade to 11G after that.

From Oracle support I got this as a good note as it turned out that the corruption was caused by no logging operations:

The Gains and Pains of Nologging Operations in a Data Guard Environment (Doc ID 290161.1)

ORA-1578 / ORA-26040 Corrupt blocks by NOLOGGING – Error explanation and solution (Doc ID 794505.1)

A quote from this Note really made me both frown and giggle:

…… Begin Quote

Note that the data inside the affected blocks is not salvageable. Methods like “Media Recovery” or “RMAN blockrecover” will not fix the problem unless the data file was backed up after the NOLOGGING operation was registered in the Redo Log.

…… END Quote

I also learned that the way for Oracle identifies that the block was before invalidated due to NOLOGGING is by updating most of the bytes in that block with 0xff but only if that “invalidate” redo is applied in a Recovery. If the associated redo/archived log file is used to RECOVER the data files ( which was the case in this preproduction environment) , Oracle invalidates such blocks and the error ORA-26040 along with error ORA-1578 are reported by SQL statements in the next block reads. So in this case the recovery was done but blocks were touched by NOLOGGING operations were marked soft corrupt during the recovery as those redo logs with flag “this block was nologging” were applied to the block.

http://docs.oracle.com/cd/E11882_01/backup.112/e10642/osadvsce.htm#BRADV90047

The recommendation is : Always performs backup after NOLOGGING operations, don’t touch those blocks again, use that backup for recovery.

As always happy reading and till we meet again,

Mathijs

Leave a comment Posted on August 18, 2014 Oracle, Oracle Dba, Rac

Dropping redo groups in Another Thread

Introduction:

Quite some time ago I had to turn a Rac environment into a single Instance / Database. However I did notice there where still redo groups of the second instance present which I wanted to drop. This is just a quick blog to show steps I have followed :

## Checking about the redologs and the thread information:

SQL> select * from v$log;

GROUP# THREAD# SEQUENCE# BYTES BLOCKSIZE MEMBERS ARC STATUS FIRST_CHANGE# FIRST_TIME NEXT_CHANGE# NEXT_TIME

———- ———- ———- ———- ———- ———- — —————- ————- ——————- ———— ——————-

1 1 1931 2147483648 512 2 YES INACTIVE 1.3789E+13 14.08.2014 18:46:19 1.3789E+13 18.08.2014 07:08:18

2 1 1933 2147483648 512 2 YES INACTIVE 1.3789E+13 18.08.2014 07:08:20 1.3789E+13 18.08.2014 07:08:22

3 1 1934 2147483648 512 2 NO CURRENT 1.3789E+13 18.08.2014 07:08:22 2.8147E+14

4 2 128 2147483648 512 2 YES INACTIVE 146319194 19.04.2012 23:21:10 146397537 20.04.2012 09:19:04

5 2 127 2147483648 512 2 YES INACTIVE 146311735 19.04.2012 22:35:29 146319194 19.04.2012 23:21:10

6 2 126 2147483648 512 2 YES INACTIVE 146169044 18.04.2012 23:20:54 146311735 19.04.2012 22:35:29

## Now lets try to drop a group that is part of 2nd Thread

SQL> alter database drop logfile group 4;

alter database drop logfile group 4

ERROR at line 1:

ORA-01623: log 4 is current log for instance MYDBP12 (thread 2) – cannot drop

ORA-00312: online log 4 thread 2: ‘+MYDBP1_REDO/MYDBP1/onlinelog/group_4.260.772812657’

ORA-00312: online log 4 thread 2: ‘+MYDBP1_DATA/MYDBP1/onlinelog/group_4.267.772812685’

## Checking Threads

SQL> select THREAD#, STATUS, ENABLED from v$thread;

THREAD# STATUS ENABLED

———- —— ——–

1 OPEN PUBLIC

2 CLOSED PUBLIC

## And as we saw in first query which groups belong to the thread we want to drop:

SQL> select group# from v$log where THREAD#=2;

GROUP#

———-

## First we need to make sure that the thread will not interfere anymore so we disable it

SQL>

SQL> alter database disable thread 2;

Database altered.

## Now we can drop the logfile group:

SQL> alter database drop logfile group 4;

Database altered.

SQL> alter database drop logfile group 5;

Database altered.

SQL> alter database drop logfile group 6;

Database altered.

## Believe is one thing , checking is better to see if thread is gone:

SQL> select THREAD#, STATUS, ENABLED from v$thread;

THREAD# STATUS ENABLED

———- —— ——–

1 OPEN PUBLIC

## Lets check the groups from redo:

SQL> select * from v$log;

GROUP# THREAD# SEQUENCE# BYTES BLOCKSIZE MEMBERS ARC STATUS FIRST_CHANGE# FIRST_TIME NEXT_CHANGE# NEXT_TIME

———- ———- ———- ———- ———- ———- — —————- ————- ——————- ———— ——————-

1 1 1931 2147483648 512 2 YES INACTIVE 1.3789E+13 14.08.2014 18:46:19 1.3789E+13 18.08.2014 07:08:18

2 1 1933 2147483648 512 2 YES INACTIVE 1.3789E+13 18.08.2014 07:08:20 1.3789E+13 18.08.2014 07:08:22

3 1 1934 2147483648 512 2 NO CURRENT 1.3789E+13 18.08.2014 07:08:22 2.8147E+14

Happy Dba again !

Enjoy reading and till we meet again,

Mathijs

Leave a comment Posted on August 15, 2014 Oracle, Oracle Dba, Rman

Is your backup strategy safe ( a story of BCV – Snap restores with EMC)

Introduction:

As a dba there should always be some room to improve your backup and recovery scenarios. This blog is about a judgment day scenario – losing a 3TB database in full and your bosses are asking what will you do and when will the database be back on its feet. In such scenarios I do tend to debate when do you really lose the full database with a need to restore it .. proper responses might be maybe during an upgrade to a newer Oracle Release or when you lose the server where the database is running (however given the fact that a lot of sites use SAN that would even offer other options than a restore of such a big database not that likely). In my experience it would be much more common to come across scenarios to restore single datafiles , tablespaces . But OK for the argument’s sake and to proof we have a scenario when we lose the full database ( means all data files and control files) let us go into more detail with that.

Details:

On my most important databases this has been the setup in the past when implementing the database(s) in ASM to give us a save feeling.

Every database has three disk groups called SID_DATA , SID_FRA and SID_REDO .
Of course database files are in the SID_DATA , Control files and REDOLOG group members are in SID_DATA and SID_REDO and the archives are put to the SID_FRA.
Fortunately we are using EMC technology , both Snap and BCV. We use them to set up clone – reporting environments, pre-production – and do our backups with it.

As with regard to the backups an interesting debate has risen this week which triggered this blog. In the setup of the Snap and BCV it has always been considered that there have to be two of them for each database. One holding the SID_DATA disk group ( means the database files) and one Snap or BCV holding SID_FRA and SID_REDO. When a backup is triggered, the following actions are performed on the production side:

In the database it is brought to begin backup ,
In the database a copy of the controlfile is pushed to SID_FRA disk group ,
In the EMC tools the mirror is broken,
In the database an end backup is issued.
On the backup server the database is mounted with the copied version of the controlfile and a level backup is running with rman (both level0 and level 1 ( indeed also using block change tracking ).
After a backup the snap / bcv remains mounted on the backup server for when of a true disaster to be used to restore the full database from disk..
On the production server the archives are saved in intervals to tape directly .

Soo far so good right ? .. We have it all covered (or not ?) . ….

If you take a closer look at a judgment day disaster – recovery ask where customer will say we need a restore of the FULL database ( of 3TB) would we be happy when seeing this ?

If we check the redos and control files we will see this:

1* select a.GROUP#,MEMBER,BYTES/1024/1024 from v$logfile a, v$log b where a.group#= b.group# order by group#;

GROUP# MEMBER BYTES/1024/1024

———- ——————————————————————————– —————

1 +MYDBP1_REDO/mydbp1/onlinelog/group_1.258.772722547 2048

1 +MYDBP1_DATA/mydbp1/onlinelog/group_1.262.772722575 2048

2 +MYDBP1_REDO/mydbp1/onlinelog/group_2.259.772723069 2048

2 +MYDBP1_DATA/mydbp1/onlinelog/group_2.263.772723099 2048

3 +MYDBP1_REDO/mydbp1/onlinelog/group_3.260.772723879 2048

3 +MYDBP1_DATA/mydbp1/onlinelog/group_3.266.772723909 2048

4 +MYDBP1_REDO/mydbp1/onlinelog/group_4.261.772723247 2048

4 +MYDBP1_DATA/mydbp1/onlinelog/group_4.267.772723277 2048

5 +MYDBP1_REDO/mydbp1/onlinelog/group_5.262.772723627 2048

5 +MYDBP1_DATA/mydbp1/onlinelog/group_5.534.772723657 2048

6 +MYDBP1_REDO/mydbp1/onlinelog/group_6.263.772722935 2048

6 +MYDBP1_DATA/mydbp1/onlinelog/group_6.535.772722963 2048

12 rows selected.

## Control files are in :

SQL> show parameter control_files

NAME TYPE VALUE

———————————— ———– ——————————

control_files string +MYDBP1_REDO/mydbp1/controlfile/current.257.771667625, +MYDBP1_DATA/mydbp1/controlfile/current.261.771667625

So if all hell breaks loose and you are asked to restore the backup from the Snap or BCV what will happen …..

First scenario was : well it is all taken care of since we have two bcv – snaps remember, so the dba would contact the storage colleague that he should restore the SID_DATA snap( he should sync it back to production) and after that we would do a recovery using the archives and the online redo logs who are still present on the production box . And we feel very good with this cause we thought of the _FRA and _REDO should not be synced back to production .. right ? ( actually the proper answer is wrong).

As we were discussing this scenario.. And after looking at the locations in the v$logfile we did notice a week spot in that idea.. If we restore the SID_DATA from BCV that would also mean that the first member of every online redolog would be wiped out since we have a redo member in the SID_DATA diskgroup ( Oracle does not like online redologs NOT being in sync).. And we would also lose a current control file since that is too in the same diskgroup ( Neither not synced control files btw are like by Oracle). Of course such a scenario will only be needed between dusk till dawn after 4 nights of being on call .. and all shouting we need this issue fixed as soon as possible so there would be little time to investigate and set up a scenario.

Looking at this scenario at hand, when being asked to restore asap from BCV, losing one of the control files well that could easily be fixed by starting the database with one control file ( since we did not touch the _FRA and _REDO diskgroups (we did not restore the second BCV !! lucky us) that would offer us still a current control file and after doing our activities to restore and recover , we could use rman to copy the controlfile again to a second – third place and amend the spfile) .

The online redo member however that would hurt bad if it really would occur that one online redo log member would be overwritten ( since we Restore from the SID_DATA SNAP – BCV) to be fastest possible way back on our feet from disk once the full database would be destroyed. In a lab test I could consider perhaps a scenario where mounting the database and dropping the first member from each group ….. but not on a production highly important database like in this scenario..

So that means that we will need to make sure that when a SID_DATA snap- bcv is used (resynced to production) to restore from disk it should only hold Datafiles and ok a control file (and not also a online redo log member as happens at the moment) .

In this case I will go and set up new online redo log groups and I will set them up in the SID_FRA and SID_REDO diskgroup. Then I will do log switches and in the end will drop the groups that had members both in _DATA and _REDO. After that .. I would be much more relaxed for my judgment day restore request to recover the FULL 3 TB database from the snap – bcv at maximum speed. It is not that much work to carry out and I am sure I feel a lot more confident that even a scenario where a 3TB database needs to be restored from a BCV can be done quit quickly and with less risks.

As Always happy reading and till we meet again,

Mathijs

Leave a comment Posted on August 2, 2014August 2, 2014 Oracle, Oracle Dba, Rac

The return of the relink Grid Infrastructure and Rdbms relink

Introduction:

This week I have been part of the debate again , do we or don’t we relink when major activities like Upgrade of Linux Kernel is performed . I have been asked to do the relink after the Rac cluster was upgraded on Linux. So as always thought it would be wise to make notes during the day as a plan to be performed during the night . In this blog you will find the steps i have performed on a two node Rac cluster with 11.2.0.4 Grid Infrastructure and two Oracle software trees holding 11.2.0.4 Rdbms and 11.1. Rdbms.

With regard to relinking discussion in team had been like .. 1) we might break things in relinking and 2) we don’t have the resources to do that for every server. My recommendation is to follow Oracle in this and do deal with relink of the Grid Infra right after OS has been relinked . Cause if something is broken during the Upgrade and your relinking there after well at least you know where it came from and can deal with things as from there . Where as if you do not relink your Software right after such a major change on OS you might still be hit in the dark in the upcoming weeks and you would need to figure out then what might have caused things.

You can even debate on the fact if it is needed to stop the resources like listeners and databases gracefully before shutting down the cluster or to perform a checkpoint in your database and just shutdown the crs . I have been doing both approaches and never had issues so far. But i can imagine that heavy used , busy systems might prefer the grace shutdown before shutting down GI.

Below you will find my steps . As always happy reading and till we meet again ,

Mathijs.

Detailed Plan:

	mysrvrar / mysrvrbr	Steps 1 – 8 will be performed on all two nodes in my cluster, in a sequential order with some delay to make sure no cluster panic will occur.
1	crsctl status resource -t>/tmp/BeforeWork.lst	Check your cluster in order to be able to compare it to what it looks like after the relinking. Maybe it is even a good idea to put it into a file. Often i end up on clusters which i am not that familiar with on a daily basis. So i tend to make this overview before i start working on the cluster.
2	cSpfile.ksh	This is a home made script in which several activities are performed. It will perform a create a spfile , do a checkpoint and do switch logfile right before shutting down the cluster node.
3	emctl stop agent
4	srvctl stop home -o $ORACLE_HOME -s /tmp/statusRDBMS -n mysrvrar	This will stop all resources that started from 1120.4 home and keep a record of them in the file in /tmp/status RDBMS. This will be convenient when starting again .
5
6	srvctl stop instance -d MYDBCM -i MYDBCM1	This is a shared cluster so we have customers requiring the 1120.4 software and some the 11.1 software . The 11.1 databases have to be stopped individually.
	srvctl stop instance -d MYDBCMAC -i MYDBCMAC1
7	srvctl stop listener -n mysrvrar -l listener_MYDBCM1	It is common to have a listener per database so i will stop the 11.1 listener in proper way as well.
	srvctl stop listener -n mysrvrar -l listener_MYDBCMAC1
8	As root:	Dealing with the cluster means you have to logon or perform sudo su – as the ORACLE user to become ROOT to perform the needed task to stop the cluster-ware on the cluster node.
9	cd /opt/crs/product/11204/crs/bin
10	./crsctl disable crs	During this maintenance Linux will be patching and rebooting various times so i was asked to make sure that the Grid Infra structure is not starting at each reboot till we are ready.
11	./crsctl stop crs	Last step as preparation for the Linux guys to patch the Machines . Shutting down the Grid Infra structure. Time to take a 2hr sleep.
	Time to Relink the software on the two nodes	Starting relink on the first node. Performing steps 9 and following . I will complete all steps needed on the first node and see to it that the Grid Infrastructure is started before moving on to the second node.
12	CHECK IF CRS IS DOWN otherwise REPEAT step 4	After Returning to the cluster still check if crs is down. Because it is better to be safe then sorry.
13	As root:	In order to relink the Grid Infra you have to become the root user again.
14	cd /opt/crs/product/11204/crs/bin	as root
15	cd /opt/crs/product/11204/crs/crs/install
16	perl rootcrs.pl -unlock	Earlier this night the GI was shutdown for Linux patching. When you perform this perl rootcrs .pl -unlock it will try to shutdown the GI. So in my case i got a message that the system was not able to stop the crs ..
17	As the grid infrastructure for a cluster owner:	This was a bit tricky. Cause the owner of the Grid Infra in my case is Oracle so dont try this as root . Better to open a second window as Oracle for the steps below.
18	export ORACLE_HOME=/opt/crs/product/11204/crs	As the Oracle user.
	cd /opt/crs/product/11204/crs/bin	As the Oracle user.
19	relink	Relink will also write a relink log which you can tail.
20	[Step 1] Log into the UNIX system as the Oracle software owner:	Once the GI software has been relinked it is time for relinking the Oracle Homes( in my case an 11.1 and 11.2. software tree). In my case i logged on as the oracle user.
21	[STEP 2] Verify that your $ORACLE_HOME is set correctly:
22	For all Oracle Versions and Platforms, perform this basic environment check first:
	export $ORACLE_HOME= /opt/oracle/product/11204_ee_64/db	Oracle 11.2.0.4
	export $ORACLE_HOME= /opt/oracle/product/111_ee_64/db	Oracle 11.1

	cd $ORACLE_HOME
	pwd	Check the environment.

23	[Step 3] Verify and/or Configure the UNIX Environment for proper relinking:
	Set LD_LIBRARY_PATH to include $ORACLE_HOME/lib	LD_LIBRARY_PATH needs to be in place so when relinking both ORACLE versions make sure you set the environment in a correct way.
	export LD_LIBRARY_PATH=/opt/oracle/product/11204_ee_64/db/lib
	echo $LD_LIBRARY_PATH

	export LD_LIBRARY_PATH=/opt/oracle/product/111_ee_64/db/lib
	echo $LD_LIBRARY_PATH
24	[Step 4] For all Oracle Versions and UNIX Platforms:
	Verify that you performed Step 2 correctly:	Check , check and check again

	env \| grep -i LD_ ….make sure that you see the correct absolute path for $ORACLE_HOME in the variable definitions.
25	[Step 5] For all Oracle Versions and UNIX Platforms:
	Verify umask is set correctly:
	umask	This must return 022. If it does not, set umask to 022.
	umask 022
	umask
26	[Step 6] Run the OS Commands to Relink Oracle:
	Important Notes:
	* Before relinking Oracle, shut down both the database and the listener.
	* The following commands will output a lot of text to your session window. To capture this output for upload to support, redirect the output to a file.
	* If relinking a client installation, it’s expected that some aspects of the following commands will fail if the components were not originally installed.
27	For all UNIX platforms:

	Oracle 8.1.X, 9.X.X, 10.X.X or 11.X.X
	————————————-
	$ORACLE_HOME/bin/relink all	Oracle 11.1
	$ORACLE_HOME/bin/relink	oracle 11.2
		writing relink log to: /opt/oracle/product/11204_ee_64/db/install/relink.log
28	How to Tell if Relinking Was Successful:	If relinking was successful, the make command will eventually return to the OS prompt without an error. There will NOT be a ‘Relinking Successful’ type message. I performed a tail on the logfiles as relink was running in a second window and did not see any issues. And as the note says wait for the prompt to return ( with no comments – messages ) and you are good to go
29	As root again:	Since i am relinking both the GI and the RDBMS i have moved this step ( starting the GI again till after the RDBMS relinking has finished because of course during the relink of RDBMS the environment ( Databases , listeners ) have to be down !
30	cd /opt/crs/product/11204/crs/crs/install/
31	perl rootcrs.pl -patch	This perl rotcrs.pl -patch wil also start the cluster on this node again.NOTE we had issues that this was hanging on the first Node . It appeared that the second node was up and running after all ( my Linux Colleague had issued a crsctl disable crs from an old not active cluster-ware software which was still present on the box) . So in this specific scenario on second node i stopped crs again then the script continued on first node.
32	crsctl enable crs	If you have used the disable crs . Enable it again so after a node reboot the GI will start.
33	As Oracle
	emctl start agent	Agent was already running so no manual action needed.
34	srvctl start home -o $ORACLE_HOME -s /tmp/statusRDBMS -n mysrvrar	This will start all resources started from 1120.4 home. The resources had been saved previously in the /tmp/statusRDBMS file

35	srvctl start instance -d MYDBCM -i MYDBCM1	Starting the 11.1 Resources.
	srvctl start instance -d MYDBCMAC -i MYDBCMAC1

36	srvctl start listener -n mysrvrar -l listener_MYDBCM1	Starting the 11.1 Resources.
	srvctl start listener -n mysrvrar -l listener_MYDBCMAC1

37	As Oracle User on the second node once it is relinked:
38	srvctl start instance -d MYDBCM -i MYDBCM2	Starting the 11.1 Resources.
	srvctl start instance -d MYDBCMAC -i MYDBCMAC2

39	srvctl start listener -n mysrvrbr -l listener_REQMOD2	Starting the 11.1 Resources.
	srvctl start listener -n mysrvrbr -l listener_MYDBCM2

	srvctl start home -o $ORACLE_HOME -s /tmp/statusRDBMS -n mysrvrbr
	crsctl status resource -t	Check your cluster again and compare the result with the status before. Hopefully all resources will appear online online or at least show the situation as it was before . There might be an extra activity if you are using services that have been relocated during the action. In such case you will have to relocate them again to the original location.

Mathijs Bruggink

Mathijs Bruggink

Tips Tricks and Blogs on Oracle

Month: August 2014

The good the bad and the corrupted (ORA-01578, ORA-26040, ORA-01110)

Introduction:

Conclusions:

Details:

Timeline:

Example 1

Example2

Example3

Dropping redo groups in Another Thread

Introduction:

## Checking about the redologs and the thread information:

## Now lets try to drop a group that is part of 2nd Thread

## Checking Threads

## And as we saw in first query which groups belong to the thread we want to drop:

## First we need to make sure that the thread will not interfere anymore so we disable it

## Now we can drop the logfile group:

## Believe is one thing , checking is better to see if thread is gone:

## Lets check the groups from redo:

Is your backup strategy safe ( a story of BCV – Snap restores with EMC)

Introduction:

Details:

The return of the relink Grid Infrastructure and Rdbms relink

Introduction:

Detailed Plan: