Exadata: Rebuild RAC clusterware without deleting data Version 2

This post is differrent from my previous post Rebuild RAC clusterware without deleting data . Because two days ago I was upgrading grid infrastructure from 12.1 to 12.2 that was successfull on first node , but failed on second node. I will not describe why this happend, but the whole process was something complicated instead of being simple. We have installed several patches before the installation(gridSetup has this option to indicate patches before installation)… Seems the 12.2 software has many bugs even during upgrade process.(But I agree  with other DBA-s that 12.2 database is very stable itself).

So what happend now is that during first node upgrade OCR files was changed. I tried deconfigure from 12.2 home and it was also failed. So now I am with my clusterware that has corrupted OCR and voting disks(it belongs 12.2 version). In my presious post I was starting clusterware in exclusive mode with nocrs and restoring OCR from backup, but now because of voting disks are different version  it does not starting in even exclusive mode.

So I have followed the steps that recreate diskgroup , where OCR and voting disks are saved. Because it is Exadata Cell Storage disks , it was more complicated than with ordinary disks, where you can cleanup header using “dd”. Instead of dd you use cellcli.

So let’s start:

Connect to each cell server(I have three of them) and drop grid disks that belong to DBFS(it contains OCR and Voting disks). Be careful dropping griddisk causes data to be erased. So DBFS must contain only OCR and Vdisks not !DATA!

#Find the name, celldisk and size of the grid disk:

CellCLI> list griddisk where name like ‘DBFS_.*’ attributes name, cellDisk, size
DBFS_CD_02_lbcel01_dr_adm CD_02_lbcel01_dr_adm 33.796875G
DBFS_CD_03_lbcel01_dr_adm CD_03_lbcel01_dr_adm 33.796875G
DBFS_CD_04_lbcel01_dr_adm CD_04_lbcel01_dr_adm 33.796875G
DBFS_CD_05_lbcel01_dr_adm CD_05_lbcel01_dr_adm 33.796875G
DBFS_CD_06_lbcel01_dr_adm CD_06_lbcel01_dr_adm 33.796875G
DBFS_CD_07_lbcel01_dr_adm CD_07_lbcel01_dr_adm 33.796875G
DBFS_CD_08_lbcel01_dr_adm CD_08_lbcel01_dr_adm 33.796875G
DBFS_CD_09_lbcel01_dr_adm CD_09_lbcel01_dr_adm 33.796875G
DBFS_CD_10_lbcel01_dr_adm CD_10_lbcel01_dr_adm 33.796875G
DBFS_CD_11_lbcel01_dr_adm CD_11_lbcel01_dr_adm 33.796875G



CellCLI> drop griddisk DBFS_CD_02_lbcel01_dr_adm
drop griddisk DBFS_CD_03_lbcel01_dr_adm
drop griddisk DBFS_CD_04_lbcel01_dr_adm
drop griddisk DBFS_CD_05_lbcel01_dr_adm
drop griddisk DBFS_CD_06_lbcel01_dr_adm
drop griddisk DBFS_CD_07_lbcel01_dr_adm
drop griddisk DBFS_CD_08_lbcel01_dr_adm
drop griddisk DBFS_CD_09_lbcel01_dr_adm
drop griddisk DBFS_CD_10_lbcel01_dr_adm
drop griddisk DBFS_CD_11_lbcel01_dr_adm


cellcli> create griddisk DBFS_CD_02_lbcel01_dr_adm celldisk=CD_02_lbcel01_dr_adm, size=33.796875G
create griddisk DBFS_CD_03_lbcel01_dr_adm celldisk=CD_03_lbcel01_dr_adm, size=33.796875G
create griddisk DBFS_CD_04_lbcel01_dr_adm celldisk=CD_04_lbcel01_dr_adm, size=33.796875G
create griddisk DBFS_CD_05_lbcel01_dr_adm celldisk=CD_05_lbcel01_dr_adm, size=33.796875G
create griddisk DBFS_CD_06_lbcel01_dr_adm celldisk=CD_06_lbcel01_dr_adm, size=33.796875G
create griddisk DBFS_CD_07_lbcel01_dr_adm celldisk=CD_07_lbcel01_dr_adm, size=33.796875G
create griddisk DBFS_CD_08_lbcel01_dr_adm celldisk=CD_08_lbcel01_dr_adm, size=33.796875G
create griddisk DBFS_CD_09_lbcel01_dr_adm celldisk=CD_09_lbcel01_dr_adm, size=33.796875G
create griddisk DBFS_CD_10_lbcel01_dr_adm celldisk=CD_10_lbcel01_dr_adm, size=33.796875G
create griddisk DBFS_CD_11_lbcel01_dr_adm celldisk=CD_11_lbcel01_dr_adm, size=33.796875G

Do the same steps on other cells.

2.  Deconfigure root.sh on each node

# Run deconfig

/u01/app/ -deconfig -force

#rename gpnp profile

mv /u01/app/ /tmp/profile_backup.xml

3. Run root.sh on first node


It will fail because will not find the disk group DBFS for mounting and of course OCR inside.  But now asm is started in nomount mode and we are able to recreate diskgroup

4. Create DBFS diskgroup

sqlplus / as sysasm

SQL> create diskgroup DBFS
failgroup LBCEL01_DR_ADM disk ‘o/*/DBFS_CD_02_lbcel01_dr_adm’,’o/*/DBFS_CD_03_lbcel01_dr_adm’,’o/*/DBFS_CD_04_lbcel01_dr_adm’,’o/*/DBFS_CD_05_lbcel01_dr_adm’,’o/*/DBFS_CD_06_lbcel01_dr_adm’,’o/*/DBFS_CD_07_lbcel01_dr_adm’,’o/*/DBFS_CD_08_lbcel01_dr_adm’,’o/*/DBFS_CD_09_lbcel01_dr_adm’,’o/*/DBFS_CD_10_lbcel01_dr_adm’,’o/*/DBFS_CD_11_lbcel01_dr_adm’
failgroup LBCEL02_DR_ADM disk ‘o/*/DBFS_CD_02_lbcel02_dr_adm’,’o/*/DBFS_CD_03_lbcel02_dr_adm’,’o/*/DBFS_CD_04_lbcel02_dr_adm’,’o/*/DBFS_CD_05_lbcel02_dr_adm’,’o/*/DBFS_CD_06_lbcel02_dr_adm’,’o/*/DBFS_CD_07_lbcel02_dr_adm’,’o/*/DBFS_CD_08_lbcel02_dr_adm’,’o/*/DBFS_CD_09_lbcel02_dr_adm’,’o/*/DBFS_CD_10_lbcel02_dr_adm’,’o/*/DBFS_CD_11_lbcel02_dr_adm’
failgroup lbcel03_dr_adm disk ‘o/*/DBFS_CD_02_lbcel03_dr_adm’,’o/*/DBFS_CD_03_lbcel03_dr_adm’,’o/*/DBFS_CD_04_lbcel03_dr_adm’,’o/*/DBFS_CD_05_lbcel03_dr_adm’,’o/*/DBFS_CD_06_lbcel03_dr_adm’,’o/*/DBFS_CD_07_lbcel03_dr_adm’,’o/*/DBFS_CD_08_lbcel03_dr_adm’,’o/*/DBFS_CD_09_lbcel03_dr_adm’,’o/*/DBFS_CD_10_lbcel03_dr_adm’,’o/*/DBFS_CD_11_lbcel03_dr_adm’

5. Do the following steps:

* Deconfigure root.sh again from first node
* remove gpnp profile
* run root.sh again on first node

At this time root.sh should be successful.

6. Restore OCR

/u01/app/<clustername> directory contans OCR backups by default

crsctl stop crs -f
crsctl start crs -excl -nocrs
ocrconfig -restore /u01/app/
crsctl stop crs -f
crsctl start crs

7. Run root.sh on the second node



Create RAC database using DBCA silent mode

Real World Scenario: 

Previously, we had a vacancy on Senior DBA position. Some of our candidates had >15 years of experience in database administration.

So for testing their knowlege we created lab. There were already installed grid and database softwares, shared disks were present and diskgroups were already created.

The first task was to create RAC database in silent mode using DBCA.  They had an option to use the internet during the exam. But unfortunatelly they have not managed to do that.

So I decided to write the simple version of the script:

dbca -silent \
-createDatabase \
-templateName General_Purpose.dbc \
-gdbName orcl  \
-sid orcl  \
-SysPassword MyPassword123 \
-SystemPassword MyPassword123 \
-emConfiguration NONE \
-redoLogFileSize 2048  \
-recoveryAreaDestination FRA \
-storageType ASM \
-asmSysPassword MyPassword123 \
-diskGroupName DATA \
-characterSet AL32UTF8 \
-nationalCharacterSet AL32UTF8 \
-automaticMemoryManagement true \
-totalMemory 2536  \
-databaseType MULTIPURPOSE \
-nodelist rac1,rac2

Copying database files
1% complete
3% complete
9% complete
15% complete
21% complete
30% complete
Creating and starting Oracle instance
32% complete
36% complete
40% complete
44% complete
45% complete
48% complete
50% complete
Creating cluster database views
52% complete
70% complete
Completing Database Creation
73% complete
76% complete
85% complete
94% complete
100% complete
Look at the log file “/u01/app/oracle/cfgtoollogs/dbca/orcl/orcl.log” for further details.

Convert Oracle SE to EE

Upgrading Oracle database from Standard Edition to Enterprise Edition is very simple.

For example, we have running Oracle SE in ORACLE_HOME=/u01/app/oracle/product/12.1.0/dbhome_1

Let’s start converting it…

  1. Install new home in /u01/app/oracle/product/12.1.0/dbhome_2 just indicate EE during installation(not SE).
    Response file entries(db_install.rsp):

  2. Shutdown database and listener from old home.
    . oraenv
    lsnrctl stop
    sqlplus / as sysdba
    shutdown immediate;
  3. Change ORACLE_HOME in oratab and .bash_profile
    cat /etc/oratab
    cat ~/.bash_profile
    export ORACLE_BASE=/u01/app/oracle
    export ORACLE_HOME=/u01/app/oracle/product/12.1.0/dbhome_2
    export PATH=$PATH:$ORACLE_HOME/bin
    export ORACLE_SID=EYC
    export TNS_ADMIN=$ORACLE_HOME/network/admin
    export EDITOR=vi
  4. Copy listener.ora, tnsnames.ora, sqlnet.ora, spfileORCL.oraorapwORCL to new home
    cp /u01/app/oracle/product/12.1.0/dbhome_1/network/*.ora /u01/app/oracle/product/12.1.0/dbhome_2/network/
    cp /u01/app/oracle/product/12.1.0/dbhome_1/dbs/spfileORCL.ora /u01/app/oracle/product/12.1.0/dbhome_2/dbs/
    cp /u01/app/oracle/product/12.1.0/dbhome_1/dbs/orapwORCL /u01/app/oracle/product/12.1.0/dbhome_2/dbs/

    Please verify that you don’t have old ORACLE_HOME indicated anywhere in these files.

  5. Renew environment variables
    . oraenv
    ORACLE_SID = [ORCL] ? 
    which sqlplus
    sqlplus / as sysdba
  6. Check that you have EE
    SQL> select banner from v$version;
    Oracle Database 12c Enterprise Edition Release - 64bit Production
    PL/SQL Release - Production
    CORE      Production
    TNS for Linux: Version - Production
    NLSRTL Version - Production

Disabling/enabling general job execution before/after maintenance

During the maintenance you may need to disable jobs.

To disable jobs created by dbms_jobs set job_queue_processes to zero.

–Save old value

show parameter job_queue_processes


alter system set job_queue_processes=0;

If you are using dbms_scheduler, this parameter does not work for you.

You will have to run the following:


After finishing maintenance, enable them:

alter system set job_queue_processes=1000;


How to find remote session executing over a database link

Select /*+ ORDERED */
substr(s.ksusemnm,1,10)||'-'|| substr(s.ksusepid,1,10) "ORIGIN",
substr(g.K2GTITID_ORA,1,35) "GTXID",
substr(s.indx,1,4)||'.'|| substr(s.ksuseser,1,5) "LSESSION" ,
2,'SNIPED',3,'SNIPED', 'KILLED'),1,1) "S",
substr(event,1,10) "WAITING"
from x$k2gte g, x$ktcxb t, x$ksuse s, v$session_wait w
where g.K2GTDXCB =t.ktcxbxba
and g.K2GTDSES=t.ktcxbses
and s.addr=g.K2GTDSES
and w.sid=s.indx;

GTXID is the same on both databases.

################################### Sample output ###################################


3   LBREPDB01-51715  LBREP.aa2c0b4f.94.11.4694801  5447.62951   I   SQL*Net me


2   LB\MARIAMI-41196:4058  LBREP.aa2c0b4f.94.11.4694801 87.36231  I  SQL*Net me

More Details:

SID – 87
SERIAL – 36231

Configure resource manager to kill sessions automatically after maximum idle time is passed


Our applications are opening too many connections and moreover are not closing them at all 🙂 .  Because of this to many sessions stay idle and after IDLE_TIME is passed they become SNIPED.
As you know SNIPED session still holds session counter and it is completely cleaned out just after SNIPED session tries to execute something(it of course errors out). But if SNIPED session never tries to execute anything then the session stays forever in database.  And after a while database throws ORA-00018 maximum number of sessions exceeded.

My old solution: 

Created script file /u01/app/oracle/dba_scripts/kill_sniped.sh, with content:


#Written by MK

cd /u01/app/oracle/dba_scripts
export ORACLE_BASE=/u01/app/oracle
export ORACLE_HOME=/u01/app/oracle/product/
export ORACLE_SID=orcl1
export ORACLE_USER=oracle
$ORACLE_HOME/bin/sqlplus -s / as sysdba <<EOF


snum NUMBER;
FOR i IN (SELECT ‘alter system kill session ”’||a.SID||’,’||a.serial#||’,@’||inst_id||”’ immediate’ killSniped FROM gv\$session a
WHERE (a.status=’SNIPED’ or a.status=’KILLED’)
and a.username is not null
execute immediate i.killSniped;
exception when others then null;

You will easily guess what does it do. It finds sessions with status SNIPED and KILLED and executes alter system kill session script for them.

Created crontab entry:

$ crontab -l
*/10 * * * * /u01/app/oracle/dba_scripts/kill_sniped.sh > /u01/app/oracle/dba_scripts/logs/kill_sniped.log 2>&1

Script was working fine about one year, without any problem 🙂 but yesterday my script was not able to handle all of these sessions and it was killing slower than SNIPED sessions were appearing in our database so database raised ORA-00018 error.

New and better solution:

Created consumer group , set plan directive with MAX_IDLE_TIME 900sec for this group and moved problematic app user in this group.

After MAX_IDLE_TIME is passed user session is automatically killed by resource manager and it is the quickest.




, COMMENT => ”
, MAX_IDLE_TIME => 900);







Note: RSAPP user had IDLE_TIME 15min in its profile, that is why I have set MAX_IDLE_TIME to 900sec(15min). Be careful for this decision , you should set this value appropriate to profile IDLE_TIME value. Or first discuss it with developers, they may not want you to kill their app session after 15min.. but after 20min.

To check how many sessions were killed by resource manager check:


Hope post was useful. 🙂

Rebuild RAC clusterware without deleting data

As I have mentioned in my previous posts, I was applying interim patch on database which had post installation script (# <GI_HOME>/crs/install/rootcrs.pl -postpatch) .
The post script failed with permission denied error on ohasd file and left clusterware in a messy situation.

I have opened SR on metalink and one of their support after a huge amount of time of talking and troubleshooting together says:

“We do not know what happened or what steps you have taken to reach this situation. You should open an SR with us before you deconfigure the node.
Please, do bare metal restore as it is recommended by previous engineer.
Bare Metal Restore Procedure for Compute Nodes on an Exadata Environment ( Doc ID 1084360.1 )”

This Bare Metal Restore is like wiping everything and after that I have had to configure RAC, DATAGUARD and everything from scratch. <<–Don’t like such solutions, this is like “if your windows works slowly then reinstall it”.. for windows this might be really true 🙂 nothing than reinstall helps 😀 but on Linux/Oracle you must troubleshoot first.
So I created another SR with another error(Errors at this time were lot) and for the second time I was lucky.
I was working 24/7 with support, the engineers were shifting. Three different engineers worked at different times on this SR.
I want to mention one “Venkata Pradeep Kumar” Oracle support engineer , he is so clever he helped me a lot and we rescued the system !:)

I want to share the steps with you , it should interesting.


After applying patch post script on first node (which failed), clusterware on first node was not starting. At this time second node was fine.
I have deconfigured clusterware (write this step in solution section) on first node and it started but with some problems about oc4j service.

2016/09/27 06:56:15 CLSRSC-1003: Failed to start resource OC4J
2016/09/27 06:56:16 CLSRSC-287: FirstNode configuration failed

I have deconfigured clusterware on second node also and tried to run root.sh, but it said that root.sh could not be run because it was not successful on first node. 😦

So, until root.sh script is not completely successful on first node you should not deconfigure it on second. But if you did it do not panic if you have OCR backup.


# Deconfigure crs on problematic node , note you may help the different solution , by just configuring one node. In my situation all nodes became problematic.
# Also please be careful, below steps assumes that you have separate group for OCR. Datafiles must be on different group. Or diskgroup will be wiped.

# From root on both nodes node1 , node2

/u01/app/ -deconfig -force

# run root.sh on node1 , it may not be completely successful


# We need to find a good OCR backup , for me it is week.ocr which was taken automatically in 2016/09/15 09:12:28.
# Patch was applied at 10:00AM in 2016/09/25. So we need week.ocr it is before patching.

[root@lbdm01-dr-adm grid]# ocrconfig -showbackup

lbdm02-dr-adm 2016/09/27 02:35:23 /u01/app/ 3351897854
lbdm02-dr-adm 2016/09/26 15:44:53 /u01/app/ 3351897854
lbdm02-dr-adm 2016/09/26 11:44:52 /u01/app/ 3351897854
lbdm02-dr-adm 2016/09/27 02:35:23 /u01/app/ 3351897854
lbdm01-dr-adm 2016/09/15 09:12:28 /u01/app/ 854493477
lbdm02-dr-adm 2016/09/25 15:29:18 /u01/app/ 3351897854
lbdm02-dr-adm 2016/09/25 10:34:56 /u01/app/ 2725022894
lbdm01-dr-adm 2015/07/29 19:46:28 /u01/app/ 854493477
lbdm01-dr-adm 2015/07/29 19:46:27 /u01/app/ 854493477

# Ensure that no process left
# node 1

crsctl stop crs -f
ps -ef|grep “/u01/app”

# if here is anything kill them!

#Start clusterware in exclusive mode with no ocr on node 1

crsctl start crs -excl -nocrs

#Restore OCR on node 1

ocrconfig -restore /u01/app/

# Stop crs on node 1

crsctl stop crs -f
crsctl start crs

# Check the status

crsctl status res -t

# It should be OK

# Do the same steps on node 2 from root, but it may fail


# Failed

ORA-15160: rolling migration internal fatal error in module SKGXP,valNorm:not-native
. For details refer to “(:CLSN00107:)” in “/u01/app/oracle/diag/crs/lbdm02-dr-adm/crs/trace/ohasd_oraagent_oracle.trc”.
CRS-2883: Resource ‘ora.asm’ failed during Clusterware stack start.
CRS-4406: Oracle High Availability Services synchronous start failed.
CRS-4000: Command Start failed, or completed with errors.
2016/09/28 09:11:00 CLSRSC-117: Failed to start Oracle Clusterware stack

# deconfig on both nodes
# node1 , node2

 /u01/app/ -deconfig -force

#and run agin root.sh
# node 1


# It was completelly successful.

# On second there is still problem

# Read the following document ORA-15160: rolling migration internal fatal error in module SKGXP,valNorm:not-native (NOTE 1682591.1)

# Here problem was on protocols that was used by asm and rdbms.
# rdbms is using rds protocol and asm is using udp, see Oracle Clusterware and RAC Support for RDS Over Infiniband (NOTE 751343.1)
# problem was in libraries and we should relink them with right protocols
# As the ORACLE_HOME/GI_HOME owner, stop all resources (database, listener, ASM etc) that’s running from the home. When stopping database, use NORMAL or IMMEDIATE option.

# From problemtic node , where asm or database is not starting.

crsctl stop crs
ps -ef|grep d.bin
ps -ef|grep “/u01/app”

# Kill if any process left

# If relinking Grid Infrastructure (GI) home, as root, unlock GI home: <GI_HOME>/crs/install/rootcrs.pl -unlock

/u01/app/ -unlock

# As the ORACLE_HOME/GI_HOME owner, go to ORACLE_HOME/GI_HOME and cd to rdbms/lib
# As the ORACLE_HOME/GI_HOME owner, issue “make -f ins_rdbms.mk <protocol write here> ioracle”
#For rdbms

[root@lbdm02-dr-adm lib]# su – oracle
[oracle@lbdm02-dr-adm ~]$ cd $ORACLE_HOME/rdbms/lib
make -f ins_rdbms.mk ipc_rds ioracle

#For asm

. oraenv
[oracle@lbdm02-dr-adm ~]$ cd $ORACLE_HOME/rdbms/lib
make -f ins_rdbms.mk ipc_g ioracle

# From root

/u01/app/ -patch

# The last step should have configure clusterware also. And everything should be fine. And you can sleep now. 🙂