Friday 9 December 2011

Redundant Interconnect in Oracle RAC


Oracle RAC 11gr2 and Redundant Interconnects

Oracle 11gr2 RAC allows the use of Redundant Interconnects. This means it is possible to create multiple links for the Interconnect, and remove the single point of failure, e.g. a switch failure, causing the cluster to fallback to a single node.

Before we configure this feature it is useful to detail what happens when part of the interconnect fails. For these tests we have a single RAC cluster of 2 nodes, operating over a single interconnect, using network interfaces eth1 on both nodes.

If we connect to the RAC database, via the scan address:

sqlplus system/manager1@rac-scan.laptop.com:1521/TESTDB.laptop.com

we can query the database fine.

Now take the interconnect away on one of the nodes, with the command:

root#> ifconfig eth1 down

If we now issue a query to the database:

sqlplus> select * from dba_data_files;

We see that this statement still works, however further digging will reveal we are now operating on one node:

sqlplus> select * from gv$instance;

Will only show one node.

If we look at the grid logs, located in:

/home/grid/app/11.2.0/grid/log/<node name>/alter<node name>.log

we see that on one node there are errors reported that one node is being shutdown and on the other node, errors saying that the ASM disk group is inaccessible.

Restart the network port, with:

root#> ifconfig eth1 up

Querying gv$instance again shows only 1 node. Therefore the node does not automatically restart after an interconnect failure.

To restart it, as root, restart the ohasd service on the node you took the network down on:

root#> service ohasd stop
root#> service ohasd start

After a few minutes you will find the second instance has been added back to the cluster and a second record appears in the gv$instance table.

So as can been seen, it is probably wise to guard against this, so we will introduce a second interconnect.

First lets list our networks:

oracle_node1$> /home/grid/app/11.2.0/grid/bin/oifcfg getif
eth0 192.168.100.0 global public
eth1 192.168.200.0 global cluster_interconnect

Here we can see that the 192.168.200.0 network is being used as a cluster interconnect.

Now create a new network interface on both nodes, and give them a separate range, for the purposes of this document I will use:

node1: eth3 192.168.230.1
node2: eth3 192.168.230.2

Now add the additional interconnect:

oracle_node1$> /home/grid/app/11.2.0/grid/bin/oifcfg setif -global eth3/192.168.230.0:cluster_interconnect

Check it has been added:

oracle_node1$> /home/grid/app/11.2.0/grid/bin/oifcfg getif

Now restart the clusterware on each node:

root_node1#> service ohasd stop
root_node1#> service ohasd start

Wait for it to fully start and the database shown as open in the gv$instance table, before doing the second node:

root_node2#> service ohasd stop
root_node2#> service ohasd start

Now we can test by taking down the interface again. This time the link stays up (by using the other link), and when querying gv$instance we see both instances.

No comments:

Post a Comment