linuxsysadmin: quick point in time Oracle database recovery on Netapp

As we have now started to move our production databases onto the Netapp, we thought we should do some testing to prove it can yield advantages in our infrastructure.

To summarise our setup first, we have three elements to our database setup on netapp:

1. The database server - this holds Oracle home, redo logs and controlfile
2. Netapp cluster node a - this holds the dbfs for the database, along with the controlfile
3. Netapp cluster node b - this holds the redo logs, flash recovery area (including archive log) and controlfile

We therefore think this offers us enough resilience in that we can lose any one of the above three elements in our setup without any data loss.

The Netapp volumes (items 2 & 3 above) are snapshotted every 2 hours during the working day, to reduce the restoration time should anything happen. This is done via a cron job on the Oracle host which co-ordinates this, with putting the database into hot backup mode to ensure consistency. It also backs up the controlfile to trace, so that we can recreate it if necessary.

In this test we wanted to test restoration of a database to a time prior to a corruption occuring.

The tests were carried out on a production database, so we therefore wanted a way to simulate this. So we performed the following.

1. Ran the database backup script - this produced Netapp snapshots of both the dbf and logs volumes - time A

2. Created a test user in the database and created a test table A under this schema - time B

3. Waited 5 minutes

4. Created a test table C under our test user schema - time C

5. Ran our database backup script again - time D

In our test we then assumed the corruption occurred just before creating the second table, ie just before time C, we therefore wanted to restore the database to a time between B and C, whereby our test user should only have 1 object (table A).

We obviously did not want to do this restoration against a production database, so we would identify the files we would need and would copy these to another database server and restore it there, effectively making a copy of the database at a time between B and C.

First, we need to identify the dbf snapshot, from before the corruption, ie that created at time A above. Login to the Netapp and issue the command:

snap list DBF_VOLUME

Where DBF_VOLUME is replaced with the volume name of your volume containing the dbfs.

This will produce a list of snapshots against the volume. From this list it should be easy to find the one corresponding to our time A.

We then need to go into this snapshot directory and copy its contents to our separate database host.

Next we need to identify the snapshot containing the logs from a time after the restore point, ie time D.

Repeat the snap list command against the logs volume and identify the one corresponding to time D. Copy the contents of this snapshot to our separate database host.

We now have everything to restore our database.

First, as this is on a secondary host, the location of the files nay be different so we can edit the trace version of the controlfile script, to reflect our new locations.

Once this is done, we can run this against the downed database in our secondary host.

Next, we need to issue the recover database command to roll the database forward from time A to a time between B and C. To do this login to the database as the sys user as sysdba:

sqlplus / as sysdba

Then recover the database:

sqlplus> recover from '.....' database until time '#time#';

Where the ..... is the location of the flash recovery area on the secondary Oracle host and #time# is a time between time B and C in the format: yyyy-mm-dd:hh24:mi:ss

You may find that the recovery still asks for the archivelog files in turn, if it does, type in the actual location - there should not be many as the period of time we are covering is small (just over 5 minutes)

Once the recovery is finished it should return the message:

Media recovery complete

We now need to open the database, however as we have done a point in time recovery, we need to add the resetlogs parameter:

sqlplus> alter database open resetlogs;

We can now log in as the test user and confirm that it only contains table A.

We have proved the concept works and by regularly (every 2 hours) snapshotting the database this method of restoration should remain quick and allow us to respond well to the demands of the business.

Published with Blogger-droid v1.6.7

linuxsysadmin

Sunday, 13 March 2011

quick point in time Oracle database recovery on Netapp

No comments:

Post a Comment