xsdb concurrency and recovery

This document describes the methodology used by xsdb for providing concurrency control and recovery.

Overview

The purpose of concurrency control is to guarantee that transactions do not interfere with eachother -- in particular to guarantee that the transaction history (which may allow several transactions to execute at once) is equivalent to some hypothetical serial execution history for the transactions.

The purpose of recovery is to make sure that all operations for committed transactions are reflected correctly in the database and no partially committed transactions are reflected in the database when the system restarts after a planned or unplanned termination. Concurrency control and recovery for the xsdb system are closely related.

The xsdb implements concurrency control and recovery at the StampedHierarchy and ShadowHierarchy levels. The general approaches adopted are timestamping combined with logging with deferred updates. Timestamp controlled write locks are also used to avoid cascading rollbacks.

TimeStamping

Each transaction T is assigned a unique fixed timestamp TS(T) which represents the relative order that the transaction should appear to have executed when viewed as a hypothetical serial execution. A transaction T1 is said to be younger than a transaction T2 if TS(T1) > TS(T2).

Each data item X is assigned two timestamp variables WTS(X) (the write timestamp of X) and RTS(X) (the read time stamp of X). The WTS(X) should always contain the timestamp of the youngest transaction to successfully write X. The RTS(X) should always contain the timestamp of the youngest transaction to successfully read X.

The timestamp protocol adopted by xsdb makes sure that T cannot read X whenever X has been written by a transaction younger than T (TS(T) < WTS(X)) and that T cannot write X whenever X has been read or written by a younger transaction (TS(T) < WTS(X) or TS(T) < RTS(X)). [Thomas's write rule is not used because it wasn't clear that it will work correctly for the hierarchy structure, and since almost all writes to X will be preceded by a read from X for each transaction.]

Write locks are also used to guarantee that no transaction can read an uncommitted write. A transaction that attempts to read, write, or lock an item which is already locked is aborted if it violates the timestamp protocol or it is suspended if it does not violate the protocol. A suspended transaction must retest for locking and for timestamp consistency when it resumes (because timestamps may have changed or the item may be "relocked" in the interval of suspension).

Write locks have 3 states: unlocked, locked for creation or update, and locked for deletion. When an object is created or modified its parent must either already exists and not be locked for deletion or its parent must be locked for creation or update. When a collection is locked for deletion all its children must be locked for deletion also.

Once a transaction has obtained write locks on all objects modified by that transaction the transaction is guaranteed to be able to complete without any timestamp violations.

Logging

Since the system uses timestamping there is no need for a global log to record all system updates as they occur.

Each transaction is associated with 3 resources managed by the ShadowHierarchy: the transaction log resource, the transaction commit resource and the transaction checkpoint resource.

The transaction log resource stores all writes executed for the transaction, and the log must be written before the writes have been effected and before the transaction has committed.

The transaction commit resource when created indicates that the transaction has committed and that all writes will eventually complete for that transaction.

The transaction checkpoint resource when created indicates that all writes for the transaction have completed and that the log file need not be used for recovery and may be deleted.

On logging and redoing transactions

Logged operations are run in stages:

Create collections (in order of increasing path length).
Update or create resources.
Delete resources.
Delete collections (in order of decreasing path length).

Locks are also taken out in the same order.

If a redo attempts to create a collection which exists it is not an error.

If a redo attempts to delete a collection or resource which doesn't exist it is not an error.

Creates are run in order of increasing path length (to guarantee parents are created before children).

Deletes are run in order of decreasing path length (to guarantee deletion of children before deletion of parent).

Every data item will be modified (created, deleted, changed) at most once by each transaction (after other modifications have been deferred and accumulated). Consequently the data item may be unlocked at the time of the modification. The ShadowHierarchy is responsible for accumulating updates before commit.

Transaction Life cycle

A transaction accesses the underlying database via a Lattice for that transaction which in turn communicates with a ShadowHierarchy for that transaction. The ShadowHierarchies communicate with the single StampedHierarchy for the database instance

Active Transaction: Before a transaction attempts to commit or abort it is said to be active. While the transaction is active all updates are deferred in the ShadowHierarchy for the transaction. Any read attempt that results in a timestamp violation causes the transaction to abort (with no action needed since permanent updates were deferred).

Partial Commit: When a transaction attempts a commit it enters the partially committed state. The transaction first attempts to obtain all write locks it requires (which if successful guarantees all writes can complete). If it cannot obtain the locks because of a timestamp violation the transaction fails. Otherwise the transaction writes its log resource and commit resource and is considered commited.

Commit: A committed transaction must perform all deferred writes and release all write locks it possesses. Once the commit is complete the transaction creates its checkpoint resource. At this point the logging information for the resource may be (optionally) deleted and the checkpoint resource may be deleted after that.

Abort: At any time before a commit a transaction may enter the aborted state. An aborted transaction must release any write locks it possesses. At abort the log file for the transaction may be (optionally) delete also. Because updates only happen after the transaction commit no updates must be undone when a transaction aborts.

Recovery Procedure

When the system recovers from an uncontrolled termination it looks for transaction commit resources that have no corresponding checkpoint resource -- these are the transactions which must be redone to result in recovery.

The transactions that must be redone are addressed in the order of their timestamps. The log for the transaction is recreated from the log resource and the commit procedure is executed for the transaction. Since recovery is done in timestamp order, serially, there are no possible timestamp conflicts during the execution of the recovery procedure.

The recovery operation is performed by the StampedHierarchy object.

return to index