Monday, September 20, 2010

A budget HA disk stack

A budget HA disk stack

Highly available disk stacks are nothing new. At the time of writing, Dell will happily sell you a no-single-point-of-failure MD3000 (SAS) or an MD3000i (iSCSI) array with a pair of 146GB 15K RPM SAS drives for about $4,500. Not bad, eh? Still, if you’re on a first name basis with Linux and have a couple of machines to spare, you can set up a shared-nothing disk cluster for next to nothing.

Just how might that be? The good folks at LINBIT have kindly offered their Distributed Replicated Block Device (DRBD) under the GPL license. DRBD is an online disk clustering suite that, as stated in their own words, can be seen as a “network-based raid1″.

About DRBD

DRBD works by injecting a thin layer in between the file system (and the buffer cache) and the disk driver. The DRBD kernel module intercepts all requests from the file system and splits them down two paths – one to the actual disk and another to a mirrored disk on a peer node. Should the former fail, the file system can be mounted on the opposing node and the data will be available for use.
DRBD works on two nodes at a time – one is given the role of the primary node, the other – a secondary role. Reads and writes can only occur on the primary node. The secondary node must not mount the file system, not even in read-only mode. This last point requires some clarification. While it’s true to say that the secondary node sees all updates on the primary node, it can’t expose these updates to the file system, as DRBD is completely file system agnostic. That is, DRBD has no explicit knowledge of the file system and, as such, has no way of communicating the changes upstream to the file system driver. The two-at-a-time rule does not actually limit DRBD from operating on more than two nodes. DRBD supports further “stacking”, where a higher level DRBD module appearing as a block device to the operating system forks to a pair of lower-level block devices which themselves are DRBD modules (and so on).

Replication takes place using one of three protocols:

Protocol A queues the data written on the primary node to the secondary node, but doesn’t wait for the secondary node to confirm the receipt of the data before acknowledging to its own host that the data has been safely committed. Those familiar with NFS will draw relationships to “asynchronous replication” for this is, indeed, the case. Being asynchronous, it is the fastest of all replication protocols but suffers from one major drawback – the failure of the primary device does not guarantee that all of the data is available on the secondary device. However, the data on the secondary device is always consistent, that is, it accurately represents the data stored on the primary device at the time of the last synchronisation.

Protocol B awaits the response from the secondary host prior to acknowledging the successful commit of the data to its own host. However, the secondary host is not required to immediately persist the replicated changes to stable storage – it may do so some time after confirming the receipt of the changes from the primary host. This ensures that, in the event of a failure, the secondary node is not only consistent, but completely up-to-date with respect to the primary node’s data. In the authors’ own words, this protocol can be seen as “semi-synchronous” replication. This protocol is somewhat slower than protocol A, as it exercises the network on each and every write operation.

Protocol C not only awaits the response from the secondary host, but also mandates that the secondary host secures the updates to stable storage prior to responding to the primary. Because of the added disk I/O overhead, Protocol C is considerably slower than protocol B. Drawing back to our NFS example, this protocol equates to fully synchronous replication.
The protocols above represent varying levels of assurance with respect to the integrity of the data replication process and trade speed for security. Protocol A is the fastest of all, but is not particularly safe. Protocol C offers the most resiliency to failure, but incurs the most amount of latency. LINBID claim that most customers should be using protocol C. This is debatable – protocol B is just as safe, while incurring far less overheads. Protocol B only comes unstuck if both nodes were to black out or power cycle at exactly the same time. This scenario should be guarded against using a UPS and/or redundant power lines. If redundant power is not available, protocol C is, indeed, the most appropriate.


Setting up DRBD

Obtaining DRBD

DRBD has been incorporated into the Linux kernel since 2.6.33. If you’ve been blessed with an older kernel but are a paying customer of LINBIT, you might be provided with a pre-built package to match your distribution. But since this is an “on a budget” thing, you’ll just have to download a tarball distribution from the DRBD website (or get a recent kernel). The following instructions apply to DRBD version 8.3.8.1.

Building DRBD

You’re probably familiar with the famous Linux trio: configure-make-install. This one’s no different, although you do have specify an extra switch or two to get the build going.
$ ./configure --with-km --sysconfdir /etc $ make # make install 

NB: In every relevant place, the DRBD documentation states that configuration files will be searched in the sequence of /etc/drbd-83.conf, /etc/drbd-08.conf, followed by /etc/drbd.conf. However, the header file (user/config.h) generated by running the configure script points to the /usr/local/etc directory instead, contradicting all documentation, including the man pages. The --sysconfig switch overwrites this behaviour. Furthermore, according to the source code of version 8.3.8.1 (user/drbdadm_main.c), there is an extra configuration file drbd-82.conf that is searched after drbd-83.conf, which has been omitted from the documentation. Our recommendation to the folks at LINBIT would be either switch the default configuration directory to /etc, or to update the documentation to indicate otherwise.

Verifying the build

After building, load the module to verify that it was built correctly:
# modprobe drbd 

If modprobe fails to load the module, it may be due to the fact that DRBD has placed the module in the wrong directory – one that is not commensurate with your kernel release (this wouldn’t be the first time DRBD got confused). You can try to search for the module, as so:
# find /lib/modules -name drbd.ko 

If you find the module, copy it to your /lib/modules/`uname -r`/kernel/drivers/block directory. Having done that, register the module:
# depmod -a 

Alternatively, enter the drbd subdirectory of the DRBD source tree and run the following (this time forcing the kernel revision):
$ make clean $ make KDIR=/lib/modules/`uname -r`/build # make install 

Then try running modprobe drbd again.

Configuring DRBD

The layout of the DRBD disk cluster must be described in a single configuration file located at /etc/drbd.conf. In our example, replication will take place over two virtual machines, interconnected by a single private link. The machines are named ‘spark’ and ‘flare’. Both hosts will be commingling on the /dev/sda3 block device. The corresponding configuration file is depicted below:
global {
  usage-count yes;
}
common {
  protocol C;
}
resource r0 {
  device /dev/drbd1;
  disk /dev/sda3;
  meta-disk internal;
  on spark {
    address 192.168.100.10:7789;
  }
  on flare {
    address 192.168.100.20:7789;
  }
 
It’s obvious from the configuration that protocol C is being employed. The resource section lists the details of a single resource named r0. (DRBD may have multiple resources configured and operational.) The two on sections represent the configurations specific to the nodes ‘spark’ and ‘flare’. The device, disk and meta-disk entries are common to both nodes. However, if any of these items were to differ between the two nodes, you would be expected to move them down into the on sections. The address entries will invariably differ between the two nodes. I feel compelled to say that the two addresses must be cross-routable, and appropriate arrangements must be made to allow DRDB traffic to traverse any firewalls on the ports nominated in the address entry.

Configuring the metadata

DRBD requires a dedicated storage area on each node for keeping metadata – information about the current state of synchronisation between the DRBD nodes.

Metadata can be external, in which case you must dedicate an area on the disk outside of the partition you wish to replicate. External metadata can offer the greatest performance since you can employ a second disk on each node to parallelise I/O operations.

Metadata can also be internal, that is, inlined with the partition being replicated. This mode offers a worse I/O performance compared to external metadata. It is somewhat simpler, however, and does have the advantage of coupling metadata closer to the real data – in case you have to physically relocate the disk. Internal metadata is placed at the end of the partition or Logical Volume (LV) occupying the target file system. To prevent the metadata from overwriting the end of the file system, the latter must first be shrunk to make room for the metadata.

In our example we’ll be using internal metadata. In either case, metadata takes up some space on the device; the space varies depending on the size of the replicated file system. Before determining the size of the metadata, we must accurately gauge the size of the file system to be replicated. When we talk about sizes, we refer to the raw size of the file system, i.e. the amount of space it takes up on the disk – not the amount of usable space the file system presents to the applications. The best way to determine the size of the file system is to look at the size of the underlying partition or LV, since file systems tend to occupy the entire partition/LV. We’ll use the parted utility in our example of replicating /dev/sda3 – a 4GB Ext3 partition.
# parted /dev/sda3 unit s print
Model: Unknown (unknown) Disk /dev/sda3: 8193150s Sector size (logical/physical): 512B/512B Partition Table: loop Number Start End Size File system Flags 1 0s 8193149s 8193150s ext3 


Determine the size of the metadata:
given by: ceiling(Size/218) x 8 + 72 = 328 (where the ceiling function rounds the input up to the nearest integer)

NB: The observant among you will notice that the actual requirement of the internal metadata block size will be smaller than the stated figure, because by shrinking the file system we’re decreasing the demand for metadata. Still, the difference in size will be negligible, and it’s simplest to compute the metadata block size from the pre-shrunk size.

Check file system for errors (Ext2/Ext3 file systems):
# e2fsck -f /dev/sda3 

Calculate the new size of the FS, allowing for DRBD metadata:
given by: Size – 328 = 8192822

Resize the file system:
# resize2fs /dev/sda3 8192822s 

Finally, create the metadata block:
# drbdadm create-md r0 

Loading DRBD on startup

In most cases it’s desirable to load the DRBD kernel module and activate DRBD replication on start-up. DRBD is distributed with a daemon for just this purpose. (Replace {DRBD_DIR} with the directory where DRBD was unpacked to.)
# cp {DRBD_DIR}/scripts/drbd /etc/rc.d/init.d # chkconfig --add drbd

Activating DRBD

Start the daemon:
# service drbd start
Observe the status of the disks:
$ cat /proc/drbd
version: 8.3.8.1 (api:88/proto:86-94) GIT-hash: 0d8589fcc32c874df57c930ca1691399b55ec893 build by emil@flare, 2010-08-04 20:45:00 1: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r---- ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:4096408

The Inconsistent/Inconsistent disk state is expected at this point. This simply means that the disks have never been synchronised.

Initial synchronisation

The next step is the initial synchronisation, and involves the complete overwrite of the data on one peer’s disk, sourced from the disk of another peer. You must select which of the peers contains the correct data, and issue the following command on that peer:
# drbdadm -- --overwrite-data-of-peer primary r0 

Now, on either of the peer nodes, do:
$ watch "cat /proc/drbd" 

You will see a progress bar, similar to the one below: version: 8.3.8.1 (api:88/proto:86-94) GIT-hash: 0d8589fcc32c874df57c930ca1691399b55ec893 build by emil@flare, 2010-08-04 20:45:00 1: cs:SyncTarget ro:Secondary/Primary ds:Inconsistent/UpToDate C r---- ns:0 nr:24064 dw:24064 dr:0 al:0 bm:1 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:4072344 [>....................] sync'ed: 0.7% (4072344/4096408)K finish: 2:49:40 speed: 324 (320) K/sec 

Depending on the size of your file system, and the speed of the network, this operation may take some time to complete. Using a pair of virtual machines and a virtual internal network, a 4GB Ext3 file system took about 3.5 hours to synchronise. That said, you should be able to start using the primary disk as soon as it’s up, without waiting for the synchronisation process to complete. However, refrain from performing any mission-critical operations on the primary file system until the intial synchronisation completes (even if using protocol C).

Mounting the file system

Next, we can mount the disk on the primary node. But first, we must ensure that one node is selected as the primary node. On the primary node, issue the following:
# drbdadm primary r0 

Observe the output of cat /proc/drbd, having made a node primary:
version: 8.3.8.1 (api:88/proto:86-94) GIT-hash: 0d8589fcc32c874df57c930ca1691399b55ec893 build by emil@spark, 2010-08-06 08:01:01 1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r---- ns:32768 nr:0 dw:0 dr:32984 al:0 bm:4 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0 

The output of cat /proc/drbd on the secondary node should be very similar, only the Primary/Secondary roles will appear reversed.

The order of our HA disk stack (lowest level first) is as follows:
Physical disk partition, LVM (if applicable), DRBD, file system
When mounting the disk, we refer to the special DRBD block device, rather than the actual device (e.g. /dev/sda3). Like real partitions, DRBD devices are suffixed with a 1-based index. For convenience, it’s worth appending the following entry to the end of the /etc/fstab file:
/dev/drbd1 /mnt/drbd1 ext3 noauto 0 0 

The noauto option in the ‘mount options’ column tells the operating system to refrain from mounting the device at startup. Otherwise, one of the nodes would invariably fail trying to mount the file system, as only one node can have the file system mounted at any given time.

Now mount the block device:
# mount /dev/drbd1
NB: Because of the entry in /etc/fstab we don’t have to specify a mountpoint to the mount command.

So there you have it: a highly available, no-single-point-of-failure disk stack for the price of a pair of Linux boxes. And all in the time it took you to drink 17 cups of coffee.

Further reading

Gridlock

DRBD fully integrates with Gridlock – the world’s best high availability cluster. Whether you’re after a high performance, highly available shared-nothing architecture, or off-site replication and disaster recovery, Gridlock is up to the challenge.

The problem with using Linux-based (or an OS-specific) clustering software is that you’ll always be tied to the operating system.

Gridlock, on the other hand, works at the application level and isn’t coupled to the operating system. I think this is the way forward, particularly seeing that many organisations are running a mixed bag of Windows and Linux servers – being able to cluster Windows and Linux machines together can be a real advantage. It also makes installation and configuration easier, since you’re don’t have separate instructions for a dozen different operating systems and hardware configurations.

The other neat thing about Gridlock is that it doesn’t use quorum and doesn’t rely on NIC bonding/teaming to achieve multipath configurations – instead it combines redundant networks at the application level, which means it works on any network card and doesn’t require specialised switchgear.

Split brain

When running in an active-standby configuration, only one DRBD node can be made primary at any given time. Two (or more) disks coexisting in the primary state can result in the branching of the data sets. Stated otherwise, one node could have changes not visible to its peer, and vice versa. This condition is known as a split brain. When the drbd daemon is started, it will check for a split brain condition, and abort synchronisation while appending an error message to /var/log/messages.

The first step in recovering from a split brain condition is to identify the changes made to both nodes following the split brain event. If both nodes have important information that needs to be merged, it’s best to back up one of the nodes (call it node A, or the trailing node) and re-sync data from the other node (node B, or the leading node). When the re-sync is complete, both nodes will contain the data set of node B, with the latter being the primary node. Following that, demote node B to secondary status, and promote node A to primary status. Hand-merge the changes from the backup data set on node A – these changes will propagate to node B.

On the trailing node, backup the data and issue the following commands:
# drbdadm secondary r0 # drbdadm -- --discard-my-data connect r0 

On the leading node, do:
# drbdadm primary r0 # drbdadm connect r0 

Observe /proc/drbd – it should now show the nodes synchronising.
Having synchronised the nodes, reverse the roles and manually merge the changes on the new primary node.

Startup barrier

Be default, when a DRBD node starts up, it waits for its peer node to start. This prevents a scenario where the cluster is only booted using one node, and mission-critical data is written without being replicated onto a peer’s disk. The default timeout is ‘unlimited’, that is, a node will wait indefinitely for its peer to come up before proceeding with its own boot sequence. Despite this, DRBD will present you with an option to skip the wait. To control the timeouts, add a startup section inside the resource section, as shown below:
startup {
   wfc-timeout 10;
   degr-wfc-timeout 10;
   outdated-wfc-timeout 10;
 
In this example, we have explicitly specified the timeout to be 10 seconds. So the node will allow some time for its peer to come up, but the absence of the peer won’t prevent the node from booting.

Synchronisation options

DRBD’s synchronisation mechanism is optimised for slow computers with slow network connections by default. This is just too bad, as the out-of-the-box configuration requires quite a bit of tinkering to get going on even the most basic hardware. The default synchronisation rate is capped at around 250 KB/s, which is roughly 2% of a 100 MBps LAN. While the presence of a throttling feature is good, its default settings are too conservative. Furthermore, DRBD by default will transmit all blocks that it thinks may be out of sync. Compare this with rolling checksums and compression used by tools such as rsync. While compression is not yet an option, it is possible to tell DRBD to compare the digests of each block with the primary’s copy, and only transfer the block if the digests differ. Bear in mind though – the use of a checksum will trade CPU cycles for bandwidth. A more free-flowing throttle cap and the use of MD5 checksums for a faster resync can be specified by adding a syncer section to the common section, as shown below:
common {
  ...
  syncer {
    rate 5M;
    csums-alg md5;
  }
  ...
 
In the example above, the sync rate has been capped to 5 MB/s, which is around 50% of the capacity of a 100Base-T Ethernet fabric, taking into account TCP/IP framing overheads. This configuration uses the MD5 algorithm to compute digests over the replicated blocks, which must be supported by your kernel (most will). The two settings are completely independent: one can specify a new throttle without setting a checksum algorithm, and vice versa.

For more information, have a browse through http://mazecard.com.au.