Storage on Cluster DRBD and GFS2
Da PiemonteWireless.
Indice |
How create a good shared storage to be used in a cluster (using DRBD and GFS2)
I tried different way to create a shared storage for my cluster.
My first and most important rule (after performance) was synchronization in both direction, so I can't use simple rsync command.
I tested Glusterfs but performances was very low, then I tried Unison but it's pretty unstable and I had alot of problem just to compile it.
So I decided to use a block level Distributed file system and I choose DRBD with OCFS2, becouse I read alot of good news about them.
The following text is my experience with them, using my prefered linux distro: Ubuntu.
DRBD part
Install prerequisites (needed packages)
apt-get install ethstats dpatch patchutils cogito git-core sp docbook-utils docbook build-essential flex dpkg-dev fakeroot module-assistant
Compile and install DRBD
This is the Ubuntu (Debian-like) way to compile both kernel module and package:
mkdir /opt/source/drbd cd /opt/source/drbd git clone git://git.drbd.org/drbd-8.3 cd drbd-8.3/ dpkg-buildpackage -rfakeroot -b -uc cd .. ddpkg -i drbd8-module-source_8.3.0-0_all.deb drbd8-utils_8.3.0-0_amd64.deb module-assistant auto-install drbd8
Configure DRBD
This example show a simple 2 machine configuration file. Some definition:
Server 1
''name'': machineA ''ip'': 192.168.50.10 ''disk'': /dev/sda1
Server 2
''name'': machineB ''ip'': 192.168.50.12 ''disk'': /dev/sda3
You need to edit the file /etc/drbd.conf with the following content:
# drbd.conf edited by Simone (2009)
global {
usage-count no;
}
common {
protocol C;
syncer { rate 10M; }
}
resource r0 {
handlers {
pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f";
pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt -f";
local-io-error "echo o > /proc/sysrq-trigger ; halt -f";
}
startup {
degr-wfc-timeout 120; # 2 minutes.
}
disk {
on-io-error detach;
}
net {
after-sb-0pri disconnect;
after-sb-1pri disconnect;
after-sb-2pri disconnect;
rr-conflict disconnect;
}
syncer {
rate 10M;
}
on machinaA {
device /dev/drbd0;
disk /dev/sda1;
address 192.168.50.10:7788;
flexible-meta-disk internal;
}
on machineB {
device /dev/drbd0;
disk /dev/sda3;
address 192.168.50.11:7788;
meta-disk internal;
}
}
Create two partitions on the two servers with the same size (roughly).
On machineA:
mkfs.ext3 /dev/sda1
On machineB:
mkfs.ext3 /dev/sda5
The partition have exactly the same size
Run this command on both machines:
drbdadm create-md r0
The partition are a roughly differently in size
If the partition are a roughly differently in size do the follow command FIRST on the machine with smaller disk:
drbdadm create-md r0
If it complains about disk size, like:
The server's response is: you are the 2nd user to install this version md_offset 146778664960 al_offset 146778632192 bm_offset 146774151168 Found ext3 filesystem which uses 143338544 kB current configuration leaves usable '''143334132''' kB Device size would be truncated, which would corrupt data and result in 'access beyond end of device' errors. You need to either * use external meta data (recommended) * shrink that filesystem first * zero out the device (destroy the filesystem) Operation refused.
then (only if previous command complain about) run this command on both machines:
e2fsck -f /dev/cciss/c0d1p1 && resize2fs /dev/cciss/c0d1p1 143334132K
where 143334132 is the suggested size in Kb that you can find in the previous command output.
Start DRBD
Run this command on both machines:
/etc/init.d/drbd start
Then run ONLY ON machineA:
drbdsetup /dev/drbd0 primary -o
WAIT THE END OF THE SYNCHRONIZATION!!! You can check the status:
cat /proc/drbd
my output in this stadium was:
version: 8.3.0 (api:88/proto:86-89)
GIT-hash: fb12a0c50f88409dab4779169698b82909e21eb0 build by root@crono, 2009-01-17 03:41:20
0: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r---
ns:3911264 nr:0 dw:0 dr:3911264 al:0 bm:238 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:139422868
[>....................] sync'ed: 2.8% (136155/139974)M
finish: 2:32:52 speed: 15,040 (4,016) K/sec
Then run:
drbdadm primary r0
At the end both nodes should be in the following state:
/etc/init.d/drbd status drbd driver loaded OK; device status: version: 8.0.14 (api:86/proto:86) GIT-hash: bb447522fc9a87d0069b7e14f0234911ebdab0f7 build by phil@fat-tyre, 2008-11-12 16:40:33 m:res cs st ds p mounted fstype 0:r0 Connected Primary/Primary UpToDate/UpToDate C
Install GFS2
mkdir build cd build apt-get source gfs2-tools libopenais-dev libvolume-id-dev apt-get install quilt libselinux1-dev linux-libc-dev libvirt-dev libxml2-dev libncurses5-dev libnss3-dev libnspr4-dev libslang2-dev psmisc libnet-snmp-perl libnet-telnet-perl python-pexpect sg3-utils cd openais-0.81 cp Makefile.inc Makefile.inc.old sed s/CFLAGS\ +=\ -O3\ -Wall/CFLAGS\ +=\ -O0\ -Wall/ Makefile.inc.old > Makefile.inc dpkg-buildpackage -rfakeroot -b -uc cd .. dpkg -i libopenais-dev_0.81-0ubuntu5_amd64.deb libopenais2_0.81-0ubuntu5_amd64.deb cd udev-113 dpkg-buildpackage -rfakeroot -b -uc cd .. dpkg -i libvolume-id-dev_113-0ubuntu17.2_amd64.deb libvolume-id0_113-0ubuntu17.2_amd64.deb cd redhat-cluster-suite-2.20070823.1 dpkg-buildpackage -rfakeroot -b -uc cd .. dpkg -i gfs2-tools_2.20070823.1-0ubuntu1_amd64.deb libcman2_2.20070823.1-0ubuntu1_amd64.deb libdlm2_2.20070823.1-0ubuntu1_amd64.deb cman_2.20070823.1-0ubuntu1_amd64.deb openais_0.81-0ubuntu5_amd64.deb mkdir /etc/cluster cp cluster.conf /etc/cluster/ vim /etc/cluster/cluster.conf
Change cluster.conf according to your hostnames and your disks.
Make sure your hostname are in /etc/hosts.
GFS2 needs cman/openais cluster, so start it:
/etc/init.d/cman start
You can check the node are up:
cman_tool nodes Node Sts Inc Joined Name 1 M 32 2009-01-13 17:30:42 hostname1 2 M 28 2009-01-13 17:30:42 hostname2
Create and Start GFS2 on top of DRBD
Create the GFS2 filesystem on the DRBD device using dlm lock manager. MAKE THE FILESYSTEM ONLY ON ONE NODE.
mkfs.gfs2 -t cluster:gfs1 -p lock_dlm -j 2 /dev/drbd0
Finally mount the device on both servers:
mount -t gfs2 /dev/drbd0 /mnt
You can use the file mountgfs2.sh (at the end of this page) to mount GFS2 at boot:
cp mountgfs2.sh /etc/init.d/ update-rc.d mountgfs2.sh start 70 2 3 4 5 . stop 07 0 1 6 .
Failures
If hostname2 failed for one or another reason, following lines will appear in /var/syslog of hostname1:
Jan 13 22:04:10 hostname1 kernel: dlm: closing connection to node 2 Jan 13 22:04:10 hostname1 fenced[2543]: hostname2 not a cluster member after 0 sec post_fail_delay Jan 13 22:04:10 hostname1 fenced[2543]: fencing node "hostname2" Jan 13 22:04:10 hostname1 fenced[2543]: fence "hostname2" failed
During this time, access to GFS2 filesystem is frozen.
A manual fencing (executed on functional machine) is needed to get access again to the shared partition:
fence_ack_manual -n hostname2
Once this is done, repair the failed node and connect it with the valid one only when you are sure it is ok, necessary to avoid corruption!
FILES
- drbd.conf
========================== <cut here> ============================
# DRBD8 HA /etc/drbd.conf configuration file
resource r0 {
protocol C; # protocol between devices
startup {
wfc-timeout 120; # wait 2min for other peers
degr-wfc-timeout 120; # wait 2min if peer was already
# down before this node was rebooted
become-primary-on both;
}
net {
allow-two-primaries;
cram-hmac-alg "sha1"; # algo to enable peer authentication
shared-secret "123456";
# handle split-brain situations
after-sb-0pri discard-least-changes;# if no primary auto sync from the
# node that touched more blocks during
# the split brain situation.
after-sb-1pri discard-secondary; # if one primary
after-sb-2pri disconnect; # if two primaries
# solve the cases when the outcome
# of the resync decision is incompatible
# with the current role assignment in
# the cluster
rr-conflict disconnect; # no automatic resynchronization
# simply disconnect
}
disk {
on-io-error detach; # detach the device from its
# backing storage if the driver of
# the lower_device reports an error
# to DRBD
}
syncer {
rate 100M;
}
on hostname1 {
device /dev/drbd0;
disk /dev/sdb5;
address 192.168.9.xx:7789;
meta-disk internal;
}
on hostname2 {
device /dev/drbd0;
disk /dev/sdb5;
address 192.168.9.xx:7789;
meta-disk internal;
}
}
========================== </cut here> ============================
- cluster.conf
========================== <cut here> ============================
<?xml version="1.0"?>
<cluster name="cluster" config_version="1">
<!-- post_join_delay: number of seconds the daemon will wait before
fencing any victims after a node joins the domain
post_fail_delay: number of seconds the daemon will wait before
fencing any victims after a domain member fails
clean_start : prevent any startup fencing the daemon might do.
It indicates that the daemon should assume all nodes
are in a clean state to start. -->
<fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
<clusternodes>
<clusternode name="hostname1" votes="1" nodeid="1">
<fence>
<!-- Handle fencing manually -->
<method name="human">
<device name="human" nodename="hostname1"/>
</method>
</fence>
</clusternode>
<clusternode name="hostname2" votes="1" nodeid="2">
<fence>
<!-- Handle fencing manually -->
<method name="human">
<device name="human" nodename="hostname2"/>
</method>
</fence>
</clusternode>
</clusternodes>
<!-- cman two nodes specification -->
<cman expected_votes="1" two_node="1"/>
<fencedevices>
<!-- Define manual fencing -->
<fencedevice name="human" agent="fence_manual"/>
</fencedevices>
</cluster>
========================== </cut here> ============================
- mountgfs2.sh
========================== <cut here> ============================
#! /bin/sh
# /etc/init.d/mountgfs2.sh
#
# Needs to be mounted after drbd start and
# unmounted before drbd stop
# update-rc.d mountgfs2.sh start 70 2 3 4 5 . stop 07 0 1 6 .
#
# Mount gfs2 partition on /synchronized
case "$1" in
start)
echo "Mounting gfs2 partition"
mount -t gfs2 /dev/drbd0 /synchronized
;;
stop)
echo "Umounting gfs2 partition"
umount /dev/drbd0
;;
*)
echo "Usage: /etc/init.d/mountgfs2.sh {start|stop}"
exit 1
;;
esac
exit 0
========================== </cut here> ============================
Riferimenti:

