Ceph is a massively scalable, open source, distributed storage system. It is comprised of an object store, block store, and a POSIX-compliant distributed file system. The platform is capable of auto-scaling to the exabyte level and beyond, it runs on commodity hardware, it is self-healing and self-managing, and has no single point of failure. Ceph is in the Linux kernel and is integrated with the OpenStack™ cloud operating system. As a result of its open source nature, this portable storage platform may be installed and used in public or private clouds.
I.1. RADOS?
You can easily get confused by the denomination: Ceph? RADOS?
RADOS: Reliable Autonomic Distributed Object Store is an object storage. RADOS takes care of distributing the objects across the whole storage cluster and replicating them for fault tolerance. It is built with 3 major components:
Object Storage Daemon (OSD): the storage daemon - RADOS service, the location of your data. You must have this daemon running on each server of your cluster. For each OSD you can have an associated hard drive disks. For performance purpose it’s usually better to pool your hard drive disk with raid arrays, LVM or btrfs pooling. With that, for one server your will have one daemon running. By default, three pools are created: data, metadata and RBD.
Meta-Data Server (MDS): this is where the metadata are stored. MDSs build POSIX file system on top of objects for Ceph clients. However if you are not using the Ceph File System, you do not need a meta data server.
Monitor (MON): this lightweight daemon handles all the communications with the external applications and the clients. It also provides a consensus for distributed decision making in a Ceph/RADOS cluster. For instance when you mount a Ceph shared on a client you point to the address of a MON server. It checks the state and the consistency of the data. In an ideal setup you will at least run 3 ceph-mon daemons on separate servers. Quorum decisions and calculs are elected by a majority vote, we expressly need odd number.
Ceph devoloppers recommend to use btrfs as a filesystem for the storage. Using XFS is also possible and might be a better alternative for production environements. Neither Ceph nor Btrfs are ready for production. It could be really risky to put them together. This is why XFS is an excellent alternative to btrfs. The ext4 filesystem is also compatible but doesn’t take advantage of all the power of Ceph.
We recommend configuring Ceph to use the XFS file system in the near term, and btrfs in the long term once it is stable enough for production.
RBD: as a block device. The linux kernel RBD (rados block device) driver allows striping a linux block device over multiple distributed object store data objects. It is compatible with the kvm RBD image.
CephFS: as a file, POSIX-compliant filesystem.
Ceph exposes its distributed object store (RADOS) and it can be accessed via multiple interfaces:
RADOS Gateway: Swift and Amazon-S3 compatible RESTful interface. For further information.
librados and the related C/C++ bindings.
rbd and QEMU-RBD: linux kernel and QEMU block devices that stripe data across multiple objects.
I.3. IS CEPH PRODUCTION-QUALITY?
The definition of “production quality” varies depending on who you ask. Because it can mean a lot of different things depending on how you want to use Ceph, we prefer not to think of it as a binary term.
At this point we support the RADOS object store, radosgw, and RBD because we think they are sufficiently stable that we can handle the support workload. There are several organizations running those parts of the system in production. Others wouldn’t dream of doing so at this stage.
The CephFS POSIX-compliant filesystem is functionally-complete and has been evaluated by a large community of users, but has not yet been subjected to extensive, methodical testing.
Since there is no stable version, I decided to version with the upstream version of Ceph. Thus, I used the Ceph repository, I worked with the last version available so 0.47.2:
Generate the keyring authentication, deploy the configuration and configure the nodes. I will highly recommand to previously setup a SSH key authentication based because mkcephfs will attempt to connect via SSH to each servers (hostnames) you provided in the ceph configuration file. It can be a pain in the arse to enter the ssh password for every command run by mkcephfs!
Directory creation is not managed by the script so you have to create them manually on each server:
Don’t forget to mount your OSD directory according to your disk map otherwise Ceph will by default use the root filesystem. It’s up to you to use ext4 or XFS. For those of you who want to setup an ext4 cluster I extremly recommend to use the following mount options for your hard drive disks:
$ sudo mkcephfs -a -c /etc/ceph/ceph.conf -k /etc/ceph/keyring.admin
Ceph doesn’t need root permission to execute his command, it simply needs to access the keyring. Each Ceph command you execute on the command line assumes that you are the client.admin default user. The client.admin key has been generated during the mkcephfs process. The interesting thing to know about cephx is that it’s based on Kerberos ticket trust mecanism. If you want to go further with the cephx authentication check the ceph documentation about it. Just make sure that your keyring is readable by everyone:
$ sudo chmod +r /etc/ceph/keyring.admin
And launch all the daemons:
$ sudo service ceph start
This will run every Ceph daemons, namely OSD, MON and MDS (-a flag), but you can specify a particular daemon with an extra parameter as osd, mon or mds. Now check the status of your cluster by running the following command:
$ ceph -k /etc/ceph/keyring.admin -c /etc/ceph/ceph.conf health HEALTH_OK
As you can see I’m using the -k option, indeed Ceph supports cephx secure authentication between the nodes within the cluster, each connection and communication are initiated with this authentication mecanism. It depends on your setup but it can be overkill to use this system…
All the daemons are running (extract from the server-04):
Of course each pool might store more critical data, for instance my pool called nova store the RBD volume of each virtual machine, so I increased the replication level like this:
$ ceph osd pool set nova size 3 set pool 3 size to 3
II.7. Connect your client
Clients can access the RADOS cluster either directly via librados with rados command. The usage of librbd is possible as well with the RBD tool (via rbd command), which creates an image / volume abstraction over the object store. To achieve highly available monitor, simply put all of them in the mount option:
I tried to simulate a MON failure while CephFS is mounted. I stopped one of my MON server but precisely the one used for mounting CephFS. Oh yes.. I forgot to tell you I used only one monitor to mount Ceph… And the result was really unexpected, after I stopped the monitor the CephFS didn’t crashed and stayed alive :). There is some magic performed under the hood by Ceph. I don’t really know how but Ceph and monitors are clever enough to figure out MON failure and re-initiate a connection to an another monitor and thus keep the the mounting filesystem alive.
Check this:
client:~$ mount | grep ceph client:~$ sudo mount -t ceph 172.17.1.7:6789:/ /mnt -vv -o name=admin,secret=AQDVGc5P0LXzJhAA5C019tbdrgypFNXUpG2cqQ== client:~$ mount grep ceph 172.17.1.7:6789:/ on /mnt type ceph (rw,name=admin,secret=AQDVGc5P0LXzJhAA5C019tbdrgypFNXUpG2cqQ==) client:~$ ls /mnt/ client:~$ touch /mnt/mon-ok client:~$ ls /mnt/ mon-ok client:~$ sudo netstat -plantu | grep EST | grep 6789 tcp 0 0 172.17.1.2:60462 172.17.1.7:6789 ESTABLISHED - server6:~$ sudo service ceph stop mon === mon.2 === Stopping Ceph mon.2 on server6...kill 532...done client:~$ touch /mnt/mon-3-down client:~$ sudo netstat -plantu | grep EST | grep 6789 tcp 0 0 172.17.1.2:60462 172.17.1.5:6789 ESTABLISHED - server6:~$ sudo service ceph start mon === mon.2 === Starting Ceph mon.2 on server6... starting mon.2 rank 2 at 172.17.1.7:6789/0 mon_data /srv/ceph/mon2 fsid caf6e927-e87e-4295-ab01-3799d6e24be1 server4:~$ sudo service ceph stop mon === mon.1 === Stopping Ceph mon.1 on server4...kill 4049...done client:~$ touch /mnt/mon-2-down client:~$ sudo netstat -plantu | grep EST | grep 6789 tcp 0 0 172.17.1.2:60462 172.17.1.4:6789 ESTABLISHED - client:~$ touch /mnt/mon-2-down client:~$ ls /mnt/ mon-ok mon-3-down mon-2-down
Impressive!
III. Openstack integration
III.1. RDB and nova-volume
Before starting, here is my setup, I volontary installed nova-volume on a node of my Ceph cluster:
By default, the RBD pool named rbd will be use by OpenStack if nothing is specified. I prefered to use nova as a pool, so I created it:
$ rados lspools data metadata rbd $ rados mkpool nova $ rados lspools data metadata rbd nova $ rbd --pool nova ls volume-0000000c
$ rbd --pool nova info volume-0000000c rbd image 'volume-0000000c': size 1024 MB in 256 objects order 22 (4096 KB objects) block_name_prefix: rb.0.0 parent: (pool -1)
Restart your nova-volume:
$ sudo service nova-volume restart
Try to create a volume, you shouldn’t have any problem :)
$ nova volume-create --display_name=rbd-vol 1
Check this via:
$ nova volume-list +----+-----------+--------------+------+-------------+-------------+ | ID | Status | Display Name | Size | Volume Type | Attached to | +----+-----------+--------------+------+-------------+-------------+ | 51 | available | rbd-vol | 1 | None | | +----+-----------+--------------+------+-------------+-------------+
Check in RBD:
$ rbd --pool nova ls volume-00000033
$ rbd --pool nova info volume-00000033 rbd image 'volume-00000033': size 1024 MB in 256 objects order 22 (4096 KB objects) block_name_prefix: rb.0.3 parent: (pool -1)
Everything looks great, but wait.. can I attach it to an instance?
Since we are using cephx authentication, nova and libvirt require a couple more steps.
For security and clarity purpose you may want to create a new user and give it access to your Ceph cluster with fine permissions. Let’s say that you want to use a user called nova, each connection to your MON server will be initiate as client.nova instead of client.admin. This behavior is define by the rados_create function which create a handle for communicating with your RADOS cluster. Ceph environment variables are read when this is called, so if $CEPH_ARGS specifies everything you need to connect, no further configuration is necessary. The trick is to add the following lines at the beginning of the /usr/lib/python2.7/dist-packages/nova/volume/driver.py file:
# use client.nova instead of nova.admin import os os.environ["CEPH_ARGS"] = "--id nova"
Adding the variable via the init script of nova-volume should also work, it’s up to you. The nova user needs this environment variable.
Here I assume that you use client.admin, if you use client.nova change every values called admin to nova. Now we can start to configure the secret in libvirt, create a file secret.xml and add this content:
Login to your compute node where the instance is running and check the id of the running instance. If you don’t know where the instance is running launch the following commands:
$ nova list +--------------------------------------+-------------------+--------+---------------------+ | ID | Name | Status | Networks | +--------------------------------------+-------------------+--------+---------------------+ | e1457eea-ef67-4df3-8ba4-245d104d2b11 | instance-over-rbd | ACTIVE | vlan1=192.168.22.36 | +--------------------------------------+-------------------+--------+---------------------+
As you can see my instance is running on the server-02, pick up the id of your instance here instance-000000d6 in virsh. Attach it manually with virsh:
Disk /dev/vda: 2147 MB, 2147483648 bytes 255 heads, 63 sectors/track, 261 cylinders, total 4194304 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0x00000000
Device Boot Start End Blocks Id System /dev/vda1 * 16065 4192964 2088450 83 Linux
Disk /dev/vdb: 1073 MB, 1073741824 bytes 16 heads, 63 sectors/track, 2080 cylinders, total 2097152 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0x00000000
Disk /dev/vdb doesn't contain a valid partition table
Now you are ready to use it:
ubuntu@instance-over-rbd:~$ sudo mkfs.ext4 /dev/vdb mke2fs 1.42 (29-Nov-2011) Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) Stride=0 blocks, Stripe width=0 blocks 65536 inodes, 262144 blocks 13107 blocks (5.00%) reserved for the super user First data block=0 Maximum filesystem blocks=268435456 8 block groups 32768 blocks per group, 32768 fragments per group 8192 inodes per group Superblock backups stored on blocks: 32768, 98304, 163840, 229376
Allocating group tables: done Writing inode tables: done Creating journal (8192 blocks): done Writing superblocks and filesystem accounting information: done
ubuntu@instance-over-rbd:~$ sudo mount /dev/vdb /mnt
ubuntu@instance-over-rbd:~$ ls /mnt/ lost+found test
Last but not least, edit your nova.conf on each nova-compute server with the authentication value. It’s working without those options, since we added them manually to libvirt, but I think it good to tell them to nova. You will be able to attach a volume to an instance from the nova cli with nova volume-attach command and from the dashboard as well :D
$ nova volume-create --display_name=nova-rbd-vol 1
$ nova volume-list +----+-----------+--------------+------+-------------+--------------------------------------+ | ID | Status | Display Name | Size | Volume Type | Attached to | +----+-----------+--------------+------+-------------+--------------------------------------+ | 51 | available | rbd-vol | 1 | None | | | 57 | available | nova-rbd-vol | 1 | None | | +----+-----------+--------------+------+-------------+--------------------------------------+
$ nova volume-attach e1457eea-ef67-4df3-8ba4-245d104d2b11 57 /dev/vdd
$ nova volume-list +----+-----------+--------------+------+-------------+--------------------------------------+ | ID | Status | Display Name | Size | Volume Type | Attached to | +----+-----------+--------------+------+-------------+--------------------------------------+ | 51 | available | rbd-vol | 1 | None | | | 57 | in-use | nova-rbd-vol | 1 | None | e1457eea-ef67-4df3-8ba4-245d104d2b11 | +----+-----------+--------------+------+-------------+--------------------------------------+
The first disk is marked as available simply because it has been attached manually with virsh and not with nova. Have a look inside your virtual machine :)
/!\ Important note, the secret.xml needs to be added on each nova-compute, more precisely to libvirt. Keep the first secret (uuid) and put it into your secret.xml. The file below becomes your new secret.xml reference file.
error : qemuMonitorJSONCheckError:318 : internal error unable to execute QEMU command 'device_add': Device 'virtio-blk-pci' could not be initialized
error : qemuMonitorJSONCheckError:318 : internal error unable to execute QEMU command 'device_add': Duplicate ID 'virtio-disk2' for device
The first one occured when I tried to mount a volume with /dev/vdb as device name and the second occured with /dev/vdc. Solved by using a different device name than /dev/vdc/, I think libvirt remembers ‘somewhere’ and ‘somehow’ that a device was previously attached (the manually one). I didn’t really investigate since it can be simply solved.
EDIT: 11/07/2012
Some people reported tp me a common issue. There were unable to attach a RBD device with nova, but with libvirt it’s fine. If you have difficulties to make it working, you will probably need to update the libvirt AppArmor profile. If you check your /var/log/libvirt/qemu/your_instance_id.log, you should see:
unable to find any monitors in conf. please specify monitors via -m monaddr or -c ceph.conf
And if you dive into the debug mode:
debug : virJSONValueFromString:914 : string={"return": "error connecting\r\ncould not open disk image rbd:nova/volume-00000050: No such file or directory\r\n", "id": "libvirt-12"}
And of course it’s log in AppArmor and it’s pretty explicit:
Now edit the libvirt AppArmor profile, you need to adjust access controls for all VMs, new or existing:
$ sudo echo"/etc/ceph/** r," | sudo tee -a /etc/apparmor.d/abstractions/libvirt-qemu $ sudo service libvirt-bin restart $ sudo service apparmor reload
That’s all, after this libvirt/qemu will be able to read your ceph.conf and your keyring (if you use cephx) ;-).
III.2. RBD and Glance
III.2.1. RBD as Glance storage backend
I followed the official instructions from OpenStack documentation. I recommend to follow the upstream package from Ceph since the Ubuntu repo doesn’t provide a valid version. This issue has been recently reported by Florian Haas in the OpenStack and Ceph mailing list however the bug has already been tracked here. It has been uploaded to precise-proposed for SRU review and waiting for approval, this shouldn’t be too long. Be sure to add the Ceph repo (deb http://ceph.com/debian/ precise main) on your Glance server (as I did earlier).
$ sudo apt-get install python-ceph
Modify your glance-api.conf like so:
# Set the rbd storage
default_store = rbd
# ============ RBD Store Options =============================
# Ceph configuration file path
# If using cephx authentication, this file should
# include a reference to the right keyring
# in a client.<USER> section
rbd_store_ceph_conf = /etc/ceph/ceph.conf
# RADOS user to authenticate as (only applicable if using cephx)
rbd_store_user = glance
# RADOS pool in which images are stored
rbd_store_pool = images
# Images will be chunked into objects of this size (in megabytes).
# For best performance, this should be a power of two
rbd_store_chunk_size = 8
According to the glance configuration, I created a new pool and a new user for RADOS:
$ rados mkpool images successfully created pool images
$ sudo service glance-api restart && sudo service glance-registry restart
Before uploading check your images pools:
$ rados --pool=images ls rbd_directory rbd_info
Try to upload a new image.
$ wget http://cloud-images.ubuntu.com/precise/current/precise-server-cloudimg-amd64-disk1.img $ glance add name="precise-ceph" is_public=True disk_format=qcow2 container_format=ovf architecture=x86_64 < precise-server-cloudimg-amd64-disk1.img Uploading image 'precise-ceph' ======================================================================================================[100%] 26.2M/s, ETA 0h 0m 0s Added new image with ID: 70685ad4-b970-49b7-8bde-83e58b255d95
Check in glance:
$ glance index ID Name Disk Format Container Format Size ------------------------------------ ------------------------------ -------------------- -------------------- -------------- 60beab84-81a7-46d1-bb4a-19947937dfe3 precise-ceph qcow2 ovf 227213312
The first command initiates and create the snapshot with name 56642cf3d09b49a7aa400b6bc07494b9 from the image disk of the instance located here /var/lib/nova/instances/instance-00000097/disk.
The second command will convert the image from qcow2 to qcow2 and store the backup into Glance thus in RBD. Here the image is stored as qcow2 format, this is not really what we want! I want to store an RBD (format) image.
The third command deletes the local file of the snapshot, no longer needed since the image has been stored in the Glance backend.
When you attempt to perform a snapshot of an instance from the dashboard or via the nova image-create command, nova executes a local copy of changes in a qcow2 file, however this file will be stored in Glance.
If you want to run a RBD snapshot through OpenStack, you need to take a volume snapshot. These functionnality is not exposed in the dashboard yet.
Snapshot a RBD volume:
snapshot snapshot-00000004: creating
snapshot snapshot-00000004: creating from (pid=18829) create_snapshot
rbd --pool nova snap create --snap snapshot-00000004 volume-00000042
snapshot snapshot-00000004: created successfully from (pid=18829) create_snapshot
Verify:
$ rbd --pool=nova snap ls volume-00000042 2 snapshot-00000004 1073741824
Full RBD managment?
$ qemu-img info precise-server-cloudimg-amd64.img image: precise-server-cloudimg-amd64.img file format: qcow2 virtual size: 2.0G (2147483648 bytes) disk size: 217M cluster_size: 65536
$ qemu-img info rbd:nova/ceph-img-cli image: rbd:nova/ceph-img-cli file format: raw virtual size: 2.0G (2147483648 bytes) disk size: unavailable
There is a surprising value here, why does the image appear as raw format? And why does the file size become unavailable? For those of you, you want to go further with Qemu-RBD snapshot, see the documentation from Ceph
III.3. Does the dream come true?
Boot from a RBD image?
I uploaded a new image in the glance RBD backend and try to boot with this image and it works. Glance is able to retrieve images over the RBD backend configured. You will usually see this log message:
INFO nova.virt.libvirt.connection [-] [instance: ce230d11-ddf8-4298-a7d9-40ae8690ff11] Instance spawned successfully.
III.4. Boot from a volume
Booting from a volume will require specifying a dummy image id, as shown in these scripts:
if [ ! -f$DIR/debian.img ]; then echo"Downloading debian image..." wget http://ceph.newdream.net/qa/debian.img -O $DIR/debian.img fi touch $DIR/dummy_img glance add name="dummy_raw_img" is_public=True disk_format=rawi container_format=ovf architecture=x86_64 < dummy_img
echo"Waiting for image to become available..." whiletrue; do if ( timeout 5 nova image-list | egrep -q 'dummy_raw_img|ACTIVE' ) then break fi sleep 2 done
echo"Creating volume..." nova volume-create --display_name=dummy 1 echo"Waiting for volume to be available..." whiletrue; do if ( nova volume-list | egrep -q 'dummy|available' ) then break fi sleep 2 done
echo"Replacing blank image with real one..." # last created volume id, assuming pool nova DUMMY_VOLUME_ID=$(rbd --pool=nova ls | sed -n '$p') rbd -p nova rm $VOLUME_ID rbd -p nova import $DIR/debian.img $DUMMY_VOLUME_ID echo"Requesting an instance..." $DIR/boot-from-volume echo"Waiting for instance to start..." whiletrue; do if ( nova list | egrep -q "boot-from-rbd|ACTIVE" ) then break fi sleep 2 done
import argparse import httplib2 import json import os
defmain(): http = httplib2.Http() parser = argparse.ArgumentParser(description='Boot an OpenStack instance from RBD') parser.add_argument( '--endpoint', help='the Nova API endpoint (http://IP:port/vX.Y/)', default=os.getenv("http://172.17.1.6:8774/v2/"), ) parser.add_argument( '--image-id', help="The image ID Nova will pretend to boot from (ie, 1 -- not ami-0000001)", default=4, ) parser.add_argument( '--volume-id', help='The RBD volume ID (ie, 1 -- not volume-0000001)', default=1, ) parser.add_argument( '-v', '--verbose', action='store_true', default=False, help='be more verbose', ) args = parser.parse_args()
resp, body = http.request( '{endpoint}/volumes/os-volumes_boot'.format(endpoint=args.endpoint), method='POST', headers=headers, body=json.dumps(req), ) if resp.status == 200: print"Instance scheduled successfully." if args.verbose: print json.dumps(json.loads(body), indent=4, sort_keys=True) else: print"Failed to create an instance: response status", resp.status print json.dumps(json.loads(body), indent=4, sort_keys=True)
if __name__ == '__main__': main()
Both are a little bit deprecated so I re-wrote some parts, it’s not that demanding. I barely spent much time on it, so there’s still work to be done. For example, I don’t use euca API, so I simply re-wrote according to nova-api.
Josh Durgin from Inktank said the following:
What’s missing is that OpenStack doesn’t yet have the ability to initialize a volume from an image. You have to put an image on one yourself before you can boot from it currently. This should be fixed in the next version of OpenStack. Booting off of RBD is nice because you can do live migration, although I haven’t tested that with OpenStack, just with libvirt. For Folsom, we hope to have copy-on-write cloning of images as well, so you can store images in RBD with glance, and provision instances booting off cloned RBD volumes in very little time.
I quickly tried this manipulation, but without success:
$ nova volume-create --display_name=dummy 1 $ nova volume-list +----+-----------+--------------+------+-------------+--------------------------------------+ | ID | Status | Display Name | Size | Volume Type | Attached to | +----+-----------+--------------+------+-------------+--------------------------------------+ | 69 | available | dummy | 2 | None | | +----+-----------+--------------+------+-------------+--------------------------------------+ $ rbd -p nova ls volume-00000045 $ rbd import debian.img volume-00000045 Importing image: 13% complete...2012-06-08 13:45:34.562112 7fbb19835700 0 client.4355.objecter pinging osd that serves lingering tid 1 (osd.1) Importing image: 27% complete...2012-06-08 13:45:39.563358 7fbb19835700 0 client.4355.objecter pinging osd that serves lingering tid 1 (osd.1) Importing image: 41% complete...2012-06-08 13:45:44.563607 7fbb19835700 0 client.4355.objecter pinging osd that serves lingering tid 1 (osd.1) Importing image: 55% complete...2012-06-08 13:45:49.564244 7fbb19835700 0 client.4355.objecter pinging osd that serves lingering tid 1 (osd.1) Importing image: 69% complete...2012-06-08 13:45:54.565737 7fbb19835700 0 client.4355.objecter pinging osd that serves lingering tid 1 (osd.1) Importing image: 83% complete...2012-06-08 13:45:59.565893 7fbb19835700 0 client.4355.objecter pinging osd that serves lingering tid 1 (osd.1) Importing image: 97% complete...2012-06-08 13:46:04.567426 7fbb19835700 0 client.4355.objecter pinging osd that serves lingering tid 1 (osd.1) Importing image: 100% complete...done. $ nova boot --flavor m1.tiny --image precise-ceph --block_device_mapping vda=69:::0 --security_groups=default boot-from-rbd
III.5. Live migration with CephFS!
I was brave enough to also experimented the live migration with the Ceph Filesystem. Some of these pre-requites are obvious but just to be sure, with the live-migration comes mandatory requirements as:
2 compute nodes
a distributed file system, here CephFS
For the live-migration configuration I followed the official OpenStack documentation. The following actions need to be performed on each compute node:
Update the libvirt configurations. Modify /etc/libvirt/libvirtd.conf:
listen_tls = 0
listen_tcp = 1
auth_tcp = "none"
Modify /etc/init/libvirt-bin.conf and add the -l option:
libvirtd_opts=" -d -l"
Restart libvirt. After executing the command, ensure that libvirt is succesfully restarted.
Make sure that you see the -l flag in the ps command. You should be able to retrieve the information (passwordless) from an hypervisor to another, to test it simply run:
server-02:/$ sudo virsh --connect qemu+tcp://server-01/system list Id Name State ---------------------------------- 1 instance-000000af running 3 instance-000000b5 running
Sometimes you can get this message from the nova-scheduler logs:
Casted 'live_migration' to compute 'server-01' from (pid=10963) cast_to_compute_host /usr/lib/python2.7/dist-packages/nova/scheduler/driver.py:80
And somehow you must get something from the logs, so check:
nova-compute logs
nova-scheduler logs
libvirt logs
The libvirt logs could show those errors:
error : virExecWithHook:328 : Cannot find 'pm-is-supported' in path: No such file or directory
error : virNetClientProgramDispatchError:174 : Unable to read from monitor: Connection reset by peer
The first issue (pm) was solved by installing this package:
$ sudo apt-get install pm-utils -y
The second one is a little bit more tricky, the only glue I found was to disable the VNC console according to this thread.
Finally check the log and see:
instance: 962c222f-2280-43e9-83be-c27a31f77946] Migrating instance to server-02 finished successfully.
Sometimes this message doesn’t appear, but the live-migration successfully performed, the best check is to wait and watch on the distant server:
$ watch sudo virsh list
Every 2.0s: sudo virsh list Id Name State ----------------------------------
Every 2.0s: sudo virsh list Id Name State ---------------------------------- 6 instance-000000dc shut off
Every 2.0s: sudo virsh list Id Name State ---------------------------------- 6 instance-000000dc paused
Every 2.0s: sudo virsh list Id Name State ---------------------------------- 6 instance-000000dc running
During the live migration, you should see those states in virsh:
shut off
paused
running
That’s all!
The downtime for m2.tiny instance was approximatively 3 sec.
III.6. Virtual instances disk’s errors - Solved!
When I use Ceph to store the /var/lib/nova/instances directory of each nova-compute server I have these I/O errors inside the virtual machines…
Buffer I/O error on device vda1, logical block 593914
Buffer I/O error on device vda1, logical block 593915
Buffer I/O error on device vda1, logical block 593916
EXT4-fs warning (device vda1): ext4_end_bio:251: I/O error writing to inode 31112 (offset 7852032 size 524288 starting block 595925)
JBD2: Detected IO errors while flushing file data on vda1-8
Logs from the kernel during the boot sequence of the instance:
This issue appears everytime I launched a new instance. Sometimes waiting for the ext4 auto mecanism recovery solve temporary the problem but the filesystem still stays unstable. This error is probably due to the ext4 filesystem. It happens really often and I don’t have any clue at the moment maybe a filesystem option or switching from ext4 to XFS will do the trick. At the moment I tried several mount options inside the VM like nobarrier or noatime but nothing changed. This is what I got when I tried to perform a basic operation like installing a package:
Reading package lists... Error!
E: Unable to synchronize mmap - msync (5: Input/output error)
E: The package lists or status file could not be parsed or opened.
This can be solved by the following commands but it’s neither useful nor relevant since this error will occur again and again…
Pass4: Checking reference counts Pass5: Checking group summary information
/dev/nova-volumes/lvol0: ***** FILE SYSTEM WAS MODIFIED ***** /dev/nova-volumes/lvol0: 3435/6553600 files (6.7% non-contiguous), 2783459/26214400 blocks
Nothing relevant, everything is properly working.
This issue is unsolved, it’s simply related to the fact that CephFS is not stable enough. It can’t handle this amount of I/O. A possible work around, here and here. I don’t even thing that using XFS instead of ext4 will change the outcome. It seems that this issue also occur with RBD volume, see on the ceph tracker.
According to this reported bug (and the mailing list discussion) this issue affects rbd volumes inside virtual machine, the workaround here is to active the rbd caching, an option should be added inside the xml file while attaching a device.
I didn’t check this workaround yet, but it seems to be solved by enabling the cache.
UPDATE: 13/06/2012 - I/O ISSUES SOLVED
It seems that Ceph has a lot of difficulties with the direct I/O support, see below:
$ mount | grep ceph 172.17.1.4:6789,172.17.1.5:6789,172.17.1.7:6789:/ on /mnt type ceph (name=admin,key=client.admin) $ dd if=/dev/zero of=/mnt/directio bs=8M count=1 oflag=direct 1+0 records in 1+0 records out 8388608 bytes (8.4 MB) copied, 0.36262 s, 23.1 MB/s $ dd if=/dev/zero of=/mnt/directio bs=9M count=1 oflag=direct dd: writing `/mnt/directio': Bad address 1+0 records in 0+0 records out 0 bytes (0 B) copied, 1.20184 s, 0.0 kB/s
Setting the cache to none means using direct I/O… Note from the libvirt documentation:
The optional cache attribute controls the cache mechanism, possible values are “default”, “none”, “writethrough”, “writeback”, “directsync” (like “writethrough”, but it bypasses the host page cache) and “unsafe” (host may cache all disk io, and sync requests from guest are ignored). Since 0.6.0, “directsync” since 0.9.5, “unsafe” since 0.9.7
Cache parameters explained:
none: uses O_DIRECT I/O that bypasses the filesystem cache on the host
writethrough: uses O_SYNC I/O that is guaranteed to be commited to disk on return to userspace. Only cache read requests and immediately write to disk.
writeback: uses normal buffered I/O that is written back later by the operating system. It caches the write requests in RAM, which bring high-performance but also increase the data loss probability.
Actually there is already a function to test if direct I/O are supported:
@staticmethod def_supports_direct_io(dirpath): testfile = os.path.join(dirpath, ".directio.test") hasDirectIO = True try: f = os.open(testfile, os.O_CREAT | os.O_WRONLY | os.O_DIRECT) os.close(f) LOG.debug(_("Path '%(path)s' supports direct I/O") % {'path': dirpath}) except OSError, e: if e.errno == errno.EINVAL: LOG.debug(_("Path '%(path)s' does not support direct I/O: " "'%(ex)s'") % {'path': dirpath, 'ex': str(e)}) hasDirectIO = False else: LOG.error(_("Error on '%(path)s' while checking direct I/O: " "'%(ex)s'") % {'path': dirpath, 'ex': str(e)}) raise e except Exception, e: LOG.error(_("Error on '%(path)s' while checking direct I/O: " "'%(ex)s'") % {'path': dirpath, 'ex': str(e)}) raise e finally: try: os.unlink(testfile) except: pass
return hasDirectIO
Somehow it’s not detected, mainly because the issue is related to the block size.
If direct I/O are supported it will specified in this file /usr/lib/python2.7/dist-packages/nova/virt/libvirt/connection.py, on line 1036:
@property defdisk_cachemode(self): if self._disk_cachemode isNone: # We prefer 'none' for consistent performance, host crash # safety & migration correctness by avoiding host page cache. # Some filesystems (eg GlusterFS via FUSE) don't support # O_DIRECT though. For those we fallback to 'writethrough' # which gives host crash safety, and is safe for migration # provided the filesystem is cache coherant (cluster filesystems # typically are, but things like NFS are not). self._disk_cachemode = "none" ifnot self._supports_direct_io(FLAGS.instances_path): self._disk_cachemode = "writethrough" return self._disk_cachemode
The first trick was to modify this line:
self._disk_cachemode = "none"
With
self._disk_cachemode = "writethrough"
With this change, every instances will have the libvirt cache option set to writethrough even if the filesystem supports direct I/O.
Fix a corrumpted VM:
FSCKFIX=yes
Reboot the VM :)
Note: writeback is also supported with Ceph, it offers better performance than writethrough but writeback stays the safest way for your data. It depends on your need :)
IV. Benchmarks
Thoses benchmarks have been performed under ext4 filesystem and on 15K RPM hard drive disks.
Average Latency: 1.20432 Max latency: 2.68683 Min latency: 0.120974
IV.1.2. OSD Benchmarks
From a console run:
$ for i in 0 1 2; do ceph osd tell $i bench; done ok ok ok
Monitor the output from an another terminal:
$ ceph -w osd.0 172.17.1.4:6802/22135 495 : [INF] bench: wrote 1024 MB in blocks of 4096 KB in 4.575725 sec at 223 MB/sec osd.1 172.17.1.5:6801/8713 877 : [INF] bench: wrote 1024 MB in blocks of 4096 KB in 22.559266 sec at 46480 KB/sec osd.2 172.17.1.7:6802/737 1274 : [INF] bench: wrote 1024 MB in blocks of 4096 KB in 20.011638 sec at 52398 KB/sec
As you can see, I have pretty bad performance on 2 OSDs. Both of them will bring down the performance of my whole cluster. (this statment will be verified bellow)
server-03:~$ time dd if=/dev/zero of=test bs=2000M count=1; time scp test root@server-04:/dev/null; 2097152000 bytes (2.1 GB) copied, 4.46267 s, 470 MB/s
root@server-04's password: test 100% 2000MB 52.6MB/s 00:47
real 0m49.298s user 0m43.915s sys 0m5.172s
It’s not really surprising since Ceph showed an average of 53MB/s. I clairly have a network bottlenck because all my servers are connected with GBit. I also test a copy from the root partition to the ceph shared mount directory to see how long does it take to write data into ceph:
$ time dd if=/dev/zero of=pouet bs=2000M count=1; time sudo cp pouet /var/lib/nova/instances/; 1+0 records in 1+0 records out 2097152000 bytes (2.1 GB) copied, 4.27012 s, 491 MB/s
The information reported by the -w option are asynchronous and not really significant. For instances we can’t tell that storing 2GB in Ceph DFS took 37 seconds.
/* * Copyright (C) 2010 Canonical * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License * as published by the Free Software Foundation; either version 2 * of the License, or (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA. * */
/* Find out how many extents there are */ if (ioctl(fd, FS_IOC_FIEMAP, fiemap) < 0) { fprintf(stderr, "fiemap ioctl() failed\n"); return NULL; }
/* Read in the extents */ extents_size = sizeof(struct fiemap_extent) * (fiemap->fm_mapped_extents);
/* Resize fiemap to allow us to read in the extents */ if ((fiemap = (struct fiemap*)realloc(fiemap,sizeof(struct fiemap) + extents_size)) == NULL) { fprintf(stderr, "Out of memory allocating fiemap\n"); return NULL; }
I hope I will be able to go further and use Ceph for production. Ceph seems fearly stable enough at the moment, for RBD and RADOS, CephFS doesn’t seem capable to handle huge I/O traffic. Also keep in mind that a company called Inktank offers a commercial support for Ceph, I don’t thing it’s a coincidence. Ceph will have a bright future. The recovery procedure is excellent, of course there is a lot of component which I would loved to play like fine crushmap tunning. This article could be updated at any time since I’m taking my research further :).
This article wouldn’t have been possible without the tremendous help of Josh Durgin from Inktank, many many thanks to him :)
Comments