Ceph: how to test if your SSD is suitable as a journal device?

Ceph: how to test if your SSD is suitable as a journal device?

A simple benchmark job to determine if your SSD is suitable to act as a journal device for your OSDs.


I. Testing

To give you a little bit of background when the OSD writes into his journal it uses D_SYNC and O_DIRECT. Writing with O_DIRECT bypasses the Kernel page cache, while D_SYNC ensures that the command won’t return until every single write is complete. So yes, basically the OSD forces all the writes to be flushed prior to start the next IO.

First disable the write cache on the disk:

$ sudo hdparm -W 0 /dev/hda 0

Disable the controller cache, assuming your controller is from HP, in slot 2 and your logical drive is the number 1:

$ sudo hpacucli ctrl slot=2 modify dwc=disable
$ sudo hpacucli controller slot=2 logicaldrive 1 modify arrayaccelerator=disable

Now you can start benchmarking your SSD correctly using two different methods. The FIO way:

$ sudo fio --filename=/dev/sda --direct=1 --sync=1 --rw=write --bs=4k --numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting --name=journal-test

Now it is important to understand the option we passed:

  • --filename: device we want to test
  • --direct: we open the device with O_DIRECT which means that we are bypassing the Kernel page cache
  • --sync: we open the device with O_DSYNC we don’t acknowledge until we are sure that the IO has been completely written
  • --rw: IO pattern, here we use write for sequential writes, journal writes are always sequential
  • --bs: block size, here we are submitting 4K IOs, this is probably the worst case scenario, so you can always change this value if you know your workload
  • --numjobs: number of threads that will be running, think this has ceph-osd daemons writing to the journal
  • --iodepth: we are submitting IO one by one.
  • --runtime: job duration in seconds
  • --time_based: run for the specified runtime duration even if the files are completely read or written
  • --group_reporting: If set, display per-group reports instead of per-job when numjobs is specified.
  • --name: name of the run

II. Ramp up

Increase --numjobs through every single new run. Here is a little example on a SSD:

  • --numjobs=1 reports bw 23418KB/s or iops=5854
  • --numjobs=2 reports bw=43697KB/s or iops=10924
  • --numjobs=3 reports bw=63592KB/s or iops=15898
  • --numjobs=4 reports bw=68500KB/s or iops=17124. My SSD is maxing out here

III. Interpret the result

Coming soon…


Bonus

If for whatever reasons fio is not available, here is the dd way:

$ sudo dd if=/dev/urandom of=randfile bs=1M count=1024 && sync
$ sudo dd if=randfile of=/dev/sda bs=4k count=100000 oflag=direct,dsync

What matters the most here is to find how the SSD is performing while using D_SYNC. At some point users reported some SSD misbehaving with DSYNC. Then you better always test your SSD prior to go in production.



Data aggregation tables

Gathering all the comments in two tables, on one side enterprise drives, on the other consumer drives:


Enterprise SSD MODELFirmware1 JOB5 JOBS10 JOBS
Netlist EV3 16GB???345 MB/s  1439 MB/s1766 MB/s
Intel P3700 400GBSSDPEDMD40406 MB/s  926 MB/s920 MB/s
Intel P3700 1.6TBSSDPEDMD01360 MB/s  985 MB/s1095 MB/s
Intel P3600 800GB5cd2e4328 MB/s  800 MB/s801 MB/s
SanDisk Fusion ioMemory SX300-1300, 1.3TB???174 MB/s  793 MB/s1101.9 MB/s
Samsung PM863 1.92TBGXT3003Q163 MB/s344 MB/s345 MB/s
Dell Express Flash NVMe XS1715 SSD 400GB???110 MB/s495 MB/s628 MB/s
Samsung PM863GXT3003Q127 MB/s324 MB/s336 MB/s
Intel DC S3610 1.6TB???96 MB/s208 MB/s241 MB/s
FusionIO IOdrive2 410GB???85.1 MB/s??? MB/s??? MB/s
Samsung SM863 240GBGXM1003Q64.7 MB/s125 MB/s125 MB/s
400GB SanDisk Lightning II 12Gb SAS SSD???48.9 MB/s194 MB/s255 MB/s
HGST Ultrastar SSD1600MM 800 GB???43.9 MB/s96 MB/s177 MB/s
Intel DC S3500???39.1 MB/s??? MB/s??? MB/s
Intel DC S3700 100GB???35.2 MB/s??? MB/s??? MB/s
SanDisk Cloudspeed II Eco, 960GB???34.9 MB/s 176 MB/s185 MB/s
Micron M500DC 480 GB???33.6 MB/s??? MB/s??? MB/s
Intel DC S3700 400GB5DV1027026 MB/s44.7 MB/s68 MB/s
Intel DC S3700 200GB???22.5 MB/s??? MB/s??? MB/s
Intel DC S3710 200GBG201014023,6 MB/s??? MB/s??? MB/s
Micron p400e 400GB???3.0 MB/s??? MB/s??? MB/s


Consumer SSD MODELFirmware1 JOB5 JOBS10 JOBS
Intel 750 NVMe 400GB???261 MB/s  884 MB/s??? MB/s
Samsung SSD 950 PRO 512GB NVMe???245 MB/s  329 MB/s388 MB/s
Kingston v300 120GB603ABBF098 MB/s181 MB/s200 MB/s
LITEON ECE-200 200GB???15.2 MB/s??? MB/s??? MB/s
Adata SP900 120GB???11.3 MB/s??? MB/s??? MB/s
Kingston v300 60GB505ABBF09.2 MB/s22 MB/s39 MB/s
Intel 520 60GB400i9 MB/s22.3 MB/s40 MB/s
Intel 520 180GB (FW - 400i) connected to (Dell C2100 Onboard SATA ICH10 - 3Gbps)???8.7 MB/s22 MB/s40 MB/s
SanDisk Ultra II 120GX31200RL7.6 MB/s28.7 MB/s40 MB/s
SanDisk Ultra Plus 256GBX2306RL6 MB/s19 MB/s33 MB/s
Intel 510???4.2 MB/s??? MB/s??? MB/s
Crucial MX200???3.7 MB/s??? MB/s??? MB/s
Plextor M6e 120GB???2.7 MB/s??? MB/s??? MB/s
PLEXTOR PX-128M5???2.6 MB/s??? MB/s??? MB/s
Samsung XP941 256GB???2.5 MB/s5 MB/s??? MB/s
Adata SP920???2.2 MB/s??? MB/s??? MB/s
Samsung 850 evo 250GB???1.9 MB/s??? MB/s??? MB/s
Samsung 840 evo 250GB???1.9 MB/s??? MB/s??? MB/s
Samsung 850 Pro 256GB???1.5 MB/s4 MB/s6.7 MB/s
Samsung 850 Pro 128GB???1.2 MB/s??? MB/s??? MB/s
Toshiba OCZ VT180 960GB???1.0 MB/s1.7 MB/s3.3 MB/s
Adata SP900???0.8 MB/s??? MB/s??? MB/s
Crucial m550???0.8 MB/s??? MB/s??? MB/s
INTEL 535 SSDSC2BW240H6 240GB???401 kB/s??? MB/s??? MB/s