How device size affects disk performance in Linux

09 02 2011
While running some tests in a client's environment, we've noticed reading from a partition of a multipath device was considerably slower than reading from its parent node:

[root@none]# dd if=mpath4 of=/dev/null bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 8.92711 seconds, 120 MB/s

[root@none]# dd if=mpath4p1 of=/dev/null bs=1M count=1024 skip=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 17.5965 seconds, 61.0 MB/s


We asked to client support of a well-known GNU+Linux vendor, and they indicated that this behavior was "expected", since this kind of partitions were created by stacking a dm-linear device over the original multipath node. I wasn't satisfied by this answer, since AFAIK dm-linear only did a simple transposition of the original request over an specified offset (the beginning of the partition), so I decided to investigate a bit further on my own.

The first thing I've noticed, was that changing size of the dm-linear device affected the performance of the tests:

[root@none]# echo "0 1870000 linear 8:96 63" | dmsetup create test
[root@none]# dd if=/dev/mapper/test of=/dev/null bs=1M count=100 skip=600
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 0.906487 seconds, 116 MB/s

[root@none]# dmsetup remove test
[root@none]# echo "0 1870001 linear 8:96 63" | dmsetup create test
[root@none]# dd if=/dev/mapper/test of=/dev/null bs=1M count=100 skip=700
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 1.47716 seconds, 71.0 MB/s


This was something, but then I needed to find the reason of how a simple change in the device size could impact the performance this way. Playing around with kgdb (what a nice tool!), I've reached to this piece of code from Linux (drivers/md/dm.c):

static int __split_bio(struct mapped_device md, struct bio bio)
{
struct clone_info ci;
int error = 0;

ci.map = dm_get_table(md);
if (unlikely(!ci.map))
return -EIO;

ci.md = md;
ci.bio = bio;
ci.io = alloc_io(md);
ci.io->error = 0;
atomic_set(&ci.io->io_count, 1);
ci.io->bio = bio;
ci.io->md = md;
ci.sector = bio->bi_sector;
ci.sector_count = bio_sectors(bio);
ci.idx = bio->bi_idx;

start_io_acct(ci.io);
while (ci.sector_count && !error)
error = __clone_and_map(&ci);

dec_pending(ci.io, error);
dm_table_put(ci.map);

return 0;
}


In the debugging session, I've noticed that ci.sector_count takes the value '1' for the device with worst performance, while other devices with different sizes and better read speeds could take values in a range from '2' to '8' (being the latter the case with best performance). So, indeed, the size of a device affects how is accessed, and this implies a noticeably difference in performance. But, still, it wasn't clear for me where is the root of this behavior, so I decided to dig a bit deeper. That took me to this function (fs/block_dev.c):

void bd_set_size(struct block_device bdev, loff_t size)
{
unsigned bsize = bdev_logical_block_size(bdev);

bdev->bd_inode->i_size = size;
while (bsize < PAGE_CACHE_SIZE) {
if (size & bsize)
break;
bsize <<= 1;
}
bdev->bd_block_size = bsize;
bdev->bd_inode->i_blkbits = blksize_bits(bsize);
}


This function searches for the greatest power of 2 which is divisor of the device size in the range of 512 (sector size) to 4096 (the value for PAGE_CACHE_SIZE in x86), and sets it as the internal block size. Further direct requests to the device will be internally divided in chunks of this size, so devices with sizes that are multiple of 4096 will perform better than the ones which are multiple of 2048, 1024 or 512 (the worst case, which every device conforms as its the size of each sector). This is specially important in scenarios in which devices are directly accessed by the application, such as in Oracle's ASM configurations.

TL;DR: Linux chooses the internal block size that will be used to fulfill page requests by searching the greatest power of 2 which is divisor of the device size in a range from 512 to 4096 (in x86), so creating your partitions with a size which is multiple of 4096 will help to obtain better performance in I/O disk operations.


Trackbacks


No Trackbacks

Comments

Display comments as (Linear | Threaded)
10 02 2011
#1 Luis
So finally it wasn't a bug but the expected performance...

I suppose it could be improved using always the maximum sector size for all the sector but the last one, that could have a different -smaller- sector size. Algorithmically it could be easily programmed -just check if you're accessing the last sector-, but would it have other consequences?

Great investigation work!
15 02 2011
#1.1 Sergio Lopez
Well, it's hard to say if we can call this as an expected behavior. Probably, kernel hackers knew about this, but I didn't find any documentation referring to this issue. Also, most GNU+Linux vendors seem to ignore this fact.

I really don't know how easy should be to create an algorithm to deal with the special situation when the internal block size plus the offset exceeds the end of the device. It doesn't seem hard, but there could be some side effects hard to measure.

Sizing properly the partitions sounds way easier :-)
16 02 2011
#1.2 Josema
Wooohooooooooooo!

Nice bug hunting, even though that couldn't clasify as such by the Royal Bug Hunters Society. But it shows how important is Open source to a clever individual, and the difference it makes: 100% performance in disk IO.

This is why the "tuning" word was coined, and I award you the disk Tuner Badge, level 1.

Cheers.

Add Comment


Enclosing asterisks marks text as bold (*word*), underscore are made via _word_.
Standard emoticons like :-) and ;-) are converted to images.

To prevent automated Bots from commentspamming, please enter the string you see in the image below in the appropriate input box. Your comment will only be submitted if the strings match. Please ensure that your browser supports and accepts cookies, or your comment cannot be verified correctly.
CAPTCHA