How device size affects disk performance in Linux
09 02 2011
While running some tests in a client's environment, we've noticed reading from a partition of a multipath device was considerably slower than reading from its parent node:
We asked to client support of a well-known GNU+Linux vendor, and they indicated that this behavior was "expected", since this kind of partitions were created by stacking a dm-linear device over the original multipath node. I wasn't satisfied by this answer, since AFAIK dm-linear only did a simple transposition of the original request over an specified offset (the beginning of the partition), so I decided to investigate a bit further on my own.
[root@none]# dd if=mpath4 of=/dev/null bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 8.92711 seconds, 120 MB/s
[root@none]# dd if=mpath4p1 of=/dev/null bs=1M count=1024 skip=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 17.5965 seconds, 61.0 MB/sWe asked to client support of a well-known GNU+Linux vendor, and they indicated that this behavior was "expected", since this kind of partitions were created by stacking a dm-linear device over the original multipath node. I wasn't satisfied by this answer, since AFAIK dm-linear only did a simple transposition of the original request over an specified offset (the beginning of the partition), so I decided to investigate a bit further on my own.
The first thing I've noticed, was that changing size of the dm-linear device affected the performance of the tests:
This was something, but then I needed to find the reason of how a simple change in the device size could impact the performance this way. Playing around with kgdb (what a nice tool!), I've reached to this piece of code from Linux (drivers/md/dm.c):
In the debugging session, I've noticed that ci.sector_count takes the value '1' for the device with worst performance, while other devices with different sizes and better read speeds could take values in a range from '2' to '8' (being the latter the case with best performance). So, indeed, the size of a device affects how is accessed, and this implies a noticeably difference in performance. But, still, it wasn't clear for me where is the root of this behavior, so I decided to dig a bit deeper. That took me to this function (fs/block_dev.c):
This function searches for the greatest power of 2 which is divisor of the device size in the range of 512 (sector size) to 4096 (the value for PAGE_CACHE_SIZE in x86), and sets it as the internal block size. Further direct requests to the device will be internally divided in chunks of this size, so devices with sizes that are multiple of 4096 will perform better than the ones which are multiple of 2048, 1024 or 512 (the worst case, which every device conforms as its the size of each sector). This is specially important in scenarios in which devices are directly accessed by the application, such as in Oracle's ASM configurations.
TL;DR: Linux chooses the internal block size that will be used to fulfill page requests by searching the greatest power of 2 which is divisor of the device size in a range from 512 to 4096 (in x86), so creating your partitions with a size which is multiple of 4096 will help to obtain better performance in I/O disk operations.
Comments :
3 Comments »
[root@none]# echo "0 1870000 linear 8:96 63" | dmsetup create test
[root@none]# dd if=/dev/mapper/test of=/dev/null bs=1M count=100 skip=600
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 0.906487 seconds, 116 MB/s
[root@none]# dmsetup remove test
[root@none]# echo "0 1870001 linear 8:96 63" | dmsetup create test
[root@none]# dd if=/dev/mapper/test of=/dev/null bs=1M count=100 skip=700
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 1.47716 seconds, 71.0 MB/sThis was something, but then I needed to find the reason of how a simple change in the device size could impact the performance this way. Playing around with kgdb (what a nice tool!), I've reached to this piece of code from Linux (drivers/md/dm.c):
static int __split_bio(struct mapped_device md, struct bio bio)
{
struct clone_info ci;
int error = 0;
ci.map = dm_get_table(md);
if (unlikely(!ci.map))
return -EIO;
ci.md = md;
ci.bio = bio;
ci.io = alloc_io(md);
ci.io->error = 0;
atomic_set(&ci.io->io_count, 1);
ci.io->bio = bio;
ci.io->md = md;
ci.sector = bio->bi_sector;
ci.sector_count = bio_sectors(bio);
ci.idx = bio->bi_idx;
start_io_acct(ci.io);
while (ci.sector_count && !error)
error = __clone_and_map(&ci);
dec_pending(ci.io, error);
dm_table_put(ci.map);
return 0;
}In the debugging session, I've noticed that ci.sector_count takes the value '1' for the device with worst performance, while other devices with different sizes and better read speeds could take values in a range from '2' to '8' (being the latter the case with best performance). So, indeed, the size of a device affects how is accessed, and this implies a noticeably difference in performance. But, still, it wasn't clear for me where is the root of this behavior, so I decided to dig a bit deeper. That took me to this function (fs/block_dev.c):
void bd_set_size(struct block_device bdev, loff_t size)
{
unsigned bsize = bdev_logical_block_size(bdev);
bdev->bd_inode->i_size = size;
while (bsize < PAGE_CACHE_SIZE) {
if (size & bsize)
break;
bsize <<= 1;
}
bdev->bd_block_size = bsize;
bdev->bd_inode->i_blkbits = blksize_bits(bsize);
}This function searches for the greatest power of 2 which is divisor of the device size in the range of 512 (sector size) to 4096 (the value for PAGE_CACHE_SIZE in x86), and sets it as the internal block size. Further direct requests to the device will be internally divided in chunks of this size, so devices with sizes that are multiple of 4096 will perform better than the ones which are multiple of 2048, 1024 or 512 (the worst case, which every device conforms as its the size of each sector). This is specially important in scenarios in which devices are directly accessed by the application, such as in Oracle's ASM configurations.
TL;DR: Linux chooses the internal block size that will be used to fulfill page requests by searching the greatest power of 2 which is divisor of the device size in a range from 512 to 4096 (in x86), so creating your partitions with a size which is multiple of 4096 will help to obtain better performance in I/O disk operations.
Categories : Software Libre
Trackbacks : No Trackbacks »


Trackbacks
No Trackbacks