Monday, April 18, 2011

xen guest I/O issue: task blocked for more than 120 seconds

(Update to this post here, with more info for G7 blades.)

A lot of people on CentOS/RHEL who run Xen (and apparently also people on Debian/Ubuntu) who have HP or Fujitsu hardware RAIDs are experiencing this issue. Under sustained, heavy I/O on the guest/domU, the guest load average will rise (beyond 20 even on a single-CPU VM) and messages like the following will appear in syslog:

INFO: task syslogd:1500 blocked for more than 120 seconds. 
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
syslogd D 0000000000000110 0 1500 1 1503 1491 (NOTLB) 
 ffff8800b0739d88 0000000000000286 ffff8800b8922970 ffff8800b8922970 
 0000000000000009 ffff8800bb2dd0c0 ffff8800baa55080 0000000000002b40 
 ffff8800bb2dd2a8 0000000000000000 
Call Trace: 
 [] :jbd:log_wait_commit+0xa3/0xf5 
 [] autoremove_wake_function+0x0/0x2e 
 [] :jbd:journal_stop+0x1cf/0x1ff 
 [] __writeback_single_inode+0x1d9/0x318 
 [] do_readv_writev+0x26e/0x291 
 [] sync_inode+0x24/0x33 
 [] :ext3:ext3_sync_file+0xcc/0xf8 
 [] do_fsync+0x52/0xa4 
 [] __do_fsync+0x23/0x36 
 [] tracesys+0xab/0xb6 
Two excellent bugs to start out with are:
http://bugs.centos.org/view.php?id=4515
https://bugzilla.redhat.com/show_bug.cgi?id=605444

This problem has apparently cropped up with all -194 kernels and persists into the -238 kernels. I find it continues even with CentOS 5.6 and 2.6.18-238.9.1.el5xen. I've found that it affects all versions of Xen from 3.0.3 to 3.4.3 to 4.1.

There is a firmware update for HP that you should definitely apply, but it doesn't appear to fix the problem completely. (Supposedly, the Fujitsu firmware update does fix the issue.) Particularly, some users got an aggravated version of the problem when a firmware issue caused the onboard battery not to charge. Other users were able to help the issue by disabling irqbalanced, though of course you may not want to do this if your vm uses multiple processors.

One extra source of ambiguity with these forum posts is that it's not clear what other people are doing to cause this issue. Post firmware upgrade, this issue can only reliably be triggered with sustained, heavy I/O. I used dd if=/dev/zero of=./test1024M bs=1024k count=1024 conv=fsync to ensure problems during testing.
- On a dom0, this test would complete in seconds with an average speed of 72-ish mb/second.
- On an affected domU, this test would cause the above-referenced load and dmesg messages, cause the entire system to be unresponsive, and have an average speed around 300 bytes/second.

The two things that made a material difference in our case, with HP Proliant BL460 G5s running CentOS 5.5 and 5.6 with CentOS guest vms, were, after the firmware update:
 - Switching the I/O scheduler on both the dom0 and the domU to noop
 - Capping the max raid speed on both the dom0 and the domU.

You can temporarily switch the I/O scheduler with something like:
echo noop > /sys/block/[your-block-device-name]/queue/schedule
... to change the default on reboot, add "elevator=noop" to the kernel line in /etc/grub.conf .

To set the max raid speed, you  can use
echo "50000" > /proc/sys/dev/raid/speed_limit_max
The default for raid/speed_limit_max is 200K; this sets it to 50K. I had expected that this would only matter on the dom0, since the domU doesn't technically have a raid, but putting it in both places eliminated some intermittent issues.

After changing the scheduler, I found that the domU 1GB copy from /dev/zero would average over 40MB/sec, but would still sometimes freak out, crawl at 2MB/sec, and fill up dmesg with complaints. The load would sometimes climb near 10. After setting the raid max speed, I found that the domU 1GB copy would always be between 47 and 49 MB/sec, the load would never climb above 1.5, and the VM stayed responsive. So it's a band-aid solution, but very effective, and should tide us over until there is a kernel or firmware solution.

*** *** ***

Update 20 June 2011:

Looking over the latest comments (see 28+) on https://bugzilla.redhat.com/show_bug.cgi?id=615543 , it seems that people running certain HP Smart Arrays may completely solve this issue by upgrading their driver AND firmware, though people on other models are not quite so lucky and still need to use the noop scheduler.

I also want to note that if you're in the middle of a slowdown due to this issue and can get any response at all, you'll probably notice that the amount of "Dirty" memory in /proc/meminfo is very high, and running "sync" can sometimes force your system out of the freeze a little quicker than it would find its way out on its own.
awk '/Dirty/ {printf( "%d\n", $2 )}' /proc/meminfo
will give you just the integer by itself, if you want a script to monitor this value.