Monday, April 18, 2011

xen guest I/O issue: task blocked for more than 120 seconds

(Update to this post here, with more info for G7 blades.)

A lot of people on CentOS/RHEL who run Xen (and apparently also people on Debian/Ubuntu) who have HP or Fujitsu hardware RAIDs are experiencing this issue. Under sustained, heavy I/O on the guest/domU, the guest load average will rise (beyond 20 even on a single-CPU VM) and messages like the following will appear in syslog:

INFO: task syslogd:1500 blocked for more than 120 seconds. 
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
syslogd D 0000000000000110 0 1500 1 1503 1491 (NOTLB) 
 ffff8800b0739d88 0000000000000286 ffff8800b8922970 ffff8800b8922970 
 0000000000000009 ffff8800bb2dd0c0 ffff8800baa55080 0000000000002b40 
 ffff8800bb2dd2a8 0000000000000000 
Call Trace: 
 [] :jbd:log_wait_commit+0xa3/0xf5 
 [] autoremove_wake_function+0x0/0x2e 
 [] :jbd:journal_stop+0x1cf/0x1ff 
 [] __writeback_single_inode+0x1d9/0x318 
 [] do_readv_writev+0x26e/0x291 
 [] sync_inode+0x24/0x33 
 [] :ext3:ext3_sync_file+0xcc/0xf8 
 [] do_fsync+0x52/0xa4 
 [] __do_fsync+0x23/0x36 
 [] tracesys+0xab/0xb6 
Two excellent bugs to start out with are:
http://bugs.centos.org/view.php?id=4515
https://bugzilla.redhat.com/show_bug.cgi?id=605444

This problem has apparently cropped up with all -194 kernels and persists into the -238 kernels. I find it continues even with CentOS 5.6 and 2.6.18-238.9.1.el5xen. I've found that it affects all versions of Xen from 3.0.3 to 3.4.3 to 4.1.

There is a firmware update for HP that you should definitely apply, but it doesn't appear to fix the problem completely. (Supposedly, the Fujitsu firmware update does fix the issue.) Particularly, some users got an aggravated version of the problem when a firmware issue caused the onboard battery not to charge. Other users were able to help the issue by disabling irqbalanced, though of course you may not want to do this if your vm uses multiple processors.

One extra source of ambiguity with these forum posts is that it's not clear what other people are doing to cause this issue. Post firmware upgrade, this issue can only reliably be triggered with sustained, heavy I/O. I used dd if=/dev/zero of=./test1024M bs=1024k count=1024 conv=fsync to ensure problems during testing.
- On a dom0, this test would complete in seconds with an average speed of 72-ish mb/second.
- On an affected domU, this test would cause the above-referenced load and dmesg messages, cause the entire system to be unresponsive, and have an average speed around 300 bytes/second.

The two things that made a material difference in our case, with HP Proliant BL460 G5s running CentOS 5.5 and 5.6 with CentOS guest vms, were, after the firmware update:
 - Switching the I/O scheduler on both the dom0 and the domU to noop
 - Capping the max raid speed on both the dom0 and the domU.

You can temporarily switch the I/O scheduler with something like:
echo noop > /sys/block/[your-block-device-name]/queue/schedule
... to change the default on reboot, add "elevator=noop" to the kernel line in /etc/grub.conf .

To set the max raid speed, you  can use
echo "50000" > /proc/sys/dev/raid/speed_limit_max
The default for raid/speed_limit_max is 200K; this sets it to 50K. I had expected that this would only matter on the dom0, since the domU doesn't technically have a raid, but putting it in both places eliminated some intermittent issues.

After changing the scheduler, I found that the domU 1GB copy from /dev/zero would average over 40MB/sec, but would still sometimes freak out, crawl at 2MB/sec, and fill up dmesg with complaints. The load would sometimes climb near 10. After setting the raid max speed, I found that the domU 1GB copy would always be between 47 and 49 MB/sec, the load would never climb above 1.5, and the VM stayed responsive. So it's a band-aid solution, but very effective, and should tide us over until there is a kernel or firmware solution.

*** *** ***

Update 20 June 2011:

Looking over the latest comments (see 28+) on https://bugzilla.redhat.com/show_bug.cgi?id=615543 , it seems that people running certain HP Smart Arrays may completely solve this issue by upgrading their driver AND firmware, though people on other models are not quite so lucky and still need to use the noop scheduler.

I also want to note that if you're in the middle of a slowdown due to this issue and can get any response at all, you'll probably notice that the amount of "Dirty" memory in /proc/meminfo is very high, and running "sync" can sometimes force your system out of the freeze a little quicker than it would find its way out on its own.
awk '/Dirty/ {printf( "%d\n", $2 )}' /proc/meminfo
will give you just the integer by itself, if you want a script to monitor this value.

7 comments:

  1. Thank you, this issue has bugged me for several months,
    and your suggestions are so far the only ones that seem to do the trick!

    ReplyDelete
  2. So glad I could help, Mandragor, though I'm still a little lost on this issue, frankly. If you ever figure it out all the way, come back and leave a note! :)

    ReplyDelete
  3. I came across this problem recently, and your post seems to be the most informative. The workarounds you've suggested didn't seem to help me, but downgrading to a -194 kernel in the guest seemed to do the trick (CentOS 5.6, Dell 1950).

    ReplyDelete
  4. Hi John, thanks for your reply, and thanks for noting that downgrading to a -194 kernel helped you. It didn't work in my case, but you're on Dell hardware, while I'm on HP, so perhaps the issue maps slightly differently on Dells. If you happen to come back and post exactly which kernel version you're using, that might help someone else.

    ReplyDelete
  5. This comment has been removed by the author.

    ReplyDelete
  6. Is this problem with current kernels still existing?

    ReplyDelete
    Replies
    1. This information was only current as of late 2012 - I am no longer using this hardware. I'm sorry I can't provide more current information.

      Delete