Diagnosing Disk I/O issues in a VPS
Every so often, my Linode goes into a state of apparent frantic I/O. Page loads slow down a bit, and I get regular email alerts indicating a potential problem:
Subject: Linode Alert - disk io rate Your Linode, linode90147, has exceeded the notification threshold (800) for disk io rate by averaging 2146.05 for the last 2 hours. The dashboard for this Linode is located at: ...
This is the first time this happened since I switched entirely to nginx. My first test was to install iostat/sar, to see what is going on.
apt-get install sysstat
The initial output of iostat looks like this:
avg-cpu: %user %nice %system %iowait %steal %idle 0.75 0.28 0.19 1.21 0.01 97.56 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn xvda 12.53 168.70 73.94 250258986 109688664 xvdb 23.49 127.81 86.97 189603528 129015512
This shows point in time output for the read/write rate, which doesn’t look nearly as high as Linode is reporting. You can do a continuous reporting by doing the following:
iostat -d 2
This showed the read/write rates running anywhere from 0 to 5500 blocks/second, about 2.8 MB/s (512 bytes/block). Some points to note: xvda is Xen Virtual Disk. Watching the usage for a while, both disks about simultaneously, but most of the writes are xvdb, which may indicate loading a lot of data from disk into memory (swap) space.
To find out which process(es) are doing the disk use, I ran the following:
pidstat -d 2 300
This takes 300 I/O samples at two second intervals (i.e. for ten minutes). It prints out each sample and an average summary. Running this, I got the following output:
Average: PID kB_rd/s kB_wr/s kB_ccwr/s Command Average: 996 0.01 2.31 0.00 kjournald Average: 1930 2.43 0.07 0.00 rsyslogd Average: 1958 0.10 0.00 0.00 atd Average: 1959 109.13 14.20 0.00 cron Average: 1971 0.26 0.00 0.00 memcached Average: 1978 47.54 0.48 0.00 mysqld Average: 2045 26.32 0.11 0.03 munin-node Average: 2131 10.21 0.01 0.00 sendmail-mta Average: 2234 2.62 0.01 0.00 ntpd Average: 2397 10.71 0.00 0.00 fail2ban-server Average: 13689 0.29 0.00 0.00 pidstat Average: 14427 0.23 0.00 0.00 cron Average: 14428 0.18 0.00 0.00 sh Average: 14431 0.01 0.00 0.00 munin-cron Average: 14432 9.95 0.01 0.00 munin-update Average: 14433 6.95 5.61 0.00 munin-update Average: 14434 10.14 0.04 0.01 munin-node Average: 14813 0.04 0.00 0.00 vmstat Average: 14814 0.15 0.00 0.00 vmstat Average: 15685 12.96 0.00 0.00 php-cgi Average: 15686 13.79 0.00 0.00 php-cgi Average: 15687 20.47 0.59 0.15 php-cgi Average: 15688 15.01 0.00 0.00 php-cgi Average: 15689 33.72 0.00 0.00 php-cgi Average: 15690 15.53 0.00 0.00 php-cgi Average: 15691 9.60 0.00 0.00 php-cgi Average: 15692 19.13 0.00 0.00 php-cgi Average: 15693 12.64 0.00 0.00 php-cgi Average: 15694 15.76 0.00 0.00 php-cgi Average: 15695 16.30 0.01 0.00 php-cgi Average: 15696 18.60 0.00 0.00 php-cgi Average: 15697 12.65 0.00 0.00 php-cgi Average: 16334 30.81 0.72 0.15 php-cgi Average: 16338 14.99 0.00 0.00 php-cgi Average: 21209 1.21 0.00 0.00 sshd Average: 31579 1.03 0.86 0.79 nginx Average: 31580 0.67 0.03 0.00 nginx Average: 31581 1.11 0.07 0.00 nginx Average: 31582 1.82 0.17 0.00 nginx
There are ares to research as an avenue for research- php, mysql, and cron. I know I added some jobs, so I tested that first. To see all available cron jobs:
for user in $(cut -f1 -d: /etc/passwd); do crontab -u $user -l; done
From this output, I removed two obsolete hourly tasks I had created. For good measure, I also decreased the frequency of man-db lookups from daily to monthly, removed apache2 cleanup (no longer used) and popularity-contest. Everything remaining appears to be important to system maintenance. The following is a second performance log, after this runs. Very little has changed.
Average: 1 1.22 0.02 0.01 init Average: 996 0.03 3.51 0.00 kjournald Average: 1930 2.74 0.26 0.00 rsyslogd Average: 1959 131.20 14.24 0.00 cron Average: 1971 0.01 0.00 0.00 memcached Average: 1978 78.06 0.63 0.11 mysqld Average: 2045 27.10 0.09 0.03 munin-node Average: 2131 23.92 0.03 0.00 sendmail-mta Average: 2234 2.77 0.00 0.00 ntpd Average: 2397 13.69 0.00 0.00 fail2ban-server Average: 15685 24.56 0.01 0.00 php-cgi Average: 15686 17.49 0.00 0.00 php-cgi Average: 15687 40.37 0.00 0.00 php-cgi Average: 15688 21.09 0.01 0.00 php-cgi Average: 15689 29.78 0.00 0.00 php-cgi Average: 15690 10.94 0.00 0.00 php-cgi Average: 15691 23.94 0.00 0.00 php-cgi Average: 15692 16.95 0.01 0.00 php-cgi Average: 15693 8.02 0.01 0.00 php-cgi Average: 15694 7.47 0.01 0.00 php-cgi Average: 15695 23.33 0.00 0.00 php-cgi Average: 15696 8.47 0.01 0.00 php-cgi Average: 15697 13.05 0.01 0.00 php-cgi Average: 16334 13.13 0.01 0.00 php-cgi Average: 16338 11.27 0.00 0.00 php-cgi Average: 21209 0.60 0.00 0.00 sshd Average: 22998 0.41 0.00 0.00 pidstat Average: 31579 0.51 0.03 0.00 nginx Average: 31580 1.37 3.08 1.43 nginx Average: 31581 1.32 1.55 1.44 nginx Average: 31582 2.55 1.77 0.00 nginx
The php work is a little lower, but likely not enough to be significant. Next up: PHP. I had APC working when I was running Apache, but perhaps it’s not working now, with Nginx as the primary server.
I rebuilt APC from scratch, in case there was a newer version. The lynchpin of this was discovering multiple php.ini files on the VPS. The instructions for building APC are as follows:
wget http://pecl.php.net/package/APC tar -xzf APC-3.1.9.tgz cd APC-3.1.9 phpize ./configure --enable-apc --enable-apc-mmap --with-apxs --with-php-config=/etc/php5/cgi/php.ini make make test make install vi /etc/php5/cgi/php.ini
Add this line at the end:
extension=apc.so
Then restart phpd/php-cgi. E.g. if you installed nginx/fast_cgi as an init.d service, do something like this:
/etc/init.d/php-fastcgi restart
I re-ran the performance test. PHP activity is pretty much gone. It looks like traffic is lower at the moment as well, but apc.php shows about 80% cache hits. For memory sake, it would be nice to share WordPress installations, but this has some significant challenges (e.g. handling upgrades). For now, disk use has slowed, so I will leave mysql tuning for another day.
Average: 8 0.00 0.01 0.00 0.01 - kworker/1:0 Average: 271 0.00 0.00 0.00 0.00 - kswapd0 Average: 996 0.00 0.00 0.00 0.00 - kjournald Average: 1730 0.00 0.01 0.00 0.01 - kworker/3:1 Average: 1864 0.00 0.00 0.00 0.00 - kworker/2:1 Average: 1930 0.00 0.00 0.00 0.00 - rsyslogd Average: 1959 0.00 0.00 0.00 0.00 - cron Average: 1971 0.00 0.00 0.00 0.00 - memcached Average: 1978 0.19 0.09 0.00 0.28 - mysqld Average: 2045 0.00 0.00 0.00 0.01 - munin-node Average: 2131 0.00 0.00 0.00 0.00 - sendmail-mta Average: 2234 0.00 0.00 0.00 0.01 - ntpd Average: 2280 0.00 0.00 0.00 0.00 - flush-202:0 Average: 2397 0.02 0.00 0.00 0.02 - fail2ban-server Average: 8895 0.25 0.02 0.00 0.27 - php-cgi Average: 8896 0.23 0.03 0.00 0.26 - php-cgi Average: 8897 0.20 0.02 0.00 0.23 - php-cgi Average: 8898 3.89 0.01 0.00 3.90 - php-cgi Average: 10837 0.16 0.41 0.00 0.57 - pidstat Average: 23460 0.00 0.01 0.00 0.02 - sshd Average: 23599 0.00 0.01 0.00 0.01 - kworker/0:1 Average: 25025 0.01 0.02 0.00 0.03 - nginx Average: 25026 0.00 0.00 0.00 0.01 - nginx Average: 25027 0.00 0.00 0.00 0.01 - nginx Average: 25028 0.00 0.00 0.00 0.01 - nginx
Want to learn something new? I send out weekly, personalized emails with articles and conference talks. Click here to see an example and subscribe.
Very good to see average lists in the format you provide. These could be reused with proper design.
Thanks. I’m working on a more generic test case based on this idea so I can do performance testing of different architectural configurations.
Did your linode slow down when you reach your 2000+ Disk IO rate? I also had a linode before and was experiencing the same thing. I can’t find the culprit so I just tried moving to a dedicated server, my site is back to normal now maybe I am really just maxing out their IO.
I bookmarked your blog! Might be useful in case I encounter the same problem again.