Diagnosing Disk I/O issues in a VPS

Every so often, my Linode goes into a state of apparent frantic I/O. Page loads slow down a bit, and I get regular email alerts indicating a potential problem:

Subject: Linode Alert - disk io rate

Your Linode, linode90147, has exceeded the notification threshold (800) for disk io rate by averaging 2146.05 for the last 2 hours. The dashboard for this Linode is located at: ...

This is the first time this happened since I switched entirely to nginx. My first test was to install iostat/sar, to see what is going on.

apt-get install sysstat

The initial output of iostat looks like this:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.75    0.28    0.19    1.21    0.01   97.56

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
xvda             12.53       168.70        73.94  250258986  109688664
xvdb             23.49       127.81        86.97  189603528  129015512

This shows point in time output for the read/write rate, which doesn’t look nearly as high as Linode is reporting. You can do a continuous reporting by doing the following:

iostat -d 2

This showed the read/write rates running anywhere from 0 to 5500 blocks/second, about 2.8 MB/s (512 bytes/block). Some points to note: xvda is Xen Virtual Disk. Watching the usage for a while, both disks about simultaneously, but most of the writes are xvdb, which may indicate loading a lot of data from disk into memory (swap) space.

To find out which process(es) are doing the disk use, I ran the following:

pidstat -d 2 300

This takes 300 I/O samples at two second intervals (i.e. for ten minutes). It prints out each sample and an average summary. Running this, I got the following output:

Average:          PID   kB_rd/s   kB_wr/s kB_ccwr/s  Command
Average:          996      0.01      2.31      0.00  kjournald
Average:         1930      2.43      0.07      0.00  rsyslogd
Average:         1958      0.10      0.00      0.00  atd
Average:         1959    109.13     14.20      0.00  cron
Average:         1971      0.26      0.00      0.00  memcached
Average:         1978     47.54      0.48      0.00  mysqld
Average:         2045     26.32      0.11      0.03  munin-node
Average:         2131     10.21      0.01      0.00  sendmail-mta
Average:         2234      2.62      0.01      0.00  ntpd
Average:         2397     10.71      0.00      0.00  fail2ban-server
Average:        13689      0.29      0.00      0.00  pidstat
Average:        14427      0.23      0.00      0.00  cron
Average:        14428      0.18      0.00      0.00  sh
Average:        14431      0.01      0.00      0.00  munin-cron
Average:        14432      9.95      0.01      0.00  munin-update
Average:        14433      6.95      5.61      0.00  munin-update
Average:        14434     10.14      0.04      0.01  munin-node
Average:        14813      0.04      0.00      0.00  vmstat
Average:        14814      0.15      0.00      0.00  vmstat
Average:        15685     12.96      0.00      0.00  php-cgi
Average:        15686     13.79      0.00      0.00  php-cgi
Average:        15687     20.47      0.59      0.15  php-cgi
Average:        15688     15.01      0.00      0.00  php-cgi
Average:        15689     33.72      0.00      0.00  php-cgi
Average:        15690     15.53      0.00      0.00  php-cgi
Average:        15691      9.60      0.00      0.00  php-cgi
Average:        15692     19.13      0.00      0.00  php-cgi
Average:        15693     12.64      0.00      0.00  php-cgi
Average:        15694     15.76      0.00      0.00  php-cgi
Average:        15695     16.30      0.01      0.00  php-cgi
Average:        15696     18.60      0.00      0.00  php-cgi
Average:        15697     12.65      0.00      0.00  php-cgi
Average:        16334     30.81      0.72      0.15  php-cgi
Average:        16338     14.99      0.00      0.00  php-cgi
Average:        21209      1.21      0.00      0.00  sshd
Average:        31579      1.03      0.86      0.79  nginx
Average:        31580      0.67      0.03      0.00  nginx
Average:        31581      1.11      0.07      0.00  nginx
Average:        31582      1.82      0.17      0.00  nginx

There are ares to research as an avenue for research- php, mysql, and cron. I know I added some jobs, so I tested that first. To see all available cron jobs:

 for user in $(cut -f1 -d: /etc/passwd); do crontab -u $user -l; done

From this output, I removed two obsolete hourly tasks I had created. For good measure, I also decreased the frequency of man-db lookups from daily to monthly, removed apache2 cleanup (no longer used) and popularity-contest. Everything remaining appears to be important to system maintenance. The following is a second performance log, after this runs. Very little has changed.

Average:            1      1.22      0.02      0.01  init
Average:          996      0.03      3.51      0.00  kjournald
Average:         1930      2.74      0.26      0.00  rsyslogd
Average:         1959    131.20     14.24      0.00  cron
Average:         1971      0.01      0.00      0.00  memcached
Average:         1978     78.06      0.63      0.11  mysqld
Average:         2045     27.10      0.09      0.03  munin-node
Average:         2131     23.92      0.03      0.00  sendmail-mta
Average:         2234      2.77      0.00      0.00  ntpd
Average:         2397     13.69      0.00      0.00  fail2ban-server
Average:        15685     24.56      0.01      0.00  php-cgi
Average:        15686     17.49      0.00      0.00  php-cgi
Average:        15687     40.37      0.00      0.00  php-cgi
Average:        15688     21.09      0.01      0.00  php-cgi
Average:        15689     29.78      0.00      0.00  php-cgi
Average:        15690     10.94      0.00      0.00  php-cgi
Average:        15691     23.94      0.00      0.00  php-cgi
Average:        15692     16.95      0.01      0.00  php-cgi
Average:        15693      8.02      0.01      0.00  php-cgi
Average:        15694      7.47      0.01      0.00  php-cgi
Average:        15695     23.33      0.00      0.00  php-cgi
Average:        15696      8.47      0.01      0.00  php-cgi
Average:        15697     13.05      0.01      0.00  php-cgi
Average:        16334     13.13      0.01      0.00  php-cgi
Average:        16338     11.27      0.00      0.00  php-cgi
Average:        21209      0.60      0.00      0.00  sshd
Average:        22998      0.41      0.00      0.00  pidstat
Average:        31579      0.51      0.03      0.00  nginx
Average:        31580      1.37      3.08      1.43  nginx
Average:        31581      1.32      1.55      1.44  nginx
Average:        31582      2.55      1.77      0.00  nginx

The php work is a little lower, but likely not enough to be significant. Next up: PHP. I had APC working when I was running Apache, but perhaps it’s not working now, with Nginx as the primary server.

I rebuilt APC from scratch, in case there was a newer version. The lynchpin of this was discovering multiple php.ini files on the VPS. The instructions for building APC are as follows:

wget http://pecl.php.net/package/APC
tar -xzf APC-3.1.9.tgz
cd APC-3.1.9
phpize
./configure --enable-apc --enable-apc-mmap --with-apxs --with-php-config=/etc/php5/cgi/php.ini
make
make test
make install

vi /etc/php5/cgi/php.ini

Add this line at the end:

extension=apc.so

Then restart phpd/php-cgi. E.g. if you installed nginx/fast_cgi as an init.d service, do something like this:

 /etc/init.d/php-fastcgi restart

I re-ran the performance test. PHP activity is pretty much gone. It looks like traffic is lower at the moment as well, but apc.php shows about 80% cache hits. For memory sake, it would be nice to share WordPress installations, but this has some significant challenges (e.g. handling upgrades). For now, disk use has slowed, so I will leave mysql tuning for another day.

Average:            8    0.00    0.01    0.00    0.01     -  kworker/1:0
Average:          271    0.00    0.00    0.00    0.00     -  kswapd0
Average:          996    0.00    0.00    0.00    0.00     -  kjournald
Average:         1730    0.00    0.01    0.00    0.01     -  kworker/3:1
Average:         1864    0.00    0.00    0.00    0.00     -  kworker/2:1
Average:         1930    0.00    0.00    0.00    0.00     -  rsyslogd
Average:         1959    0.00    0.00    0.00    0.00     -  cron
Average:         1971    0.00    0.00    0.00    0.00     -  memcached
Average:         1978    0.19    0.09    0.00    0.28     -  mysqld
Average:         2045    0.00    0.00    0.00    0.01     -  munin-node
Average:         2131    0.00    0.00    0.00    0.00     -  sendmail-mta
Average:         2234    0.00    0.00    0.00    0.01     -  ntpd
Average:         2280    0.00    0.00    0.00    0.00     -  flush-202:0
Average:         2397    0.02    0.00    0.00    0.02     -  fail2ban-server
Average:         8895    0.25    0.02    0.00    0.27     -  php-cgi
Average:         8896    0.23    0.03    0.00    0.26     -  php-cgi
Average:         8897    0.20    0.02    0.00    0.23     -  php-cgi
Average:         8898    3.89    0.01    0.00    3.90     -  php-cgi
Average:        10837    0.16    0.41    0.00    0.57     -  pidstat
Average:        23460    0.00    0.01    0.00    0.02     -  sshd
Average:        23599    0.00    0.01    0.00    0.01     -  kworker/0:1
Average:        25025    0.01    0.02    0.00    0.03     -  nginx
Average:        25026    0.00    0.00    0.00    0.01     -  nginx
Average:        25027    0.00    0.00    0.00    0.01     -  nginx
Average:        25028    0.00    0.00    0.00    0.01     -  nginx

4 Replies to “Diagnosing Disk I/O issues in a VPS”

Tomas says:

January 27, 2012 at 1:47 pm

Very good to see average lists in the format you provide. These could be reused with proper design.

1. admin says:
  
  January 27, 2012 at 2:21 pm
  
  Thanks. I’m working on a more generic test case based on this idea so I can do performance testing of different architectural configurations.
  
best says:

March 26, 2013 at 5:37 pm

Did your linode slow down when you reach your 2000+ Disk IO rate? I also had a linode before and was experiencing the same thing. I can’t find the culprit so I just tried moving to a dedicated server, my site is back to normal now maybe I am really just maxing out their IO.

I bookmarked your blog! Might be useful in case I encounter the same problem again.

Pingback: My First A/B Test With Results

4 Replies to “Diagnosing Disk I/O issues in a VPS”

Leave a Reply Cancel reply