{"id":969,"date":"2013-05-02T03:27:01","date_gmt":"2013-05-02T03:27:01","guid":{"rendered":"http:\/\/garysieling.com\/blog\/?p=969"},"modified":"2013-05-02T03:27:01","modified_gmt":"2013-05-02T03:27:01","slug":"case-study-10x-file-copy-performance-with-robocopy","status":"publish","type":"post","link":"https:\/\/www.garysieling.com\/blog\/case-study-10x-file-copy-performance-with-robocopy\/","title":{"rendered":"Case Study: 10x File Copy Performance with Robocopy"},"content":{"rendered":"<p>Source data:<\/p>\n<ul>\n<li><span style=\"line-height: 13px;\">~500,000 folders (court cases)<\/span><\/li>\n<li>~2.5-3 million documents<\/li>\n<li>Source drives is replicated x2 with RAID<\/li>\n<li>Copying to NAS over GB ethernet<\/li>\n<li>Initial un-tuned copy was set to take ~2 weeks (after switching to Robocopy &#8211; before, it was painful just to do an ls)<\/li>\n<li>Final copy took ~24 hours<\/li>\n<\/ul>\n<p>Monitoring:<\/p>\n<ul>\n<li><span style=\"line-height: 13px;\">Initially I saw 20-40 Kbps in traffic in DD-WRT, clearly too low. After some changes this is still generally low, but with spikes up to 650 Kbps.<\/span><\/li>\n<li>CPU use &#8211; 4\/8 cores in use, even with &gt;8 threads assigned to Robocopy<\/li>\n<li>In Computer Management -&gt; Performance monitoring, the disk being copied is Reading as fast as it can (set to 100 all the time)<\/li>\n<li>The number called &#8220;Split IO \/ second&#8221; is very high much of the time. Research indicates this could be improved with defrag (though this might take me months to complete).<\/li>\n<\/ul>\n<p>Filesystem Lessions:<\/p>\n<ul>\n<li><span style=\"line-height: 13px;\">NTFS can hold a folder with large numbers of files but takes forever to enumerate<\/span><\/li>\n<li>When you enumerate a directory in NTFS (e.g. by opening it in Windows Explorer), Windows appears to lock the folder(!) which pauses any copy\/ls operations<\/li>\n<li>The copy does not appear to be I\/O bound &#8211; even setting Robocopy to use many threads, only 4\/8 cores are in use at 5-15% per each.<\/li>\n<li>ext4 (destination system) supports up to 64,000 items per folder, any more and you get an error.<\/li>\n<li>I split all 500k items into groups of 256*256 at random (for instance one might open \\36\\0f to see a half dozen items). These are split up using\u00a0<a href=\"http:\/\/garysieling.com\/blog\/moving-files-and-folders-into-hashed-subfolders\">md5 on the folder names<\/a> &#8211; basically this uses the filesystem as a tree map.<\/li>\n<li>One nice consequence of this is that you can estimate how far along the process is by looking at how many folders have been copied (85\/256 -&gt; 33%, etc)<\/li>\n<\/ul>\n<p>Robocopy Options:<\/p>\n<ul>\n<li><span style=\"line-height: 13px;\">Robocopy lets you remove the console logging, with \/LOG:output.txt<\/span><\/li>\n<li>Robocopy lets you set the number of threads it uses. By default this is 8, it seemed to run faster with &gt; 8, but only the first few threads made any difference.<\/li>\n<\/ul>\n<p>To investigate:<\/p>\n<ul>\n<li><span style=\"line-height: 13px;\">Ways of using virtual filesystems &#8211; it&#8217;d be nice to continue using wget to download, but split up large folders into batches for scraping.\u00a0<\/span><\/li>\n<li>One possibility is to use wget through VirtualBox, since there are more linux based virtual filesystems &#8211; not sure on the performance ovehead<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>Source data: ~500,000 folders (court cases) ~2.5-3 million documents Source drives is replicated x2 with RAID Copying to NAS over GB ethernet Initial un-tuned copy was set to take ~2 weeks (after switching to Robocopy &#8211; before, it was painful just to do an ls) Final copy took ~24 hours Monitoring: Initially I saw 20-40 &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/www.garysieling.com\/blog\/case-study-10x-file-copy-performance-with-robocopy\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Case Study: 10x File Copy Performance with Robocopy&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[6,7],"tags":[204,377,467,535,554],"aioseo_notices":[],"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/969"}],"collection":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/comments?post=969"}],"version-history":[{"count":0,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/969\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/media?parent=969"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/categories?post=969"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/tags?post=969"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}