Gary Sieling

Source data:

~500,000 folders (court cases)
~2.5-3 million documents
Source drives is replicated x2 with RAID
Copying to NAS over GB ethernet
Initial un-tuned copy was set to take ~2 weeks (after switching to Robocopy – before, it was painful just to do an ls)
Final copy took ~24 hours

Monitoring:

Initially I saw 20-40 Kbps in traffic in DD-WRT, clearly too low. After some changes this is still generally low, but with spikes up to 650 Kbps.
CPU use – 4/8 cores in use, even with >8 threads assigned to Robocopy
In Computer Management -> Performance monitoring, the disk being copied is Reading as fast as it can (set to 100 all the time)
The number called “Split IO / second” is very high much of the time. Research indicates this could be improved with defrag (though this might take me months to complete).

Filesystem Lessions:

NTFS can hold a folder with large numbers of files but takes forever to enumerate
When you enumerate a directory in NTFS (e.g. by opening it in Windows Explorer), Windows appears to lock the folder(!) which pauses any copy/ls operations
The copy does not appear to be I/O bound – even setting Robocopy to use many threads, only 4/8 cores are in use at 5-15% per each.
ext4 (destination system) supports up to 64,000 items per folder, any more and you get an error.
I split all 500k items into groups of 256*256 at random (for instance one might open \36\0f to see a half dozen items). These are split up using md5 on the folder names – basically this uses the filesystem as a tree map.
One nice consequence of this is that you can estimate how far along the process is by looking at how many folders have been copied (85/256 -> 33%, etc)

Robocopy Options:

Robocopy lets you remove the console logging, with /LOG:output.txt
Robocopy lets you set the number of threads it uses. By default this is 8, it seemed to run faster with > 8, but only the first few threads made any difference.

To investigate:

Ways of using virtual filesystems – it’d be nice to continue using wget to download, but split up large folders into batches for scraping.
One possibility is to use wget through VirtualBox, since there are more linux based virtual filesystems – not sure on the performance ovehead

Posts