Source data:
- ~500,000 folders (court cases)
 - ~2.5-3 million documents
 - Source drives is replicated x2 with RAID
 - Copying to NAS over GB ethernet
 - Initial un-tuned copy was set to take ~2 weeks (after switching to Robocopy – before, it was painful just to do an ls)
 - Final copy took ~24 hours
 
Monitoring:
- Initially I saw 20-40 Kbps in traffic in DD-WRT, clearly too low. After some changes this is still generally low, but with spikes up to 650 Kbps.
 - CPU use – 4/8 cores in use, even with >8 threads assigned to Robocopy
 - In Computer Management -> Performance monitoring, the disk being copied is Reading as fast as it can (set to 100 all the time)
 - The number called “Split IO / second” is very high much of the time. Research indicates this could be improved with defrag (though this might take me months to complete).
 
Filesystem Lessions:
- NTFS can hold a folder with large numbers of files but takes forever to enumerate
 - When you enumerate a directory in NTFS (e.g. by opening it in Windows Explorer), Windows appears to lock the folder(!) which pauses any copy/ls operations
 - The copy does not appear to be I/O bound – even setting Robocopy to use many threads, only 4/8 cores are in use at 5-15% per each.
 - ext4 (destination system) supports up to 64,000 items per folder, any more and you get an error.
 - I split all 500k items into groups of 256*256 at random (for instance one might open \36\0f to see a half dozen items). These are split up using md5 on the folder names – basically this uses the filesystem as a tree map.
 - One nice consequence of this is that you can estimate how far along the process is by looking at how many folders have been copied (85/256 -> 33%, etc)
 
Robocopy Options:
- Robocopy lets you remove the console logging, with /LOG:output.txt
 - Robocopy lets you set the number of threads it uses. By default this is 8, it seemed to run faster with > 8, but only the first few threads made any difference.
 
To investigate:
- Ways of using virtual filesystems – it’d be nice to continue using wget to download, but split up large folders into batches for scraping.
 - One possibility is to use wget through VirtualBox, since there are more linux based virtual filesystems – not sure on the performance ovehead