Having investigated seemingly expensive SaaS scraping software, I wanted to tease out what the challenges are, and open the door to some interesting projects. I have some background in data warehousing, and a little exposure to natural language processing, but in order to do any of those things I needed a source of data.
The dataset I built is 58,000 Flippa auctions, which have fairly well-structured pages with fielded data. I augmented the data by doing a crude form of entity extraction to see what business models or partners are most commonly mentioned in website auctions.
I did the downloading with wget, which worked great for this. One of my concerns with the SaaS solution I demoed, is that if you made a mistake in parsing one field, you might have to pay to re-download some subset of the data.
By this point, the reader may have noticed that my host environment is Windows, chosen primarily to get the best value from Steam. The virtualization environment is created on VirtualBox using Vagrant and Chef, which make creating virtual machines fairly easy. Unfortunately, it is also easy to destroy them. I kept the data on the main machine, backed up in git, to prevent wasting days of downloading. This turned out to be annoying because it required dealing with two operating systems (Ubuntu and Windows), which have different configuration settings for networking.
As the data volume increased, I found many new defects with this approach. Most were environmental issues, such as timeouts and settings for the maximum number of TCP connections (presumably these are low by default in Windows to slow the spread of bots).
Garbage collection also presented an issue, since the Chrome processes consume resources at an essentially fixed rate (their memory disappears when the process ends). The garbage collection in Node.js causes a sawtooth memory pattern. During this process many Chrome tabs open. The orchestration script must watch for this in order to slow down and allow Node.js to catch up. This script should also pause if the CPU overheats; unfortunately I have not been able to read CPU temperature. Although this capability is supposedly supported by Windows APIs, it is not supported either by Intel’s drivers or my chip.
A while back I read about Netflix’s Chaos Monkey and tried to apply its principle of assuming failure to my system. Ideally a parsing script should not stop in the middle of a several day run, so it is necessary to handle errors gracefully. Although the scripts have fail-retry logic, it unfortunately differs in each. Node.js restarts if it crashes because it is running tandem with Forever. The orchestration script doesn’t seem to crash, but supports resumption at any point, and watches the host machine to see if it should slow down. The third script, the Chrome extension, watches for failures from RPC calls, and does exponential backoff to retry.
Using the browser as a front-end gives you a free debugger and script interface, as well as a tool to generate xpath expressions.
The current script runs five to ten thousand entries before requiring attention. I intend to experiment with PhantomJS in order to improve performance, enable sharding, and support in-memory connections.
See source on Github here
Thanks to Ariele and Melissa for editing