{"id":131,"date":"2012-05-17T01:27:59","date_gmt":"2012-05-17T01:27:59","guid":{"rendered":"http:\/\/garysieling.com\/blog\/?p=131"},"modified":"2012-05-17T01:27:59","modified_gmt":"2012-05-17T01:27:59","slug":"building-a-website-scraper-using-chrome-and-node-js","status":"publish","type":"post","link":"https:\/\/www.garysieling.com\/blog\/building-a-website-scraper-using-chrome-and-node-js\/","title":{"rendered":"Building a Website Scraper using Chrome and Node.js"},"content":{"rendered":"<p>A couple of months back, I did a proof of concept to build a scraper entirely in JavaScript, using <a href=\"http:\/\/nightly.webkit.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">webkit<\/a> (Chrome) as a parser and front-end.<\/p>\n<p>Having investigated <a href=\"http:\/\/www.mozenda.com\/pricing\" target=\"_blank\" rel=\"noopener noreferrer\">seemingly expensive<\/a> SaaS scraping software, I wanted to tease out what the challenges are, and open the door to some interesting projects. I have some background in data warehousing, and a little exposure to natural language processing, but in order to do any of those things I needed a source of data.<\/p>\n<p>The <a href=\"https:\/\/github.com\/garysieling\/chrome-scraper\/blob\/master\/sample-data.csv\" target=\"_blank\" rel=\"noopener noreferrer\">dataset I built<\/a> is 58,000 <a href=\"https:\/\/flippa.com\/\" target=\"_blank\" rel=\"noopener noreferrer\">Flippa auctions<\/a>, which have fairly well-structured pages with fielded data. I augmented the data by doing a crude form of entity extraction to see what business models or partners are most commonly mentioned in website auctions.<\/p>\n<h2>Architecture<\/h2>\n<p>I did the downloading with wget, which worked great for this. One of my concerns with the SaaS solution I demoed, is that if you made a mistake in parsing one field, you might have to pay to re-download some subset of the data.<\/p>\n<p>One of my goals was to use a single programming language. In my solution, each downloaded file is <a href=\"https:\/\/github.com\/garysieling\/chrome-scraper\/blob\/master\/run.js\" target=\"_blank\" rel=\"noopener noreferrer\">opened in a Chrome tab<\/a>, <a href=\"https:\/\/github.com\/garysieling\/chrome-scraper\/blob\/master\/project.js\" target=\"_blank\" rel=\"noopener noreferrer\">parsed<\/a>, and <a href=\"https:\/\/github.com\/garysieling\/chrome-scraper\/blob\/master\/parser.user.js\" target=\"_blank\" rel=\"noopener noreferrer\">then closed<\/a>. I used Chrome because it is fast, but this should be easily portable to Firefox, as the activity within Chrome is a <a href=\"http:\/\/www.chromium.org\/developers\/design-documents\/user-scripts\" target=\"_blank\" rel=\"noopener noreferrer\">Greasemonkey script<\/a>. Opening the Chrome tabs is done through <a href=\"http:\/\/blog.idleworx.com\/2010\/01\/windows-scripting-host-and-javascript.html\" target=\"_blank\" rel=\"noopener noreferrer\">Windows Scripting Host<\/a> (WSH). The chrome extension connects to a <a href=\"http:\/\/nodejs.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">Node.js<\/a> server to retrieve the actual parsing code and save data back to a <a href=\"http:\/\/www.postgresql.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">Postgres database<\/a>. Having JavaScript on both client and server was fantastic for handling the back and forth communication. Despite the use of a simple programming language, the three scripts (WSH, Node.js, and Greasemonkey) have very different APIs and programming models, so it\u2019s not as simple as I would like. Being accustomed to Apache, I was a little disappointed that I had to track down a script just to keep Node.js running.<\/p>\n<p>Incidentally, WSH is using Internet Explorer (IE) to run its JavaScript; this worked well, unlike the typical web programming experience with IE. My first version of the script was a cygwin bash script, which involved too much resource utilization (i.e. threads) for cygwin to handle. Once I switched to WSH I had no further problems of that sort, which is not surprising considering its long-standing use in corporate environments.<\/p>\n<h2>Challenges<\/h2>\n<p>By this point, the reader may have noticed that my host environment is Windows, chosen primarily to get the best value from <a href=\"http:\/\/store.steampowered.com\/\" target=\"_blank\" rel=\"noopener noreferrer\">Steam<\/a>. The virtualization environment is created on <a href=\"http:\/\/www.oracle.com\/technetwork\/server-storage\/virtualbox\/downloads\/index.html#vbox\" target=\"_blank\" rel=\"noopener noreferrer\">VirtualBox<\/a> using <a href=\"http:\/\/vagrantup.com\/\" target=\"_blank\" rel=\"noopener noreferrer\">Vagrant<\/a> and <a href=\"http:\/\/www.opscode.com\/chef\/\" target=\"_blank\" rel=\"noopener noreferrer\">Chef<\/a>, which make creating virtual machines fairly easy. Unfortunately, it is also easy to destroy them. I kept the data on the main machine, backed up in git, to prevent wasting days of downloading. This turned out to be annoying because it required dealing with two operating systems (Ubuntu and Windows), which have different configuration settings for networking.<\/p>\n<p>As the data volume increased, I found many new defects with this approach. Most were environmental issues, such as timeouts and settings for the maximum number of TCP connections (presumably these are low by default in Windows to slow the spread of bots).<\/p>\n<p>Garbage collection also presented an issue, since the Chrome processes consume resources at an essentially fixed rate (their memory disappears when the process ends). The garbage collection in Node.js causes a sawtooth memory pattern. During this process many Chrome tabs open. The orchestration script must watch for this in order to slow down and allow Node.js to catch up. This script should also pause if the CPU overheats; unfortunately I have not been able to read CPU temperature. Although this capability is supposedly supported by Windows APIs, it is not supported either by Intel\u2019s drivers or my chip.<\/p>\n<h2>Successes<\/h2>\n<p>A while back I read about <a href=\"http:\/\/techblog.netflix.com\/2011\/07\/netflix-simian-army.html\" target=\"_blank\" rel=\"noopener noreferrer\">Netflix\u2019s Chaos Monkey<\/a> and tried to apply its principle of assuming failure to my system. Ideally a parsing script should not stop in the middle of a several day run, so it is necessary to handle errors gracefully. Although the scripts have fail-retry logic, it unfortunately differs in each. Node.js restarts if it crashes because it is running tandem with Forever. The orchestration script doesn\u2019t seem to crash, but supports resumption at any point, and watches the host machine to see if it should slow down. The third script, the Chrome extension, watches for failures from RPC calls, and does exponential backoff to retry.<\/p>\n<p>Using the browser as a front-end gives you a free debugger and script interface, as well as a tool to generate xpath expressions.<\/p>\n<h2>Possibilities<\/h2>\n<p>The current script runs five to ten thousand entries before requiring attention. I intend to experiment with <a href=\"http:\/\/phantomjs.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">PhantomJS<\/a> in order to improve performance, enable sharding, and support in-memory connections.<\/p>\n<p><a href=\"https:\/\/github.com\/garysieling\/chrome-scraper\" target=\"_blank\" rel=\"noopener noreferrer\">See source on Github here<br \/>\n<\/a><br \/>\nThanks to <a href=\"http:\/\/www.arielesieling.com\/?page_id=66\">Ariele<\/a> and <a href=\"http:\/\/melissasieling.com\">Melissa<\/a> for editing<\/p>\n","protected":false},"excerpt":{"rendered":"<p>A couple of months back, I did a proof of concept to build a scraper entirely in JavaScript, using webkit (Chrome) as a parser and front-end. Having investigated seemingly expensive SaaS scraping software, I wanted to tease out what the challenges are, and open the door to some interesting projects. I have some background in &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/www.garysieling.com\/blog\/building-a-website-scraper-using-chrome-and-node-js\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Building a Website Scraper using Chrome and Node.js&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[5,6],"tags":[110,305,390,426,437,495],"aioseo_notices":[],"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/131"}],"collection":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/comments?post=131"}],"version-history":[{"count":0,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/131\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/media?parent=131"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/categories?post=131"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/tags?post=131"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}