{"id":316,"date":"2012-07-20T12:26:19","date_gmt":"2012-07-20T12:26:19","guid":{"rendered":"http:\/\/garysieling.com\/blog\/?p=316"},"modified":"2012-07-20T12:26:19","modified_gmt":"2012-07-20T12:26:19","slug":"scraping-adsense-ads-with-phantomjs","status":"publish","type":"post","link":"https:\/\/www.garysieling.com\/blog\/scraping-adsense-ads-with-phantomjs\/","title":{"rendered":"Scraping Adsense Ads with PhantomJS"},"content":{"rendered":"<p><a href=\"http:\/\/www.phantomjs.org\" target=\"_new\" rel=\"noopener noreferrer\">PhantomJS<\/a> is a headless WebKit, which lets you run Javascript in a browser from the command line. It adds additional API calls which facilitate automated testing, screenshots, and scraping. I thought it would be interesting to write a script to retrieve Adsense destination URLs and text with PhantomJS. <\/p>\n<p>Extracting advertisement blocks requires fairly simple CSS selectors. Google can&#8217;t change the format too often, since each publisher must paste a code into their site. Some ad networks render advertisements inside an iframe, so running the script may run into browser security restrictions. Extracting ad data from a page of Home Depot&#8217;s website gives us the following results:<\/p>\n<pre>\nDrywall Materials Sale, http:\/\/www.compare99.com\/compare.html%3Fq%3Ddrywall-products%26ort%3DDrywall-Materials-Sale%26adid%3DiaCkp56m1aqplM3OkH6Tp8bUzJKepofRzm52pdrZxJ2eYK7D15aknMLO1lelcMjD2KRYlsnD1W6W\nSheetrock, http:\/\/shopping.yahoo.com\/search%3B_ylc%3DX3oDMTJ1dGkyY2Y5BF9TAzk2MDc5MjYwBGsDc2hlZXRyb2NrBHNlbV9hY3QDMjYyOTkxMDA5MARzZW1fYWRnAzE5NjgwNTY2MwRzZW1fY21wAzM3NDI5MTMEc2VtX2t3aWQDMTU0NTgwMDE-%3Fp%3Dsheetrock%26sem%3DGoogle\nSheetrock Material Sale, http:\/\/www.buycheapr.com\/us\/result.jsp%3Fga%3Dus19%26q%3Dsheetrock%2Bmaterial\nInstallation Framing Door, http:\/\/www.moifriefacility.com\nArchitectural GFRG, http:\/\/www.sbgrace.com\nWallBuilders Library, http:\/\/www.logos.com\/products\/details\/2982%3Fgoogleads\n<\/pre>\n<p>I&#8217;ve written a short demo, which retrieves ad text and a screenshot for testing. It is invoked as follows (source is below, and <a href=\"https:\/\/github.com\/garysieling\/adsense-scraper\">on Github<\/a>)<\/p>\n<pre>phantomjs adsense.js http:\/\/www.homedepot.com\/Building-Materials-Drywall\/FibaTape\/h_d1\/N-5yc1vZar3dZ38m\/h_d2\/Navigation?catalogId=10053&amp;Nu=P_PARENT_ID&amp;langId=-1&amp;storeId=10051<\/pre>\n<p>The code is almost a little too easy- tell PhantomJS to load a page, run Javascript in the page context, and parse the Adsense URL format. As a programming paradigm, it&#8217;s a little complex to track scope, since some code runs in the PhantomJS context and some in the page context. PhantomJS scripts do not exit when a script ends, because many browser actions are asynchronous. This requires scripts to track state and add exit() calls at the end of every branch.<\/p>\n<pre lang=\"Javascript\">var page = require('webpage').create(),\n    t, address;\n\nif (phantom.args.length === 0) {\n    console.log('Usage: phantomjs adsense.js ');\n    phantom.exit();\n} else {\n    t = Date.now();\n    address = phantom.args[0];\n    output = phantom.args[1];\n    page.viewportSize = { width: 600, height: 600 };\n\n    page.onConsoleMessage = function (msg) {\n        console.log('Console log: ' + msg);\n    };\n\n    page.open(address, function (status) {\n        if (status !== 'success') {\n            console.log('FAIL to load the address');\n            phantom.exit();\n        } else {\n\n            page.evaluate(function () {\n                var parse = function(query) {\n                    var vars = query.split(\"?\")[1].split(\"&amp;\");\n                    var res = {};\n                    for (var i = 0; i &lt; vars.length; i++) {\n                        var pair = vars[i].split(\"=\");\n                        res[pair[0]] = unescape(pair[1]);\n                    }\n                    return res;\n                };\n\n               console.log(document.title);\n                var ads = document.querySelectorAll('#googleAdSenseLeft ul li a');\n                for (var i=0; i<ads.length; i++){\n                     var adQuery = ads[i].href;\n                     var adContents = parse(adQuery);\n                     adContents.url = adQuery;\n                     adContents.text = ads[i].innerText;\n                     console.log(JSON.stringify(adContents));\n                }\n            });\n\n            t = Date.now() - t;\n            console.log('Loading time ' + t + ' msec');\n\n            window.setTimeout(function () {\n                page.render(output);\n                phantom.exit();\n            }, 200);\n\n        }\n    });\n}\n\n\n<\/pre>\n<p>The actual script output is JSON, and a little tedious to read:<\/p>\n<pre>Console log: Building Materials - Drywall - FibaTape\u00a0at The Home Depot\nConsole log: {\"sa\":\"l\",\"ai\":\"CtXg4mpkIUPygG6Ol0AGRloCYCcupmcoEi9O58FOzp_mMrgEQByDHxt8eKAxQ_aetrgRgyb6miYyk1A-gAe2HlNYDyAEBqgQbT9CqfdRE5uzDpvzqUgRUNqtZ3ouY_UBn7VGD\",\"num\":\"7\",\"sig\":\"AOD64_0jc0_3b0Au9uLLeud6cAI77O6zrQ\",\"adurl\":\"http:\/\/www.compare99.com\/compare.html?q=drywall-products&amp;ort=Drywall-Materials-Sale&amp;adid=iaCkp56m1aqplM3OkH6Tp8bUzJKepofRzm52pdrZxJ2eYK7D15aknMLO1lelcMjD2KRYlsnD1W6W\",\"url\":\"http:\/\/www.google.com\/aclk?sa=l&amp;ai=CtXg4mpkIUPygG6Ol0AGRloCYCcupmcoEi9O58FOzp_mMrgEQByDHxt8eKAxQ_aetrgRgyb6miYyk1A-gAe2HlNYDyAEBqgQbT9CqfdRE5uzDpvzqUgRUNqtZ3ouY_UBn7VGD&amp;num=7&amp;sig=AOD64_0jc0_3b0Au9uLLeud6cAI77O6zrQ&amp;adurl=http:\/\/www.compare99.com\/compare.html%3Fq%3Ddrywall-products%26ort%3DDrywall-Materials-Sale%26adid%3DiaCkp56m1aqplM3OkH6Tp8bUzJKepofRzm52pdrZxJ2eYK7D15aknMLO1lelcMjD2KRYlsnD1W6W\",\"text\":\"Drywall Materials Sale\"}\nConsole log: {\"sa\":\"L\",\"ai\":\"CHaiampkIUPygG6Ol0AGRloCYCZu8jlqzr4eAA9G9rwcQCCDHxt8eKAxQ7dCHowNgyb6miYyk1A_IAQGqBB5P0Loa2ETp7MOm_KJSgnGFYNjxasFE9TqY0t8TLpE\",\"num\":\"8\",\"ggladgrp\":\"2492582816717521168\",\"gglcreat\":\"9712000621987456871\",\"sig\":\"AOD64_1WwDM7Zp2jGv1pdrozELP2CSkZUA\",\"adurl\":\"http:\/\/shopping.yahoo.com\/search;_ylc=X3oDMTJ1dGkyY2Y5BF9TAzk2MDc5MjYwBGsDc2hlZXRyb2NrBHNlbV9hY3QDMjYyOTkxMDA5MARzZW1fYWRnAzE5NjgwNTY2MwRzZW1fY21wAzM3NDI5MTMEc2VtX2t3aWQDMTU0NTgwMDE-?p=sheetrock&amp;sem=Google\",\"url\":\"http:\/\/www.google.com\/aclk?sa=L&amp;ai=CHaiampkIUPygG6Ol0AGRloCYCZu8jlqzr4eAA9G9rwcQCCDHxt8eKAxQ7dCHowNgyb6miYyk1A_IAQGqBB5P0Loa2ETp7MOm_KJSgnGFYNjxasFE9TqY0t8TLpE&amp;num=8&amp;ggladgrp=2492582816717521168&amp;gglcreat=9712000621987456871&amp;sig=AOD64_1WwDM7Zp2jGv1pdrozELP2CSkZUA&amp;adurl=http:\/\/shopping.yahoo.com\/search%3B_ylc%3DX3oDMTJ1dGkyY2Y5BF9TAzk2MDc5MjYwBGsDc2hlZXRyb2NrBHNlbV9hY3QDMjYyOTkxMDA5MARzZW1fYWRnAzE5NjgwNTY2MwRzZW1fY21wAzM3NDI5MTMEc2VtX2t3aWQDMTU0NTgwMDE-%3Fp%3Dsheetrock%26sem%3DGoogle\",\"text\":\"Sheetrock\"}\nConsole log: {\"sa\":\"L\",\"ai\":\"CELX4mpkIUPygG6Ol0AGRloCYCc-MjpECz_OgsCKf8OKPCRAJIMfG3x4oDFDFt4T4-f____8BYMm-pomMpNQPyAEBqgQeT9CaZ9JE6OzDpvyiUrJ702HY8WrBRPU6mNLfEy6R\",\"num\":\"9\",\"sig\":\"AOD64_39oakLNPF7SIjdARg9y73otRYZhQ\",\"adurl\":\"http:\/\/www.buycheapr.com\/us\/result.jsp?ga=us19&amp;q=sheetrock+material\",\"url\":\"http:\/\/www.google.com\/aclk?sa=L&amp;ai=CELX4mpkIUPygG6Ol0AGRloCYCc-MjpECz_OgsCKf8OKPCRAJIMfG3x4oDFDFt4T4-f____8BYMm-pomMpNQPyAEBqgQeT9CaZ9JE6OzDpvyiUrJ702HY8WrBRPU6mNLfEy6R&amp;num=9&amp;sig=AOD64_39oakLNPF7SIjdARg9y73otRYZhQ&amp;adurl=http:\/\/www.buycheapr.com\/us\/result.jsp%3Fga%3Dus19%26q%3Dsheetrock%2Bmaterial\",\"text\":\"Sheetrock Material Sale\"}\nConsole log: {\"sa\":\"L\",\"ai\":\"C-S13mpkIUPygG6Ol0AGRloCYCaS2oM0D5P7ugla6r8cGEAogx8bfHigMULiOjo_9_____wFgyb6miYyk1A_IAQGqBBhP0Np5zUTr7MOm_LNT0dUZSgBnPsWIwzM\",\"num\":\"10\",\"sig\":\"AOD64_0rVOUBr5lndIy_sed-v9kmQBeqjw\",\"adurl\":\"http:\/\/www.moifriefacility.com\",\"url\":\"http:\/\/www.google.com\/aclk?sa=L&amp;ai=C-S13mpkIUPygG6Ol0AGRloCYCaS2oM0D5P7ugla6r8cGEAogx8bfHigMULiOjo_9_____wFgyb6miYyk1A_IAQGqBBhP0Np5zUTr7MOm_LNT0dUZSgBnPsWIwzM&amp;num=10&amp;sig=AOD64_0rVOUBr5lndIy_sed-v9kmQBeqjw&amp;adurl=http:\/\/www.moifriefacility.com\",\"text\":\"Installation Framing Door\"}\nConsole log: {\"sa\":\"L\",\"ai\":\"CnycAmpkIUPygG6Ol0AGRloCYCem7q4oEqYSS7FKunu8KEAsgx8bfHigMUJbYl_L8_____wFgyb6miYyk1A_IAQGqBB5P0IonikTq7MOm_KJShBu3Z9jxasFE9TqY0t8TLpE\",\"num\":\"11\",\"sig\":\"AOD64_1MZGFM0lJ7DsgtuZZ-rv2CP6vcxA\",\"adurl\":\"http:\/\/www.sbgrace.com\",\"url\":\"http:\/\/www.google.com\/aclk?sa=L&amp;ai=CnycAmpkIUPygG6Ol0AGRloCYCem7q4oEqYSS7FKunu8KEAsgx8bfHigMUJbYl_L8_____wFgyb6miYyk1A_IAQGqBB5P0IonikTq7MOm_KJShBu3Z9jxasFE9TqY0t8TLpE&amp;num=11&amp;sig=AOD64_1MZGFM0lJ7DsgtuZZ-rv2CP6vcxA&amp;adurl=http:\/\/www.sbgrace.com\",\"text\":\"Architectural GFRG\"}\nConsole log: {\"sa\":\"l\",\"ai\":\"C86EtmpkIUPygG6Ol0AGRloCYCcComAjImM7lA5iY2DAQDCDHxt8eKAxQzZGFtAFgyb6miYyk1A-gAZCCsf8DyAEBqgQbT9D6P8RE7ezDpvzqUgRUNqtZ3ouY_UBn7VGT\",\"num\":\"12\",\"sig\":\"AOD64_0n--_8h8e-W75X5eNYIOhLyJ7ezQ\",\"adurl\":\"http:\/\/www.logos.com\/products\/details\/2982?googleads\",\"url\":\"http:\/\/www.google.com\/aclk?sa=l&amp;ai=C86EtmpkIUPygG6Ol0AGRloCYCcComAjImM7lA5iY2DAQDCDHxt8eKAxQzZGFtAFgyb6miYyk1A-gAZCCsf8DyAEBqgQbT9D6P8RE7ezDpvzqUgRUNqtZ3ouY_UBn7VGT&amp;num=12&amp;sig=AOD64_0n--_8h8e-W75X5eNYIOhLyJ7ezQ&amp;adurl=http:\/\/www.logos.com\/products\/details\/2982%3Fgoogleads\",\"text\":\"WallBuilders Library\"}\nConsole log: {\"sa\":\"l\",\"ai\":\"C86EtmpkIUPygG6Ol0AGRloCYCcComAjImM7lA5iY2DAQDCDHxt8eKAxQzZGFtAFgyb6miYyk1A-gAZCCsf8DyAEBqgQbT9D6P8RE7ezDpvzqUgRUNqtZ3ouY_UBn7VGT\",\"num\":\"12\",\"sig\":\"AOD64_0n--_8h8e-W75X5eNYIOhLyJ7ezQ\",\"adurl\":\"http:\/\/www.logos.com\/products\/details\/2982?googleads\",\"url\":\"http:\/\/www.google.com\/aclk?sa=l&amp;ai=C86EtmpkIUPygG6Ol0AGRloCYCcComAjImM7lA5iY2DAQDCDHxt8eKAxQzZGFtAFgyb6miYyk1A-gAZCCsf8DyAEBqgQbT9D6P8RE7ezDpvzqUgRUNqtZ3ouY_UBn7VGT&amp;num=12&amp;sig=AOD64_0n--_8h8e-W75X5eNYIOhLyJ7ezQ&amp;adurl=http:\/\/www.logos.com\/products\/details\/2982%3Fgoogleads\",\"text\":\"www.logos.com\/\"}<\/pre>\n<p>Looking at the output, some design decisions made by Google's engineers become apparent. Google must track all clicks in order to charge publishers and pay advertisers, so they redirect everything through a URL shortener. The latency must be low or else the viewer will give up waiting for a site to load.<\/p>\n<p>Links contain all information required to load the advertised site, so no database reads are required. The URL contains hashes, which presumably prevents a malicious user from modifying the URL. I suspect that these URLs also expire, by including the date in a hashed value. Clicks are likely written to a sharded database (i.e. BigTable, see also Redis, Cassandra, etc) and reconciled later. <\/p>\n<p><i>Many thanks to Ariele for <a href=\"http:\/\/www.arielesieling.com\/?page_id=66\">editing<\/a><\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>PhantomJS is a headless WebKit, which lets you run Javascript in a browser from the command line. It adds additional API calls which facilitate automated testing, screenshots, and scraping. I thought it would be interesting to write a script to retrieve Adsense destination URLs and text with PhantomJS. Extracting advertisement blocks requires fairly simple CSS &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/www.garysieling.com\/blog\/scraping-adsense-ads-with-phantomjs\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Scraping Adsense Ads with PhantomJS&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[5,6,7],"tags":[39,86,100,110,302,389,392,421,426,495,554,591],"aioseo_notices":[],"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/316"}],"collection":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/comments?post=316"}],"version-history":[{"count":0,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/316\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/media?parent=316"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/categories?post=316"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/tags?post=316"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}