{"id":415,"date":"2012-08-14T12:46:42","date_gmt":"2012-08-14T12:46:42","guid":{"rendered":"http:\/\/garysieling.com\/blog\/?p=415"},"modified":"2020-03-30T02:44:47","modified_gmt":"2020-03-30T02:44:47","slug":"scraping-a-list-of-adsense-sites-within-a-niche","status":"publish","type":"post","link":"https:\/\/www.garysieling.com\/blog\/scraping-a-list-of-adsense-sites-within-a-niche\/","title":{"rendered":"Scraping a List of Adsense Sites Within a Niche"},"content":{"rendered":"<p>One of the challenges in web crawling and scraping is determining which URLs to scrape. It\u2019s easy for a site to have many urls that aren\u2019t visited by humans, like a <a href=\"http:\/\/www.stickstock.com\">stock photo site<\/a> that uses an API to supplement its data. Sites with sessionid parameters or dynamic content may make many duplicate or similar pages.<\/p>\n<p>In a previous post I described a <a href=\"http:\/\/garysieling.com\/blog\/scraping-adsense-ads-with-phantomjs\">phantomjs adsense scraper<\/a>, which demonstrates an instance where the tool is very helpful. One might scrape ads to find out who is running campaigns to find out what is selling, how products are pitched, and who you might sell advertising to, if you are a publisher.  There are products to do this, like <a href=\"http:\/\/mixrank.com\/\">MixRank<\/a>.<\/p>\n<p>There are a couple ways you can do this on your own. There is a not-for-profit called Common Crawl, which has a 70 TB index on AWS, which lets you run Hadoop map-reduce queries. It has the entire text of many pages, which would allow searching the original source of the page. I started down this road &#8211; this would work as a generalized solution if I were building a product, but I found an easier way.<\/p>\n<p>There are a surprising number of search engine APIs &#8211; e.g. Yahoo, DuckDuckGo, Blekko, and Yandex. Blekko is very SEO focused and exposes a lot of useful fields, such as whether a site is an adsense publisher. Much of this understandably requires either an API key or login, but you can easily add parameters to turn the output into JSON and increase the paging size, like so:<\/p>\n<pre>http:\/\/blekko.com\/ws\/?q=guitar+tabs+\/adsense=+\/ps=100&amp;json=1&amp;\n<\/pre>\n<p>This gives you nicely formatted entries, like so:<\/p>\n<pre>  {\n         \"c\" : 1,\n         \"display_url\" : \"ultimate-guitar.com\",\n         \"n_group\" : 1,\n         \"rss\" : \"http:\/\/www.ultimate-guitar.com\/modules\/rss\/all_updates.xml.php\",\n         \"rss_title\" : \"Ultimate-Guitar.Com Updates\",\n         \"short_host\" : \"ultimate-guitar.com\",\n         \"short_host_url\" : \"http:\/\/www.ultimate-guitar.com\/\",\n         \"snippet\" : \"Search archives or submit <strong>tab<\/strong>.  Your #1 source for <strong><strong>guitar<\/strong> <strong>tabs<\/strong><\/strong>, bass <strong>tabs<\/strong>, chords and <strong><strong>guitar<\/strong> pro <strong>tabs<\/strong><\/strong>.  <strong>Guitar<\/strong> and bass <strong>tabs<\/strong> archive with daily updates.  In order to use the widgets you need to.  You can add up to three widgets to the home page's widget panel.\",\n         \"toplevel\" : \"1\",\n         \"url\" : \"http:\/\/www.ultimate-guitar.com\/\",\n         \"url_title\" : \"ULTIMATE <strong><strong>GUITAR<\/strong> <strong>TABS<\/strong><\/strong> ARCHIVE - 300,000+ <strong><strong>Guitar<\/strong> <strong>Tabs<\/strong><\/strong>, Bass <strong>Tabs<\/strong>, Chords and <strong><strong>Guitar<\/strong> Pro <strong>Tabs<\/strong><\/strong>\"\n      },\n      {\n         \"c\" : 2,\n         \"display_url\" : \"chordie.com\",\n         \"main_slashtag_boosted\" : \"\/blekko\/tabs\",\n         \"n_group\" : 2,\n         \"rss\" : \"http:\/\/www.chordie.com\/rss\/mostpopular.rss\",\n         \"rss_title\" : \"Most popular guitar songs\",\n         \"short_host\" : \"chordie.com\",\n         \"short_host_url\" : \"http:\/\/www.chordie.com\/\",\n         \"snippet\" : \"<strong><strong>Guitar<\/strong> chords<\/strong> and <strong>guitar<\/strong> tablature made easy.  Chordie is a search engine for finding <strong><strong>guitar<\/strong> chords<\/strong> and <strong><strong>guitar<\/strong> <strong>tabs<\/strong><\/strong>.  Search the Internet for <strong><strong>guitar<\/strong> chords<\/strong> and <strong>tabs<\/strong>\/tablatures.  <strong><strong>Guitar<\/strong> chords<\/strong> and <strong><strong>guitar<\/strong> <strong>tabs<\/strong><\/strong>.  This morning a lot of people were getting a message about being banned for life.\",\n         \"toplevel\" : \"1\",\n         \"url\" : \"http:\/\/www.chordie.com\/\",\n         \"url_title\" : \"<strong><strong>Guitar<\/strong> <strong>Tabs<\/strong><\/strong>, <strong><strong>Guitar<\/strong> Chords<\/strong> and Lyrics - Chordie\"\n      },\n      {\n         \"c\" : 3,\n         \"display_url\" : \"guitartabs.net\",\n         \"n_group\" : 3,\n         \"short_host\" : \"guitartabs.net\",\n         \"short_host_url\" : \"http:\/\/www.guitartabs.net\/\",\n         \"snippet\" : \"ActiveBass.com Premier site with theory + bass <strong>tab<\/strong> search.  GuitarWar.com Ultimate <strong>guitar<\/strong> <strong>tab<\/strong> competition.  <strong>Tab<\/strong> Robot Unique <strong><strong>guitar<\/strong> <strong>tabs<\/strong><\/strong> engine.  GuitarTricks <strong>Guitar<\/strong> <strong>tab<\/strong>,chords,and video lessons.  Olga Search- search the OLGA <strong>tab<\/strong> archive by putting in the artist or song name in the search field at the top of the page.\",\n         \"toplevel\" : \"1\",\n         \"url\" : \"http:\/\/www.guitartabs.net\/\",\n         \"url_title\" : \"<strong><strong>Guitar<\/strong> <strong>Tabs<\/strong><\/strong> Dot Net - Your #1 source for <strong><strong>guitar<\/strong> <strong>tabs<\/strong><\/strong>\"\n      },\n<\/pre>\n<p>This saves hours over using Elastic Map-Reduce, much like purchasing a product would likely save me hours over doing it this way \ud83d\ude09<\/p>\n","protected":false},"excerpt":{"rendered":"<p>One of the challenges in web crawling and scraping is determining which URLs to scrape. It\u2019s easy for a site to have many urls that aren\u2019t visited by humans, like a stock photo site that uses an API to supplement its data. Sites with sessionid parameters or dynamic content may make many duplicate or similar &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/www.garysieling.com\/blog\/scraping-a-list-of-adsense-sites-within-a-niche\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Scraping a List of Adsense Sites Within a Niche&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[5,6],"tags":[39,71,89,127,147,187,267,476,495],"aioseo_notices":[],"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/415"}],"collection":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/comments?post=415"}],"version-history":[{"count":1,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/415\/revisions"}],"predecessor-version":[{"id":6480,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/415\/revisions\/6480"}],"wp:attachment":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/media?parent=415"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/categories?post=415"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/tags?post=415"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}