{"id":495,"count":29,"description":"Web scraping, harvesting, data extraction is the process of gathering data from websites. Various tools exist to download, clean, and process data, from built-in UNIX tools to custom software and programming language APIs. Often processing is required - either to extract structured or unstructured data. Sites that generate pages but do not provide an API often are easier to parse, as they are more consistent. Unstructured data is common in data extracted from PDFs, and in some cases manual intervention is even required, where you are forced to use a system like Amazon Mechanical Turk to have humans batch process data.\n\nBelow you will find some of the most popular pieces I've written, often proof of concept tools to test new technologies and explore database systems. Much data is readily available given some patience, from legal documents such as court cases, financial filings, weather monitoring, advertisement text, search results, auction data, and so on. \n","link":"https:\/\/www.garysieling.com\/blog\/examples\/scraping\/","name":"scraping","slug":"scraping","taxonomy":"post_tag","meta":[],"_links":{"self":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/tags\/495"}],"collection":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/tags"}],"about":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/taxonomies\/post_tag"}],"wp:post_type":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts?tags=495"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}