{"id":4883,"date":"2016-08-22T01:06:29","date_gmt":"2016-08-22T01:06:29","guid":{"rendered":"http:\/\/www.garysieling.com\/blog\/?p=4883"},"modified":"2016-08-22T01:06:29","modified_gmt":"2016-08-22T01:06:29","slug":"extracting-text-wikipedia-article","status":"publish","type":"post","link":"https:\/\/www.garysieling.com\/blog\/extracting-text-wikipedia-article\/","title":{"rendered":"Extracting the text from a Wikipedia article"},"content":{"rendered":"<p>Wikipedia has extensive text on articles it discusses &#8211; in some cases so much that a lot of language processing APIs won&#8217;t accept it. Alchemy API (now seemingly marketed as &#8220;IBM Watson&#8221;) has an endpoint to parse text from a website, but it only accepts 600KB pages (50K of output text). Consequently, it quickly becomes easier to just get the text yourself.<\/p>\n<p>To do this, I recommend Apache Tika, which seems to include one of the better \/ best libraries for extracting text, and has every imaginable interface &#8211; Java, command line, REST, and a GUI(!). <\/p>\n<p>You only need a Java jar for this-<\/p>\n<pre lang=\"bash\">\ncurl http:\/\/apache.spinellicreations.com\/tika\/tika-app-1.13.jar > tika-app.jar\n<\/pre>\n<p>Tika has a complex set of options for detecting content types<sup><a href=\"#footnote_0_4883\" id=\"identifier_0_4883\" class=\"footnote-link footnote-identifier-link\" title=\"https:\/\/tika.apache.org\/1.1\/detection.html#Content_Detection\">1<\/a><\/sup>, but it seems to respond to file extensions, and when I was testing this I found that it was more reliable when I specified these:<\/p>\n<pre lang=\"bash\">\ncurl https:\/\/en.wikipedia.org\/wiki\/Barack_Obama > Barack_Obama.html\n<\/pre>\n<p>Invoking Tika is simple:<\/p>\n<pre lang=\"bash\">\njava -jar tika-app.jar -t data\/$1.html > out\/$1.txt\n<\/pre>\n<p>Problem is, Wikipedia has a ton of extra content wrapping the text. You could handle this in a few ways &#8211; pre-process the file to select out what you want, customize Tika to have it parse out the bits you want (probably a good option if you want to get just captions or headings), or hack at the output.<\/p>\n<p>For my case I chose the last option. The following script will remove the table of contents, most captions, and the bogus header \/ footer information that shows up at the end of the file. Tune to your liking (I removed the references as well).<\/p>\n<pre lang=\"python\">\nimport fileinput\nimport re\n\nstart = re.compile(\"Jump to:.*navigation,.*search\")\n \nend = re.compile(\"^Notes and references$\")\n\nstarted = False\nended = False\nblank = False\n\nignore = re.compile(\n  \"^(Main article: .*|\" +\n  \"Main articles: .*|\" + \n  \"See also: .*|\" + \n  \"\\s*[0-9]+.[0-9]+ .*\" + \n  \"|\\s*[0-9]+.[0-9]+.[0-9]+ .*)\\s*$\")\n\nfootnote = re.compile(\"\\[[0-9]+\\]\")\n\nfor line in fileinput.input():\n\n  if (re.match(end, line.strip())):\n    ended = True\n\n  if (started and not ended):\n    if (not blank or line.strip() != \"\"):\n      if (not re.match(ignore, line)):\n        if (\".\" in line or len(line) > 150 or len(line.strip()) == 0):\n          print(re.sub(footnote, \"\", line.strip()))\n\n          if (line.strip() == \"\"):\n            blank = True\n          else:\n            blank = False\n\n  if (not started):\n    if (re.search(start, line) != None):\n      started = True\n\n  pass\n<\/pre>\n<ol class=\"footnotes\"><li id=\"footnote_0_4883\" class=\"footnote\">https:\/\/tika.apache.org\/1.1\/detection.html#Content_Detection<span class=\"footnote-back-link-wrapper\"> [<a href=\"#identifier_0_4883\" class=\"footnote-link footnote-back-link\">&#8617;<\/a>]<\/span><\/li><\/ol>","protected":false},"excerpt":{"rendered":"<p>How to extract text content from Wikipedia articles programmatically<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[12],"tags":[300,447],"aioseo_notices":[],"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/4883"}],"collection":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/comments?post=4883"}],"version-history":[{"count":0,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/4883\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/media?parent=4883"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/categories?post=4883"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/tags?post=4883"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}