{"id":2763,"date":"2015-12-16T04:38:16","date_gmt":"2015-12-16T04:38:16","guid":{"rendered":"http:\/\/www.garysieling.com\/blog\/?p=2763"},"modified":"2020-03-31T00:46:31","modified_gmt":"2020-03-31T00:46:31","slug":"generate-summaries-for-your-wordpress-blog-posts-using-python","status":"publish","type":"post","link":"https:\/\/www.garysieling.com\/blog\/generate-summaries-for-your-wordpress-blog-posts-using-python\/","title":{"rendered":"Generate summaries for your WordPress blog posts using Python"},"content":{"rendered":"<p>Using the <a href=\"https:\/\/wordpress.org\/plugins\/json-rest-api\/\">WordPress Rest API plugin<\/a> you can easily get a JSON payload containing data from your blog.<\/p>\n<p>If you use SSL, you likely will need to use Python 3, as this includes many bug fixes.<\/p>\n<p>First, load the page text:<\/p>\n<pre lang=\"python\">\nurl = 'https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts?per_page=10&page=18'\n\nimport urllib3\nhttp = urllib3.PoolManager(10)\nresponse = http.request('GET', url)\n<\/pre>\n<p>Then parse it as JSON:<\/p>\n<pre lang=\"python\">\nimport json\n\njsonData = response.data.decode('utf-8')\nposts = json.loads(jsonData)\n<\/pre>\n<p>From there, the blog post text is readily available:<\/p>\n<pre lang=\"python\">\npost = posts[1][\"content\"][\"rendered\"]\ntitle = posts[1][\"title\"][\"rendered\"]\n<\/pre>\n<p>Then, you can easily rip out all the HTML tags (see <a href=\"http:\/\/stackoverflow.com\/questions\/753052\/strip-html-from-strings-in-python\">this stackoverflow post<\/a> for the source of this solution)<\/p>\n<pre lang=\"python\">\nfrom html.parser import HTMLParser\n\nclass MLStripper(HTMLParser):\n    def __init__(self):\n        self.reset()\n        self.strict = False\n        self.convert_charrefs= True\n        self.fed = []\n    def handle_data(self, d):\n        self.fed.append(d)\n    def get_data(self):\n        return ''.join(self.fed)\n\ndef strip_tags(html):\n    s = MLStripper()\n    s.feed(html)\n    return s.get_data()\n\nclean = strip_tags(post)\n<\/pre>\n<p>Unfortunately, the summarization library does not support python 3. There is ap atch for this, and you can install it directly from github, like so:<\/p>\n<pre>\npip install https:\/\/github.com\/voneiden\/PyTeaser\/archive\/py3.zip\n<\/pre>\n<p>Once you do this, you can get a summary for the given post (<a href=\"https:\/\/www.garysieling.com\/blog\/lessons-learned-from-0-to-40000-blog-readers\">Lessons Learned from 0 to 40,000 Readers<\/a>).<\/p>\n<pre lang=\"python\">\nfrom pyteaser import Summarize\n\n\" \".join(Summarize(title, clean))\n<\/pre>\n<p>This results in the following text, which gives a decent summary of the article:<\/p>\n<pre>\n'Since then, a bit over 40,000 people have read articles I\u2019ve written,\n not a huge number in the grand scheme of things, but enough to draw a\n few lessons. Posts I\u2019ve made received more votes, even though they are\nself posts, because they are at least relevant. In practice, I\u2019ve \nwritten on wider subjects \u2013 anything within \u201cfull stack web \ndevelopment\u201d is fair game, trying to focus on new, or popular tech \u2013 \nScala, DevOps (Vagrant\/Chef\/Virtualization), Hadoop, R, and scraping. \nIt\u2019s the only thing I\u2019ve written that seems to have received \nsignificant attention on Google+ (19 events on G+, 24 on Twitter).\nI\u2019ve written several articles which have been posted to Twitter\n by 20+ people.'\n<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>Using the WordPress Rest API plugin you can easily get a JSON payload containing data from your blog. If you use SSL, you likely will need to use Python 3, as this includes many bug fixes. First, load the page text: url = &#8216;https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts?per_page=10&#038;page=18&#8217; import urllib3 http = urllib3.PoolManager(10) response = http.request(&#8216;GET&#8217;, url) Then parse &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/www.garysieling.com\/blog\/generate-summaries-for-your-wordpress-blog-posts-using-python\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Generate summaries for your WordPress blog posts using Python&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[4,5,6],"tags":[447],"aioseo_notices":[],"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/2763"}],"collection":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/comments?post=2763"}],"version-history":[{"count":1,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/2763\/revisions"}],"predecessor-version":[{"id":6493,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/2763\/revisions\/6493"}],"wp:attachment":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/media?parent=2763"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/categories?post=2763"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/tags?post=2763"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}