Data Mining - Gary Sieling

Using the WordPress Rest API plugin you can easily get a JSON payload containing data from your blog.

If you use SSL, you likely will need to use Python 3, as this includes many bug fixes.

First, load the page text:

url = 'https://www.garysieling.com/blog/wp-json/wp/v2/posts?per_page=10&page=18'

import urllib3
http = urllib3.PoolManager(10)
response = http.request('GET', url)

Then parse it as JSON:

import json

jsonData = response.data.decode('utf-8')
posts = json.loads(jsonData)

From there, the blog post text is readily available:

post = posts[1]["content"]["rendered"]
title = posts[1]["title"]["rendered"]

Then, you can easily rip out all the HTML tags (see this stackoverflow post for the source of this solution)

from html.parser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.strict = False
        self.convert_charrefs= True
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

clean = strip_tags(post)

Unfortunately, the summarization library does not support python 3. There is ap atch for this, and you can install it directly from github, like so:

pip install https://github.com/voneiden/PyTeaser/archive/py3.zip

Once you do this, you can get a summary for the given post (Lessons Learned from 0 to 40,000 Readers).

from pyteaser import Summarize

" ".join(Summarize(title, clean))

This results in the following text, which gives a decent summary of the article:

'Since then, a bit over 40,000 people have read articles I’ve written,
 not a huge number in the grand scheme of things, but enough to draw a
 few lessons. Posts I’ve made received more votes, even though they are
self posts, because they are at least relevant. In practice, I’ve 
written on wider subjects – anything within “full stack web 
development” is fair game, trying to focus on new, or popular tech – 
Scala, DevOps (Vagrant/Chef/Virtualization), Hadoop, R, and scraping. 
It’s the only thing I’ve written that seems to have received 
significant attention on Google+ (19 events on G+, 24 on Twitter).
I’ve written several articles which have been posted to Twitter
 by 20+ people.'

Category: Data Mining

Generate summaries for your WordPress blog posts using Python