Script to get values from wikiart pages

The following script will pull data values from a wikiart page (an excellent index of paintings) –

import glob

indir = 'D:\\projects\\art\\'
for filename in glob.glob(indir + "*.html"):
    print filename
    spans = soup.select("span[itemprop]")
    ahrefs = soup.select("a[itemprop]")
    
    file = open(filename, 'r')

    soup = BeautifulSoup(file, 'html.parser')

    spans = soup.select("span[itemprop]")
    arefs = soup.select("a[itemprop]")
    
    [(v["itemprop"], v.text) for v in spans + arefs]    

Since most of the data values are labelled, you get a data structure like so:

[(u’birthDate’, u’31 March 1902′), (u’dearthDate’, u’24 August 1976′), (u’nation’, u’French’), (u’nation’, u’Russian’), (u’art movement’, u’Art Informel’), (u’art movement’, u’Tachisme’), (u’painting school’, u’\xc9cole de Paris’), (u’genre’, u’abstract’)]
D:\projects\art\wikiart\www.wikiart.org\en\andre-pierre-arnal.html
D:\projects\art\wikiart\www.wikiart.org\en\andre-pierre-arnal.html
[(u’birthDate’, u’16 December 1939′), (u’nation’, u’French’), (u’art movement’, u’Art Informel’), (u’art movement’, u’Contemporary’), (u’painting school’, u’Supports/Surfaces’), (u’genre’, u’abstract’)]
D:\projects\art\wikiart\www.wikiart.org\en\andrea-del-castagno.html
D:\projects\art\wikiart\www.wikiart.org\en\andrea-del-castagno.html
[(u’birthDate’, u’c.1421′), (u’dearthDate’, u’19 August 1457′), (u’nation’, u’Italian’), (u’art movement’, u’Early Renaissance’), (u’painting school’, u’Florentine School’)]

Leave a Reply

Your email address will not be published. Required fields are marked *