After an interesting meeting today, we’ve each chosen a website to extract data from, to be fed into construct as RDF. The idea of standardising on Python for all of the newly created sensors was brought up, which is good as I’ve already started working on my Python scraper.
Hidden Data
I didn’t mention this in the meeting, but some very useful data, like currency conversion rates, are generally not shown on public-facing websites. To get at them requires a form submission, and then scraping the resulting HTML page. Things I learned during my fourth year project may be able to help here, since one of the sites I tested on was this currency conversion page.
Access to a feed of realtime data costs $540 a year. With my project, for the cost of a HTTP GET request, you could have up to the minute data on any currency available in their system. This was made possible by a very useful Perl module called HTML::Form, which allowed me to simulate form submits, and thus retrieve the HTTP response page. Something similar is bound to exist for Python.
Working with Trees
There are two main approaches to screen-scraping: using heavy, regular expression-laden parsing for certain patterns of text in a string, or constructing a treelike representation of a page in memory, and then traversing this tree looking for certain elements. My favoured method is the latter, since it is generally more robust to small cosmetic changes to the underlying HTML page. Scraper rewrites are still required for when a page is reorganised, but this happens less frequently than a site having a few colours changed around.
Beautiful Soup is a very useful package for Python, which will robustly convert even an invalid HTML page into a tree, and then provides you the methods required to traverse the tree. This way, scrapers can be bashed out pretty quickly. Here’s some code to set it up; after this’ll come the page-specific code that extracts the relevant table rows or whatever is required.
import urllib, sys, re, BeautifulSoup
def get_page(url):
"""Fetches an arbitrary page from the web and prints it."""
try:
location = urllib.urlopen(url)
except IOError, (errno, strerror):
sys.exit("I/O error(%s): %s" % (errno, strerror))
content = location.read()
# Clear out all troublesome whitespace
content = content.replace("\n", "")
content = content.replace("\r", "")
content = content.replace("\t", "")
content = content.replace("> ", ">")
content = content.replace(" ", " ")
location.close()
return content
def generate_tree(page):
"""Converts a string of HTML into a document tree."""
return BeautifulSoup.BeautifulSoup(page)
Once you have this set up, fetching a certain element on a page becomes as easy as writing:
print generate_tree(get_page('http://www.imdb.com/')).first('table')
Polling Period
We discussed how often the sensors/scrapers should fetch their target webpage to re-parse it. Polling a page too often is likely to get your IP address blocked. Personally I don’t think this is as big a problem as was made out. Most RSS readers are designed to poll a feed once every 30 minutes to an hour. This is a reasonable period. Bar a few examples (stock quotes specifically), very few sites that we’re monitoring will be updating more frequently than that. In fact, the period could likely be increased. It would be relatively simple to set up a cron job to run each of the sensors in order every 30 minutes.
This approach could then be extended. RSS readers are/should be designed to honour various HTTP headers so that they don’t continually re-fetch the same feed over and over again if it’s not changing. All HTML files are sent with those same headers, so we could have conditions set up that the sensors will first do a HEAD request, and if we get a 304 response or if the Last Modified headers are within the last update cycle, we defer the update until the next cycle.
Ideally, the polling would be adaptive, so we have a single script that takes as input the derived update frequency of each page, and writes a new cron file with modified periodicity for each site. Thus, pages like the Dublin Bus timetables, which I’m working on, will be re-parsed very infrequently, since the site is rarely updated. Conversely, sites that serve constantly-updated information, like stock quotes and currency conversion rates, will be fetched much more often (but never more than a lower bound, like every 10 minutes).
Nice info, very good starting point. I personally would like to figure out if IMDB will block too many queries from the same IP. Did you ever contact the sites you scrapped or just hoped for the best? I don’t think a site like ugc.ie will do any fancy IP logging.
§ Comment by Aaron on December 14th, 2005 at 10:55 pm
No scraping for me: FTP all the way (maybe?)
http://www.imdb.com/help/show_leaf?usedatasoftware
Anyone else interested in this should see if similar policies apply before blindly scraping content.
§ Comment by Aaron on December 14th, 2005 at 11:01 pm
Those requirements seem ridiculously draconian, given that editors at the IMDb aren’t even responsible for the majority of the content on the site. Akin to a book publisher stipulating that people may read their books only while patting their heads and saying a Hail Mary.
They do say they’ll grant express written consent for some people to use robots, if asked nicely. Google had to ask for permission to crawl the site, apparently.
§ Comment by Ross on December 15th, 2005 at 1:44 pm
Nice article Ross. I looked into the Ruby version of Beautiful Soup (Rubiful Soup) earlier today, but found it to suffer from large processing overheads (not far short of 60 seconds) when running it on a page with a months worth of flight information (presumably because it builds an complete object model of the page). What kind of noticible overheads (if any) did you experience when you were using the python version? What size of pages were you parsing?
Back to regular expressions for the moment I think!
§ Comment by Graeme on December 15th, 2005 at 5:33 pm
I’ve spent the night putting together my parser, or at least a rough cut of it, for the Dublin Bus site. It’s in pretty good shape now. Running on a page like the 46A schedule, which has a frightening number of nested tables, it extracts the times from each table and collates them into a Python list in just over one second (including the page fetch). The source page is pretty atrociously coded too, so that’s not the problem. I’m not familiar with Ruby, so I don’t know what could be slowing your one down so much.
§ Comment by Ross on December 16th, 2005 at 12:47 am
Right, Græme recoded his scraper in Python and it ran well, so it looks like the currently crappy port of Beautiful Soup into Ruby is to blame.
§ Comment by Ross on December 16th, 2005 at 11:23 pm