Scraping Book
Scraping Book
Scraping Book
Contents
Introducing HTML
Introducing Python
Downloading files
Extracting links
Extracting tables
Final notes
13
19
25
33
39
49
43
11
20
4.1
4.2
4.3
4.4
4.5
4.6
4.7
Pitchfork.com review
25
Pitchfork.com code
27
Scraper to print full text
28
Scraper output
28
Looping over paragraphs
29
Looping over divisions
30
File output
31
5.1
5.2
5.3
5.4
Saving a file
34
Saving multiple files
35
Leveson inquiry website
36
Links on multiple pages
36
6.1
6.2
6.3
6.4
AHRC code
39
AHRC link scraper
AHRC output
41
OpenOffice dialog
7.1
7.2
7.3
7.4
7.5
ATP rankings
43
ATP code
44
ATP code
44
Improved ATP code
ATP output
46
40
42
45
Introduction
1
Introducing web scraping
10
S C R A P I N G T H E W E B F O R A RT S A N D H U M A N I T I E S
Usage scenarios
Web scraping will help you in any situation where you find yourself
copying and pasting information from your web browser. Here are
some times when Ive used web scraping:
to download demo MP3s from pitchfork.com;
to download legal decisions from courts around the world;
to download results for Italian general elections, with breakdowns
for each of the 8000+ Italian municipalities
to download information from the Economic and Social Research
Council about the amount of money going to each UK university
Some of these were more difficult than others. One of these
downloading MP3s from Pitchfork well replicate in this booklet.
Alternatives
I scrape the web because I want to save time when I have to collect
a lot of information. But theres no sense writing a program to scrape
the web when you could save time some other way. Here are some
alternatives to screen scraping:
ScraperWiki ScraperWiki (https://scraperwiki.com/) is a website set up in 2009 by a bunch of clever people previously involved
in the very useful http://theyworkforyou.com/. ScraperWiki hosts
programmes to scrape the web and it also hosts the nice, tidy
data these scrapers produce. Right now, there are around 10,000
scrapers hosted on the website.
You might be lucky, and find that someone has already written
a program to scrape the website youre interested in. If youre not
lucky, you can pay someone else to write that scraper as long as
youre happy for the data and the scraper to be in the public domain.
Outwit Outwit (http://www.outwit.com/) is freemium4 software
that acts either as an add-on for Firefox, or as a stand-alone product.
It allows fairly intelligent automated extraction of tables, lists and
links, and makes it easier to write certain kinds of scraper. The free
version has fairly serious limitations on data extraction (maximum
100 rows in a table), and the full version retails for 50.
SocSciBot If you are just interested in patterns of links between
web sites and use Windows, you might try SocSciBot (http://
socscibot.wlv.ac.uk/) from the University of Wolverhampton. Its
free to use, and has good integration with other tools to analyze
networks.
11
User - agent : *
Disallow : / cgi - bin
Disallow : / cgi - perl
Disallow : / cgi - perlx
Disallow : / cgi - store
Disallow : / iplayer / cy /
Disallow : / iplayer / gd /
Disallow : / iplayer / bigscreen /
Disallow : / iplayer / cbeebies / episodes /
Disallow : / iplayer / cbbc / episodes /
Disallow : / iplayer / _proxy_
Disallow : / iplayer / page compone nts /
Disallow : / iplayer / user compone nts /
Disallow : / iplayer / playlist /
Disallow : / furniture
Disallow : / navigation
Disallow : / weather / broadband /
Disallow : / education / bitesize
That means that the BBC doesnt want you scraping anything
from iPlayer, or from their Bitesize GCSE revision micro-site. So
dont.
Respect the hosting sites bandwidth It costs money to host a web
site, and repeated scraping from a web site, if it is very intensive,
can result in the site going down.5 Its good manners to write your
program in a way that doesnt hammer the web site youre scraping.
Well discuss this later.
12
S C R A P I N G T H E W E B F O R A RT S A N D H U M A N I T I E S
Respect the law Just because content is online doesnt mean its
yours to use as you see fit. Many sites which use paywalls will require you to sign up to Terms and Agreements. This means, for
example, that you cant write a web scraper to download all the articles from your favourite journal for all time.6 The legal requirements
will differ from site to site. If in doubt, consult a lawyer.7
2
Introducing HTML
http://dev.w3.org/html5/spec/
single-page.html
Basics
The basics of HTML are simple. HTML is composed of elements
called tags. Tags are enclosed in left and right-pointing angle brackets. So, <html> is a tag.
Some tags are paired, and have opening and closing tags. So,
<html> is an opening tag which appears at the beginning of each
HTML document, and </html> is the closing tag which appears at
the end of each HTML document. Other tags are unpaired. So, the
<img> tag, used to insert an image, has no corresponding closing
tag.
Listing 2.1 shows a basic and complete HTML page.
You can see this HTML code and the web page it produces at
the W3Cs TryIt editor.3 Complete Exercise 1 before you go on.
Now youve experimented with some basic HTML, we can go over
the listing in 2.1 in more detail:
Line 1 has a special tag to tell the browser that this is HTML, and
not plain text.
Line 2 begins the web page properly
http://www.w3schools.com/html/
tryit.asp?filename=tryhtml_intro
14
S C R A P I N G T H E W E B F O R A RT S A N D H U M A N I T I E S
1
2
3
4
5
6
7
8
9
10
Line 3 starts the body of the web page. Web pages have a
<body> and a <head>. The <head> contains information like the
title of the page.
Line 5 starts a new heading, the biggest heading size, and closes
it
Line 7 starts a new paragraph, and closes it. HTML needs to be
told when a new paragraph starts. Otherwise, it runs all the text
together.
Lines 9 and 10 close off the tags we began with.
<a href = " http :// www . uea . ac . uk / " > University of East Anglia </
a>
INTRODUCING HTML
15
and heres a code snippet which would insert the UEA logo.
< img src = " http :// www . uea . ac . uk / polopoly_fs /1.166636!
ueas tandardr gb . png " >
The attribute for the link tag is href (short for hyper-reference),
and it takes a particular value in this case, the address of the web
page were linking to. The attribute for the image tag is src (short
for source), and it too takes a particular value the address of a
PNG image file.4 Try copying these in to the TryIt editor and see
what happens. You can see that the tag for links has an opening and
closing tag, and that the text of the link goes between these two. The
image tag doesnt need a closing tag.
Tables
Because of our interest in scraping, its helpful to take a close look at
the HTML tags used to create tables. The essentials can be summarized very quickly: each table starts with a <table> tag; Each table
row starts with a <tr> tag; Each table cell starts with a <td> tag.5
Listing 2.2 shows the code for a basic HTML table.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
16
S C R A P I N G T H E W E B F O R A RT S A N D H U M A N I T I E S
a new line but they are used very often in modern HTML pages
to add formatting information, or to add interactive elements. Most
web pages are now a mix of HTML which we know about and
two other technologies, CSS (short for Cascading Style Sheets) and
Javascript, a programming language. We dont need to know about
them, but theyll crop up in most web pages we look at. Youll often
see them used with attributes id or class. These provide hooks for
the CSS or Javascript to latch on to.
Tag
A
B
BLOCKQUOTE
BODY
BR
DIV
EM
HEAD
H1...H6
I
IMG
LI
OL
P
PRE
SPAN
TABLE
TD
TH
TR
UL
Stands for
Used in
Anchor
Bold
Block-quotes
Body
Line BReak
DIVision
EMphasis
Head
Heading
Italics
Image
List Item
Ordered List
Paragraph
PRE-formatted text
Span
Table
Table Data
Table Header
Table Row
Unordered list
Links
Formatting text
Formatting text
HTML structure
Formatting text
HTML structure
Formatting text
HTML structure
Formatting text
Formatting text
HTML structure
Lists
Lists
Formatting text
Formatting text
HTML structure
Tables
Tables
Tables
Tables
Lists
INTRODUCING HTML
17
18
S C R A P I N G T H E W E B F O R A RT S A N D H U M A N I T I E S
folder to save all the work youll be doing in the next chapter. It also
doesnt matter what you call it but test.html would be a good
suggestion.
Once youve saved your file, its time to open it in your browser.
You should be able to open a local file in your browser. Some
browsers will, as a default, only show you files ending in .htm or
.html.
If your browser offers to save the file youve just tried to open,
youve done something wrong. If you get a whole load of gibberish,
youve done something wrong. Try googling for the name of your text
editor and save plain text.
3
Introducing Python
Why choose Python? Python is not the only programming language out there. Its not even the programming language I use most
often. But Python has a number of advantages for us:
Its tidy. Code written in Python is very easy to read, even for
people who have no understanding of computer programming.
This compares favourably with other languages.2
Its popular. Python is in active development and there is a large
installed user base. That means that there are lots of people
learning Python, that there are lots of people helping others learn
Python, and that its very easy to Google your way to an answer.
Its used for web scraping. The site ScraperWiki (which I mentioned in the introduction) hosts web scrapers written in three
languages: Python, Ruby, and PHP. That means that you can look
at Python scrapers that other people have written, and use them
as templates or recipes for your own scrapers.
Before we begin
Installing Python
Installing BeautifulSoup
20
S C R A P I N G T H E W E B F O R A RT S A N D H U M A N I T I E S
First steps
I L L A S S U M E YO U H AV E P Y T H O N I N S TA L L E D . Im also going
to assume that youre running Python interactively. That is, youre
staring at a screen which looks something like Figure 3.1, even
though the colors might be different.
Listing 3.1: Python terminal
1
2
3
1+1
22.0/7.0
pow (2 ,16)
INTRODUCING PYTHON
21
1
2
pi = 22.0 / 7.0
pi
1
2
3
4
5
a = 20
b = 10
a = b
print a
print b
Numbers arent the only types of values that variables can hold.
They can also hold strings.
1
2
1
2
3
4
myfirstname [0:5]
myfirstname [1:5]
myfirstname [:5]
myfirstname [5:]
22
S C R A P I N G T H E W E B F O R A RT S A N D H U M A N I T I E S
Looper
One important way in which programming saves effort is through
looping. We can ask our program to do something to each item in a
list, or to do something for a range of values. For example, if were
interested in calculating the number of possible combinations of n
students (2n ), for up to eight students, we might write the following:
1
2
1
2
for i in myfirstname :
print i
1
2
Regular expressions
Regular expressions5 are a powerful language for matching text patterns. This page gives a basic introduction to regular expressions
themselves sufficient for our Python exercises and shows how regular expressions work in Python. The Python re module provides
regular expression support. In Python a regular expression search is
typically written as:
INTRODUCING PYTHON
1
2
3
4
5
6
7
The code match = re.search(pat, str) stores the search result in a variable named match. Then the if-statement tests the
match if true the search succeeded and match.group() is the
matching text (e.g. word:cat). Otherwise if the match is false (None
to be more specific), then the search did not succeed, and there is
no matching text. The r at the start of the pattern string designates
a python raw string which passes through backslashes without
change which is very handy for regular expressions (Java needs this
feature badly!). I recommend that you always write pattern strings
with the r just as a habit.
The power of regular expressions is that they can specify patterns, not just fixed characters. Here are the most basic patterns
which match single chars:
a, X, 9, ordinary characters just match themselves exactly. The
meta-characters which do not match themselves because they
have special meanings are: . $ * + ? [ ] \ ( ) (details below)
. (a period) matches any single character except newline \n
\w (lowercase w) matches a word character: a letter or digit or
underbar [a-zA-Z0-9 ]. Note that although word is the mnemonic
for this, it only matches a single word char, not a whole word. \W
(upper case W) matches any non-word character.
\b boundary between word and non-word
\s (lowercase s) matches a single whitespace character
space, newline, return, tab, form [ \n\r\t\f]. \S (upper case S)
matches any non-whitespace character.
\t, \n, \r tab, newline, return
\d decimal digit [0-9] (some older regex utilities do not support
but \d, but they all support \w and \s)
= start, $ = end match the start or end of the string
23
24
S C R A P I N G T H E W E B F O R A RT S A N D H U M A N I T I E S
Conclusion
This chapter has given you a whistle-stop introduction to some features of Python. You havent really been able to use any of these
features in any programs, youve just learn that they exist. Thats
okay. The next chapters will show you what full programs look like,
and hopefully youll come to recognize some of the structure and
features youve just learned in those programs.
4
Extracting some text
W E B PAG E S I N C L U D E A L OT O F I R R E L E VA N T F O R M AT T I N G . Very
often, were not interested in the images contained in a page, the
mouse-overs that give us definitions of terms, or the buttons that
allow us to post an article to Facebook or retweet it. Some browsers
now allow you to read pages without all of this irrelevant information.1 In this first applied chapter, were going to write some Python
scrapers to extract the text from a page, and print it either to the
screen or to a separate file. This is going to be our first introduction
to the BeautifulSoup package, which is going to make things very
easy indeed.
The example
The example Im going to use for this chapter is a recent review
from the music website, Pitchfork.com. In particular, its a review of
the latest (at the time of writing) Mogwai album, A Wrenched and
Virile Lore.2 You can find it at http://pitchfork.com/reviews/
albums/17374-a-wrenched-virile-lore/. When you open it in
your browser, it should look something like Figure 4.1.
You can take a look at the source of the web page by right clicking
and selecting View Source (or the equivalent in your browser). The
26
S C R A P I N G T H E W E B F O R A RT S A N D H U M A N I T I E S
source is not that readable its 138 lines long, but many of those
lines are very long indeed. Were interested in the start of the review
text itself, beginning Mogwai made their name. Try searching for
made their name in the source. You should find it on line 55 (a
horrifically long line). See Listing 4.2.
How are we supposed to make any sense of that? Well, lets
look at where the review begins. We seen that Mogwai is between
opening and closing link tags (<a> and </a>). Those take us to a
round up of all Pitchfork articles on Mogwai, which we dont want.
If we go back before that, we see that the first line is wrapped in
opening and closing paragraph tags. Thats helpful: well definitely
be interested in text contained in paragraph tags (as opposed to
free-floating text). But the tag thats really helpful is the one before
the opening paragraph tag:
E X T R AC T I N G S O M E T E X T
html/mogwai.html
1
</ div > < div class = " info " > < h1 > <a href = " / features / update / " > Update </ a > </ h1 > < h2 > <a href = " /
features / update /8965 - mount - kimbie / " > Mount Kimbie </ a > </ h2 > < div class = " abstract " > <p >
Following their excellent 2010 debut LP , <i > Crooks & amp ; Lovers , </ i > s u b t l e U K bass duo
Mount Kimbie were picked up by the estimable folks at Warp Records . They talk to Larry
Fitzmaurice about their forthcoming album and why their studio is a dump . </ p > </ div > </ div
> </ li > </ script > </ div > </ div > < div id = " content " > < div id = " ad - nav " > < div class = " ad - unit
autoload " id = " ad - unit - Strip_Reviews " data - unit = " Strip_Reviews " > </ div > </ div > < div id = "
main " > < ul class = " review - meta " > < li data - pk = " 18553 " > < div class = " artwork " > < img src = " http
:// cdn . pitchfork . com / albums /18553/ home page_la rge . b79481ae . jpg " / > </ div > < div class = " info
" > < h1 > <a href = " / artists /2801 - mogwai / " > Mogwai </ a > </ h1 > < h2 >A Wrenched and Virile Lore </ h2
> < h3 > Sub Pop ; 2012 </ h3 > < h4 > By < address > Stuart Berman </ address >; < span class = " pub date " > November 27 , 2012 </ span > </ h4 > < span class = " score score -7 -0 " > 7.0 </ span > < div
class = " bnm - label " > </ div > < ul class = " outbound " > < li class = " first " > < h1 > Artists : </ h1 > <a
href = " / artists /2801 - mogwai / " class = " first last " > Mogwai </ a > </ li > < li > < h1 > Find it at : <
/ h1 > <a href = " http :// www . insound . com / Mogwai / A /21837/& from =47597 " rel = " nofollow " target = "
_blank " class = " first " > Insound Vinyl </ a > <a rel = " nofollow " target = " _blank " href = " http ://
www . emusic . com / artist / Mogwai /? fref =150242 " > eMusic </ a > <a href = " http :// www . amazon . com / s /?
url = search - alias %3 Daps & tag = p4kalbrevs -20& field - keywords = Mogwai %20 A %20 Wrenched %20 Virile %20
Lore " rel = " nofollow " target = " _blank " class = " last " > Amazon MP3 & amp ; CD </ a > </ li > < li class
= " last " > < div id = " social - h t t p p i t c h f o r k c o m r e v i e w s a l b u m s 1 7 3 7 4 " class = " social - deferred " > <
div class = " lazy " data - content = " & lt ; script class =& quot ; social & quot ; type =& quot ; text /
javascript & quot ;& gt ; $ ( function () { p4k . ui . social ( ' social h t t p p i t c h f o r k c o m r e v i e w s a l b u m s 1 7 3 7 4 ' , ' Mogwai : A Wrenched Virile Lore ' , '
http :// pitchfork . com / reviews / albums /17374\ u002Da \ u002Dwrenched \ u002Dvirile \ u002Dlore
/') ; }) ; & lt ;/ script & gt ; " > </ div > </ div > </ li > </ ul > </ div > </ li > </ ul > < div class = "
object - detail " > < div class = " editorial " > <p > <a href = " http :// pitchfork . com / artists /2801 mogwai / " target = " _blank " > Mogwai </ a > made their name on a simple formula : be very quiet ,
and then , without warning , be sadistically loud . But very early on , the group showed they
weren t especially precious about their practice . Their 1997 full - length debut , <a href =
" http :// pitchfork . com / reviews / albums /11600 - young - team -2008 - edition / " target = " _blank " > <i >
Young Team </ i > </ a > , was quickly followed by <i > Kicking a Dead Pig </ i > , wherein <i > Young
Team </ i > s tracks were subjected to sonic surgery by a cast of noted studio scientists
that included <a href = " http :// pitchfork . com / artists /4045 - kevin - shields / " target = " _blank " >
Kevin Shields </ a > , <a href = " http :// pitchfork . com / artists /1342 - alec - empire / " target = "
_blank " > Alec Empire </ a > , and <i > <a href = " http :// pitchfork . com / artists /4402 - - ziq / " > </ a > <
/ i > <a href = " http :// pitchfork . com / artists /4402 - - ziq / " target = " _blank " > - Ziq </ a >. At the
time , it hardly seemed odd that the most undanceable band in all of Scotland would
welcome the opportunity for a beat - based rethink . After all , in the wake of the late -90 s
electronica boom , remix albums had effectively replaced live albums as the default cash cow - milking measure for rock bands ( as acknowledged by the collection s piss - take of a
title ) , and the ample negative space in the band s music presents producers with a large
canvass to color in . But despite its impressive cast and elaborate double - CD presentation
, <i > Dead Pig </ i > ultimately sounded like random attempts at applying Mogwai s metallic
noise to the darker strains of electronic music of the era ( drill and bass , digital
hardcore ) , to the point of using its entire second disc to determine who could produce
the most gonzo version of the b a n d s epic signature track " Mogwai Fear Satan " . ( <a href
= " http :// www . youtube . com / watch ? v =0 T3l35zohFM " target = " _blank " rel = " nofollow " > Shields
titanic take </ a > won in a landslide .) </ p > <p > Fourteen years later , the band s second
remix album , <a href = " http :// pitchfork . com / news /48102 - mogwai - share - remix - album - details / "
target = " _blank " > <i >A Wrenched Virile Lore </ i > </ a > , arrives as a more cohesive work ,
presenting a cerebral , alternate - universe reimgination of Mogwai s 2011 release , <a href =
" http :// pitchfork . com / reviews / albums /15100 - hardcore - will - never - die - but - you - will / " target =
" _blank " > <i > Hardcore Will Never Die But You Will </ i > </ a >. Despite its severe title , <i >
Hardcore </ i > counts as the b a n d s most blissful and texturally rich to date . <i >
Hardcore </ i >s sound provides producers with more jumping - off points than the band s
mountainous art - rock would normally allow . If <i > Kicking a Dead Pig </ i > was mostly about
giving Mogwai s atomic guitar eruptions a mechanized makeover , <i >A Wrenched Virile Lore <
/ i > repositions the songs central melodies in more splendorous surroundings . <a href = "
http :// pitchfork . com / news /48102 - mogwai - share - remix - album - details / " target = " _blank " > In the
hands of Justin K . Broadrick </ a > , the post - punky krautrock of " George Square Thatcher
Death Party " becomes the sort of gently ascendant , anthemic opener that you could imagine
blaring out of a stadium to kick - off an Olympics ceremony ; Pittsburgh prog - rockers Zombi
hear the mournful piano refrain of " Letters to the Metro " as the basis for a glorious ,
strobe - lit Trans Europe Express flashback . Or in some cases , the material is stripped
down to its essence : Glaswegian troubador R . M . Hubbert re - routes the motorik pulse of "
Mexican Grand Prix " off the speedway into the backwoods and transforms it into a hushed ,
Jose Gonzalez - like acoustic hymn , while San Fran neo - goth upstarts <a href = " http ://
pitchfork . com / artists /28699 - the - soft - moon / " target = " _blank " > the Soft Moon </ a > scuff away
the surface sheen of <a href = " http :// www . pitchfork . com / forkcast /15544 - san - pedro / " target =
" _blank " >" San Pedro " </ a > to expose the seething menace lurking underneath . However , there
are limits to this approach : Umberto s distended , ambient distillation of " Too Raging to
Cheers " s i m m e r s down this already s e r e n e t r a c k to the point of rendering it
i nc on se q ue nt ia l . </ p > <p > Like <i > Hardcore </ i > , <i >A Wrenched Virile Lore </ i > features 10
tracks , though it only references eight of the originals . However , even the mixes that
draw from the same songs are different enough in approach and sequenced in such a way
that the reappearances feel like purposeful reprises : Klad Hest s drum and bass - rattled
redux of " Rano Pano " finds its sobering aftershock in <a href = " http :// www . pitchfork . com /
artists /1917 - tim - hecker / " target = " _blank " > Tim Hecker s </ a > haunted and damaged revision ,
while <a href = " http :// pitchfork . com / artists /876 - cylob / " target = " _blank " > Cylob s </ a >
cloying synth - pop take on <i > Hardcore </ i > s opener " White Noise " -- which fills in the
original s instrumental melody with lyrics sung through a vocoder - - is redeemed by UK
composer Robert Hampson s 13 - minute soft - focus dissolve of the same track . Renamed " La
Mort Blanche " , its appearance provides both a full - circle completion of this record ( by
book - ending it with mixes by former Godflesh members ) while serving as a re - entry point
back into <i > Hardcore </ i > by reintroducing the source song s main motifs and widescreen
vantage . But more than just inspire a renewed appreciation for <i > Hardcore </ i > , <i >A
Wrenched Virile Lore </ i > potentially provides Mogwai with new avenues to explore now that
they re well into their second decade , and perhaps instill a greater confidence in the
idea that their identity can remain intact even in the absence of their usual skull crushing squall . </ p > </ div > </ div > </ div > < div id = " side " > < ul class = " object - prevnext minimal " > < li class = " prev " > <a href = " / reviews / albums /17373 - true / " > < img src = " http :// cdn .
pitchfork . com / albums /18552/ list . cd511d55 . jpg " / > < h1 > Solange </ h1 > </ a > </ li > < li class = "
next " > <a href = " / reviews / albums /17363 - the - history - channel / " > < img src = " http :// cdn .
pitchfork . com / albums /18540/ list .745 b89a0 . jpg " / > < h1 >E -40 / Too $hort </ h1 > </ a > </ li > </
ul > < div class = " ad - unit autoload " id = " ad - unit - R e v _ A l b u m s _ 3 0 0 x 2 5 0 " data - unit = "
R e v _ A l b u m s _ 3 0 0 x 2 5 0 " > </ div > < div class = " most - read albumreviews - recordreview " > < h1 > Most
Read Album Reviews </ h1 > < div class = " tabbed " > < ul class = " tabs " > < li class = " first " > <a
href = " # " >7 Days </ a > </ li > < li class = " " > <a href = " # " > 30 Days </ a > </ li > < li class = " last " >
<a href = " # " > 90 Days </ a > </ li > </ ul > < ul class = " content " > < li class = " first " > < ul class = "
object - list carousel " data - transition = " fade " data - autoadvance = " 5000 " data - randomize initial = " on " > < li class = " first " > <a href = " / reviews / albums /17064 - the - disintegration loops / " > < div class = " artwork " > < div class = " lazy " data - content = " & lt ; img src =& quot ; http ://
cdn4 . pitchfork . com / albums /18243/ hom epage_l arge .06 fc9f79 . jpg & quot ; /& gt ; " > </ div > </ div > <
div class = " info " > < h1 > William Basinski </ h1 > < h2 > The D isinteg ration Loops </ h2 > < h3 > By
Mark Richardson ; November 19 , 2012 </ h3 > </ div >
27
28
S C R A P I N G T H E W E B F O R A RT S A N D H U M A N I T I E S
python code/mogwai1.py
1
2
import urllib2
from bs4 import BeautifulSoup
3
4
5
6
7
start = http :// pitchfork . com / reviews / albums /17374 - a wrenched - virile - lore /
page = urllib2 . urlopen ( start ) . read ()
soup = BeautifulSoup ( page )
print ( soup . get_text () )
python code/mogwai1.out
1
3
4
htmlvar NREUMQ = NREUMQ ||[]; NREUMQ . push ([ " mark " ," firstbyte " ,
new Date () . getTime () ]) Mogwai : A Wrenched Virile Lore |
Album Reviews | Pitchfork
[ if IE 7] > < link rel = " stylesheet " type = " text / css " href = "
http :// cdn . pitchfork . com / desktop / css / ie7 . css " / > < ![
endif ][ if IE ] > < script src = " http :// cdn4 . pitchfork . com /
desktop / js / excanvas . js " > </ script > < ![ endif ] var p4k =
window . p4k || {}; p4k . init = []; p4k . init_once = [];
p4k . ads = {}; var __js = []; var __jsq = []; __jsq .
push ( function () { $ ( function () { p4k . core . init () }) }) __js
. push ( " https :// www . google . com / jsapi ? key \
u003DABQIAAAAd4VqGt0ds \
u002DTq6JhwtckYyxQ7a1MeXZzsUvkGOs95E1kgVOL_HRTWzR1RoBGaK0NcJfQcDtUuCXrHcQ
" ) ; __DFP_ID__ = " 1036323 " ; STATIC_URL = " http :// cdn .
pitchfork . com / " ;
var _gaq = _gaq || [];
google var _gaq = _gaq || []; _gaq . push ([ _setAccount ,
UA -535622 -1 ]) ; _gaq . push ([ _trackPageview ]) ; (
function () { var ga = document . createElement ( script )
; ga . type = text / javascript ; ga . async = true ; ga . src
= ( https : == document . location . protocol ? https ://
ssl : http :// www ) + . google - analytics . com / ga . js ;
var s = document . g e t E l e m e n t s B y T a g N a m e ( script ) [0]; s .
parentNode . insertBefore ( ga , s ) ; }) () ;
E X T R AC T I N G S O M E T E X T
29
Hmm.... not so good. Weve still got a lot of crap at the top. Were
going to have to work on that. For the moment, though, check you
can do Exercise 5 before continuing.
Exercise 5 Your first scraper
Try running the code given in 4.3. Try it first as a saved program.
Now try it interactively. Did it work both times?
Try removing the http:// from the beginning of the web address.
Does the program still work?
You could have learnt about the get text() function from the
BeautifulSoup documentation at http://www.crummy.com/
software/BeautifulSoup/bs4/doc/. Go there now and look at
some of the other ways of accessing parts of the soup. Using Python
interactively, try title and find all. Did they work?
python code/mogwai2.py
1
2
import urllib2
from bs4 import BeautifulSoup
3
4
5
6
start = http :// pitchfork . com / reviews / albums /17374 - a wrenched - virile - lore /
page = urllib2 . urlopen ( start ) . read ()
soup = BeautifulSoup ( page )
7
8
9
30
S C R A P I N G T H E W E B F O R A RT S A N D H U M A N I T I E S
a whole load of crap at the start, and it starts with the text of the
review.
Unfortunately, its still not perfect. Youll see at the bottom of the
output that there are a number of sentence fragments ending in
ellipses (. . . ). If you go back to the page in your web browser, as
shown in Figure 4.1, youll see that these sentence fragments are
actually parts of boxes linking to other Mogwai reviews. We dont
want to include those in our output. We need to find some way of
becoming more precise.
Thats where the div we saw earlier comes in to play. Heres the
listing; explanation follows.
Listing 4.6: Looping over divisions
python code/mogwai3.py
1
2
import urllib2
from bs4 import BeautifulSoup
3
4
5
6
start = http :// pitchfork . com / reviews / albums /17374 - a wrenched - virile - lore /
page = urllib2 . urlopen ( start ) . read ()
soup = BeautifulSoup ( page )
7
8
9
10
Recap
So how did we arrive at this wonderful result? We proceeded in four
steps.
E X T R AC T I N G S O M E T E X T
First, we identified the portion of the web page that we wanted, and
found the corresponding location in the HTML source code. This is
usually a trivial step, but can become more complicated if you want
to find multiple, non-contiguous parts of the page.
Second, we identified a particular HTML tag which could help us
refine our output. In this case, it was a particular div which had a
class called editorial. There was no guarantee that we would
find something like this. We were lucky because well built web sites
usually include classes like this to help them format the page.
Third, we used BeautifulSoup to help us loop over the tags we
identified in the second step. We used information both on the particular div and the paragraphs containing the text within that div.
Fourth, we used BeautifulSoups get text on each paragraph,
and printed each in turn.
This is a common structure for extracting text. Whilst the tags you
use to identify the relevant portion of the document might differ, this
basic structure can and ought to guide your thinking.
python code/mogwai4.py
1
2
3
import urllib2
from bs4 import BeautifulSoup
import codecs
4
5
6
7
start = http :// pitchfork . com / reviews / albums /17374 - a wrenched - virile - lore /
page = urllib2 . urlopen ( start ) . read ()
soup = BeautifulSoup ( page )
8
9
10
11
12
13
14
15
outfile . close ()
There are two changes you need to take note of. First, we import
a new package, called codecs. Thats to take care of things like
accented characters, which are represented in different ways on
different web-pages. Second, instead of calling the print function,
31
32
S C R A P I N G T H E W E B F O R A RT S A N D H U M A N I T I E S
A taster
Weve worked through printing out a review from a particular website. Now its time to try a different example.
The BBC lists many of the interviewees who have appeared on
the Radio 4 programme Desert Island Discs. One of the most recent
castaways was Edmund de Waal. You can see his selections at
http://www.bbc.co.uk/radio4/features/desert-island-discs/
castaway/2ada59ff#b01p314n.
Well return to this example later but try Exercise 6 to see how
much you can extract from this page.
Exercise 6 Castaway scraper
Try looking at the source of the De Waal page.
1. What tags surround the artist of each track?
2. Is this on its own enough to extract the artist? If not, what div
must you combine it with?
3. Write a program to extract the eight artists chosen by Edmund de
Waal.
4. Look again at the BeautifulSoup documentation at http:
//www.crummy.com/software/BeautifulSoup/bs4/doc/.
In particular, look at the section headed .next sibling and
.previous sibling. Can you use .next sibling to print out the
track?
5
Downloading files
In the previous section, Python helped us clear a lot of the junk from
the text of web pages. Sometimes, however, the information we want
isnt plain text, but is a file perhaps an image file (.jpg, .png, .gif), or
a sound file (.mp3, .ogg), or a document file (.pdf, .doc). Python can
help us by making the automated downloading of these files easier
particularly when theyre spread over multiple pages.
In this chapter, were going to learn the basics of how to identify
links and save them. Were going to use some of the regular expression skills we learned back in 3. And were going to make some
steps in identifying sequences of pages we want to scrape.
The example
As our example were going to use a selection of files hosted on
UbuWeb (www.ubu.com). UbuWeb describes itself as a completely
independent resource dedicated to all strains of the avant-garde,
ethnopoetics, and outsider arts. It hosts a number of out-of-circulation
media, including the complete run of albums released by Briano
Enos short-lived experimental record label Obscure Records.
34
S C R A P I N G T H E W E B F O R A RT S A N D H U M A N I T I E S
python code/toop.py
1
2
3
4
5
6
7
8
9
10
and be a little smart, and create a filename for the saved file. We
have to give a filename Pythons not going to invent one for us. So
well use the last part of the address itself the part after the last
forward slash.
In order to get that, we take the url, and call the split function
on it. We give the split function the character we want to split on, the
forward slash. That split function would normally return us a whole
list of parts. But we only want the last part. So we use the minus
notation to count backwards (as we saw before).
In Line 8, we create another variable to tell Python which directory
we want to save it in. Remember to create this directory, or Python
will fail. Remember also to include the trailing slash.
Finally, line 10 does the hard work of downloading stuff for us.
We call the urlretrieve function, and pass it two arguments the
address, and a path to where we want to save the file, which has the
directory plus the filename.
One thing youll notice when you try to run this program it will
seem as if its not doing anything for a long time. It takes time to
download things, especially from sites which arent used to heavy
traffic. Thats why its important to be polite when scraping.
D OW N L O A D I N G F I L E S
35
python code/toop2.py
1
2
3
4
5
import re
import urlparse
from urllib2 import urlopen
from urllib import urlretrieve
from bs4 import BeautifulSoup
6
7
8
9
10
11
12
13
http://www.levesoninquiry.org.
uk/
http://www.levesoninquiry.
org.uk/evidence/?witness=
rupert-murdoch
6
http://www.levesoninquiry.org.
uk/evidence/?witness=tony-blair
36
S C R A P I N G T H E W E B F O R A RT S A N D H U M A N I T I E S
all of the PDF transcripts. So well amend the same regular expression code we used to download MP3s above. Listing 5.4 shows the
necessary steps.
Listing 5.4: Links on multiple pages
python code/leveson.py
1
2
3
4
5
6
import re
import urlparse
import time
from urllib2 import urlopen
from urllib import urlretrieve
from bs4 import BeautifulSoup
7
8
9
10
11
12
13
14
15
16
17
18
19
We start with the usual calling of certain packages, before defining two variables. The first variable, start, is the page well go to
first. The second variable, baseurl, is something were going to use
to make sense of the links we get. More on that later.
We start the meat of the program on line 13, where we iterate
over links which contain witness=. We can access the address for
each of those links by witness["href"]. However, these links are
not sufficient on their own. Theyre whats known as relative links.
They miss out all the http://www.leveson... guff at the start.
The only way of knowing whether links are relative or not is to look
carefully at the code.
Because of this, we combine the base url, with the value of the
href attribute. That gives us the full address. (Check if you dont
D OW N L O A D I N G F I L E S
believe).
We then pause a little to give the servers a rest, with time.sleep,
from the time package. We then open a new page, with the full
address we just created! (We store it in the same soup, which might
get confusing).
Now were on the witness page, we need to find more links. Just
searching for stuff that ends in .pdf isnt enough; we need just PDF
transcripts. So we also add a regular expression to search on the
text of the link.
To save bandwidth (and time!) we close by printing off the base
URL together with the relative link from the href attribute of the <a>
tag. If that leaves you unsatisfied, try Exercise 7.
Exercise 7 Leveson scraper
1. Amend the source found in Listing 5.4 to download all text transcripts. (Text files are much smaller than PDF files; the whole set
will take much less time to download).
2. Turn your wireless connection off and try running the program
again. What happens?
37
6
Extracting links
The idea of the link is the fundamental building block not only of
the web but of many applications built on top of the web. Links
whether theyre links between normal web pages, between followers
on Twitter, or between friends on Facebook are often based on latent structures that often not even those doing the linking are aware
of. Were going to write a Python scraper to extract links from a particular web site, the AHRC website. Were going to write our results
to a plain text spreadsheet file, and were going to try and get that in
to a spreadsheet program so we can analyze it later.
AHRC news
The AHRC has a news page at http://www.ahrc.ac.uk/News-and-Events/
News/Pages/News-Listing.aspx. It has some image-based links at
the top, followed by a list of the 10 latest news items. Were going
to go through each of these, and extract the external links for each
item.
Lets take a look at the code at the time of writing. An excerpt is
featured in Listing 6.1.
Listing 6.1: AHRC code
1
2
3
4
5
6
Were going to pivot off the div with class of item, and identify
the links in those divs. Once we get those links, well go to those
news items. Those items (and youll have to trust me on this) have
divs with class of pageContent. Well use that in the same way.
40
S C R A P I N G T H E W E B F O R A RT S A N D H U M A N I T I E S
python code/ahrc.py
1
2
3
4
5
6
import re
import urlparse
import codecs
from urllib2 import urlopen
from urllib import urlretrieve
from bs4 import BeautifulSoup
7
8
start = http :// www . ahrc . ac . uk / News - and - Events / News / Pages /
News - Listing . aspx
outfile = codecs . open ( ahrc_links . csv , w , utf -8 )
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
outfile . close ()
Listing 6.2 shows the eventual link scraper. You should notice
two things. First, were starting to use if-tests, like we discussed
back in Chapter 3. Second, weve got quite a lot of loops we loop
over news items, we have a (redundant) loop over content divs, and
we loop over all links. The combination of these two things means
theres quite a lot of indentation.
Let me explain three lines in particular. Line 15 tests whether or
not the <a> tag that we for in the loop beginning Line 14 has an href
attribute. Its good to test for things like that. There are some <a>
1
tags which dont have href attributes.1 . If you give get to line 16 with
Instead they have a name tag, and
act as anchor points. You use them
just such a tag, Python will choke.
whenever you go to a link with a hash
Line 23 takes a particular slice out of our link text. It goes from the
symbol (# ) after the .html
beginning to the fourth character. We could have made that clearer
by writing linkurl[0:4] remember, lists in Python start from zero,
not one. Were relying on external links beginning with http.
Line 24 uses a regular expression. Specifically, it says, take any
kind of character that follows a forward slash, and replace it with
nothing and do that to the variable linkurl, from the seventh character onwards. Thats going to mean that we get only the website
address, not any folders below that. (So, digitrans.crowdvine.com/pages/watch-live
becomes digitrans.crowdvine.com/).
Finally, Line 25 gives us our output. We want to produce a spreadsheet table with two columns. The first column is going to be the
AHRC page that we scraped. The second column is going to give us
E X T R AC T I N G L I N K S
ahrc links.csv
1
10
11
12
13
14
15
16
17
http :// www . ahrc . ac . uk / News - and - Events / News / Pages / Care - for the - Future - and - Science - and - Culture - leaderships - fellows announced . aspx research . sas . ac . uk
http :// www . ahrc . ac . uk / News - and - Events / News / Pages / Care - for the - Future - and - Science - and - Culture - leaderships - fellows announced . aspx humanities . exeter . ac . uk
http :// www . ahrc . ac . uk / News - and - Events / News / Pages / Care - for the - Future - and - Science - and - Culture - leaderships - fellows announced . aspx www . exeter . ac . uk
http :// www . ahrc . ac . uk / News - and - Events / News / Pages / Care - for the - Future - and - Science - and - Culture - leaderships - fellows announced . aspx www . sas . ac . uk
http :// www . ahrc . ac . uk / News - and - Events / News / Pages / Join - in the - Moot - today . aspx digitrans . crowdvine . com
http :// www . ahrc . ac . uk / News - and - Events / News / Pages / Join - in the - Moot - today . aspx digitrans . crowdvine . com
http :// www . ahrc . ac . uk / News - and - Events / News / Pages / Join - in the - Moot . aspx digitrans . crowdvine . com
http :// www . ahrc . ac . uk / News - and - Events / News / Pages / Join - in the - Moot . aspx digitrans . crowdvine . com
http :// www . ahrc . ac . uk / News - and - Events / News / Pages / Care - for the - Future - and - Science - and - Culture - leaderships - fellows announced . aspx research . sas . ac . uk
http :// www . ahrc . ac . uk / News - and - Events / News / Pages / Care - for the - Future - and - Science - and - Culture - leaderships - fellows announced . aspx humanities . exeter . ac . uk
http :// www . ahrc . ac . uk / News - and - Events / News / Pages / Care - for the - Future - and - Science - and - Culture - leaderships - fellows announced . aspx www . exeter . ac . uk
http :// www . ahrc . ac . uk / News - and - Events / News / Pages / Care - for the - Future - and - Science - and - Culture - leaderships - fellows announced . aspx www . sas . ac . uk
http :// www . ahrc . ac . uk / News - and - Events / News / Pages /
Archaeologists - reveal - rare - Anglo - Saxon - feasting - hall .
aspx www . l y m i n g e a r c h a e o l o g y . org
http :// www . ahrc . ac . uk / News - and - Events / News / Pages /
Archaeologists - reveal - rare - Anglo - Saxon - feasting - hall .
aspx blogs . reading . ac . uk
http :// www . ahrc . ac . uk / News - and - Events / News / Pages / Investment
- to - promote - innovation - in - additive - manufacturing . aspx
www . innovateuk . org
http :// www . ahrc . ac . uk / News - and - Events / News / Pages / Hajj Journey - to - the - Heart - of - Islam . aspx www . britishmuseum .
org
http :// t . co /3 Q4ls16I charades . hypotheses . org
I told Python to write this to the file ahrc links.csv. Files that
end in CSV are normally comma-separated values files. That is,
they use a comma where I used a tab. I still told Python to write
to a file ending in .csv, because my computer recognises .csv as
an extension for comma-separated values files, and tries to open
the thing in a spreadsheet. There is an extension for tab separated
values files, .tsv, but my computer doesnt recognise that. So I cheat,
and use .csv.
41
42
S C R A P I N G T H E W E B F O R A RT S A N D H U M A N I T I E S
I can do this, because OpenOffice (the spreadsheet I use) intelligently recognises that Im trying to open a plain-text file, and asks
me to check the separator I use. The dialog box I get is shown in
Figure 6.4. You can see that there are a whole host of separator
options for me to play with.
Listing 6.4: OpenOffice dialog
This is going to form the route for getting our information into a
form in which we can analyze things. Were going to get a whole
load of information, smash it together with tabs separating it, and
open it in a spreadsheet.
That strategy pays off most obviously when looking at tables
and thats what were going to look at next.
7
Extracting tables
A lot of information on the web and particularly the kind of information that were interested in extracting is contained in tables. Tables
are also a natural kind of thing to pipe through to a spreadsheet. But
sometimes the table formatting used for the web gets in the way.
Were going to extract tables with a degree of control that wouldnt
be possible with manual copy-and-pasting.
Our example
I believe (mostly without foundation) that most people derive most
exposure to tables from sports. Whether its league or ranking tables, almost all of us understand how to understand these kinds of
tables. Im going to use one particular ranking table, the ATP tennis tour ranking. You can see the current mens singles ranking at
http://www.atpworldtour.com/Rankings/Singles.aspx. The
website is shown in Figure 7.1.
Listing 7.1: ATP rankings
Try copying and pasting the table from the ATP tour page into
your spreadsheet. You should find that the final result is not very
useful. Instead of giving you separate columns for ranking, player
name, and nationality, theyre all collapsed into one column, making
it difficult to search or sort.
You can see why that is by looking at the source of the web page.
44
S C R A P I N G T H E W E B F O R A RT S A N D H U M A N I T I E S
html/atp.html
1
2
3
4
5
6
7
8
9
10
< td > <a href = " / Tennis / Players / Zo / J / Jurgen - Zopp . aspx ? t = rb " > 579 </ a > </ td >
< td > -2 </ td >
< td class = " last " > <a href = " / Tennis / Players / Zo / J / Jurgen - Zopp . aspx
? t = pa & m = s " > 22 </ a > </ td >
</ tr >
You should be able to see that there are only four cells in this
table row, whereas we want to extract six pieces of information (rank,
name, nationality, points, week change, and tournaments played).
What were going to do is produce an initial version of the scraper
which extracts the table as it stands, and then improve things by
separating out the first column.
Listing 7.3: ATP code
python code/atp.py
1
2
3
4
5
6
7
import re
import urlparse
import codecs
from urllib2 import urlopen
from urllib import urlretrieve
from bs4 import BeautifulSoup
from bs4 import SoupStrainer
8
9
10
11
12
13
14
15
16
17
18
19
20
outfile . write ( \ n )
21
22
outfile . close ()
E X T R AC T I N G TA B L E S
by my cell separator, which is a tab.1 Finally, after the end of the for
loop, I add a new line so that my spreadsheet isnt just one long row.
How are we going to improve on that? Were going to use some
if and else statements in our code. Essentially, were going to process the cell contents one way if it has class of first, but process it
in a quite different way if it doesnt. Listing 7.4 shows the listing.
The major differences with respect to the previous listing are
as follows. Theres a little bit of a trick in Line 14. Because were
going to parse table cells with class first on the assumption that they
contain spans with the rank, and links, and so on, were going to
ignore the first row of the table, because it has a table cell with class
first which doesnt contain a span with the rank, etc., and because if
we ask Python to get spans from a table cell which doesnt contain
them, its going to choke.2 So we take a slice of the results returned
by BeautifulSoup, omitting the first element.3
python code/atp2.py
1
2
3
4
5
6
7
import re
import urlparse
import codecs
from urllib2 import urlopen
from urllib import urlretrieve
from bs4 import BeautifulSoup
from bs4 import SoupStrainer
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
outfile . write ( \ n )
29
30
outfile . close ()
45
46
S C R A P I N G T H E W E B F O R A RT S A N D H U M A N I T I E S
E X T R AC T I N G TA B L E S
200, 201-300, and so on...). We could adapt the code weve just
written to scrape this information as well. In this instance, the key
would come about not through replicating the actions we would go
through in the browser (selecting each item, hitting on Go, copying
the results), but on examining what happens to the address when we
try and example change of week, or change of ranking strata. For
example: if we select 101 - 200, you should see that the URL in your
browsers address bar changes from http://www.atpworldtour.
com/Rankings/Singles.aspx to http://www.atpworldtour.com/
Rankings/Singles.aspx?d=26.11.2012&r=101&c=#{}. In fact, if we
play around a bit, we can get to arbitrary start points by just adding
something after r we dont even need to include these d= and
c= parameters. Try http://www.atpworldtour.com/Rankings/
Singles.aspx?r=314#{} as an example.
We might therefore wrap most of the code from Listing 7.4 in a for
loop of the form:
1
2
3
4
weekends = []
soup = BeautifulSoup ( urlopen ( start ) , parse_only =
SoupStrainer ( select ,{ id : singlesDates }) )
for option in soup . find_all ( option ) :
weekends . append ( option . get_text () )
We could then create two nested for loops with strata and
weekends, paste these on to the address we started with, and extract ourr table information.
47
8
Final notes
C O N G R AT U L AT I O N S , YO U V E G OT T H I S F A R . Youve learned
how to understand the language that web pages are written in, and
to take your first steps in a programming language called Python.
Youve written some simple programs that extract information from
web pages, and turn them in to spreadsheets. I would estimate that
those achievements place you in the top 0.5% of the population
when it comes to digital literacy.
Whilst youve learned a great deal, the knowledge you have is
quite shallow and brittle. From this booklet you will have learned a
number of recipes that you can customize to fit your own needs. But
pretty soon, youll need to do some research on your own. Youll
need independent knowledge to troubleshoot problems you have
customizing these recipes and for writing entirely new recipes.
If scraping the web is likely to be useful to you, you need to do the
following.
First, you need to get a book on how to program Python. Most
university libraries should have a couple of books, either in print or
(more unwieldy) online through OReilly. An introductory text will
give you a much fuller understanding. In particular, it will show you
techniques that we havent needed, but could have employed to
make our code more robust or tidier.
Second, you need to get a grip on functions and regular expressions. We havent really talked about functions here. Weve used
the functions that are built in to Python and to the several packages weve used, but we havent rolled our own. Being able to write
your own functions or at least, to understand functions other people write is very important. Very few of the scrapers available on
ScraperWiki, for example, are written without using functions. Being
able to wield regular expressions is also tremendously helpful. A lot
of problems in life can be solved with regular expressions.1
Third, you need to look at what other people are doing. Checking
out some of the scrapers on ScraperWiki is invaluable. Look for
some scrapers that use BeautifulSoup. Hack them. Break them.
50
S C R A P I N G T H E W E B F O R A RT S A N D H U M A N I T I E S
Then look for other scrapers that are written in plain Python, or using
other libraries2 to parse the web. Hack them. Break them. Lather.
Rinse. Repeat.
Difficulties
I wont pretend that scraping the web is always easy. I have almost
never written a program that worked the way I wanted to the first
time I ran it. And there are some web pages that are difficult or impossible to scrape. Heres a list of some things you wont be able to
do.
Facebook You cant scrape Facebook. Its against the terms
and conditions, and Facebook is pretty good at detecting automated access from scrapers. When I first started on Facebook
(late 2004/early 2005), I wrote a scraper to collect information on
students political views. I got banned for two weeks, and had to
promise I wouldnt do the same again. Dont try. You can write to
Facebook, and ask them to give you the data they hold on you.
Twitter You can get Twitter data, but not over the web. Youll need
to use a different package,3 and youll need to get a key from Twitter.
Its non-trivial.