Python For Web Scraping - Week 3: 1 Installing A Module
Python For Web Scraping - Week 3: 1 Installing A Module
Python For Web Scraping - Week 3: 1 Installing A Module
1
1.1
Installing a Module
Installing BeautifulSoup, or other modules
http://www.crummy.com/software/BeautifulSoup/#Download
Download the version that is right for you. It is very important that you not use version 3.1.x with older versions of Python - it will not work at all, as I learned the hard way! Youll download a le that ends with .tar.gz - this is the zipped up source le for the module. In the terminal, cd into the directory you downloaded the tar le to. Then type:
$ tar -xzf [filename]
Replacing [lename] with the downloaded tar balls name. tar is the Unix command to unpack the le. As we talked about previously, the letters after the dash are Unix options. The x tells Unix to extract from the tar ball, the z tells Unix its a zip le, and the f tells Unix to name the unpacked directory according to the name of the tar le itself. Now you should have a new directory within the download directory, with a name that looks like the name of the tar le, but without the .tar.gz at the end. cd into this directory and type ls to have a look around. You should see that there is a le called setup.py - every Python module comes with this le, which tells the computer how to install the module properly. Now you execute:
$ python setup.py install
Andrew Hall, Department of Government, Harvard University To test if it installed properly, launch Python in the terminal and type import BeautifulSoup. If it imports without error, youve succeeded! This method of installing should work for almost any Python module, not just for BeautifulSoup.
2
2.1
Basic Scraping
Opening a webpage
Its easy to learn the basics of urllib2 in the interactive Python environment.
>>> import urllib2 >>> url = http://news.yahoo.com >>> page = urllib2.urlopen(url) >>> type(page) <type instance> >>> page = urllib2.urlopen(url).read() >>> type(page) <type str> >>> print page ## Ton of raw html dumped here
The basic idea is, we pass a URL to the urlopen() function, and then get the raw content of the webpage using the read() attribute, which returns a string.
>>> page = urllib2.urlopen(url).read() >>> len(page) 202930
2.2
## mySoup.py ## A simple example of how to call BeautifulSoup import urllib2 from BeautifulSoup import BeautifulSoup url = http://sports.yahoo.com/mlb page = urllib2.urlopen(url).read() soup = BeautifulSoup(page) print soup.prettify()
Andrew Hall, Department of Government, Harvard University BeautifulSoup takes in the raw HTML and organizes the html tags. Now we can cruise through the html to nd whatever were looking for without having to implement a bunch of special functions to deal with the weird stringyness of html ourselves. Among the incredibly useful BeautifulSoup functions are: 1. ndAll(tag): this is a major work-horse. Suppose I want to nd all of the links on a given webpage. ndAll(a) gives me a list of all the times there is an <a> tag (a is the link tag in html). 2. tag.string: gives you the text content associated with a tag. So, for example, if in the html I have <p>This is a paragraph</p>, then if i have my_tag = soup.nd(p), tag.string will return This is a paragraph. 3. tag.parent: gives you the next html tag surrounding the current tag. So if you have <p>Blah<a href=somelink>blah</a></p> and you have a variable set equal to the a tag, tag.parent will give you the p tag. Check out the BeautifulSoup documentation for a ton more functions and better explanations.
2.3
A Full Example
Suppose we wanted to create a text le containing all of the abstracts of Gary Kings publications, as listed on his website: http://gking.harvard.edu/vitae/node7.html First, we read in the webpage with the list of publications and links to the abstracts:
## Script to get abstracts import urllib2 from BeautifulSoup import BeautifulSoup import nltk url = http://gking.harvard.edu/vitae/node7.html page = urllib2.urlopen(url) soup = BeautifulSoup(page)
We open the url using urllib2, and then turn the html from that webpage into a soup using BeautifulSoup. Now we can parse the html:
## download_links will be a list containing the links to the webpage for each abstract download_links = [] links = soup.findAll(a) for link in links: if link.string == [Abstract]: download_url = link[href] download_links.append(download_url)
Andrew Hall, Department of Government, Harvard University This part of the script uses BeautifulSoup to nd all of the hyperlinks (i.e. the a tags in the html). Then, noticing that on his webpage, all of the abstracts are linked using the phrase [Abstract], I capture just the links that are going to abstracts. I then get the url these links point to using the link[href ] command. The idea here is that each tag Ive gotten out of ndAll can be treated like a dictionary for the purpose of accessing its attributes. Attributes in html (not to be confused with attributes in Python!) are like optional modiers to tags. For example, <p> is a paragraph tag with no attributes, whereas <p align=center> has an attribute called align. <a> tags have an attribute called href that is the actual url of the link. So we get the url of each abstract by accessing the href attribute of the abstract <a> tags. Then we put these links into a list, so that at the end we have a list containing all of the abstracts we want to download. Now we need to decide what we want as our end product. For example, we could write all of the abstracts to a text le, or to a webpage, or do anything else we want with them. For now, to keep things simple, Im just going to print them to the console.
for target in download_links: content = urllib2.urlopen(target).read() raw = nltk.clean_html(content) ## print out the abstract after removing the html print raw
I loop through the list of urls, opening each one in turn using urllib2 again. When I have the page open, since Ive used the .read() attribute, I now have a string of unprocessed html. I could BeautifulSoup this and do whatever with it, but to keep things simple, for now Im just going to use nltks clean_html function, which will remove all of the html tags and just dump the text content. I then print this to the console, and were done! Note that the output will be far from perfect; itll have weird formatting, some weird characters, etc. Such is the nature of scraping with a quick and dirty hack like this script.