Let's Scrape the Web with Python 3

Mar 10, 2013

Python Tutorial

By: Brandon Quakkelaar

In the back of my mind I’ve always been intrigued by writing an application that can retrieve web pages over HTTP. It’s a fairly simple thing to do. We have a myriad of web browsers that do it for us. But there is just something about writing an application that operates independently of a browser and reaches out to touch the internet that I find fun and intriguing. So let’s do it… in Python.

First let’s define some specifications for this project. Basically we’re going to “scrape” Craigslist.org listings and display them in our terminal (command line). It should be able to scrape any (or nearly any) of Craigslist’s regions and categories.

Separating the Web Scraper into Pieces

As I see it, there are three pieces to this application:

  1. The user interface which handles input and displays ouput
  2. The HTTP client which actually accesses the HTML page and gets the information therein
  3. The HTML parser which reads the HTML and collects the parts we want to keep

Now that the application’s functionality is defined and it is broken down into pieces. We can start thinking about the project’s name and structure. Let’s just call it MyScrape and let’s structure in the following folder and files.

The HTTP Client and Python 3

Python 3 has a handly little module that we can use to make our lives easy. We’re going to import http.client into our MyHttp.py file and use it in our class.

import http.client

Now that we imported http.client, we can create our class to handle a page. Let’s keep things simple and just call this class Page. Page just needs to connect to a server, request a page using a path, and provide the result to the application. Here is the complete MyHttp.py file:

'''GET a webpage using http.'''

import http.client

class Page:

    def __init__(self, servername, path):
        '''This initialize function sets the servername and path'''
        self.set_target(servername, path)

    def set_target(self, servername, path):
        '''This is a utility function that will reset the servername and the path'''
        self.servername = servername
        self.path = path

    def __get_page(self):
        '''This is a private function that actually goes out 
        and gets the response from the server'''

        server = http.client.HTTPConnection(self.servername)
        server.putrequest('GET', self.path)
        server.putheader('Accept', 'text/html')
        server.endheaders()

        return server.getresponse()        

    def get_as_string(self):
        '''This function provides the webpage as a string'''
        page = ''
        reply = self.__get_page() # gets the page

        if reply.status != 200:
            page = 'Error sending request {0} {1}'.format(reply.status, reply.reason)
        else:
            data = reply.readlines()
            reply.close()
            for line in data:
                page += line.decode('utf-8')
        return page

Now that we have our class, we need to make sure it works by testing it. We can do that by using the Python interpreter to execute our code for us. First, start the Python interpreter by sending the command:

$ python3

to the shell prompt (or the command prompt if you’re on Windows). This should give you a prompt that looks like this:

>>>

To exit the interpreter, just enter exit() and press enter. Note: More information about the Python interpreter can be found here on the Python.org website.

To test our code in the Python interpreter, first navigate to the MyScrape folder that has the MyHttp.py file in it.

$ cd path/to/your/MyScrape

Then start the interpreter and enter the following code:

>>> import MyHttp
>>> page = MyHttp.Page('quakkels.com', '')
>>> print(page.get_as_string())

You should now see the HTML source code for quakkels.com in your terminal. It works!

HTML Parsing and Python 3

The next part of this project we need to write is the HTML parser that allows us to identify the pieces of the Craigslist page that we want to keep. There are several different techniques for doing this including: regular expression matching (don’t use this technique), DOM, and SAX parsing. The DOM (Document Object Model) technique basically involves navigating an XML or HTML document through a tree of nodes. The SAX (Simple API for XML) technique does not involve navigating like the DOM technique does. Rather, it reads the file through once, sending information to the application as the file is read. This means it’s pretty quick, but because there is no navigation the application will need to keep track of the state of the document as the SAX style parser reads it. Our MyParser.py file is going to execute a SAX style parser using the html.parser module.

Python 3 has a handy module called html.parser that we’ll use in our application. Our parser class is going to be designed to just read Cragslist.com listings. I’m going to name the parser class ClParser. ClParser will need to inherit from HTMLParser (which is in the html.parse module) so that we can override the methods that get executed as the file is read in a SAX manner.

Here’s the complete MyParser.py file:

'''
Parse html from craigslist
'''

from html.parser import HTMLParser

class ClParser(HTMLParser):

    # parser state
    # These variables store the current state of the parser as it reads the file
    date = ''           # The date for the current listing

    title = ''          # The title of the current listing

    link = ''           # The link to the current listing's details

    collectFor = None   # will use this to keep track of what kind of data we 
                        # are currently collecting for. valid options are:
                        # 'date', 'title', and 'link'

    insideRow = False   # This flag keeps track of whether we are inside a "row"
                        # "rows" have listing information

    # parser output
    results = ''        # the parser's output will be stored here

    def handle_starttag(self, tag, attrs):
        '''This function gets called when the parser encounters a start tag'''
        if tag == 'a' and self.insideRow:
            self.collectFor = 'title'

        for key, value in attrs:

            if(self.collectFor == 'title' 
                and key == 'href'
                and not self.link): # and not self.link makes sure it doesn't overwrite a preexisting value
                self.link = value

            if key == 'class':
                if value == 'row':
                    self.insideRow = True
                if value == 'ban':
                    self.collectFor = 'date'

    def handle_endtag(self,tag):
        '''This function is called when the parser encounters an end tag'''
        if tag == 'p':
            self.insideRow = False

            # is there data to output?
            if self.title + self.link:
                self.results += "\nDate: \t{0}\nTitle:\t{1}\nLink:\t{2}\n".format(
                    self.date, 
                    self.title, 
                    self.link)
            self.__reset_row()

    def handle_data(self, data):
        '''This function is called when the parser encounters data inside to tags'''        
        if self.collectFor == 'date':
            self.date = data
        if self.collectFor == 'title' and not self.title:
            self.title = data

        self.collectFor = None # when we're done collecting the data, reset this flag

    def __reset_row(self):
        '''This is a utility function to reset the parser's state after a row'''
        self.title = ''
        self.link = ''
        self.summary = ''
        self.collectFor = None
        self.insideRow = False

The HTMLParser class that we are inheriting from has a feed(string argument) function that has been applied to our ClParser class. To execute our parser, we just need to make an instance of the class and call the feed(string argument) function.

We can test this in the Python interpreter in the same way that we tested MyHttp. In the interpreter enter the following code:

>>> import MyHttp, MyParser
>>> page = MyHttp.Page('milwaukee.craigslist.org', '/sya/')
>>> parser = MyParser.ClParser()
>>> parser.feed(page.get_as_string())
>>> print(parser.results)

This should print a list of nicely formatted Craiglist listings for computers in the Milwaukee area. We’re almost done!

The Last Piece!

Alright, we have two of our three pieces built. The last thing to do is handle user input and display results. We’re going to implement these features in the MyScrape.py file. Here’s the whole MyScrape.py file:

import sys, MyParser, MyHttp

# try to assign the subdomain and path values
# if the assignment fails, just use default values
try:
    subdomain, path = sys.argv[1:]
except:
    subdomain, path = 'milwaukee', '/sya/'

# instantiate the parser
parser = MyParser.ClParser()

# instantiate the page
page = MyHttp.Page(subdomain + '.craigslist.org', path)

# get the page and feed it to the parser
parser.feed(page.get_as_string())

# display the results
print('################\n    Results:\n################\n', parser.results)

There you have it. MyScrape.py is the entry point to our application. It allows the user to set a subdomain and a path when calling the script. It brings the MyHttp and MyParser modules together. and it displays results to the screen. To use this application, enter the following command in your shell or command prompt:

$ python3 MyScrape.py

…or…

$ python3 MyScrape.py sierravista /ata/

You can download the entire source here.

Improving the Scrape

Feel free to take this code and experiment with it. Expand on it. Make it spider sub pages. Make it return a list of dictionaries instead of a string. Save the data in a sqlite database, or to a text file. Maybe make it into a web service. Do whatever you want with it. (Keep it legal.)