Writing A Scraper

Web Scraping is when you “scrape” content from another site, and make it your own. Please note the omission of the word “STEAL”. Anyways, this process is so easy, it’s fantastic.

First, find what you want. Let’s take my blog for example.

Second, find the tags you want. Generally speaking, if you can find a div tag with a class that’s unique to the rest of the page, you are set. You are able to build a dictionary in your findAll clause to grab the exact chunk that you are looking for. The dict uses the attributes of the html tag, see the example for more details.

>>> from urllib2 import urlopen
>>> from BeautifulSoup import BeautifulSoup
 
>>> soup = BeautifulSoup(urlopen('http://blackcodeseo.com').read())
 
>>> # Grab all the links on the page
>>> links = soup.findAll('a')
>>> links[0].string
u'Black Code SEO'
 
>>> postCaption = soup.findAll('h1', {'class':'post-caption'})
>>> post = postCaption[0]
>>> post
<h1 class="post-caption"><a href="http://blackcodeseo.com/testing-a-proxy-in-python/" rel="bookmark" title="Permanent Link to Testing A Proxy In Python">Testing A Proxy In Python</a></h1>
 
>>> post.a.text
u'Testing A Proxy In Python'
 
>>> post.a['href']
u'http://blackcodeseo.com/testing-a-proxy-in-python/'

Leave a comment

You must be logged in to post a comment.