Iterating Your Scraper
Typically, when you write a scraper you are in the following position.
1) You have many pages to iterate
2) Each page is full of links to the data that you wish to scrape
3) You need to scrape a sub-page of the main page, before iterating to the next page
In my previous post, I explained how to find the exact data within the tag that you are looking for. I’m going to explain how to keep incrementing your pages, so that you are able to continue collecting data.
I wrote a universal function to get the next page, with regex support.
def getNextPage(url, nextAnchorText, useRegex=False): import re import urllib2 from BeautifulSoup import BeautifulSoup f = urllib2.urlopen(url) data = f.read() f.close() soup = BeautifulSoup(data) if useRegex: nextAnchorText = re.compile(nextAnchorText) next = soup.find('a', text=nextAnchorText, href=True) if next: next = next.parent['href'] return next
Here’s an example of that function.
>>> getNextPage('http://blackcodeseo.com/', '←Older') u'http://blackcodeseo.com/page/2/'
Good luck, and keep on hacking!

Leave a comment
You must be logged in to post a comment.