Zipcode to County in Python

My most excellent colleague was pointing out the need for County data in a particular application. I had some geo-data for the particular data, but was lacking the County data that we needed. After some google-fu, I found that somebody had a web-app that would do just what i wanted to. I fired up firebug to search for the tags I wanted. Upon discovering that they used HTML TABLES, I fell ill. I then pulled myself together, and wrote a function to grab the county name from a zipcode in python. Enjoy!

from BeautifulSoup import BeautifulSoup
from urllib2 import urlopen
 
def lookup(zip):
    try:
        f = urlopen('http://www.getzips.com/CGI-BIN/ziplook.exe?What=1&Zip=%s&Submit=Look+It+Up' % (zip))
        data = f.read()
        f.close()
        soup = BeautifulSoup(data)
        table = soup.findAll('table')[-1]
        return table.findAll('p')[-2].string
    except:
        return None

Resizing an Image in Python

This is another tool to add to your swiss-army knife collection. This function will resize an image, and rewrite the original file. It uses Anti-aliasing and defaults the image to 100% of the original quality. You can specify the width or height or both, maintaining the aspect ratio.

You will need Python Imaging Library (PIL) installed, if you haven’t already installed it.

Download PyImg

>>> from PyImg import Resize
>>> Resize('sample.jpg', 300)
True

Google Pagerank API for Python

I decided to write this post based on the lack of functionality of the Pagerank tools for python. Checking the pagerank of a URL is fairly simple, thanks to netscripter. However, the major feature of the pagerank tool is collecting URLS to check the pagerank for. If you are among the 90% of internet users, you use google to search. Using my Google Search API for Python, you can get a list of URLS, given a query and now, with the Google Pagerank API in Python, you can get a dictionary of urls, keyed by their numeric pagerank.

Download Google Search API for Python

Download Google Pagerank API for Python

>>> from PageRank import getUrlsWithPageRank
>>> prUrls = getUrlsWithPageRank('blackcodeseo.com', 10)
 
>>> prUrls.keys()
[0, 1, 2]
 
>>> prUrls[2]
['http://blackcodeseo.com/']
 
>>> prUrls[1]
['http://blackcodeseo.com/black-code-seo-forum/comment-page-1/']
 
>>> prUrls[0]
['http://forums.blackcodeseo.com/', 'http://forums.blackcodeseo.com/list.php?6', 'http://chexmedia.com/technology/blackcodeseocom/', 'http://whois.domaintools.com/blackcodeseo.com', 'http://www.w3who.com/blackcodeseo.com', 'http://www.mail-archive.com/python-list@python.org/msg223994.html', 'http://pipl.com/directory/name/Code/Black', 'http://www.alexa.com/siteinfo/www.blackcodeseo.com']

Mashup Api in Python

I’ve done the rss2html automated blogging. It’s not bad, but you lack quite a bit of content. When you are putting a mashup together, you wind up with a little bit of many sources. Granted, this approach works, but if you think about it, you are in competition with all of the aggregators (which generally have a HIGH page rank). This is where my Mashup Api in Python comes into play.

Let me first point out that this ONLY works with wordpress blogs. If this api catches on, I’ll add support for other blog platforms. What it specifically does is…

1) Grabs all of the posts on the homepage of a wordpress blog
2) Continues that process for as many Blogs as you supply
3) It mashes all of the content together, and gives you lots of content. (with the ability to randomize the content)

Download Mashup Api in Python

sentencesPerBlogPostBody (int) – Tells Mashup how many sentences to include in your mashup
randomize (boolean) – Tells Mashup to randomize the sentences
createMashup – Returns the text of all blogs/posts

Here’s an example:

>> from mashup import Mashup, getMashup
 
>> dir(Mashup)
['__doc__', '__init__', '__module__', 'addBlog', 'blogs', 'createMashup', 'randomize', 'removeBlog', 'sentencesPerBlogPostBody', 'setBlogs', 'setRandomize', 'setSentencesPerBlogPostBody']
 
>> urls = ['http://blackcodeseo.com', 'http://blog.5ubliminal.com', 'http://blackhatseo-blog.com',]))
>> mashup = getMashup(urls)
>> mashup.setRandomize(True)
>> print mashup.createMashup()
"""I’m talking about a blog that blogs for you.In my previous post, I explained how to find the exact data within the tag that you are looking for.  I’ve put together simple script to test to assert if a proxy is up or down.  Using the alternate email saves me from spam, but I still need to physically log into the email account and grab the confirmationIt’s open to the public, please become a member and share your knowledge.Wondering why I've written this post.  Since MediaWiki disallows scrapers, I used Mechanize.  Having said that, I?m working on a framework *platform independent*, that will allow you to automate and ?form filling?/?web submission? process, with user agent emulation.  This wasmy non-media, friends of the family bias.It doesn’t get much more simple than thisI believe in community, I believe in groups, I believe in support. I use textareas a lot in my Control Panels and lately I got so annoyed with the lackof TAB character insertion support that I went out to find a fix. Me too.Growing up, I always made the assumption that people were fairly intelligent and able to make good decisions.  Anyways, this process is so easy, it’s fantasticProxies are easy to find, but often not working. I laugh my s off everytime I seeit.  Whatever that means.  The idea is fairly simple, however, there’s a fine line between “stealing” and “syndicating”Freedom.  Please note the omission of the word “STEAL”.Typically, when you write a scraper you are in the following position.This is a short one pointing you to an excellent jQuery script.  But it’s been beat into my head that we have itThis is just a basic MediaWiki Scraper, just pulling out all readable strings in “p” tags.  I’m going to explain how to keep incrementingyour pages, so that you are able to continue collecting dataWeb Scraping is when you “scrape” content from another site, and make it your own.I hope you'll enjoy the time off at least as much as I will and get back to work in 2010 with your batteries fully charged, ready to rumble. Not sure why No comment.The ?engine? parses an XML rule-set for a given siteNot cars.Pending support requests will be solved in the next 24 hours.  I have a junk email address that I use for such purposes.. Got to get re-acquainted with the 'office' and .  I realized that people made a decision and followed through with it, based on some form of decision makingWhen you sign up for a website, there’s a good change that you need to validate your email account"""

Iterating Your Scraper

Typically, when you write a scraper you are in the following position.

1) You have many pages to iterate
2) Each page is full of links to the data that you wish to scrape
3) You need to scrape a sub-page of the main page, before iterating to the next page

In my previous post, I explained how to find the exact data within the tag that you are looking for. I’m going to explain how to keep incrementing your pages, so that you are able to continue collecting data.

I wrote a universal function to get the next page, with regex support.

def getNextPage(url, nextAnchorText, useRegex=False):
    import re
    import urllib2
    from BeautifulSoup import BeautifulSoup
    f = urllib2.urlopen(url)
    data = f.read()
    f.close()
    soup = BeautifulSoup(data)
    if useRegex:
        nextAnchorText = re.compile(nextAnchorText)
    next = soup.find('a', text=nextAnchorText, href=True)
    if next:
        next = next.parent['href']
    return next

Here’s an example of that function.

>>> getNextPage('http://blackcodeseo.com/', '←Older')
u'http://blackcodeseo.com/page/2/'

Good luck, and keep on hacking!

Writing A Scraper

Web Scraping is when you “scrape” content from another site, and make it your own. Please note the omission of the word “STEAL”. Anyways, this process is so easy, it’s fantastic.

First, find what you want. Let’s take my blog for example.

Second, find the tags you want. Generally speaking, if you can find a div tag with a class that’s unique to the rest of the page, you are set. You are able to build a dictionary in your findAll clause to grab the exact chunk that you are looking for. The dict uses the attributes of the html tag, see the example for more details.

>>> from urllib2 import urlopen
>>> from BeautifulSoup import BeautifulSoup
 
>>> soup = BeautifulSoup(urlopen('http://blackcodeseo.com').read())
 
>>> # Grab all the links on the page
>>> links = soup.findAll('a')
>>> links[0].string
u'Black Code SEO'
 
>>> postCaption = soup.findAll('h1', {'class':'post-caption'})
>>> post = postCaption[0]
>>> post
<h1 class="post-caption"><a href="http://blackcodeseo.com/testing-a-proxy-in-python/" rel="bookmark" title="Permanent Link to Testing A Proxy In Python">Testing A Proxy In Python</a></h1>
 
>>> post.a.text
u'Testing A Proxy In Python'
 
>>> post.a['href']
u'http://blackcodeseo.com/testing-a-proxy-in-python/'

Testing A Proxy In Python

Proxies are easy to find, but often not working.  I’ve put together simple script to test to assert if a proxy is up or down.

import os
from urllib2 import urlopen
 
# example proxy
os.environ['HTTP_PROXY']='http://169.229.50.9:3128'
try:
    f = urlopen('http://www.google.com')
    data = f.read()
    f.close()
    print 'Pass'
except:
    print 'Fail'

It doesn’t get much more simple than this. Granted, the read is probably overkill, if the url cannot open, it dies.

Auto-Fill Web Forms

I believe in community, I believe in groups, I believe in support.  Having said that, I’m working on a framework *platform independent*, that will allow you to automate and “form filling”/”web submission” process, with user agent emulation.

The “engine” parses an XML rule-set for a given site.  Ex. myspace.com.xml

You supply a configuration XML.

The script runs, and follows the rules in the XML rule-set.

This project will be limited to captchas.  I’m very excited to have a framework that will do automatic submissions, however, time doesn’t allow for a captchas decoder.  More news to come later.

Update!

I started working on the configuration framework. Here’s a sample of the framework.

<config domain=”value” proxy=”value” stdout=”value”>

<define>

<user_defined_x>
<user_data_variable_1>value</user_data_variable_1>
<user_data_variable_2>value</user_data_variable_2>
</user_defined_x>

</define>

<sequence>

<form define=”value”>
<input type=”value” name=”value” id=”value” class=”value”>value</input>
<input type=”value” name=”value” id=”value” class=”value”>value</input>
<click type=”value” name=”value” id=”value” class=”value”/>
</form>

<navigate>
<click type=”value” id=”value” class=”value”/>
</navigate>

</sequence>

</config>

In the define tag, a user will be able to “describe” a form. For example, if I’m using a login form, I could do something like.

<define>

<login>
<username>MyUserName</username>
<password>MyPassword</password>
</login>

</define>

It is important to note that the tags “username” and “password” are spelled exactly the same way as they are on the form. So, when I call the sequence tag, it would look like this.

<sequence>

<form define=”login”>
<input type=”text” name=”password”/>
<input type=”password” name=”password”/>
<click type=”submit” name=”submit”/>
</form>

</sequence>

In the form tag, I reference the login tag from define *the example just before this one*. Calling the input tags without a value tells the engine to use the data provided from the define section.

The navigate tag is used to automatically click links. More to come later.

Make Money With An Auto-Blog

Not cars.

I’m talking about a blog that blogs for you.  The idea is fairly simple, however, there’s a fine line between “stealing” and “syndicating”.  My understanding of this principle is…You can “reblog” anything in an RSS feed, as long as you provide a back link.  So, let’s continue.

Pick a niche, I’m going to pick the “Pittsburgh Pirates”.

Now that I’ve identified a niche, I register a domain with “Pittsburgh Pirates” in the url.  Always include your main keyword in your domain.

Content time!  Head on over to http://blogsearch.google.com/ and grab some feeds.  Do some searches  “Pittsburgh Pirates”, “Pirates”, “Pirates Baseball”, “Pittsburgh Baseball”…you get the idea.  Assert that the rss feeds have enough information in them to keep people coming back.

Next step, install WP for your blog.  Grab a plugin that allows you to post syndication.  “FeedWordPress” or “WP-Autoblog” will do the trick.

Insert rss feeds into the plugin that you’ve downloaded, and let it do the magic.

This is the most inportant part of the whole process, skip this step and you can expect to make $0.  SEO Optimize your site.  Change the footer, and do some anchor tags on your topic, and to your tags.  Before all of your posts, write one static, SEO optimized post that includes every search term that you are trying to own.  Don’t get too aggressive on this, just pick 5 or so.  Anchor link.  Make images that link to your categories.  Include a cool image on your main page.  All of a sudden, people spend more time on your site, and you’ve got people clicking links.

Last thing, socially submit your site.  And submit your feeds to rss aggregators.  Do ~1 per day.  Initially, be very slow to add your url anyplace, google will peanlize you from adding your domain too fast.

I hope you make millions!

Rethinking Society

Freedom.  Whatever that means.  But it’s been beat into my head that we have it.  Freedom hints at the notion of “anything”.  Wow…Anything!?  I can be/do/go anything I want, as long as I’m not hurting anybody, and it’s cool.  Sounds awesome?

Every time I get to know somebody, I find that they are a lover of substance.  Whether they are a drunk, a pill popper or a hard core street junkie, they NEED it.  They really don’t know why.  Nobody knows why, but we need a crutch.  Maybe we need to rethink our social structure.  Maybe “anything” isn’t so awesome.

Capitalism!  Democracy!  Freedom!  Let’s revisit “anything”.  I’m brought up to believe that I’ll be rich/famous/beautiful/awesome/INSERT WORD HERE/ and I’m eating it up.  But wait!  I get my first job, and realize that I’m but a pawn in a huge SCHEME.  Yes, a SCHEME.  It apears that my hardwork goes to furnish a SWEET LIFE for a very small percentage of the population.  Bizzare?  Yeah, that’s not fair.  I was the one that developed that flagship product, and you gave me a $0.23 raise, while making yourself millions.  But you didn’t forget about your FLUNKIES.  You found it in your heart to give them $250,000.  The “taker of the progress report” is a HIGH paid position.  I like how you have a special lunch in my team’s honor.  You assure us that WE MADE THE COMPANY WHAT IT IS!  I wonder how our names were omitted from the publications.  Oh wait, there it is.  Is that a footnote?  Well, at least it’s there.

Let me give you %100.  Take your %20 cut of the profits.  I’ll take my <1% cut of the profits.  I question you about the fairness, you retalliate “YOU ARE LUCKY TO HAVE A JOB!”.  Wow, I feel greedy.  Why do I ask for so much.  I’m lucky, I’m fortunate, that’s what you tell me.  In the back of my mind, I KNOW that I can go anywhere else and they’ll pay me what I’m worth.  I must have forgotten about capitalism.  Did I forget that I was but a pawn in far “greater” scheme.  I’m not worth a percentage, but rather the “competetive salary”.  Translating to “We don’t pay much, but if we word it like this, you’ll think we do”.

So, let’s go to our jobs, work our best, and make a grand future for the “haves” of society.  For without us, they’d be one of us.

←Older