Spell Checking in Python

I was looking into spell checking in Python.  I found spell4py, and downloaded the zip, but couldn’t get it to build on my system.  If I tried a bit longer maybe, but in the end my solution worked out fine.  This library was overkill for my needs too.

I found this article here: http://code.activestate.com/recipes/117221/

This seemed to work well for my purposes, but I wanted to test out other spell checking libraries.    Mozilla Firefox , Google Chrome, and OpenOffice all use hunspell, so I wanted to try that one (as I’m testing the spelling of words on the Internet).  Here are some python snippets to get you up and running with the popular spelling checkers. I modified these to take more than 1 word, split them up, and then return a list of suggestions.  They do require each spelling checker to be installed.  I was able to do this through the openSuSE package manager.

Ispell

import popen2
 
class ispell:
    def __init__(self):
        self._f = popen2.Popen3("ispell")
        self._f.fromchild.readline() #skip the credit line
    def __call__(self, words):
        words = words.split(' ')
        output = []
        for word in words:
            self._f.tochild.write(word+'\n')
            self._f.tochild.flush()
            s = self._f.fromchild.readline().strip()
            self._f.fromchild.readline() #skip the blank line
            if s[:8] == "word: ok":
                output.append(None)
            else:
                output.append((s[17:-1]).strip().split(', '))
        return output

Aspell

import popen2
 
class aspell:
    def __init__(self):
        self._f = popen2.Popen3("aspell -a")
        self._f.fromchild.readline() #skip the credit line
    def __call__(self, words):
        words = words.split(' ')
        output = []
        for word in words:
            self._f.tochild.write(word+'\n')
            self._f.tochild.flush()
            s = self._f.fromchild.readline().strip()
            self._f.fromchild.readline() #skip the blank line
            if s == "*":
                output.append(None)
            elif s[0] == '#':
                output.append("No Suggestions")
            else:
                output.append(s.split(':')[1].strip().split(', '))
        return output

Hunspell

import popen2
 
class hunspell:
    def __init__(self):
        self._f = popen2.Popen3("hunspell")
        self._f.fromchild.readline() #skip the credit line
    def __call__(self, words):
        words = words.split(' ')
        output = []
        for word in words:
            self._f.tochild.write(word+'\n')
            self._f.tochild.flush()
            s = self._f.fromchild.readline().strip().lower()
            self._f.fromchild.readline() #skip the blank line
            if s == "*":
                output.append(None)
            elif s[0] == '#':
                output.append("No Suggestions")
            elif s[0] == '+':
                pass
            else:
                output.append(s.split(':')[1].strip().split(', '))
        return output

Now, after doing this and seeing the suggestions. I decided a spell checker isn’t really what I was looking for.  A spelling checker always tries to make a suggestion, and I wanted to filter out things from a database.  I started this with the hope that I would be able to take misspellings and convert them into the correct word.  In the end, I just removed words that were not spelled correctly using WordNET through NLTK.  WordNET had a bigger dictionary than most of the spell checkers which also helped in the filtering task.  NLTK has a simple how to on how to get started using WordNET.

Posted on April 11, 2009 at 11:38 am by Joe · Permalink · 5 Comments
In: Python · Tagged with: ,

NLTK vs MontyLingua Part of Speech Taggers

This is a comparison of the part of speech taggers available in python. As far as I know, these are the most prominent python taggers. Let me know if you think another tagger should be added to the comparison.

MontyLingua includes several natural language processing (NLP) tools. The ones that I used in this comparison were the stemmer, tagger, and sentence tokenizer. The Natural Language Toolkit (NLTK) is another set of python tools for natural language processing. It has a much greater breadth of tools than MontyLingua. It has taggers, parsers, tokenizers, chunkers, and stemmers. It usually has a few different implementations of each providing different options to their users. In the case of stemmers, they have the Punkt and WordNet stemmers. Both of these tools are written to aid in NLP using Python.

Taggers

For those that don’t know, a tagger is a NLP tool that will mark the part of speech of a word.

Example:
Input: “A dog walks”
Output: “A/DT dog/NN walks/VBZ”

The meanings of the tokens after the / can be found here.

For NLTK, I’m comparing the built-in tagger to MontyLingua. I didn’t do any training at all and just called nltk.tag.pos_tag(). I used the taggers mostly as is, with some slight modifications. I added a RegExp tagger in front of the NLTK tagger, and make the default tagger the backoff tagger. It will mark A, An, and The as DT always. It was annoying and messing up my results to have them marked as NNP. They were capitalized, and I suppose the tagger thought they were either initials or proper names.

MontyLingua on the other hand was always marking “US” as a pronoun. This was a problem when scanning sentences that said “US Pint” or “US Gallon.” I look at the word before “US” and see if it’s an article, if it is I allow it to continue being processed. Neither tagger is perfect, but it becomes clear that one may be better than the other for my use-case. It may be different for yours. I’m scanning sentences from the web.

Stemmers

A stemmer is a tool that will take a word with a suffix attached to it, and return the ‘stem’ or base word of it.

Example:
Input: dogs
Output: dog

While neither stemmer is perfect, they both do a decent job. MontyLingua is more inclined to take the ‘S’ off the end of something, and the NLTK WordNetLemmatizer doesn’t always take it off. ‘Cows’ is an example of a word the WordNetLemmatizer will not stem to ‘Cow’ but MontyLingua will. On the other hand, MontyLingua is more likely to take the ‘S’ off the end of an acronym, and I wrote code to correct that in some cases. If a word is less than 4 characters or all consonants, I don’t run it on the MontyLingua stemmer. The all consonants is to catch some acronyms. While using MontyLingua on a specific part of speech it’s important to specify whether it’s a noun or a verb with the ‘pos’ parameter. Since I’m only stemming nouns, I used pos=’noun’.

Results

The first results don’t only reflect a change in taggers, but changes in the stemmer and sentence tokenizer also. I did another comparison using the MontyLingua tagger with the NLTK stemmer and sentence tokenizer for comparison.

A phrase found by one algorithm and not by another is shown first. They both were able to find some words that were not found by the other. Hits is the number of times a phrase comes up, it is displayed only if there is a discrepancy. If MontyLingua and NLTK both found a phrase but found it a different number of times, that is reflected there. The first numbers are totals for every discrepancy summed. There is also a graph below showing how many of each difference there is. For example there were 157 times that there was a discrepancy of 1 hit and MontyLingua came out on top. There were 78 times the number of hits were different by 1 and NLTK had more. An interesting one is there was one time MontyLingua had one word with 40 hits more than NLTK. That word was elephant.

MontyLingua toolchain vs NLTK toolchain
In MontyLingua but not NLTK: 514
In NLTK but not MontyLingua: 403

Total Hits: MontyLingua: 1421 vs NLTK: 1184

MontyLinga vs NLTK Graph

MontyLingua vs NLTK

Hit Count MontyLingua NLTK
1 157 78
2 35 10
3 10 0
4 4 1
5 1 0
6 1 0
13 0 1
14 2 0
40 1 0

On average MontyLingua had more hits than NLTK on words

MontyLingua Tagger NLTK Stemmer & Tokenizer (ML-NLTK) vs MontyLingua Toolchain
For the sake of completeness here are the results of the MontyLingua tagger with the NLTK stemmer and tokenizer.

In ML-NLTK but not in MontyLingua: 65
In MontyLingua but not in ML-NLTK: 68

Total Hits: ML-NLTK: 290 vs MontyLingua: 299

Hit Count MontyLingua ML-NLTK
1 20 17
2 0 2
10 1 0

Total Phrases Found By

Name Phrase Count
NLTK 3777
ML-NLTK 3885
MontyLingua 3888

At the end of the day, I’ll be using the MontyLingua toolchain with some slight modifications I’ve made (mentioned above). I’m definitely still using NLTK, just for different tasks. NLTK has a great and easy to use regexp chunker that I’ll continue to use.

Again, a tagger’s performace can vary greatly based on the data used to train and test it. I was testing them on about 12,000 webpages I downloaded and looking for specific phrases. On a different data set NLTK may turn out to be better.

Posted on March 28, 2009 at 10:23 pm by Joe · Permalink · 5 Comments
In: Python · Tagged with: , ,

The ease of Python, SQLite, and Storm

I began learning Python this spring, and I must say, the more I program in it the more I like it. I chose the language because of the libraries that are available for it. There is a library for everything. :) Also, there are tools for Natural Language Processing that are a great help, but that’s for another time and another post.

I was originally thinking about using Postgres, and it would probably give me better speed and scalability. But then, I began to think if I really needed a full RDBMS for my application. After all, I’m not expecting the project to get too large, and being able to easily move it from one computer to another by just moving a single files sounds very convenient. SQLite has a great page to see if it’s right for you. I ended up settling on SQLite, and so far am happy with the decision.

Installing SQLite was a breeze. I just opened the package manager in openSuSE and installed the packages. I also installed the python Storm package. There is no daemon process, as there is with Postgres, because you’re just accessing one file on your filesystem. There is a great tool for setting up a SQLite database called SQLite Manager. It will let you create tables, view your data, and run queries. The fact that it’s available as a firefox extention makes it easy to install on many platforms.

Now is when the real fun begins. Enter Storm.

Storm is an object relation mapping (ORM) tool for Python. It allows you to manipulate the database through the manipulation of Python objects. After you map your python objects to database tables, you manipulate them, and your changes will show up in the database for you. I’ve used other ORM tools in the past (Hibernate for Java), but I was amazed at the simplicity of the setup/configuration step when using Storm.

It takes two lines to connect to your sqlite database:

DATABASE = create_database('sqlite:db_name')
# or simply create_database('sqlite:') for in-memory
STORE = Store(DATABASE)

Mapping a class to a table can be done with ease. Here is an example of one of my classes:

class Sentence(object):
    __storm_table__ = "TABLE_SENT"
    sent_id = Int(primary=True)
    loc_id = Int()
    location = Reference(loc_id, Location.loc_id)
    sentence = Unicode()
 
    def __init__(self, sent, loc = None):
        if loc: self.location = loc
 
        # sent cannot be None
        if not isinstance(sent, unicode):
            self.sentence = unicode(sent, "utf-8")
        else:
            self.sentence = sent

If you access loc_id, it will give you the database id. If you access the variable without location through a Reference, it will hand you the corresponding database object.

Now, I set this up in about 45 minutes from start to finish, so it might need some more fiddling, but overall it seems to work pretty well. I needed to set something up in one night to keep moving on other parts of the project, and this allowed me to.

It can’t all be sunshine and rainbows, there was one thing that tripped me up a bit. Being new to Python, I wasn’t aware of the u”String” for unicode. It was used in their examples, and after I got an error assumed that’s what it was for, but it tripped me up. As you can see in my constructor, I added some code to handle the case when a string that isn’t unicode is passed in.

As I get into the more advanced aspects of SQLite/Storm, I hope I continue to be impressed.

Posted on March 8, 2009 at 7:43 pm by Joe · Permalink · 4 Comments
In: Python · Tagged with: ,

Regular Expressions Review

A hobby of mine is to learn new programming languages.  I try and learn at least one a year, and use it for more than just a hello world app.  So this year is the year of python, where if I’m required to write a script Python is the go to guy.  Having that said, I recently had a need for regular expressions, so python was used.

Being most familiar with Java and Ruby, Python seems a little in between, but one feature that stuck out was the regular expression syntax.  Python let’s you put a ‘r’ in front of the quote to denote raw input.  This means you don’t have to escape back slashes twice (ala Java).

For those not familiar with regular expressions, here is an example of a regular expression in a few languages:

Java:
Find Slashes: "[\\\\/]"

Python:
Find Slashes: r"[\\/]"

Ruby:
Find Slashes: /[\\\/]/

This is just a simple example to illustrate a point, but look at Java.  Find slashes has 4 backslashes.  Even if regular expressions were typically this simple (they’re not), that seems unnecessary.  Anytime it’s necessary for a backslash to make it to the regular expression processor, there much be two in the regular expression.  Another quick example, to find a period (.) the regular expression would be “\\.”.  This compounds very quickly, and makes it painful to use regular expressions in Java.

I can’t understand why Java wouldn’t adopt the python syntax of putting a ‘r’ in from of a string to denote raw input.  At quick glance, it doesn’t seem as though it would break any currently in use regular expressions because the old syntax would continue working as expected.  It’d make them better for the future.

Here are some Regular Expression resources that I find useful:

http://www.rubular.com/ A Ruby Regular Expression Tester

http://www.fileformat.info/tool/regex.htm A Java Regular Expression Tester

Posted on February 14, 2009 at 2:18 pm by Joe · Permalink · Leave a comment
In: Java, Python, Ruby · Tagged with: 

Getting Started with Git (and the Hub)

I’m very new to Git, but I find some of its features interesting and worth a post about.  This is possibly a part 1 of more to come as I learn more about Git and how to use it effectively.  The comments below are made with having mostly used SVN in the past for source control.

Features (that appeal to me):

These features are of DVCS in general, Bisect may be git specific.

Starting out:

Make sure you have git installed before starting here.  I’m assuming git is installed, and you are a little familiar with the commands.  Here are some links to some excellent resources to get you started if you’re not at that point yet:

Well, now that you’re somewhat familiar with git, let’s get started on with github.  Now, the main reason for posting this is because as a new user it didn’t seem that clear what I need to do to actually push to a repository.  After you’ve been given access to, or created, a github repository, you must set your username and email for github to use.

Instructions on that can be found here: http://github.com/guides/tell-git-your-user-name-and-email-address

Now, here is the part that really wasn’t clear to me.  You need to setup an ssh key to github as part of their authentication.  I don’t know if this is correct, but I think of this as a layer of encryption that happens to make sure my files get to github unaltered. (similar to PGP for email)

Instructions for this can be found here: http://github.com/guides/providing-your-ssh-key

Now that you’re all setup, you’re ready to do some commits and then a push to the server.  Have fun in the world of git.

Posted on January 20, 2009 at 9:30 pm by Joe · Permalink · One Comment
In: Uncategorized · Tagged with: , ,