Elixir and Storm are very similar, they’re both object relational mappers that provide an easy way to map your objects to database tables. In a future post, I’ll do a more in depth comparison between the two in a future post.
Starting out, Elixir uses SQL Alchemy as a backend. While working with the tool you will probably find yourself running into things you may not understand if you’re not familiar with SQL Alchemy. Keeping open a tab in firefox pointed at the SQL Alchemy documentation can be useful. It does show through in certain instances.
There are two main starting points for an ORM tool. There is the case where you’re starting with an existing database, and the case where you’re setting up the database from scratch. Mapping to a table that already exists with Elixir can be a little tricky depending on the relationships.
It’s as simple as this to connect to a database:
metadata.bind = "sqlite:///../sizedb.sqlite"
Here is a simple example:
class Location(Entity): using_options(tablename='TABLE_LOC') loc_id = Field(Integer, primary_key=True) location = Field(UnicodeText)
And here is a more complex example of connecting to an existing database table:
class Comparison(Entity): using_options(tablename='TABLE_COMP') comp_id = Field(Integer, primary_key=True) date_added = Field(DateTime, default=datetime.datetime.now) hits = Field(Integer, default=1) smaller = ManyToOne('Phrase', colname='smaller_id') larger = ManyToOne('Phrase', colname='larger_id') sentences = ManyToMany('Sentence', tablename='TABLE_COMP_SENT', local_side='comp_id', remote_side='sent_id', column_format="%(key)s")
I left out most of the class specific code to focus on Elixir. One thing that took a while to figure out was how to setup a ManyToMany relationship with specific columns in my database. The column_format parameter is the key to being able to specify your column names directly. I really didn’t have to use any other options besides what you see above when connecting to an existing database. Overall, I had about five database tables to connect.
Now if it was not being setup with an existing database, many of the parameters in the different relationships are unnecessary. For comparison here is the same example if Elixir is used to create the database tables:
class Comparison(Entity): comp_id = Field(Integer, primary_key=True) date_added = Field(DateTime, default=datetime.datetime.now) hits = Field(Integer, default=1) smaller = ManyToOne('Phrase') larger = ManyToOne('Phrase') sentences = ManyToMany('Sentence')
As you can see, it gets quite a bit simpler. The underlying table information is no longer needed. It created tables that were very similar to my hand-created tables that I had used with Storm. When it comes to queries on the database, SQL Alchemist shows through.
I found the documentation on the Elixir webpage to be a little bit lacking in terms of queries. SQL Alchemist has a page that more fully describes the query functions. AND and OR operators are named and_ and or_, respectively, probably because and and or are reserved in Python. I thought this was worth mentioning because they are common SQL operators.
In: Python · Tagged with: elixir, sqlite
I was looking into spell checking in Python. I found spell4py, and downloaded the zip, but couldn’t get it to build on my system. If I tried a bit longer maybe, but in the end my solution worked out fine. This library was overkill for my needs too.
I found this article here: http://code.activestate.com/recipes/117221/
This seemed to work well for my purposes, but I wanted to test out other spell checking libraries. Mozilla Firefox , Google Chrome, and OpenOffice all use hunspell, so I wanted to try that one (as I’m testing the spelling of words on the Internet). Here are some python snippets to get you up and running with the popular spelling checkers. I modified these to take more than 1 word, split them up, and then return a list of suggestions. They do require each spelling checker to be installed. I was able to do this through the openSuSE package manager.
import popen2 class ispell: def __init__(self): self._f = popen2.Popen3("ispell") self._f.fromchild.readline() #skip the credit line def __call__(self, words): words = words.split(' ') output =  for word in words: self._f.tochild.write(word+'\n') self._f.tochild.flush() s = self._f.fromchild.readline().strip() self._f.fromchild.readline() #skip the blank line if s[:8] == "word: ok": output.append(None) else: output.append((s[17:-1]).strip().split(', ')) return output
import popen2 class aspell: def __init__(self): self._f = popen2.Popen3("aspell -a") self._f.fromchild.readline() #skip the credit line def __call__(self, words): words = words.split(' ') output =  for word in words: self._f.tochild.write(word+'\n') self._f.tochild.flush() s = self._f.fromchild.readline().strip() self._f.fromchild.readline() #skip the blank line if s == "*": output.append(None) elif s == '#': output.append("No Suggestions") else: output.append(s.split(':').strip().split(', ')) return output
import popen2 class hunspell: def __init__(self): self._f = popen2.Popen3("hunspell") self._f.fromchild.readline() #skip the credit line def __call__(self, words): words = words.split(' ') output =  for word in words: self._f.tochild.write(word+'\n') self._f.tochild.flush() s = self._f.fromchild.readline().strip().lower() self._f.fromchild.readline() #skip the blank line if s == "*": output.append(None) elif s == '#': output.append("No Suggestions") elif s == '+': pass else: output.append(s.split(':').strip().split(', ')) return output
Now, after doing this and seeing the suggestions. I decided a spell checker isn’t really what I was looking for. A spelling checker always tries to make a suggestion, and I wanted to filter out things from a database. I started this with the hope that I would be able to take misspellings and convert them into the correct word. In the end, I just removed words that were not spelled correctly using WordNET through NLTK. WordNET had a bigger dictionary than most of the spell checkers which also helped in the filtering task. NLTK has a simple how to on how to get started using WordNET.
In: Python · Tagged with: nlp, spelling checker
This is a comparison of the part of speech taggers available in python. As far as I know, these are the most prominent python taggers. Let me know if you think another tagger should be added to the comparison.
MontyLingua includes several natural language processing (NLP) tools. The ones that I used in this comparison were the stemmer, tagger, and sentence tokenizer. The Natural Language Toolkit (NLTK) is another set of python tools for natural language processing. It has a much greater breadth of tools than MontyLingua. It has taggers, parsers, tokenizers, chunkers, and stemmers. It usually has a few different implementations of each providing different options to their users. In the case of stemmers, they have the Punkt and WordNet stemmers. Both of these tools are written to aid in NLP using Python.
For those that don’t know, a tagger is a NLP tool that will mark the part of speech of a word.
Input: “A dog walks”
Output: “A/DT dog/NN walks/VBZ”
The meanings of the tokens after the / can be found here.
For NLTK, I’m comparing the built-in tagger to MontyLingua. I didn’t do any training at all and just called nltk.tag.pos_tag(). I used the taggers mostly as is, with some slight modifications. I added a RegExp tagger in front of the NLTK tagger, and make the default tagger the backoff tagger. It will mark A, An, and The as DT always. It was annoying and messing up my results to have them marked as NNP. They were capitalized, and I suppose the tagger thought they were either initials or proper names.
MontyLingua on the other hand was always marking “US” as a pronoun. This was a problem when scanning sentences that said “US Pint” or “US Gallon.” I look at the word before “US” and see if it’s an article, if it is I allow it to continue being processed. Neither tagger is perfect, but it becomes clear that one may be better than the other for my use-case. It may be different for yours. I’m scanning sentences from the web.
A stemmer is a tool that will take a word with a suffix attached to it, and return the ‘stem’ or base word of it.
While neither stemmer is perfect, they both do a decent job. MontyLingua is more inclined to take the ‘S’ off the end of something, and the NLTK WordNetLemmatizer doesn’t always take it off. ‘Cows’ is an example of a word the WordNetLemmatizer will not stem to ‘Cow’ but MontyLingua will. On the other hand, MontyLingua is more likely to take the ‘S’ off the end of an acronym, and I wrote code to correct that in some cases. If a word is less than 4 characters or all consonants, I don’t run it on the MontyLingua stemmer. The all consonants is to catch some acronyms. While using MontyLingua on a specific part of speech it’s important to specify whether it’s a noun or a verb with the ‘pos’ parameter. Since I’m only stemming nouns, I used pos=’noun’.
The first results don’t only reflect a change in taggers, but changes in the stemmer and sentence tokenizer also. I did another comparison using the MontyLingua tagger with the NLTK stemmer and sentence tokenizer for comparison.
A phrase found by one algorithm and not by another is shown first. They both were able to find some words that were not found by the other. Hits is the number of times a phrase comes up, it is displayed only if there is a discrepancy. If MontyLingua and NLTK both found a phrase but found it a different number of times, that is reflected there. The first numbers are totals for every discrepancy summed. There is also a graph below showing how many of each difference there is. For example there were 157 times that there was a discrepancy of 1 hit and MontyLingua came out on top. There were 78 times the number of hits were different by 1 and NLTK had more. An interesting one is there was one time MontyLingua had one word with 40 hits more than NLTK. That word was elephant.
MontyLingua toolchain vs NLTK toolchain
In MontyLingua but not NLTK: 514
In NLTK but not MontyLingua: 403
Total Hits: MontyLingua: 1421 vs NLTK: 1184
On average MontyLingua had more hits than NLTK on words
MontyLingua Tagger NLTK Stemmer & Tokenizer (ML-NLTK) vs MontyLingua Toolchain
For the sake of completeness here are the results of the MontyLingua tagger with the NLTK stemmer and tokenizer.
In ML-NLTK but not in MontyLingua: 65
In MontyLingua but not in ML-NLTK: 68
Total Hits: ML-NLTK: 290 vs MontyLingua: 299
Total Phrases Found By
At the end of the day, I’ll be using the MontyLingua toolchain with some slight modifications I’ve made (mentioned above). I’m definitely still using NLTK, just for different tasks. NLTK has a great and easy to use regexp chunker that I’ll continue to use.
Again, a tagger’s performace can vary greatly based on the data used to train and test it. I was testing them on about 12,000 webpages I downloaded and looking for specific phrases. On a different data set NLTK may turn out to be better.
In: Python · Tagged with: benchmarks, nlp, taggers
I began learning Python this spring, and I must say, the more I program in it the more I like it. I chose the language because of the libraries that are available for it. There is a library for everything. :) Also, there are tools for Natural Language Processing that are a great help, but that’s for another time and another post.
I was originally thinking about using Postgres, and it would probably give me better speed and scalability. But then, I began to think if I really needed a full RDBMS for my application. After all, I’m not expecting the project to get too large, and being able to easily move it from one computer to another by just moving a single files sounds very convenient. SQLite has a great page to see if it’s right for you. I ended up settling on SQLite, and so far am happy with the decision.
Installing SQLite was a breeze. I just opened the package manager in openSuSE and installed the packages. I also installed the python Storm package. There is no daemon process, as there is with Postgres, because you’re just accessing one file on your filesystem. There is a great tool for setting up a SQLite database called SQLite Manager. It will let you create tables, view your data, and run queries. The fact that it’s available as a firefox extention makes it easy to install on many platforms.
Now is when the real fun begins. Enter Storm.
Storm is an object relation mapping (ORM) tool for Python. It allows you to manipulate the database through the manipulation of Python objects. After you map your python objects to database tables, you manipulate them, and your changes will show up in the database for you. I’ve used other ORM tools in the past (Hibernate for Java), but I was amazed at the simplicity of the setup/configuration step when using Storm.
It takes two lines to connect to your sqlite database:
DATABASE = create_database('sqlite:db_name') # or simply create_database('sqlite:') for in-memory STORE = Store(DATABASE)
Mapping a class to a table can be done with ease. Here is an example of one of my classes:
class Sentence(object): __storm_table__ = "TABLE_SENT" sent_id = Int(primary=True) loc_id = Int() location = Reference(loc_id, Location.loc_id) sentence = Unicode() def __init__(self, sent, loc = None): if loc: self.location = loc # sent cannot be None if not isinstance(sent, unicode): self.sentence = unicode(sent, "utf-8") else: self.sentence = sent
If you access loc_id, it will give you the database id. If you access the variable without location through a Reference, it will hand you the corresponding database object.
Now, I set this up in about 45 minutes from start to finish, so it might need some more fiddling, but overall it seems to work pretty well. I needed to set something up in one night to keep moving on other parts of the project, and this allowed me to.
It can’t all be sunshine and rainbows, there was one thing that tripped me up a bit. Being new to Python, I wasn’t aware of the u”String” for unicode. It was used in their examples, and after I got an error assumed that’s what it was for, but it tripped me up. As you can see in my constructor, I added some code to handle the case when a string that isn’t unicode is passed in.
As I get into the more advanced aspects of SQLite/Storm, I hope I continue to be impressed.
In: Python · Tagged with: sqlite, storm
A hobby of mine is to learn new programming languages. I try and learn at least one a year, and use it for more than just a hello world app. So this year is the year of python, where if I’m required to write a script Python is the go to guy. Having that said, I recently had a need for regular expressions, so python was used.
Being most familiar with Java and Ruby, Python seems a little in between, but one feature that stuck out was the regular expression syntax. Python let’s you put a ‘r’ in front of the quote to denote raw input. This means you don’t have to escape back slashes twice (ala Java).
For those not familiar with regular expressions, here is an example of a regular expression in a few languages:
Java: Find Slashes: "[\\\\/]" Python: Find Slashes: r"[\\/]" Ruby: Find Slashes: /[\\\/]/
This is just a simple example to illustrate a point, but look at Java. Find slashes has 4 backslashes. Even if regular expressions were typically this simple (they’re not), that seems unnecessary. Anytime it’s necessary for a backslash to make it to the regular expression processor, there much be two in the regular expression. Another quick example, to find a period (.) the regular expression would be “\\.”. This compounds very quickly, and makes it painful to use regular expressions in Java.
I can’t understand why Java wouldn’t adopt the python syntax of putting a ‘r’ in from of a string to denote raw input. At quick glance, it doesn’t seem as though it would break any currently in use regular expressions because the old syntax would continue working as expected. It’d make them better for the future.
Here are some Regular Expression resources that I find useful:
http://www.rubular.com/ A Ruby Regular Expression Tester
http://www.fileformat.info/tool/regex.htm A Java Regular Expression Tester