NLTK Regular Expression Parser (RegexpParser)
The Natural Language Toolkit (NLTK) provides a variety of tools for dealing with natural language. One such tool is the Regular Expression Parser. If you’re familiar with regular expressions, it can be a useful tool in natural language processing.
Background Information
You must first be familiar with regular expressions to be able to fully utilize the RegexpParser/RegexpChunkParser. If you need to learn about regular expressions, here is a site with an abundance of information to get you started: http://www.regular-expressions.info. It is also necessary to know how to use a tagger, and what the tags mean. A tagger is a tool that marks each word in a sentence with its part of speech. Here is a small comparison I did of python taggers: NLTK vs MontyLingua Part of Speech Taggers. The NLTK RegexpParser works by running regular expressions on top of the part of speech tags added by a tagger. The Brown Corpus tags will be the tags used throughout the rest of this post, and are commonly used by taggers in general. On a side note, the RegexpParser can be used with either the NLTK or MontyLingua tagger.
Basic RegexpParser Usage
Let me start by going over the “how to” provided in the NLTK documentation. The source of this information is here: NLTK RegexParser HowTo. The documentation goes through how you could use the RegexParser/RegexpChunkParser to do a traditional parse of a sentence.
The RegexParser/RegexChunkParser works by defining rules for grouping different words together. A simple example would be: “NP: {<DT>? <JJ>* <NN>*}”. This is a definition for a rule to group of words into a noun phrase. It will group one determinant (usually an article), then zero or more adjectives followed by zero or more nouns. In the how to, they go over prepositions and creating prepositional phrases from a preposition and noun phrase. It’s important to note that earlier regular expressions can be used in later ones. Also, the regular expression syntax can occur within the tags or apply to the tags themselves.
Here is the example from the NLTK website:
parser = RegexpParser(''' NP: {<DT>? <JJ>* <NN>*} # NP P: {<IN>} # Preposition V: {<V.*>} # Verb PP: {<P> <NP>} # PP -> P NP VP: {<V> <NP|PP>*} # VP -> V (NP|PP)* ''')
Alternative RegexpParser Usage
I call this an alternate usage because it can be used to find patterns that aren’t necessarily related to grammatical phrases in English. It can be used to find any pattern in a sentence. Let me start by showing the regular expression grammar from my program.
grammar = """ NP: {<PRP>?<JJ.*>*<NN.*>+} CP: {<JJR|JJS>} VERB: {<VB.*>} THAN: {<IN>} COMP: {<DT>?<NP><RB>?<VERB><DT>?<CP><THAN><DT>?<NP>} """ self.chunker = RegexpParser(grammar)
I was using it to look for a specific pattern in a sentence. The first part, NP, is looking for a noun phrase. The <PRP>? is there because of a bug found in the tagger I was using. It was marking An with a capital ‘A’ as a PRP (Pronoun) rather than a DT (Determinant/Article). I found another workaround for the bug, but left the PRP in there to catch anything that might have slipped through.
Then it moves onto the CP, which is the comparison word. JJR tagged words are comparative adjectives. They include words bigger, smaller, and larger. JJS words are words that signify the most or chief. JJS words include biggest, smallest, and largest.
The next two a simply the VERB and the word THAN. The VERB could be a compound verb, so there would be one or more verbs present. The IN tag denotes a preposition. In this case, I was looking specifically for the word than.
The last line is COMP. This is the regular expression that puts it all together. This was looking for a size comparison of two objects. It might be easier to look at the output of this part of the expression than trying to explain it piece by piece. The only tag not explained above is RB, which is an adverb.
Here is the parse for the sentence “Everyone knows an elephant is larger than a dog.”:
(S
(NP everyone/NN)
(VERB knows/VBZ)
(COMP
an/DT
(NP elephant/NN)
(VERB is/VBZ)
(CP larger/JJR)
(THAN than/IN)
a/DT
(NP dog/NN))
./.)
The output is a simple tree, that makes to easy data extraction. It’s easy to see there are many possibilities that open up when looking for patterns in English text. May this help you in your data mining endeavors.
In: Uncategorized · Tagged with: nlp, nltk, python
Sort Optimization (Part 2) with JDK 6 vs JDK 7
In part 1, I went over my first foray into the world of sorting algorithms. Since then, I’ve had some other ideas on how to improve my quicksort implementation. One idea that I had while originally working on the sorting algorithm, was to rework the partition function to take into account duplicate elements. I had a few different working implementations, but all of them came with severe performance penalty. I finally figured out a way to get performance close to the previous algorithm.
The partition function needs to perform the minimal number of swaps possible. So moving towards the center from both ends and only swapping when both are out of order is the best approach I’ve found so far. When grouping duplicate elements, they are swapped to the beginning of the partition area as they are found. Then at the end, a pass is run to move them to their correct location in the final list. Then instead of returning one number from the partition function, it returns two. It returns the minimum and maximum indices on the range that has the pivot value.
Another area that I was able to get some performance gain out of was getting rid of the shell sort form the first algorithm. While that was there to make sure the quicksort did not recurse too deeply, in practice the shell sort algorithm doesn’t run.
Results
Here are the results of JDK 6 MergeSort, Tim Sort, QSort, QSortv2, and Dual Pivot sort 2 benchmarked on the same set of files. Overall, the new version doesn’t outperform the old version, but I thought it was worth posting my findings. On most data sets with duplicates it does perform better. I ran these benchmarks on OpenJDK 7 because I was curious as to how they would compare to one another.
It’s important to note that the tables are speedup relative the Java implementation on the given JDK. The graphs are the average runtimes for each algorithm. The reason for doing the average runtime is that it could show the performance difference between Sun’s JDK 6 and OpenJDK 7 build 73.
Conclusions
Overall the new version of the Qsort implementation doesn’t improve greatly over the previous implementation. While it didn’t work out to be the performace improvement I was looking for. I think the last graph with 1000 iterations of warmup for each algorithm is the most interesting. The Qsort v2 implementation apparently doesn’t get handled any better by OpenJDK 7. The partition function is larger after my changes, so perhaps it didn’t JIT very well. What is interesting is the boost that Tim Sort saw with the change of JDK’s. Running these benchmarks made me realize that upgrading my Java Runtime will increase the performance of all my Java applications. It will be interesting to see if the performance carries over to Netbeans and Eclipse; I expect it will.
In: Uncategorized · Tagged with: benchmarks, java, shootout, sorting
Sorting Algorithm Shootout
Since I did my Sort Optimization post, I’ve been keeping an eye on things that happen in the sorting world. Recently an article popped up on Reddit about someone wanting to replace the JDK sorting algorithm with a Dual Pivot Quick Sort. This lead to the discovery that Tim Sort would be replacing Merge Sort in the JDK starting with version 7. This probably got some attention because of the OpenJDK project. It’s nice to see that allowing more developers to work on different areas of the JDK. First I’ll do a quick overview of the algorithms, then show some benchmarks. All algorithms are written in Java.
JDK 6 Sort
The JDK6 implements a fairly standard Merge Sort. It will switch to an insertion sort at a specific depth.
QSort
This is the implementation of quicksort I outlined in the earlier blog post. It performed admirably at the time, but how will it hold up against tougher competition. It’s pretty much an iterative quicksort, that short-circuits to a shell sort if it’s going too deep.
Original QSort Post:
Sort Optimization
Tim Sort
This is an optimized in place variation of a merge sort. Tim Peters developed this sorting algorithm for the Python programming language. It is in use by Python and will be used by Java starting with JDK 7. It takes advantage of partially sorted parts of the list.
Available here:
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6804124
http://hg.openjdk.java.net/jdk7/tl/jdk/rev/bfd7abda8f79
Dual Pivot Quick sort
This is a newcomer to the sorting table. Developed by Vladimir Yaroslavskiy for the inclusion into the Java language. The premise is the same as quick sort, only it will choose two pivot points rather than one. He did a full writeup detailing the algorithm, and its benefits. I did modify it to take the comparable interface, and Vladimir explicitly said this was not the intended target of the algorithm. He has stated it is designed to work directly with primitive types. I don’t see how doing an int comparison vs a Integer.compareTo() would be different, as long as they are used uniformly between all algorithms. Since my sorting algorithm works with comparable, as does Tim Sort, I chose to convert this algorithm to use the Comparable interface also.
Available here:
http://article.gmane.org/gmane.comp.java.openjdk.core-libs.devel/2628
Results
These tables show the speedup relative to JDK 6 with and without warm up.
Here is the original text data if you’re interested in that. These are in simple table format. The columns store the runtime in seconds for each algorithm. The number in parenthesis is the speedup relative to JDK 6.
Tim Sort is definitely the way to go if you’re interested in a stable sorting algorithm. I was pretty amazed when I first looked at the results with how well it actually it did. It really takes advantage of any presorted parts of the lists. Overall, I’d say my optimized quicksort does fairly well, but maybe it could do better. I may have to look into that again.
In: Uncategorized · Tagged with: java, shootout, sorting
Starting Python, Elixir, and SQLite
When I did the post about Storm, someone suggested that I look into Elixir. Since I didn’t have time to at the time, I made a note of looking into it at a later time. That time is now. :)
Elixir and Storm are very similar, they’re both object relational mappers that provide an easy way to map your objects to database tables. In a future post, I’ll do a more in depth comparison between the two in a future post.
Starting out, Elixir uses SQL Alchemy as a backend. While working with the tool you will probably find yourself running into things you may not understand if you’re not familiar with SQL Alchemy. Keeping open a tab in firefox pointed at the SQL Alchemy documentation can be useful. It does show through in certain instances.
There are two main starting points for an ORM tool. There is the case where you’re starting with an existing database, and the case where you’re setting up the database from scratch. Mapping to a table that already exists with Elixir can be a little tricky depending on the relationships.
It’s as simple as this to connect to a database:
metadata.bind = "sqlite:///../sizedb.sqlite"
Here is a simple example:
class Location(Entity): using_options(tablename='TABLE_LOC') loc_id = Field(Integer, primary_key=True) location = Field(UnicodeText)
And here is a more complex example of connecting to an existing database table:
class Comparison(Entity): using_options(tablename='TABLE_COMP') comp_id = Field(Integer, primary_key=True) date_added = Field(DateTime, default=datetime.datetime.now) hits = Field(Integer, default=1) smaller = ManyToOne('Phrase', colname='smaller_id') larger = ManyToOne('Phrase', colname='larger_id') sentences = ManyToMany('Sentence', tablename='TABLE_COMP_SENT', local_side='comp_id', remote_side='sent_id', column_format="%(key)s")
I left out most of the class specific code to focus on Elixir. One thing that took a while to figure out was how to setup a ManyToMany relationship with specific columns in my database. The column_format parameter is the key to being able to specify your column names directly. I really didn’t have to use any other options besides what you see above when connecting to an existing database. Overall, I had about five database tables to connect.
Now if it was not being setup with an existing database, many of the parameters in the different relationships are unnecessary. For comparison here is the same example if Elixir is used to create the database tables:
class Comparison(Entity): comp_id = Field(Integer, primary_key=True) date_added = Field(DateTime, default=datetime.datetime.now) hits = Field(Integer, default=1) smaller = ManyToOne('Phrase') larger = ManyToOne('Phrase') sentences = ManyToMany('Sentence')
As you can see, it gets quite a bit simpler. The underlying table information is no longer needed. It created tables that were very similar to my hand-created tables that I had used with Storm. When it comes to queries on the database, SQL Alchemist shows through.
I found the documentation on the Elixir webpage to be a little bit lacking in terms of queries. SQL Alchemist has a page that more fully describes the query functions. AND and OR operators are named and_ and or_, respectively, probably because and and or are reserved in Python. I thought this was worth mentioning because they are common SQL operators.
In: Uncategorized · Tagged with: elixir, python, sqlite
Spell Checking in Python
I was looking into spell checking in Python. I found spell4py, and downloaded the zip, but couldn’t get it to build on my system. If I tried a bit longer maybe, but in the end my solution worked out fine. This library was overkill for my needs too.
I found this article here: http://code.activestate.com/recipes/117221/
This seemed to work well for my purposes, but I wanted to test out other spell checking libraries. Mozilla Firefox , Google Chrome, and OpenOffice all use hunspell, so I wanted to try that one (as I’m testing the spelling of words on the Internet). Here are some python snippets to get you up and running with the popular spelling checkers. I modified these to take more than 1 word, split them up, and then return a list of suggestions. They do require each spelling checker to be installed. I was able to do this through the openSuSE package manager.
import popen2 class ispell: def __init__(self): self._f = popen2.Popen3("ispell") self._f.fromchild.readline() #skip the credit line def __call__(self, words): words = words.split(' ') output = [] for word in words: self._f.tochild.write(word+'\n') self._f.tochild.flush() s = self._f.fromchild.readline().strip() self._f.fromchild.readline() #skip the blank line if s[:8] == "word: ok": output.append(None) else: output.append((s[17:-1]).strip().split(', ')) return output
import popen2 class aspell: def __init__(self): self._f = popen2.Popen3("aspell -a") self._f.fromchild.readline() #skip the credit line def __call__(self, words): words = words.split(' ') output = [] for word in words: self._f.tochild.write(word+'\n') self._f.tochild.flush() s = self._f.fromchild.readline().strip() self._f.fromchild.readline() #skip the blank line if s == "*": output.append(None) elif s[0] == '#': output.append("No Suggestions") else: output.append(s.split(':')[1].strip().split(', ')) return output
import popen2 class hunspell: def __init__(self): self._f = popen2.Popen3("hunspell") self._f.fromchild.readline() #skip the credit line def __call__(self, words): words = words.split(' ') output = [] for word in words: self._f.tochild.write(word+'\n') self._f.tochild.flush() s = self._f.fromchild.readline().strip().lower() self._f.fromchild.readline() #skip the blank line if s == "*": output.append(None) elif s[0] == '#': output.append("No Suggestions") elif s[0] == '+': pass else: output.append(s.split(':')[1].strip().split(', ')) return output
Now, after doing this and seeing the suggestions. I decided a spell checker isn’t really what I was looking for. A spelling checker always tries to make a suggestion, and I wanted to filter out things from a database. I started this with the hope that I would be able to take misspellings and convert them into the correct word. In the end, I just removed words that were not spelled correctly using WordNET through NLTK. WordNET had a bigger dictionary than most of the spell checkers which also helped in the filtering task. NLTK has a simple how to on how to get started using WordNET.
In: Uncategorized · Tagged with: nlp, python, spelling checker








