NLTK vs MontyLingua Part of Speech Taggers

This is a comparison of the part of speech taggers available in python. As far as I know, these are the most prominent python taggers. Let me know if you think another tagger should be added to the comparison.

MontyLingua includes several natural language processing (NLP) tools. The ones that I used in this comparison were the stemmer, tagger, and sentence tokenizer. The Natural Language Toolkit (NLTK) is another set of python tools for natural language processing. It has a much greater breadth of tools than MontyLingua. It has taggers, parsers, tokenizers, chunkers, and stemmers. It usually has a few different implementations of each providing different options to their users. In the case of stemmers, they have the Punkt and WordNet stemmers. Both of these tools are written to aid in NLP using Python.

Taggers

For those that don’t know, a tagger is a NLP tool that will mark the part of speech of a word.

Example:
Input: “A dog walks”
Output: “A/DT dog/NN walks/VBZ”

The meanings of the tokens after the / can be found here.

For NLTK, I’m comparing the built-in tagger to MontyLingua. I didn’t do any training at all and just called nltk.tag.pos_tag(). I used the taggers mostly as is, with some slight modifications. I added a RegExp tagger in front of the NLTK tagger, and make the default tagger the backoff tagger. It will mark A, An, and The as DT always. It was annoying and messing up my results to have them marked as NNP. They were capitalized, and I suppose the tagger thought they were either initials or proper names.

MontyLingua on the other hand was always marking “US” as a pronoun. This was a problem when scanning sentences that said “US Pint” or “US Gallon.” I look at the word before “US” and see if it’s an article, if it is I allow it to continue being processed. Neither tagger is perfect, but it becomes clear that one may be better than the other for my use-case. It may be different for yours. I’m scanning sentences from the web.

Stemmers

A stemmer is a tool that will take a word with a suffix attached to it, and return the ‘stem’ or base word of it.

Example:
Input: dogs
Output: dog

While neither stemmer is perfect, they both do a decent job. MontyLingua is more inclined to take the ‘S’ off the end of something, and the NLTK WordNetLemmatizer doesn’t always take it off. ‘Cows’ is an example of a word the WordNetLemmatizer will not stem to ‘Cow’ but MontyLingua will. On the other hand, MontyLingua is more likely to take the ‘S’ off the end of an acronym, and I wrote code to correct that in some cases. If a word is less than 4 characters or all consonants, I don’t run it on the MontyLingua stemmer. The all consonants is to catch some acronyms. While using MontyLingua on a specific part of speech it’s important to specify whether it’s a noun or a verb with the ‘pos’ parameter. Since I’m only stemming nouns, I used pos=’noun’.

Results

The first results don’t only reflect a change in taggers, but changes in the stemmer and sentence tokenizer also. I did another comparison using the MontyLingua tagger with the NLTK stemmer and sentence tokenizer for comparison.

A phrase found by one algorithm and not by another is shown first. They both were able to find some words that were not found by the other. Hits is the number of times a phrase comes up, it is displayed only if there is a discrepancy. If MontyLingua and NLTK both found a phrase but found it a different number of times, that is reflected there. The first numbers are totals for every discrepancy summed. There is also a graph below showing how many of each difference there is. For example there were 157 times that there was a discrepancy of 1 hit and MontyLingua came out on top. There were 78 times the number of hits were different by 1 and NLTK had more. An interesting one is there was one time MontyLingua had one word with 40 hits more than NLTK. That word was elephant.

MontyLingua toolchain vs NLTK toolchain
In MontyLingua but not NLTK: 514
In NLTK but not MontyLingua: 403

Total Hits: MontyLingua: 1421 vs NLTK: 1184

MontyLinga vs NLTK Graph

MontyLingua vs NLTK

Hit Count MontyLingua NLTK
1 157 78
2 35 10
3 10 0
4 4 1
5 1 0
6 1 0
13 0 1
14 2 0
40 1 0

On average MontyLingua had more hits than NLTK on words

MontyLingua Tagger NLTK Stemmer & Tokenizer (ML-NLTK) vs MontyLingua Toolchain
For the sake of completeness here are the results of the MontyLingua tagger with the NLTK stemmer and tokenizer.

In ML-NLTK but not in MontyLingua: 65
In MontyLingua but not in ML-NLTK: 68

Total Hits: ML-NLTK: 290 vs MontyLingua: 299

Hit Count MontyLingua ML-NLTK
1 20 17
2 0 2
10 1 0

Total Phrases Found By

Name Phrase Count
NLTK 3777
ML-NLTK 3885
MontyLingua 3888

At the end of the day, I’ll be using the MontyLingua toolchain with some slight modifications I’ve made (mentioned above). I’m definitely still using NLTK, just for different tasks. NLTK has a great and easy to use regexp chunker that I’ll continue to use.

Again, a tagger’s performace can vary greatly based on the data used to train and test it. I was testing them on about 12,000 webpages I downloaded and looking for specific phrases. On a different data set NLTK may turn out to be better.

Posted on March 28, 2009 at 10:23 pm by Joe · Permalink
In: Python · Tagged with: , ,

5 Responses

Subscribe to comments via RSS

  1. Written by Daniel Loureiro
    on November 13, 2009 at 9:13 pm
    Reply · Permalink

    Thank you for sharing your results. Saved me a lot of work and possibly headaches!

    • Written by Joe
      on November 15, 2009 at 4:19 pm
      Reply · Permalink

      You’re welcome. I was hoping it would be useful to other people, but keep in mind NLTK is still active while MontyLingua’s last release was in 2004. At some point NLTK could surpass it.

  2. Written by Justin
    on December 12, 2009 at 4:38 pm
    Reply · Permalink

    Hi Joe,

    Yes, thanks so much for writing this! :-)

    I’ve used MontyPython for other projects but I’m intregued about the NLTK.

    Can you explain how you used the NLTK’s regex chunker?

    Did you use a grammar that came with the NLTK module or did you write your own? And would you mind sharing it either the code that accesses the NLTK chunker & grammar or the grammar you used?

    cheers,

    Justin

    • Written by Joe
      on December 20, 2009 at 10:18 am
      Reply · Permalink

      Sure, I’ll do it in another post. Since I’m using MontyLingua for the tagger and NLTK’s regex chunker there was a bit of conversion that needed to happen to go from one to the other. I’ll post the whole process. I’ll send you an email when it’s posted, but don’t expect anything too soon because of the time of year and all.

  3. Written by NLTK Regular Expression Parser (RegexpParser) - Blog::Quibb
    on January 27, 2010 at 10:04 am
    Reply · Permalink

    […] word in a sentence with its part of speech.  Here is a small comparison I did of python taggers: NLTK vs MontyLingua Part of Speech Taggers.  The NLTK RegexpParser works by running regular expressions on top of the part of speech tags […]

Subscribe to comments via RSS

Leave a Reply