<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Blog::Quibb &#187; nlp</title>
	<atom:link href="http://blog.quibb.org/tag/nlp/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.quibb.org</link>
	<description>Software development and more.</description>
	<lastBuildDate>Tue, 10 Aug 2010 14:11:56 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>NLTK Regular Expression Parser (RegexpParser)</title>
		<link>http://blog.quibb.org/2010/01/nltk-regular-expression-parser-regexpparser/</link>
		<comments>http://blog.quibb.org/2010/01/nltk-regular-expression-parser-regexpparser/#comments</comments>
		<pubDate>Wed, 27 Jan 2010 13:53:35 +0000</pubDate>
		<dc:creator>Joe</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[nlp]]></category>
		<category><![CDATA[nltk]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://blog.quibb.org/?p=122</guid>
		<description><![CDATA[The Natural Language Toolkit (NLTK) provides a variety of tools for dealing with natural language.  One such tool is the Regular Expression Parser.  If you&#8217;re familiar with regular expressions, it can be a useful tool in natural language processing. Background Information You must first be familiar with regular expressions to be able to fully utilize [...]]]></description>
			<content:encoded><![CDATA[<p>The <a href="http://www.nltk.org/">Natural Language Toolkit (NLTK)</a> provides a variety of tools for dealing with natural language.  One such tool is the Regular Expression Parser.  If you&#8217;re familiar with regular expressions, it can be a useful tool in <a title="natural language processing" href="http://en.wikipedia.org/wiki/Natural_language_processing">natural language processing</a>.</p>
<h2>Background Information</h2>
<p>You must first be familiar with regular expressions to be able to fully  utilize the <a title="RegexpParser" href="http://nltk.googlecode.com/svn/trunk/doc/api/nltk.chunk.regexp.RegexpParser-class.html">RegexpParser</a>/<a title="RegexpChunkParser" href="http://nltk.googlecode.com/svn/trunk/doc/api/nltk.chunk.regexp.RegexpChunkParser-class.html">RegexpChunkParser</a>.  If you need to learn about regular expressions, here is a site with an abundance of information to get you started: <a title="Regular Expressions" href="http://www.regular-expressions.info">http://www.regular-expressions.info</a>.  It is also necessary to know how to use a tagger, and what the tags mean.  A <a title="tagger" href="http://en.wikipedia.org/wiki/Part-of-speech_tagging">tagger</a> is a tool that marks each word in a sentence with its part of speech.  Here is a small comparison I did of python taggers: <a title="NLTK vs MontyLingua Part of Speech Taggers" href="http://blog.quibb.org/2009/03/nltk-vs-montylingua-part-of-speech-taggers/">NLTK vs MontyLingua Part of Speech Taggers</a>.  The NLTK RegexpParser works by running regular expressions on top of the part of speech tags added by a tagger.  The <a title="Brown Corpus Tags" href="http://en.wikipedia.org/wiki/Brown_Corpus#Part-of-speech_tags_used">Brown Corpus tags</a> will be the tags used throughout the rest of this post, and are commonly used by taggers in general.  On a side note, the RegexpParser can be used with either the NLTK or <a title="MontyLingua" href="http://en.wikipedia.org/wiki/MontyLingua">MontyLingua</a> tagger.</p>
<h2>Basic RegexpParser Usage</h2>
<p>Let me start by going over the &#8220;how to&#8221; provided in the NLTK  documentation.  The source of this information is here: <a title="NLTK  RegexParser HowTo" href="http://nltk.googlecode.com/svn/trunk/doc/howto/chunk.html">NLTK  RegexParser HowTo</a>.  The documentation goes through how you could use  the RegexParser/RegexpChunkParser to do a traditional parse of a  sentence.</p>
<p>The RegexParser/RegexChunkParser works by defining rules for grouping different words together.  A simple example would be: &#8220;NP: {&lt;DT&gt;? &lt;JJ&gt;* &lt;NN&gt;*}&#8221;.  This is a definition for a rule to group of words into a noun phrase.  It will group one determinant (usually an article), then zero or more adjectives followed by zero or more nouns.  In the how to, they go over prepositions and creating prepositional phrases from a preposition and noun phrase.  It&#8217;s important to note that earlier regular expressions can be used in later ones.  Also, the regular expression syntax can occur within the tags or apply to the tags themselves.</p>
<p>Here is the example from the NLTK website:</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #dc143c;">parser</span> = RegexpParser<span style="color: black;">&#40;</span><span style="color: #483d8b;">''</span><span style="color: #483d8b;">'
    NP: {&lt;DT&gt;? &lt;JJ&gt;* &lt;NN&gt;*} # NP
    P: {&lt;IN&gt;}           # Preposition
    V: {&lt;V.*&gt;}          # Verb
    PP: {&lt;P&gt; &lt;NP&gt;}      # PP -&gt; P NP
    VP: {&lt;V&gt; &lt;NP|PP&gt;*}  # VP -&gt; V (NP|PP)*
    '</span><span style="color: #483d8b;">''</span><span style="color: black;">&#41;</span></pre></div></div>

<h2>Alternative RegexpParser Usage</h2>
<p>I call this an alternate usage because it can be used to find patterns that aren&#8217;t necessarily related to grammatical phrases in English.  It can be used to find any pattern in a sentence.  Let me start by showing the regular expression grammar from my program.</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;">grammar = <span style="color: #483d8b;">&quot;&quot;&quot;
	NP:   {&lt;PRP&gt;?&lt;JJ.*&gt;*&lt;NN.*&gt;+}
	CP:   {&lt;JJR|JJS&gt;}
	VERB: {&lt;VB.*&gt;}
	THAN: {&lt;IN&gt;}
	COMP: {&lt;DT&gt;?&lt;NP&gt;&lt;RB&gt;?&lt;VERB&gt;&lt;DT&gt;?&lt;CP&gt;&lt;THAN&gt;&lt;DT&gt;?&lt;NP&gt;}
	&quot;&quot;&quot;</span>
<span style="color: #008000;">self</span>.<span style="color: black;">chunker</span> = RegexpParser<span style="color: black;">&#40;</span>grammar<span style="color: black;">&#41;</span></pre></div></div>

<p>I was using it to look for a specific pattern in a sentence.  The first part, NP, is looking for a noun phrase.  The &lt;PRP&gt;? is there because of a bug found in the tagger I was using.  It was marking An with a capital &#8216;A&#8217; as a PRP (Pronoun) rather than a DT (Determinant/Article).  I found another workaround for the bug, but left the PRP in there to catch anything that might have slipped through.</p>
<p>Then it moves onto the CP, which is the comparison word.  JJR tagged words are comparative adjectives.  They include words bigger, smaller, and larger.  JJS words are words that signify the most or chief.  JJS words include biggest, smallest, and largest.</p>
<p>The next two a simply the VERB and the word THAN.  The VERB could be a compound verb, so there would be one or more verbs present.  The IN tag denotes a preposition.  In this case, I was looking specifically for the word than.</p>
<p>The last line is COMP.  This is the regular expression that puts it all together.  This was looking for a size comparison of two objects.  It might be easier to look at the output of this part of the expression than trying to explain it piece by piece.  The only tag not explained above is RB, which is an adverb.</p>
<p>Here is the parse for the sentence &#8220;Everyone knows an elephant is larger than a dog.&#8221;:</p>
<pre>
(S
  (NP everyone/NN)
  (VERB knows/VBZ)
  (COMP
    an/DT
    (NP elephant/NN)
    (VERB is/VBZ)
    (CP larger/JJR)
    (THAN than/IN)
    a/DT
    (NP dog/NN))
  ./.)
</pre>
<p>The output is a simple tree, that makes to easy data extraction.  It&#8217;s easy to see there are many possibilities that open up when looking for patterns in English text.  May this help you in your data mining endeavors.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.quibb.org/2010/01/nltk-regular-expression-parser-regexpparser/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Spell Checking in Python</title>
		<link>http://blog.quibb.org/2009/04/spell-checking-in-python/</link>
		<comments>http://blog.quibb.org/2009/04/spell-checking-in-python/#comments</comments>
		<pubDate>Sat, 11 Apr 2009 15:38:10 +0000</pubDate>
		<dc:creator>Joe</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[nlp]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[spelling checker]]></category>

		<guid isPermaLink="false">http://blog.quibb.org/?p=54</guid>
		<description><![CDATA[I was looking into spell checking in Python.  I found spell4py, and downloaded the zip, but couldn&#8217;t get it to build on my system.  If I tried a bit longer maybe, but in the end my solution worked out fine.  This library was overkill for my needs too. I found this article here: http://code.activestate.com/recipes/117221/ This [...]]]></description>
			<content:encoded><![CDATA[<p>I was looking into spell checking in Python.  I found <a href="http://www.keyphrene.com/products/4py/">spell4py</a>, and downloaded the zip, but couldn&#8217;t get it to build on my system.  If I tried a bit longer maybe, but in the end my solution worked out fine.  This library was overkill for my needs too.</p>
<p>I found this article here: <a href="http://code.activestate.com/recipes/117221/">http://code.activestate.com/recipes/117221/</a></p>
<p>This seemed to work well for my purposes, but I wanted to test out other spell checking libraries.    Mozilla Firefox , Google Chrome, and OpenOffice all use hunspell, so I wanted to try that one (as I&#8217;m testing the spelling of words on the Internet).  Here are some python snippets to get you up and running with the popular spelling checkers.  I modified these to take more than 1 word, split them up, and then return a list of suggestions.  They do require each spelling checker to be installed.  I was able to do this through the openSuSE package manager.</p>
<p><a href="http://en.wikipedia.org/wiki/Ispell"><strong>Ispell</strong></a></p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">popen2</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">class</span> ispell:
    <span style="color: #ff7700;font-weight:bold;">def</span> <span style="color: #0000cd;">__init__</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:
        <span style="color: #008000;">self</span>._f = <span style="color: #dc143c;">popen2</span>.<span style="color: black;">Popen3</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;ispell&quot;</span><span style="color: black;">&#41;</span>
        <span style="color: #008000;">self</span>._f.<span style="color: black;">fromchild</span>.<span style="color: #dc143c;">readline</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span> <span style="color: #808080; font-style: italic;">#skip the credit line</span>
    <span style="color: #ff7700;font-weight:bold;">def</span> <span style="color: #0000cd;">__call__</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span>, words<span style="color: black;">&#41;</span>:
        words = words.<span style="color: black;">split</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">' '</span><span style="color: black;">&#41;</span>
        output = <span style="color: black;">&#91;</span><span style="color: black;">&#93;</span>
        <span style="color: #ff7700;font-weight:bold;">for</span> word <span style="color: #ff7700;font-weight:bold;">in</span> words:
            <span style="color: #008000;">self</span>._f.<span style="color: black;">tochild</span>.<span style="color: black;">write</span><span style="color: black;">&#40;</span>word+<span style="color: #483d8b;">'<span style="color: #000099; font-weight: bold;">\n</span>'</span><span style="color: black;">&#41;</span>
            <span style="color: #008000;">self</span>._f.<span style="color: black;">tochild</span>.<span style="color: black;">flush</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
            s = <span style="color: #008000;">self</span>._f.<span style="color: black;">fromchild</span>.<span style="color: #dc143c;">readline</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>.<span style="color: black;">strip</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
            <span style="color: #008000;">self</span>._f.<span style="color: black;">fromchild</span>.<span style="color: #dc143c;">readline</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span> <span style="color: #808080; font-style: italic;">#skip the blank line</span>
            <span style="color: #ff7700;font-weight:bold;">if</span> s<span style="color: black;">&#91;</span>:<span style="color: #ff4500;">8</span><span style="color: black;">&#93;</span> == <span style="color: #483d8b;">&quot;word: ok&quot;</span>:
                output.<span style="color: black;">append</span><span style="color: black;">&#40;</span><span style="color: #008000;">None</span><span style="color: black;">&#41;</span>
            <span style="color: #ff7700;font-weight:bold;">else</span>:
                output.<span style="color: black;">append</span><span style="color: black;">&#40;</span><span style="color: black;">&#40;</span>s<span style="color: black;">&#91;</span><span style="color: #ff4500;">17</span>:-<span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>.<span style="color: black;">strip</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>.<span style="color: black;">split</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">', '</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
        <span style="color: #ff7700;font-weight:bold;">return</span> output</pre></div></div>

<p><a href="http://en.wikipedia.org/wiki/GNU_Aspell"><strong>Aspell</strong></a></p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">popen2</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">class</span> aspell:
    <span style="color: #ff7700;font-weight:bold;">def</span> <span style="color: #0000cd;">__init__</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:
        <span style="color: #008000;">self</span>._f = <span style="color: #dc143c;">popen2</span>.<span style="color: black;">Popen3</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;aspell -a&quot;</span><span style="color: black;">&#41;</span>
        <span style="color: #008000;">self</span>._f.<span style="color: black;">fromchild</span>.<span style="color: #dc143c;">readline</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span> <span style="color: #808080; font-style: italic;">#skip the credit line</span>
    <span style="color: #ff7700;font-weight:bold;">def</span> <span style="color: #0000cd;">__call__</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span>, words<span style="color: black;">&#41;</span>:
        words = words.<span style="color: black;">split</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">' '</span><span style="color: black;">&#41;</span>
        output = <span style="color: black;">&#91;</span><span style="color: black;">&#93;</span>
        <span style="color: #ff7700;font-weight:bold;">for</span> word <span style="color: #ff7700;font-weight:bold;">in</span> words:
            <span style="color: #008000;">self</span>._f.<span style="color: black;">tochild</span>.<span style="color: black;">write</span><span style="color: black;">&#40;</span>word+<span style="color: #483d8b;">'<span style="color: #000099; font-weight: bold;">\n</span>'</span><span style="color: black;">&#41;</span>
            <span style="color: #008000;">self</span>._f.<span style="color: black;">tochild</span>.<span style="color: black;">flush</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
            s = <span style="color: #008000;">self</span>._f.<span style="color: black;">fromchild</span>.<span style="color: #dc143c;">readline</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>.<span style="color: black;">strip</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
            <span style="color: #008000;">self</span>._f.<span style="color: black;">fromchild</span>.<span style="color: #dc143c;">readline</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span> <span style="color: #808080; font-style: italic;">#skip the blank line</span>
            <span style="color: #ff7700;font-weight:bold;">if</span> s == <span style="color: #483d8b;">&quot;*&quot;</span>:
                output.<span style="color: black;">append</span><span style="color: black;">&#40;</span><span style="color: #008000;">None</span><span style="color: black;">&#41;</span>
            <span style="color: #ff7700;font-weight:bold;">elif</span> s<span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span> == <span style="color: #483d8b;">'#'</span>:
                output.<span style="color: black;">append</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;No Suggestions&quot;</span><span style="color: black;">&#41;</span>
            <span style="color: #ff7700;font-weight:bold;">else</span>:
                output.<span style="color: black;">append</span><span style="color: black;">&#40;</span>s.<span style="color: black;">split</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">':'</span><span style="color: black;">&#41;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span>.<span style="color: black;">strip</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>.<span style="color: black;">split</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">', '</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
        <span style="color: #ff7700;font-weight:bold;">return</span> output</pre></div></div>

<p><a href="http://en.wikipedia.org/wiki/Hunspell"><strong>Hunspell</strong></a></p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">popen2</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">class</span> hunspell:
    <span style="color: #ff7700;font-weight:bold;">def</span> <span style="color: #0000cd;">__init__</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:
        <span style="color: #008000;">self</span>._f = <span style="color: #dc143c;">popen2</span>.<span style="color: black;">Popen3</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;hunspell&quot;</span><span style="color: black;">&#41;</span>
        <span style="color: #008000;">self</span>._f.<span style="color: black;">fromchild</span>.<span style="color: #dc143c;">readline</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span> <span style="color: #808080; font-style: italic;">#skip the credit line</span>
    <span style="color: #ff7700;font-weight:bold;">def</span> <span style="color: #0000cd;">__call__</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span>, words<span style="color: black;">&#41;</span>:
        words = words.<span style="color: black;">split</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">' '</span><span style="color: black;">&#41;</span>
        output = <span style="color: black;">&#91;</span><span style="color: black;">&#93;</span>
        <span style="color: #ff7700;font-weight:bold;">for</span> word <span style="color: #ff7700;font-weight:bold;">in</span> words:
            <span style="color: #008000;">self</span>._f.<span style="color: black;">tochild</span>.<span style="color: black;">write</span><span style="color: black;">&#40;</span>word+<span style="color: #483d8b;">'<span style="color: #000099; font-weight: bold;">\n</span>'</span><span style="color: black;">&#41;</span>
            <span style="color: #008000;">self</span>._f.<span style="color: black;">tochild</span>.<span style="color: black;">flush</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
            s = <span style="color: #008000;">self</span>._f.<span style="color: black;">fromchild</span>.<span style="color: #dc143c;">readline</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>.<span style="color: black;">strip</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>.<span style="color: black;">lower</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
            <span style="color: #008000;">self</span>._f.<span style="color: black;">fromchild</span>.<span style="color: #dc143c;">readline</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span> <span style="color: #808080; font-style: italic;">#skip the blank line</span>
            <span style="color: #ff7700;font-weight:bold;">if</span> s == <span style="color: #483d8b;">&quot;*&quot;</span>:
                output.<span style="color: black;">append</span><span style="color: black;">&#40;</span><span style="color: #008000;">None</span><span style="color: black;">&#41;</span>
            <span style="color: #ff7700;font-weight:bold;">elif</span> s<span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span> == <span style="color: #483d8b;">'#'</span>:
                output.<span style="color: black;">append</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;No Suggestions&quot;</span><span style="color: black;">&#41;</span>
            <span style="color: #ff7700;font-weight:bold;">elif</span> s<span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span> == <span style="color: #483d8b;">'+'</span>:
                <span style="color: #ff7700;font-weight:bold;">pass</span>
            <span style="color: #ff7700;font-weight:bold;">else</span>:
                output.<span style="color: black;">append</span><span style="color: black;">&#40;</span>s.<span style="color: black;">split</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">':'</span><span style="color: black;">&#41;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span>.<span style="color: black;">strip</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>.<span style="color: black;">split</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">', '</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
        <span style="color: #ff7700;font-weight:bold;">return</span> output</pre></div></div>

<p>Now, after doing this and seeing the suggestions.  I decided a spell checker isn&#8217;t really what I was looking for.  A spelling checker always tries to make a suggestion, and I wanted to filter out things from a database.  I started this with the hope that I would be able to take misspellings and convert them into the correct word.  In the end, I just removed words that were not spelled correctly using <a href="http://wordnet.princeton.edu/">WordNET</a> through <a href="http://www.nltk.org/">NLTK</a>.  WordNET had a bigger dictionary than most of the spell checkers which also helped in the filtering task.  NLTK has a simple <a href="http://nltk.googlecode.com/svn/trunk/doc/howto/wordnet.html">how to</a> on how to get started using WordNET.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.quibb.org/2009/04/spell-checking-in-python/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>NLTK vs MontyLingua Part of Speech Taggers</title>
		<link>http://blog.quibb.org/2009/03/nltk-vs-montylingua-part-of-speech-taggers/</link>
		<comments>http://blog.quibb.org/2009/03/nltk-vs-montylingua-part-of-speech-taggers/#comments</comments>
		<pubDate>Sun, 29 Mar 2009 02:23:50 +0000</pubDate>
		<dc:creator>Joe</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[benchmarks]]></category>
		<category><![CDATA[nlp]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[taggers]]></category>

		<guid isPermaLink="false">http://blog.quibb.org/?p=37</guid>
		<description><![CDATA[This is a comparison of the part of speech taggers available in python. As far as I know, these are the most prominent python taggers. Let me know if you think another tagger should be added to the comparison. MontyLingua includes several natural language processing (NLP) tools. The ones that I used in this comparison [...]]]></description>
			<content:encoded><![CDATA[<p>This is a comparison of the part of speech taggers available in python.  As far as I know, these are the most prominent python taggers.  Let me know if you think another tagger should be added to the comparison.</p>
<p><a href="http://web.media.mit.edu/~hugo/montylingua/">MontyLingua</a> includes several natural language processing (NLP) tools.  The ones that I used in this comparison were the stemmer, tagger, and sentence tokenizer.  <a href="http://www.nltk.org/">The Natural Language Toolkit (NLTK)</a> is another set of python tools for natural language processing.  It has a much greater breadth of tools than MontyLingua.  It has taggers, parsers, tokenizers, chunkers, and stemmers.  It usually has a few different implementations of each providing different options to their users.  In the case of stemmers, they have the Punkt and WordNet stemmers.  Both of these tools are written to aid in NLP using Python.</p>
<h2 style="text-align: left;"><strong>Taggers</strong></h2>
<p style="text-align: left;">For those that don&#8217;t know, a tagger is a NLP tool that will mark the part of speech of a word.</p>
<p>Example:<br />
Input: &#8220;A dog walks&#8221;<br />
Output: &#8220;A/DT dog/NN walks/VBZ&#8221;</p>
<p>The meanings of the tokens after the / can be <a href="http://en.wikipedia.org/wiki/Brown_Corpus">found here</a>.</p>
<p>For NLTK, I&#8217;m comparing the built-in tagger to MontyLingua.  I didn&#8217;t do any training at all and just called nltk.tag.pos_tag().  I used the taggers mostly as is, with some slight modifications.  I added a RegExp tagger in front of the NLTK tagger, and make the default tagger the backoff tagger.  It will mark A, An, and The as DT always.  It was annoying and messing up my results to have them marked as NNP.  They were capitalized, and I suppose the tagger thought they were either initials or proper names.</p>
<p>MontyLingua on the other hand was always marking &#8220;US&#8221; as a pronoun.  This was a problem when scanning sentences that said &#8220;US Pint&#8221; or &#8220;US Gallon.&#8221;  I look at the word before &#8220;US&#8221; and see if it&#8217;s an article, if it is I allow it to continue being processed.  Neither tagger is perfect, but it becomes clear that one may be better than the other for my use-case.  It may be different for yours.  I&#8217;m scanning sentences from the web.</p>
<h2 style="text-align: left;"><strong>Stemmers</strong></h2>
<p style="text-align: left;">A stemmer is a tool that will take a word with a suffix attached to it, and return the &#8216;stem&#8217; or base word of it.</p>
<p>Example:<br />
Input: dogs<br />
Output: dog</p>
<p>While neither stemmer is perfect, they both do a decent job.  MontyLingua is more inclined to take the &#8216;S&#8217; off the end of something, and the NLTK WordNetLemmatizer doesn&#8217;t always take it off.  &#8216;Cows&#8217; is an example of a word the WordNetLemmatizer will not stem to &#8216;Cow&#8217; but MontyLingua will.  On the other hand, MontyLingua is more likely to take the &#8216;S&#8217; off the end of an acronym, and I wrote code to correct that in some cases.  If a word is less than 4 characters or all consonants, I don&#8217;t run it on the MontyLingua stemmer.  The all consonants is to catch some acronyms.  While using MontyLingua on a specific part of speech it&#8217;s important to specify whether it&#8217;s a <em>noun</em> or a <em>verb</em> with the &#8216;pos&#8217; parameter.  Since I&#8217;m only stemming nouns, I used pos=&#8217;noun&#8217;.</p>
<h2 style="text-align: left;"><strong>Results</strong></h2>
<p>The first results don&#8217;t only reflect a change in taggers, but changes in the stemmer and sentence tokenizer also.  I did another comparison using the MontyLingua tagger with the NLTK stemmer and sentence tokenizer for comparison.</p>
<p>A phrase found by one algorithm and not by another is shown first.  They both were able to find some words that were not found by the other.  Hits is the number of times a phrase comes up, it is displayed only if there is a discrepancy.  If MontyLingua and NLTK both found a phrase but found it a different number of times, that is reflected there.  The first numbers are totals for every discrepancy summed.  There is also a graph below showing how many of each difference there is.  For example there were 157 times that there was a discrepancy of 1 hit and MontyLingua came out on top.  There were 78 times the number of hits were different by 1 and NLTK had more.  An interesting one is there was one time MontyLingua had one word with 40 hits more than NLTK.  That word was elephant.</p>
<p>MontyLingua toolchain vs NLTK toolchain<br />
In MontyLingua but not NLTK: 514<br />
In NLTK but not MontyLingua: 403</p>
<p>Total Hits: MontyLingua: 1421 vs NLTK: 1184</p>
<p style="text-align: center;">
<div id="attachment_38" class="wp-caption aligncenter" style="width: 310px"><a href="http://blog.quibb.org/wp-content/uploads/2009/03/monty_v_nltk.png"><img class="size-medium wp-image-38" title="monty_v_nltk" src="http://blog.quibb.org/wp-content/uploads/2009/03/monty_v_nltk-300x266.png" alt="MontyLinga vs NLTK Graph" width="300" height="266" /></a><p class="wp-caption-text">MontyLingua vs NLTK</p></div>
<table style="height: 222px;" border="1" cellspacing="0" cellpadding="3" width="260" align="center">
<tbody>
<tr>
<td>Hit Count</td>
<td>MontyLingua</td>
<td>NLTK</td>
</tr>
<tr>
<td>1</td>
<td>157</td>
<td>78</td>
</tr>
<tr>
<td>2</td>
<td>35</td>
<td>10</td>
</tr>
<tr>
<td>3</td>
<td>10</td>
<td>0</td>
</tr>
<tr>
<td>4</td>
<td>4</td>
<td>1</td>
</tr>
<tr>
<td>5</td>
<td>1</td>
<td>0</td>
</tr>
<tr>
<td>6</td>
<td>1</td>
<td>0</td>
</tr>
<tr>
<td>13</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>14</td>
<td>2</td>
<td>0</td>
</tr>
<tr>
<td>40</td>
<td>1</td>
<td>0</td>
</tr>
</tbody>
</table>
<p>On average MontyLingua had more hits than NLTK on words</p>
<p>MontyLingua Tagger NLTK Stemmer &amp; Tokenizer (ML-NLTK) vs MontyLingua Toolchain<br />
For the sake of completeness here are the results of the MontyLingua tagger with the NLTK stemmer and tokenizer.</p>
<p>In ML-NLTK but not in MontyLingua: 65<br />
In MontyLingua but not in ML-NLTK: 68</p>
<p>Total Hits: ML-NLTK: 290 vs MontyLingua: 299</p>
<table style="height: 90px;" border="1" cellspacing="0" cellpadding="3" width="260" align="center">
<tbody>
<tr>
<td>Hit Count</td>
<td>MontyLingua</td>
<td>ML-NLTK</td>
</tr>
<tr>
<td>1</td>
<td>20</td>
<td>17</td>
</tr>
<tr>
<td>2</td>
<td>0</td>
<td>2</td>
</tr>
<tr>
<td>10</td>
<td>1</td>
<td>0</td>
</tr>
</tbody>
</table>
<p style="text-align: center;">
<p style="text-align: center;"><span style="text-decoration: underline;">Total Phrases Found By</span></p>
<table style="height: 90px;" border="1" cellspacing="0" cellpadding="3" width="260" align="center">
<tbody>
<tr>
<td>Name</td>
<td>Phrase Count</td>
</tr>
<tr>
<td>NLTK</td>
<td>3777</td>
</tr>
<tr>
<td>ML-NLTK</td>
<td>3885</td>
</tr>
<tr>
<td>MontyLingua</td>
<td>3888</td>
</tr>
</tbody>
</table>
<p>At the end of the day, I&#8217;ll be using the MontyLingua toolchain with some slight modifications I&#8217;ve made (mentioned above).  I&#8217;m definitely still using NLTK, just for different tasks.  NLTK has a great and easy to use regexp chunker that I&#8217;ll continue to use.</p>
<p>Again, a tagger&#8217;s performace can vary greatly based on the data used to train and test it.  I was testing them on about 12,000 webpages I downloaded and looking for specific phrases.  On a different data set NLTK may turn out to be better.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.quibb.org/2009/03/nltk-vs-montylingua-part-of-speech-taggers/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
	</channel>
</rss>
