<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Blog::Quibb &#187; python</title>
	<atom:link href="http://blog.quibb.org/tag/python/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.quibb.org</link>
	<description>Software development and more.</description>
	<lastBuildDate>Tue, 10 Aug 2010 14:11:56 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>NLTK Regular Expression Parser (RegexpParser)</title>
		<link>http://blog.quibb.org/2010/01/nltk-regular-expression-parser-regexpparser/</link>
		<comments>http://blog.quibb.org/2010/01/nltk-regular-expression-parser-regexpparser/#comments</comments>
		<pubDate>Wed, 27 Jan 2010 13:53:35 +0000</pubDate>
		<dc:creator>Joe</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[nlp]]></category>
		<category><![CDATA[nltk]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://blog.quibb.org/?p=122</guid>
		<description><![CDATA[The Natural Language Toolkit (NLTK) provides a variety of tools for dealing with natural language.  One such tool is the Regular Expression Parser.  If you&#8217;re familiar with regular expressions, it can be a useful tool in natural language processing. Background Information You must first be familiar with regular expressions to be able to fully utilize [...]]]></description>
			<content:encoded><![CDATA[<p>The <a href="http://www.nltk.org/">Natural Language Toolkit (NLTK)</a> provides a variety of tools for dealing with natural language.  One such tool is the Regular Expression Parser.  If you&#8217;re familiar with regular expressions, it can be a useful tool in <a title="natural language processing" href="http://en.wikipedia.org/wiki/Natural_language_processing">natural language processing</a>.</p>
<h2>Background Information</h2>
<p>You must first be familiar with regular expressions to be able to fully  utilize the <a title="RegexpParser" href="http://nltk.googlecode.com/svn/trunk/doc/api/nltk.chunk.regexp.RegexpParser-class.html">RegexpParser</a>/<a title="RegexpChunkParser" href="http://nltk.googlecode.com/svn/trunk/doc/api/nltk.chunk.regexp.RegexpChunkParser-class.html">RegexpChunkParser</a>.  If you need to learn about regular expressions, here is a site with an abundance of information to get you started: <a title="Regular Expressions" href="http://www.regular-expressions.info">http://www.regular-expressions.info</a>.  It is also necessary to know how to use a tagger, and what the tags mean.  A <a title="tagger" href="http://en.wikipedia.org/wiki/Part-of-speech_tagging">tagger</a> is a tool that marks each word in a sentence with its part of speech.  Here is a small comparison I did of python taggers: <a title="NLTK vs MontyLingua Part of Speech Taggers" href="http://blog.quibb.org/2009/03/nltk-vs-montylingua-part-of-speech-taggers/">NLTK vs MontyLingua Part of Speech Taggers</a>.  The NLTK RegexpParser works by running regular expressions on top of the part of speech tags added by a tagger.  The <a title="Brown Corpus Tags" href="http://en.wikipedia.org/wiki/Brown_Corpus#Part-of-speech_tags_used">Brown Corpus tags</a> will be the tags used throughout the rest of this post, and are commonly used by taggers in general.  On a side note, the RegexpParser can be used with either the NLTK or <a title="MontyLingua" href="http://en.wikipedia.org/wiki/MontyLingua">MontyLingua</a> tagger.</p>
<h2>Basic RegexpParser Usage</h2>
<p>Let me start by going over the &#8220;how to&#8221; provided in the NLTK  documentation.  The source of this information is here: <a title="NLTK  RegexParser HowTo" href="http://nltk.googlecode.com/svn/trunk/doc/howto/chunk.html">NLTK  RegexParser HowTo</a>.  The documentation goes through how you could use  the RegexParser/RegexpChunkParser to do a traditional parse of a  sentence.</p>
<p>The RegexParser/RegexChunkParser works by defining rules for grouping different words together.  A simple example would be: &#8220;NP: {&lt;DT&gt;? &lt;JJ&gt;* &lt;NN&gt;*}&#8221;.  This is a definition for a rule to group of words into a noun phrase.  It will group one determinant (usually an article), then zero or more adjectives followed by zero or more nouns.  In the how to, they go over prepositions and creating prepositional phrases from a preposition and noun phrase.  It&#8217;s important to note that earlier regular expressions can be used in later ones.  Also, the regular expression syntax can occur within the tags or apply to the tags themselves.</p>
<p>Here is the example from the NLTK website:</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #dc143c;">parser</span> = RegexpParser<span style="color: black;">&#40;</span><span style="color: #483d8b;">''</span><span style="color: #483d8b;">'
    NP: {&lt;DT&gt;? &lt;JJ&gt;* &lt;NN&gt;*} # NP
    P: {&lt;IN&gt;}           # Preposition
    V: {&lt;V.*&gt;}          # Verb
    PP: {&lt;P&gt; &lt;NP&gt;}      # PP -&gt; P NP
    VP: {&lt;V&gt; &lt;NP|PP&gt;*}  # VP -&gt; V (NP|PP)*
    '</span><span style="color: #483d8b;">''</span><span style="color: black;">&#41;</span></pre></div></div>

<h2>Alternative RegexpParser Usage</h2>
<p>I call this an alternate usage because it can be used to find patterns that aren&#8217;t necessarily related to grammatical phrases in English.  It can be used to find any pattern in a sentence.  Let me start by showing the regular expression grammar from my program.</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;">grammar = <span style="color: #483d8b;">&quot;&quot;&quot;
	NP:   {&lt;PRP&gt;?&lt;JJ.*&gt;*&lt;NN.*&gt;+}
	CP:   {&lt;JJR|JJS&gt;}
	VERB: {&lt;VB.*&gt;}
	THAN: {&lt;IN&gt;}
	COMP: {&lt;DT&gt;?&lt;NP&gt;&lt;RB&gt;?&lt;VERB&gt;&lt;DT&gt;?&lt;CP&gt;&lt;THAN&gt;&lt;DT&gt;?&lt;NP&gt;}
	&quot;&quot;&quot;</span>
<span style="color: #008000;">self</span>.<span style="color: black;">chunker</span> = RegexpParser<span style="color: black;">&#40;</span>grammar<span style="color: black;">&#41;</span></pre></div></div>

<p>I was using it to look for a specific pattern in a sentence.  The first part, NP, is looking for a noun phrase.  The &lt;PRP&gt;? is there because of a bug found in the tagger I was using.  It was marking An with a capital &#8216;A&#8217; as a PRP (Pronoun) rather than a DT (Determinant/Article).  I found another workaround for the bug, but left the PRP in there to catch anything that might have slipped through.</p>
<p>Then it moves onto the CP, which is the comparison word.  JJR tagged words are comparative adjectives.  They include words bigger, smaller, and larger.  JJS words are words that signify the most or chief.  JJS words include biggest, smallest, and largest.</p>
<p>The next two a simply the VERB and the word THAN.  The VERB could be a compound verb, so there would be one or more verbs present.  The IN tag denotes a preposition.  In this case, I was looking specifically for the word than.</p>
<p>The last line is COMP.  This is the regular expression that puts it all together.  This was looking for a size comparison of two objects.  It might be easier to look at the output of this part of the expression than trying to explain it piece by piece.  The only tag not explained above is RB, which is an adverb.</p>
<p>Here is the parse for the sentence &#8220;Everyone knows an elephant is larger than a dog.&#8221;:</p>
<pre>
(S
  (NP everyone/NN)
  (VERB knows/VBZ)
  (COMP
    an/DT
    (NP elephant/NN)
    (VERB is/VBZ)
    (CP larger/JJR)
    (THAN than/IN)
    a/DT
    (NP dog/NN))
  ./.)
</pre>
<p>The output is a simple tree, that makes to easy data extraction.  It&#8217;s easy to see there are many possibilities that open up when looking for patterns in English text.  May this help you in your data mining endeavors.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.quibb.org/2010/01/nltk-regular-expression-parser-regexpparser/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Starting Python, Elixir, and SQLite</title>
		<link>http://blog.quibb.org/2009/05/starting-python-elixir-and-sqlite/</link>
		<comments>http://blog.quibb.org/2009/05/starting-python-elixir-and-sqlite/#comments</comments>
		<pubDate>Mon, 18 May 2009 22:41:09 +0000</pubDate>
		<dc:creator>Joe</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[elixir]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[sqlite]]></category>

		<guid isPermaLink="false">http://blog.quibb.org/?p=59</guid>
		<description><![CDATA[When I did the post about Storm, someone suggested that I look into Elixir. Since I didn&#8217;t have time to at the time, I made a note of looking into it at a later time.  That time is now. :) Elixir and Storm are very similar, they&#8217;re both object relational mappers that provide an easy [...]]]></description>
			<content:encoded><![CDATA[<p>When I did the <a href="http://blog.quibb.org/2009/03/the-ease-of-python-sqlite-and-storm/">post</a> about <a href="https://storm.canonical.com/">Storm</a>, someone suggested that I look into Elixir.  Since I didn&#8217;t have time to at the time, I made a note of looking into it at a later time.  That time is now. :)</p>
<p><a href="http://elixir.ematia.de/trac/wiki">Elixir</a> and <a href="https://storm.canonical.com/">Storm</a> are very similar, they&#8217;re both object relational mappers that provide an easy way to map your objects to database tables.  In a future post, I&#8217;ll do a more in depth comparison between the two in a future post.</p>
<p>Starting out, Elixir uses <a href="http://www.sqlalchemy.org/">SQL Alchemy</a> as a backend.  While working with the tool you will probably find yourself running into things you may not understand if you&#8217;re not familiar with SQL Alchemy.  Keeping open a tab in firefox pointed at the SQL Alchemy documentation can be useful.  It does show through in certain instances.</p>
<p>There are two main starting points for an ORM tool.  There is the case where you&#8217;re starting with an existing database, and the case where you&#8217;re setting up the database from scratch.  Mapping to a table that already exists with Elixir can be a little tricky depending on the relationships.</p>
<p>It&#8217;s as simple as this to connect to a database:</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;">metadata.<span style="color: black;">bind</span> = <span style="color: #483d8b;">&quot;sqlite:///../sizedb.sqlite&quot;</span></pre></div></div>

<p>Here is a simple example:</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #ff7700;font-weight:bold;">class</span> Location<span style="color: black;">&#40;</span>Entity<span style="color: black;">&#41;</span>:
    using_options<span style="color: black;">&#40;</span>tablename=<span style="color: #483d8b;">'TABLE_LOC'</span><span style="color: black;">&#41;</span>
    loc_id = Field<span style="color: black;">&#40;</span>Integer, primary_key=<span style="color: #008000;">True</span><span style="color: black;">&#41;</span>
    location = Field<span style="color: black;">&#40;</span>UnicodeText<span style="color: black;">&#41;</span></pre></div></div>

<p>And here is a more complex example of connecting to an existing database table:</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #ff7700;font-weight:bold;">class</span> Comparison<span style="color: black;">&#40;</span>Entity<span style="color: black;">&#41;</span>:
    using_options<span style="color: black;">&#40;</span>tablename=<span style="color: #483d8b;">'TABLE_COMP'</span><span style="color: black;">&#41;</span>
    comp_id = Field<span style="color: black;">&#40;</span>Integer, primary_key=<span style="color: #008000;">True</span><span style="color: black;">&#41;</span>
    date_added = Field<span style="color: black;">&#40;</span>DateTime, default=<span style="color: #dc143c;">datetime</span>.<span style="color: #dc143c;">datetime</span>.<span style="color: black;">now</span><span style="color: black;">&#41;</span>
    hits = Field<span style="color: black;">&#40;</span>Integer, default=<span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span>
&nbsp;
    smaller = ManyToOne<span style="color: black;">&#40;</span><span style="color: #483d8b;">'Phrase'</span>, colname=<span style="color: #483d8b;">'smaller_id'</span><span style="color: black;">&#41;</span>
    larger = ManyToOne<span style="color: black;">&#40;</span><span style="color: #483d8b;">'Phrase'</span>, colname=<span style="color: #483d8b;">'larger_id'</span><span style="color: black;">&#41;</span>
    sentences = ManyToMany<span style="color: black;">&#40;</span><span style="color: #483d8b;">'Sentence'</span>, tablename=<span style="color: #483d8b;">'TABLE_COMP_SENT'</span>,
                           local_side=<span style="color: #483d8b;">'comp_id'</span>, remote_side=<span style="color: #483d8b;">'sent_id'</span>, column_format=<span style="color: #483d8b;">&quot;%(key)s&quot;</span><span style="color: black;">&#41;</span></pre></div></div>

<p>I left out most of the class specific code to focus on Elixir.  One thing that took a while to figure out was how to setup a ManyToMany relationship with specific columns in my database.  The <em>column_format</em> parameter is the key to being able to specify your column names directly.  I really didn&#8217;t have to use any other options besides what you see above when connecting to an existing database.  Overall, I had about five database tables to connect.</p>
<p>Now if it was not being setup with an existing database, many of the parameters in the different relationships are unnecessary.  For comparison here is the same example if Elixir is used to create the database tables:</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #ff7700;font-weight:bold;">class</span> Comparison<span style="color: black;">&#40;</span>Entity<span style="color: black;">&#41;</span>:
    comp_id = Field<span style="color: black;">&#40;</span>Integer, primary_key=<span style="color: #008000;">True</span><span style="color: black;">&#41;</span>
    date_added = Field<span style="color: black;">&#40;</span>DateTime, default=<span style="color: #dc143c;">datetime</span>.<span style="color: #dc143c;">datetime</span>.<span style="color: black;">now</span><span style="color: black;">&#41;</span>
    hits = Field<span style="color: black;">&#40;</span>Integer, default=<span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span>
&nbsp;
    smaller = ManyToOne<span style="color: black;">&#40;</span><span style="color: #483d8b;">'Phrase'</span><span style="color: black;">&#41;</span>
    larger = ManyToOne<span style="color: black;">&#40;</span><span style="color: #483d8b;">'Phrase'</span><span style="color: black;">&#41;</span>
    sentences = ManyToMany<span style="color: black;">&#40;</span><span style="color: #483d8b;">'Sentence'</span><span style="color: black;">&#41;</span></pre></div></div>

<p>As you can see, it gets quite a bit simpler.  The underlying table information is no longer needed.  It created tables that were very similar to my hand-created tables that I had used with Storm.  When it comes to queries on the database, SQL Alchemist shows through.</p>
<p>I found the documentation on the Elixir webpage to be a little bit lacking in terms of queries.  SQL Alchemist has a <a href="http://www.sqlalchemy.org/docs/05/ormtutorial.html#querying">page</a> that more fully describes the query functions.  <strong>AND</strong> and <strong>OR</strong> operators are named <strong>and_ </strong>and <strong>or_</strong>, respectively, probably because <em>and</em> and <em>or</em> are reserved in Python.  I thought this was worth mentioning because they are common SQL operators.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.quibb.org/2009/05/starting-python-elixir-and-sqlite/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Spell Checking in Python</title>
		<link>http://blog.quibb.org/2009/04/spell-checking-in-python/</link>
		<comments>http://blog.quibb.org/2009/04/spell-checking-in-python/#comments</comments>
		<pubDate>Sat, 11 Apr 2009 15:38:10 +0000</pubDate>
		<dc:creator>Joe</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[nlp]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[spelling checker]]></category>

		<guid isPermaLink="false">http://blog.quibb.org/?p=54</guid>
		<description><![CDATA[I was looking into spell checking in Python.  I found spell4py, and downloaded the zip, but couldn&#8217;t get it to build on my system.  If I tried a bit longer maybe, but in the end my solution worked out fine.  This library was overkill for my needs too. I found this article here: http://code.activestate.com/recipes/117221/ This [...]]]></description>
			<content:encoded><![CDATA[<p>I was looking into spell checking in Python.  I found <a href="http://www.keyphrene.com/products/4py/">spell4py</a>, and downloaded the zip, but couldn&#8217;t get it to build on my system.  If I tried a bit longer maybe, but in the end my solution worked out fine.  This library was overkill for my needs too.</p>
<p>I found this article here: <a href="http://code.activestate.com/recipes/117221/">http://code.activestate.com/recipes/117221/</a></p>
<p>This seemed to work well for my purposes, but I wanted to test out other spell checking libraries.    Mozilla Firefox , Google Chrome, and OpenOffice all use hunspell, so I wanted to try that one (as I&#8217;m testing the spelling of words on the Internet).  Here are some python snippets to get you up and running with the popular spelling checkers.  I modified these to take more than 1 word, split them up, and then return a list of suggestions.  They do require each spelling checker to be installed.  I was able to do this through the openSuSE package manager.</p>
<p><a href="http://en.wikipedia.org/wiki/Ispell"><strong>Ispell</strong></a></p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">popen2</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">class</span> ispell:
    <span style="color: #ff7700;font-weight:bold;">def</span> <span style="color: #0000cd;">__init__</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:
        <span style="color: #008000;">self</span>._f = <span style="color: #dc143c;">popen2</span>.<span style="color: black;">Popen3</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;ispell&quot;</span><span style="color: black;">&#41;</span>
        <span style="color: #008000;">self</span>._f.<span style="color: black;">fromchild</span>.<span style="color: #dc143c;">readline</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span> <span style="color: #808080; font-style: italic;">#skip the credit line</span>
    <span style="color: #ff7700;font-weight:bold;">def</span> <span style="color: #0000cd;">__call__</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span>, words<span style="color: black;">&#41;</span>:
        words = words.<span style="color: black;">split</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">' '</span><span style="color: black;">&#41;</span>
        output = <span style="color: black;">&#91;</span><span style="color: black;">&#93;</span>
        <span style="color: #ff7700;font-weight:bold;">for</span> word <span style="color: #ff7700;font-weight:bold;">in</span> words:
            <span style="color: #008000;">self</span>._f.<span style="color: black;">tochild</span>.<span style="color: black;">write</span><span style="color: black;">&#40;</span>word+<span style="color: #483d8b;">'<span style="color: #000099; font-weight: bold;">\n</span>'</span><span style="color: black;">&#41;</span>
            <span style="color: #008000;">self</span>._f.<span style="color: black;">tochild</span>.<span style="color: black;">flush</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
            s = <span style="color: #008000;">self</span>._f.<span style="color: black;">fromchild</span>.<span style="color: #dc143c;">readline</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>.<span style="color: black;">strip</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
            <span style="color: #008000;">self</span>._f.<span style="color: black;">fromchild</span>.<span style="color: #dc143c;">readline</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span> <span style="color: #808080; font-style: italic;">#skip the blank line</span>
            <span style="color: #ff7700;font-weight:bold;">if</span> s<span style="color: black;">&#91;</span>:<span style="color: #ff4500;">8</span><span style="color: black;">&#93;</span> == <span style="color: #483d8b;">&quot;word: ok&quot;</span>:
                output.<span style="color: black;">append</span><span style="color: black;">&#40;</span><span style="color: #008000;">None</span><span style="color: black;">&#41;</span>
            <span style="color: #ff7700;font-weight:bold;">else</span>:
                output.<span style="color: black;">append</span><span style="color: black;">&#40;</span><span style="color: black;">&#40;</span>s<span style="color: black;">&#91;</span><span style="color: #ff4500;">17</span>:-<span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>.<span style="color: black;">strip</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>.<span style="color: black;">split</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">', '</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
        <span style="color: #ff7700;font-weight:bold;">return</span> output</pre></div></div>

<p><a href="http://en.wikipedia.org/wiki/GNU_Aspell"><strong>Aspell</strong></a></p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">popen2</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">class</span> aspell:
    <span style="color: #ff7700;font-weight:bold;">def</span> <span style="color: #0000cd;">__init__</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:
        <span style="color: #008000;">self</span>._f = <span style="color: #dc143c;">popen2</span>.<span style="color: black;">Popen3</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;aspell -a&quot;</span><span style="color: black;">&#41;</span>
        <span style="color: #008000;">self</span>._f.<span style="color: black;">fromchild</span>.<span style="color: #dc143c;">readline</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span> <span style="color: #808080; font-style: italic;">#skip the credit line</span>
    <span style="color: #ff7700;font-weight:bold;">def</span> <span style="color: #0000cd;">__call__</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span>, words<span style="color: black;">&#41;</span>:
        words = words.<span style="color: black;">split</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">' '</span><span style="color: black;">&#41;</span>
        output = <span style="color: black;">&#91;</span><span style="color: black;">&#93;</span>
        <span style="color: #ff7700;font-weight:bold;">for</span> word <span style="color: #ff7700;font-weight:bold;">in</span> words:
            <span style="color: #008000;">self</span>._f.<span style="color: black;">tochild</span>.<span style="color: black;">write</span><span style="color: black;">&#40;</span>word+<span style="color: #483d8b;">'<span style="color: #000099; font-weight: bold;">\n</span>'</span><span style="color: black;">&#41;</span>
            <span style="color: #008000;">self</span>._f.<span style="color: black;">tochild</span>.<span style="color: black;">flush</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
            s = <span style="color: #008000;">self</span>._f.<span style="color: black;">fromchild</span>.<span style="color: #dc143c;">readline</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>.<span style="color: black;">strip</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
            <span style="color: #008000;">self</span>._f.<span style="color: black;">fromchild</span>.<span style="color: #dc143c;">readline</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span> <span style="color: #808080; font-style: italic;">#skip the blank line</span>
            <span style="color: #ff7700;font-weight:bold;">if</span> s == <span style="color: #483d8b;">&quot;*&quot;</span>:
                output.<span style="color: black;">append</span><span style="color: black;">&#40;</span><span style="color: #008000;">None</span><span style="color: black;">&#41;</span>
            <span style="color: #ff7700;font-weight:bold;">elif</span> s<span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span> == <span style="color: #483d8b;">'#'</span>:
                output.<span style="color: black;">append</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;No Suggestions&quot;</span><span style="color: black;">&#41;</span>
            <span style="color: #ff7700;font-weight:bold;">else</span>:
                output.<span style="color: black;">append</span><span style="color: black;">&#40;</span>s.<span style="color: black;">split</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">':'</span><span style="color: black;">&#41;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span>.<span style="color: black;">strip</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>.<span style="color: black;">split</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">', '</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
        <span style="color: #ff7700;font-weight:bold;">return</span> output</pre></div></div>

<p><a href="http://en.wikipedia.org/wiki/Hunspell"><strong>Hunspell</strong></a></p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">popen2</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">class</span> hunspell:
    <span style="color: #ff7700;font-weight:bold;">def</span> <span style="color: #0000cd;">__init__</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:
        <span style="color: #008000;">self</span>._f = <span style="color: #dc143c;">popen2</span>.<span style="color: black;">Popen3</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;hunspell&quot;</span><span style="color: black;">&#41;</span>
        <span style="color: #008000;">self</span>._f.<span style="color: black;">fromchild</span>.<span style="color: #dc143c;">readline</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span> <span style="color: #808080; font-style: italic;">#skip the credit line</span>
    <span style="color: #ff7700;font-weight:bold;">def</span> <span style="color: #0000cd;">__call__</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span>, words<span style="color: black;">&#41;</span>:
        words = words.<span style="color: black;">split</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">' '</span><span style="color: black;">&#41;</span>
        output = <span style="color: black;">&#91;</span><span style="color: black;">&#93;</span>
        <span style="color: #ff7700;font-weight:bold;">for</span> word <span style="color: #ff7700;font-weight:bold;">in</span> words:
            <span style="color: #008000;">self</span>._f.<span style="color: black;">tochild</span>.<span style="color: black;">write</span><span style="color: black;">&#40;</span>word+<span style="color: #483d8b;">'<span style="color: #000099; font-weight: bold;">\n</span>'</span><span style="color: black;">&#41;</span>
            <span style="color: #008000;">self</span>._f.<span style="color: black;">tochild</span>.<span style="color: black;">flush</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
            s = <span style="color: #008000;">self</span>._f.<span style="color: black;">fromchild</span>.<span style="color: #dc143c;">readline</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>.<span style="color: black;">strip</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>.<span style="color: black;">lower</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
            <span style="color: #008000;">self</span>._f.<span style="color: black;">fromchild</span>.<span style="color: #dc143c;">readline</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span> <span style="color: #808080; font-style: italic;">#skip the blank line</span>
            <span style="color: #ff7700;font-weight:bold;">if</span> s == <span style="color: #483d8b;">&quot;*&quot;</span>:
                output.<span style="color: black;">append</span><span style="color: black;">&#40;</span><span style="color: #008000;">None</span><span style="color: black;">&#41;</span>
            <span style="color: #ff7700;font-weight:bold;">elif</span> s<span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span> == <span style="color: #483d8b;">'#'</span>:
                output.<span style="color: black;">append</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;No Suggestions&quot;</span><span style="color: black;">&#41;</span>
            <span style="color: #ff7700;font-weight:bold;">elif</span> s<span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span> == <span style="color: #483d8b;">'+'</span>:
                <span style="color: #ff7700;font-weight:bold;">pass</span>
            <span style="color: #ff7700;font-weight:bold;">else</span>:
                output.<span style="color: black;">append</span><span style="color: black;">&#40;</span>s.<span style="color: black;">split</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">':'</span><span style="color: black;">&#41;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span>.<span style="color: black;">strip</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>.<span style="color: black;">split</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">', '</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
        <span style="color: #ff7700;font-weight:bold;">return</span> output</pre></div></div>

<p>Now, after doing this and seeing the suggestions.  I decided a spell checker isn&#8217;t really what I was looking for.  A spelling checker always tries to make a suggestion, and I wanted to filter out things from a database.  I started this with the hope that I would be able to take misspellings and convert them into the correct word.  In the end, I just removed words that were not spelled correctly using <a href="http://wordnet.princeton.edu/">WordNET</a> through <a href="http://www.nltk.org/">NLTK</a>.  WordNET had a bigger dictionary than most of the spell checkers which also helped in the filtering task.  NLTK has a simple <a href="http://nltk.googlecode.com/svn/trunk/doc/howto/wordnet.html">how to</a> on how to get started using WordNET.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.quibb.org/2009/04/spell-checking-in-python/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>NLTK vs MontyLingua Part of Speech Taggers</title>
		<link>http://blog.quibb.org/2009/03/nltk-vs-montylingua-part-of-speech-taggers/</link>
		<comments>http://blog.quibb.org/2009/03/nltk-vs-montylingua-part-of-speech-taggers/#comments</comments>
		<pubDate>Sun, 29 Mar 2009 02:23:50 +0000</pubDate>
		<dc:creator>Joe</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[benchmarks]]></category>
		<category><![CDATA[nlp]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[taggers]]></category>

		<guid isPermaLink="false">http://blog.quibb.org/?p=37</guid>
		<description><![CDATA[This is a comparison of the part of speech taggers available in python. As far as I know, these are the most prominent python taggers. Let me know if you think another tagger should be added to the comparison. MontyLingua includes several natural language processing (NLP) tools. The ones that I used in this comparison [...]]]></description>
			<content:encoded><![CDATA[<p>This is a comparison of the part of speech taggers available in python.  As far as I know, these are the most prominent python taggers.  Let me know if you think another tagger should be added to the comparison.</p>
<p><a href="http://web.media.mit.edu/~hugo/montylingua/">MontyLingua</a> includes several natural language processing (NLP) tools.  The ones that I used in this comparison were the stemmer, tagger, and sentence tokenizer.  <a href="http://www.nltk.org/">The Natural Language Toolkit (NLTK)</a> is another set of python tools for natural language processing.  It has a much greater breadth of tools than MontyLingua.  It has taggers, parsers, tokenizers, chunkers, and stemmers.  It usually has a few different implementations of each providing different options to their users.  In the case of stemmers, they have the Punkt and WordNet stemmers.  Both of these tools are written to aid in NLP using Python.</p>
<h2 style="text-align: left;"><strong>Taggers</strong></h2>
<p style="text-align: left;">For those that don&#8217;t know, a tagger is a NLP tool that will mark the part of speech of a word.</p>
<p>Example:<br />
Input: &#8220;A dog walks&#8221;<br />
Output: &#8220;A/DT dog/NN walks/VBZ&#8221;</p>
<p>The meanings of the tokens after the / can be <a href="http://en.wikipedia.org/wiki/Brown_Corpus">found here</a>.</p>
<p>For NLTK, I&#8217;m comparing the built-in tagger to MontyLingua.  I didn&#8217;t do any training at all and just called nltk.tag.pos_tag().  I used the taggers mostly as is, with some slight modifications.  I added a RegExp tagger in front of the NLTK tagger, and make the default tagger the backoff tagger.  It will mark A, An, and The as DT always.  It was annoying and messing up my results to have them marked as NNP.  They were capitalized, and I suppose the tagger thought they were either initials or proper names.</p>
<p>MontyLingua on the other hand was always marking &#8220;US&#8221; as a pronoun.  This was a problem when scanning sentences that said &#8220;US Pint&#8221; or &#8220;US Gallon.&#8221;  I look at the word before &#8220;US&#8221; and see if it&#8217;s an article, if it is I allow it to continue being processed.  Neither tagger is perfect, but it becomes clear that one may be better than the other for my use-case.  It may be different for yours.  I&#8217;m scanning sentences from the web.</p>
<h2 style="text-align: left;"><strong>Stemmers</strong></h2>
<p style="text-align: left;">A stemmer is a tool that will take a word with a suffix attached to it, and return the &#8216;stem&#8217; or base word of it.</p>
<p>Example:<br />
Input: dogs<br />
Output: dog</p>
<p>While neither stemmer is perfect, they both do a decent job.  MontyLingua is more inclined to take the &#8216;S&#8217; off the end of something, and the NLTK WordNetLemmatizer doesn&#8217;t always take it off.  &#8216;Cows&#8217; is an example of a word the WordNetLemmatizer will not stem to &#8216;Cow&#8217; but MontyLingua will.  On the other hand, MontyLingua is more likely to take the &#8216;S&#8217; off the end of an acronym, and I wrote code to correct that in some cases.  If a word is less than 4 characters or all consonants, I don&#8217;t run it on the MontyLingua stemmer.  The all consonants is to catch some acronyms.  While using MontyLingua on a specific part of speech it&#8217;s important to specify whether it&#8217;s a <em>noun</em> or a <em>verb</em> with the &#8216;pos&#8217; parameter.  Since I&#8217;m only stemming nouns, I used pos=&#8217;noun&#8217;.</p>
<h2 style="text-align: left;"><strong>Results</strong></h2>
<p>The first results don&#8217;t only reflect a change in taggers, but changes in the stemmer and sentence tokenizer also.  I did another comparison using the MontyLingua tagger with the NLTK stemmer and sentence tokenizer for comparison.</p>
<p>A phrase found by one algorithm and not by another is shown first.  They both were able to find some words that were not found by the other.  Hits is the number of times a phrase comes up, it is displayed only if there is a discrepancy.  If MontyLingua and NLTK both found a phrase but found it a different number of times, that is reflected there.  The first numbers are totals for every discrepancy summed.  There is also a graph below showing how many of each difference there is.  For example there were 157 times that there was a discrepancy of 1 hit and MontyLingua came out on top.  There were 78 times the number of hits were different by 1 and NLTK had more.  An interesting one is there was one time MontyLingua had one word with 40 hits more than NLTK.  That word was elephant.</p>
<p>MontyLingua toolchain vs NLTK toolchain<br />
In MontyLingua but not NLTK: 514<br />
In NLTK but not MontyLingua: 403</p>
<p>Total Hits: MontyLingua: 1421 vs NLTK: 1184</p>
<p style="text-align: center;">
<div id="attachment_38" class="wp-caption aligncenter" style="width: 310px"><a href="http://blog.quibb.org/wp-content/uploads/2009/03/monty_v_nltk.png"><img class="size-medium wp-image-38" title="monty_v_nltk" src="http://blog.quibb.org/wp-content/uploads/2009/03/monty_v_nltk-300x266.png" alt="MontyLinga vs NLTK Graph" width="300" height="266" /></a><p class="wp-caption-text">MontyLingua vs NLTK</p></div>
<table style="height: 222px;" border="1" cellspacing="0" cellpadding="3" width="260" align="center">
<tbody>
<tr>
<td>Hit Count</td>
<td>MontyLingua</td>
<td>NLTK</td>
</tr>
<tr>
<td>1</td>
<td>157</td>
<td>78</td>
</tr>
<tr>
<td>2</td>
<td>35</td>
<td>10</td>
</tr>
<tr>
<td>3</td>
<td>10</td>
<td>0</td>
</tr>
<tr>
<td>4</td>
<td>4</td>
<td>1</td>
</tr>
<tr>
<td>5</td>
<td>1</td>
<td>0</td>
</tr>
<tr>
<td>6</td>
<td>1</td>
<td>0</td>
</tr>
<tr>
<td>13</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>14</td>
<td>2</td>
<td>0</td>
</tr>
<tr>
<td>40</td>
<td>1</td>
<td>0</td>
</tr>
</tbody>
</table>
<p>On average MontyLingua had more hits than NLTK on words</p>
<p>MontyLingua Tagger NLTK Stemmer &amp; Tokenizer (ML-NLTK) vs MontyLingua Toolchain<br />
For the sake of completeness here are the results of the MontyLingua tagger with the NLTK stemmer and tokenizer.</p>
<p>In ML-NLTK but not in MontyLingua: 65<br />
In MontyLingua but not in ML-NLTK: 68</p>
<p>Total Hits: ML-NLTK: 290 vs MontyLingua: 299</p>
<table style="height: 90px;" border="1" cellspacing="0" cellpadding="3" width="260" align="center">
<tbody>
<tr>
<td>Hit Count</td>
<td>MontyLingua</td>
<td>ML-NLTK</td>
</tr>
<tr>
<td>1</td>
<td>20</td>
<td>17</td>
</tr>
<tr>
<td>2</td>
<td>0</td>
<td>2</td>
</tr>
<tr>
<td>10</td>
<td>1</td>
<td>0</td>
</tr>
</tbody>
</table>
<p style="text-align: center;">
<p style="text-align: center;"><span style="text-decoration: underline;">Total Phrases Found By</span></p>
<table style="height: 90px;" border="1" cellspacing="0" cellpadding="3" width="260" align="center">
<tbody>
<tr>
<td>Name</td>
<td>Phrase Count</td>
</tr>
<tr>
<td>NLTK</td>
<td>3777</td>
</tr>
<tr>
<td>ML-NLTK</td>
<td>3885</td>
</tr>
<tr>
<td>MontyLingua</td>
<td>3888</td>
</tr>
</tbody>
</table>
<p>At the end of the day, I&#8217;ll be using the MontyLingua toolchain with some slight modifications I&#8217;ve made (mentioned above).  I&#8217;m definitely still using NLTK, just for different tasks.  NLTK has a great and easy to use regexp chunker that I&#8217;ll continue to use.</p>
<p>Again, a tagger&#8217;s performace can vary greatly based on the data used to train and test it.  I was testing them on about 12,000 webpages I downloaded and looking for specific phrases.  On a different data set NLTK may turn out to be better.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.quibb.org/2009/03/nltk-vs-montylingua-part-of-speech-taggers/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>The ease of Python, SQLite, and Storm</title>
		<link>http://blog.quibb.org/2009/03/the-ease-of-python-sqlite-and-storm/</link>
		<comments>http://blog.quibb.org/2009/03/the-ease-of-python-sqlite-and-storm/#comments</comments>
		<pubDate>Mon, 09 Mar 2009 00:43:21 +0000</pubDate>
		<dc:creator>Joe</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[sqlite]]></category>
		<category><![CDATA[storm]]></category>

		<guid isPermaLink="false">http://blog.quibb.org/?p=31</guid>
		<description><![CDATA[I began learning Python this spring, and I must say, the more I program in it the more I like it. I chose the language because of the libraries that are available for it. There is a library for everything. :) Also, there are tools for Natural Language Processing that are a great help, but [...]]]></description>
			<content:encoded><![CDATA[<p>I began learning Python this spring, and I must say, the more I program in it the more I like it.  I chose the language because of the libraries that are available for it.  There is a library for everything. :)  Also, there are <a href="http://www.nltk.org">tools</a> for Natural Language Processing that are a great help, but that&#8217;s for another time and another post.</p>
<p>I was originally thinking about using Postgres, and it would probably give me better speed and scalability. But then, I began to think if I really needed a full RDBMS for my application.  After all, I&#8217;m not expecting the project to get too large, and being able to easily move it from one computer to another by just moving a single files sounds very convenient.  SQLite has a <a href="http://www.sqlite.org/whentouse.html">great page</a> to see if it&#8217;s right for you. I ended up settling on SQLite, and so far am happy with the decision.</p>
<p>Installing SQLite was a breeze.  I just opened the package manager in openSuSE and installed the packages.  I also installed the python Storm package.  There is no daemon process, as there is with Postgres, because you&#8217;re just accessing one file on your filesystem.  There is a great tool for setting up a SQLite database called <a href="http://code.google.com/p/sqlite-manager/">SQLite Manager</a>.  It will let you create tables, view your data, and run queries.  The fact that it&#8217;s available as a firefox extention makes it easy to install on many platforms.</p>
<p>Now is when the real fun begins. Enter <a href="https://storm.canonical.com/">Storm</a>.</p>
<p>Storm is an object relation mapping (ORM) tool for Python.  It allows you to manipulate the database through the manipulation of Python objects.  After you map your python objects to database tables, you manipulate them, and your changes will show up in the database for you.  I&#8217;ve used other ORM tools in the past (Hibernate for Java), but I was amazed at the simplicity of the setup/configuration step when using Storm.</p>
<p>It takes two lines to connect to your sqlite database:</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;">DATABASE = create_database<span style="color: black;">&#40;</span><span style="color: #483d8b;">'sqlite:db_name'</span><span style="color: black;">&#41;</span>
<span style="color: #808080; font-style: italic;"># or simply create_database('sqlite:') for in-memory</span>
STORE = Store<span style="color: black;">&#40;</span>DATABASE<span style="color: black;">&#41;</span></pre></div></div>

<p>Mapping a class to a table can be done with ease.  Here is an example of one of my classes:</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #ff7700;font-weight:bold;">class</span> Sentence<span style="color: black;">&#40;</span><span style="color: #008000;">object</span><span style="color: black;">&#41;</span>:
    __storm_table__ = <span style="color: #483d8b;">&quot;TABLE_SENT&quot;</span>
    sent_id = Int<span style="color: black;">&#40;</span>primary=<span style="color: #008000;">True</span><span style="color: black;">&#41;</span>
    loc_id = Int<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
    location = Reference<span style="color: black;">&#40;</span>loc_id, Location.<span style="color: black;">loc_id</span><span style="color: black;">&#41;</span>
    sentence = Unicode<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
&nbsp;
    <span style="color: #ff7700;font-weight:bold;">def</span> <span style="color: #0000cd;">__init__</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span>, sent, loc = <span style="color: #008000;">None</span><span style="color: black;">&#41;</span>:
        <span style="color: #ff7700;font-weight:bold;">if</span> loc: <span style="color: #008000;">self</span>.<span style="color: black;">location</span> = loc
&nbsp;
        <span style="color: #808080; font-style: italic;"># sent cannot be None</span>
        <span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: #ff7700;font-weight:bold;">not</span> <span style="color: #008000;">isinstance</span><span style="color: black;">&#40;</span>sent, <span style="color: #008000;">unicode</span><span style="color: black;">&#41;</span>:
            <span style="color: #008000;">self</span>.<span style="color: black;">sentence</span> = <span style="color: #008000;">unicode</span><span style="color: black;">&#40;</span>sent, <span style="color: #483d8b;">&quot;utf-8&quot;</span><span style="color: black;">&#41;</span>
        <span style="color: #ff7700;font-weight:bold;">else</span>:
            <span style="color: #008000;">self</span>.<span style="color: black;">sentence</span> = sent</pre></div></div>

<p>If you access loc_id, it will give you the database id.  If you access the variable without location through a Reference, it will hand you the corresponding database object.</p>
<p>Now, I set this up in about 45 minutes from start to finish, so it might need some more fiddling, but overall it seems to work pretty well.  I needed to set something up in one night to keep moving on other parts of the project, and this allowed me to.</p>
<p>It can&#8217;t all be sunshine and rainbows, there was one thing that tripped me up a bit.  Being new to Python, I wasn&#8217;t aware of the u&#8221;String&#8221; for unicode.  It was used in their examples, and after I got an error assumed that&#8217;s what it was for, but it tripped me up.  As you can see in my constructor, I added some code to handle the case when a string that isn&#8217;t unicode is passed in.</p>
<p>As I get into the more advanced aspects of SQLite/Storm, I hope I continue to be impressed.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.quibb.org/2009/03/the-ease-of-python-sqlite-and-storm/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
	</channel>
</rss>
