<?xml version="1.0" encoding="UTF-8"?> <rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" ><channel><title>Blog::Quibb &#187; spelling checker</title> <atom:link href="http://blog.quibb.org/tag/spelling-checker/feed/" rel="self" type="application/rss+xml" /><link>http://blog.quibb.org</link> <description>Software development and more.</description> <lastBuildDate>Mon, 21 Nov 2011 05:12:26 +0000</lastBuildDate> <language>en</language> <sy:updatePeriod>hourly</sy:updatePeriod> <sy:updateFrequency>1</sy:updateFrequency> <generator>http://wordpress.org/?v=3.3.1</generator> <item><title>Spell Checking in Python</title><link>http://blog.quibb.org/2009/04/spell-checking-in-python/</link> <comments>http://blog.quibb.org/2009/04/spell-checking-in-python/#comments</comments> <pubDate>Sat, 11 Apr 2009 15:38:10 +0000</pubDate> <dc:creator>Joe</dc:creator> <category><![CDATA[Python]]></category> <category><![CDATA[nlp]]></category> <category><![CDATA[spelling checker]]></category><guid isPermaLink="false">http://blog.quibb.org/?p=54</guid> <description><![CDATA[I was looking into spell checking in Python.  I found spell4py, and downloaded the zip, but couldn&#8217;t get it to build on my system.  If I tried a bit longer maybe, but in the end my solution worked out fine.  This library was overkill for my needs too. I found this article here: http://code.activestate.com/recipes/117221/ This [...]]]></description> <content:encoded><![CDATA[<p>I was looking into spell checking in Python.  I found <a href="http://www.keyphrene.com/products/4py/">spell4py</a>, and downloaded the zip, but couldn&#8217;t get it to build on my system.  If I tried a bit longer maybe, but in the end my solution worked out fine.  This library was overkill for my needs too.</p><p>I found this article here: <a href="http://code.activestate.com/recipes/117221/">http://code.activestate.com/recipes/117221/</a></p><p>This seemed to work well for my purposes, but I wanted to test out other spell checking libraries.    Mozilla Firefox , Google Chrome, and OpenOffice all use hunspell, so I wanted to try that one (as I&#8217;m testing the spelling of words on the Internet).  Here are some python snippets to get you up and running with the popular spelling checkers.  I modified these to take more than 1 word, split them up, and then return a list of suggestions.  They do require each spelling checker to be installed.  I was able to do this through the openSuSE package manager.</p><p><a href="http://en.wikipedia.org/wiki/Ispell"><strong>Ispell</strong></a></p><div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">popen2</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">class</span> ispell:
    <span style="color: #ff7700;font-weight:bold;">def</span> <span style="color: #0000cd;">__init__</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:
        <span style="color: #008000;">self</span>._f = <span style="color: #dc143c;">popen2</span>.<span style="color: black;">Popen3</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;ispell&quot;</span><span style="color: black;">&#41;</span>
        <span style="color: #008000;">self</span>._f.<span style="color: black;">fromchild</span>.<span style="color: #dc143c;">readline</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span> <span style="color: #808080; font-style: italic;">#skip the credit line</span>
    <span style="color: #ff7700;font-weight:bold;">def</span> <span style="color: #0000cd;">__call__</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span>, words<span style="color: black;">&#41;</span>:
        words = words.<span style="color: black;">split</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">' '</span><span style="color: black;">&#41;</span>
        output = <span style="color: black;">&#91;</span><span style="color: black;">&#93;</span>
        <span style="color: #ff7700;font-weight:bold;">for</span> word <span style="color: #ff7700;font-weight:bold;">in</span> words:
            <span style="color: #008000;">self</span>._f.<span style="color: black;">tochild</span>.<span style="color: black;">write</span><span style="color: black;">&#40;</span>word+<span style="color: #483d8b;">'<span style="color: #000099; font-weight: bold;">\n</span>'</span><span style="color: black;">&#41;</span>
            <span style="color: #008000;">self</span>._f.<span style="color: black;">tochild</span>.<span style="color: black;">flush</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
            s = <span style="color: #008000;">self</span>._f.<span style="color: black;">fromchild</span>.<span style="color: #dc143c;">readline</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>.<span style="color: black;">strip</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
            <span style="color: #008000;">self</span>._f.<span style="color: black;">fromchild</span>.<span style="color: #dc143c;">readline</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span> <span style="color: #808080; font-style: italic;">#skip the blank line</span>
            <span style="color: #ff7700;font-weight:bold;">if</span> s<span style="color: black;">&#91;</span>:<span style="color: #ff4500;">8</span><span style="color: black;">&#93;</span> == <span style="color: #483d8b;">&quot;word: ok&quot;</span>:
                output.<span style="color: black;">append</span><span style="color: black;">&#40;</span><span style="color: #008000;">None</span><span style="color: black;">&#41;</span>
            <span style="color: #ff7700;font-weight:bold;">else</span>:
                output.<span style="color: black;">append</span><span style="color: black;">&#40;</span><span style="color: black;">&#40;</span>s<span style="color: black;">&#91;</span><span style="color: #ff4500;">17</span>:-<span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>.<span style="color: black;">strip</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>.<span style="color: black;">split</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">', '</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
        <span style="color: #ff7700;font-weight:bold;">return</span> output</pre></div></div><p><a href="http://en.wikipedia.org/wiki/GNU_Aspell"><strong>Aspell</strong></a></p><div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">popen2</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">class</span> aspell:
    <span style="color: #ff7700;font-weight:bold;">def</span> <span style="color: #0000cd;">__init__</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:
        <span style="color: #008000;">self</span>._f = <span style="color: #dc143c;">popen2</span>.<span style="color: black;">Popen3</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;aspell -a&quot;</span><span style="color: black;">&#41;</span>
        <span style="color: #008000;">self</span>._f.<span style="color: black;">fromchild</span>.<span style="color: #dc143c;">readline</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span> <span style="color: #808080; font-style: italic;">#skip the credit line</span>
    <span style="color: #ff7700;font-weight:bold;">def</span> <span style="color: #0000cd;">__call__</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span>, words<span style="color: black;">&#41;</span>:
        words = words.<span style="color: black;">split</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">' '</span><span style="color: black;">&#41;</span>
        output = <span style="color: black;">&#91;</span><span style="color: black;">&#93;</span>
        <span style="color: #ff7700;font-weight:bold;">for</span> word <span style="color: #ff7700;font-weight:bold;">in</span> words:
            <span style="color: #008000;">self</span>._f.<span style="color: black;">tochild</span>.<span style="color: black;">write</span><span style="color: black;">&#40;</span>word+<span style="color: #483d8b;">'<span style="color: #000099; font-weight: bold;">\n</span>'</span><span style="color: black;">&#41;</span>
            <span style="color: #008000;">self</span>._f.<span style="color: black;">tochild</span>.<span style="color: black;">flush</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
            s = <span style="color: #008000;">self</span>._f.<span style="color: black;">fromchild</span>.<span style="color: #dc143c;">readline</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>.<span style="color: black;">strip</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
            <span style="color: #008000;">self</span>._f.<span style="color: black;">fromchild</span>.<span style="color: #dc143c;">readline</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span> <span style="color: #808080; font-style: italic;">#skip the blank line</span>
            <span style="color: #ff7700;font-weight:bold;">if</span> s == <span style="color: #483d8b;">&quot;*&quot;</span>:
                output.<span style="color: black;">append</span><span style="color: black;">&#40;</span><span style="color: #008000;">None</span><span style="color: black;">&#41;</span>
            <span style="color: #ff7700;font-weight:bold;">elif</span> s<span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span> == <span style="color: #483d8b;">'#'</span>:
                output.<span style="color: black;">append</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;No Suggestions&quot;</span><span style="color: black;">&#41;</span>
            <span style="color: #ff7700;font-weight:bold;">else</span>:
                output.<span style="color: black;">append</span><span style="color: black;">&#40;</span>s.<span style="color: black;">split</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">':'</span><span style="color: black;">&#41;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span>.<span style="color: black;">strip</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>.<span style="color: black;">split</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">', '</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
        <span style="color: #ff7700;font-weight:bold;">return</span> output</pre></div></div><p><a href="http://en.wikipedia.org/wiki/Hunspell"><strong>Hunspell</strong></a></p><div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">popen2</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">class</span> hunspell:
    <span style="color: #ff7700;font-weight:bold;">def</span> <span style="color: #0000cd;">__init__</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:
        <span style="color: #008000;">self</span>._f = <span style="color: #dc143c;">popen2</span>.<span style="color: black;">Popen3</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;hunspell&quot;</span><span style="color: black;">&#41;</span>
        <span style="color: #008000;">self</span>._f.<span style="color: black;">fromchild</span>.<span style="color: #dc143c;">readline</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span> <span style="color: #808080; font-style: italic;">#skip the credit line</span>
    <span style="color: #ff7700;font-weight:bold;">def</span> <span style="color: #0000cd;">__call__</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span>, words<span style="color: black;">&#41;</span>:
        words = words.<span style="color: black;">split</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">' '</span><span style="color: black;">&#41;</span>
        output = <span style="color: black;">&#91;</span><span style="color: black;">&#93;</span>
        <span style="color: #ff7700;font-weight:bold;">for</span> word <span style="color: #ff7700;font-weight:bold;">in</span> words:
            <span style="color: #008000;">self</span>._f.<span style="color: black;">tochild</span>.<span style="color: black;">write</span><span style="color: black;">&#40;</span>word+<span style="color: #483d8b;">'<span style="color: #000099; font-weight: bold;">\n</span>'</span><span style="color: black;">&#41;</span>
            <span style="color: #008000;">self</span>._f.<span style="color: black;">tochild</span>.<span style="color: black;">flush</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
            s = <span style="color: #008000;">self</span>._f.<span style="color: black;">fromchild</span>.<span style="color: #dc143c;">readline</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>.<span style="color: black;">strip</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>.<span style="color: black;">lower</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
            <span style="color: #008000;">self</span>._f.<span style="color: black;">fromchild</span>.<span style="color: #dc143c;">readline</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span> <span style="color: #808080; font-style: italic;">#skip the blank line</span>
            <span style="color: #ff7700;font-weight:bold;">if</span> s == <span style="color: #483d8b;">&quot;*&quot;</span>:
                output.<span style="color: black;">append</span><span style="color: black;">&#40;</span><span style="color: #008000;">None</span><span style="color: black;">&#41;</span>
            <span style="color: #ff7700;font-weight:bold;">elif</span> s<span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span> == <span style="color: #483d8b;">'#'</span>:
                output.<span style="color: black;">append</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;No Suggestions&quot;</span><span style="color: black;">&#41;</span>
            <span style="color: #ff7700;font-weight:bold;">elif</span> s<span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span> == <span style="color: #483d8b;">'+'</span>:
                <span style="color: #ff7700;font-weight:bold;">pass</span>
            <span style="color: #ff7700;font-weight:bold;">else</span>:
                output.<span style="color: black;">append</span><span style="color: black;">&#40;</span>s.<span style="color: black;">split</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">':'</span><span style="color: black;">&#41;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span>.<span style="color: black;">strip</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>.<span style="color: black;">split</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">', '</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
        <span style="color: #ff7700;font-weight:bold;">return</span> output</pre></div></div><p>Now, after doing this and seeing the suggestions.  I decided a spell checker isn&#8217;t really what I was looking for.  A spelling checker always tries to make a suggestion, and I wanted to filter out things from a database.  I started this with the hope that I would be able to take misspellings and convert them into the correct word.  In the end, I just removed words that were not spelled correctly using <a href="http://wordnet.princeton.edu/">WordNET</a> through <a href="http://www.nltk.org/">NLTK</a>.  WordNET had a bigger dictionary than most of the spell checkers which also helped in the filtering task.  NLTK has a simple <a href="http://nltk.googlecode.com/svn/trunk/doc/howto/wordnet.html">how to</a> on how to get started using WordNET.</p> ]]></content:encoded> <wfw:commentRss>http://blog.quibb.org/2009/04/spell-checking-in-python/feed/</wfw:commentRss> <slash:comments>2</slash:comments> </item> </channel> </rss>
<!-- Performance optimized by W3 Total Cache. Learn more: http://www.w3-edge.com/wordpress-plugins/

Minified using disk
Page Caching using disk (enhanced)
Database Caching 3/14 queries in 0.004 seconds using disk
Object Caching 293/314 objects using disk

Served from: blog.quibb.org @ 2012-02-05 12:18:55 -->
