<?xml version="1.0" encoding="UTF-8"?> <rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" ><channel><title>Blog::Quibb</title> <atom:link href="http://blog.quibb.org/feed/" rel="self" type="application/rss+xml" /><link>http://blog.quibb.org</link> <description>Software development and more.</description> <lastBuildDate>Mon, 21 Nov 2011 05:12:26 +0000</lastBuildDate> <language>en</language> <sy:updatePeriod>hourly</sy:updatePeriod> <sy:updateFrequency>1</sy:updateFrequency> <generator>http://wordpress.org/?v=3.3.2</generator> <item><title>Configuring Console2 with Cygwin</title><link>http://blog.quibb.org/2011/11/configuring-console2-with-cygwin/</link> <comments>http://blog.quibb.org/2011/11/configuring-console2-with-cygwin/#comments</comments> <pubDate>Mon, 21 Nov 2011 05:12:26 +0000</pubDate> <dc:creator>Joe</dc:creator> <category><![CDATA[Configuration]]></category> <category><![CDATA[console2]]></category> <category><![CDATA[cygwin]]></category><guid isPermaLink="false">http://blog.quibb.org/?p=353</guid> <description><![CDATA[When using Windows, Console2 does a great job of managing my console windows, but it&#8217;s not intuitive how to configure it with Cygwin (my console of choice).  It&#8217;s not that hard to simply get Cygwin to open in Console2, but it can be tricky to get it open to a startup directory. Here are the [...]]]></description> <content:encoded><![CDATA[<p>When using Windows, <a title="Console 2" href="http://sourceforge.net/projects/console/">Console2</a> does a great job of managing my console windows, but it&#8217;s not intuitive how to configure it with <a title="Cygwin" href="http://www.cygwin.com/">Cygwin</a> (my console of choice).  It&#8217;s not that hard to simply get Cygwin to open in Console2, but it can be tricky to get it open to a startup directory.</p><p>Here are the steps to get Console2 to open to a specific startup directory:</p><ol><li>Launch Console2</li><li>Open settings through Edit &gt; Settings</li><li>Click tabs from the tree on the left</li><li>Click the &#8216;Add&#8217; button to add a new tab</li><li>Set the title to &#8216;Cygwin&#8217; (or another appropriate name)</li><li>In the &#8216;Shell&#8217; field put: <strong>C:\cygwin\bin\bash.exe &#8211;login -i -c &#8220;cd /cygdrive/c/Users/&lt;username&gt;/&lt;path&gt;/; exec /bin/bash&#8221;</strong></li><ul><li>Replace username with your login name, and the path with the path you want to use relative your home directory.  This can also be used to start in other paths on the system.</li><li>This launched cygwin with the command to change the path and then launches bash again so the console window will stay open.</li><li>You don&#8217;t have to put anything in &#8216;Startup dir&#8217;</li></ul><li>You should be set to open Cygwin tabs to a specific directory in Console2.</li></ol><p>Feel free to do other customizations to the console, I like to make each console have a different color so it&#8217;s easy to tell which type of console I&#8217;m looking at very quickly.</p><p>&nbsp;</p><p>&nbsp;</p> ]]></content:encoded> <wfw:commentRss>http://blog.quibb.org/2011/11/configuring-console2-with-cygwin/feed/</wfw:commentRss> <slash:comments>3</slash:comments> </item> <item><title>unitbench 0.1 released!</title><link>http://blog.quibb.org/2011/03/unitbench-0-1-released/</link> <comments>http://blog.quibb.org/2011/03/unitbench-0-1-released/#comments</comments> <pubDate>Tue, 08 Mar 2011 13:23:34 +0000</pubDate> <dc:creator>Joe</dc:creator> <category><![CDATA[Announcement]]></category> <category><![CDATA[Python]]></category> <category><![CDATA[benchmarks]]></category> <category><![CDATA[unitbench]]></category><guid isPermaLink="false">http://blog.quibb.org/?p=331</guid> <description><![CDATA[After writing cppbench, a C++ benchmark framework, I felt inspired to take the next step.  I wanted a benchmark library that was similar to unittest in python.  I started working on unitbench.  I ended up with a library that allows you to start a benchmark with &#8216;bench&#8217; and it will be run and timed.  It [...]]]></description> <content:encoded><![CDATA[<p>After writing <a title="cppbench" href="http://blog.quibb.org/cppbench/">cppbench</a>, a C++ benchmark framework, I felt inspired to take the next step.  I wanted a benchmark library that was similar to <a title="unittest" href="http://docs.python.org/library/unittest.html">unittest</a> in python.  I started working on <a title="unitbench" href="http://blog.quibb.org/unitbench">unitbench</a>.  I ended up with a library that allows you to start a benchmark with &#8216;bench&#8217; and it will be run and timed.  It has functions that can be overloaded to denote the number of warmup runs to perform and the input to the benchmarks.  See the documentation below for full examples and all functions.</p><p><strong>Features:</strong></p><ul><li>BSD License</li><li>Supports python 2.6 &#8211; 3.2</li><li>Output formatters</li><li>Cross platform</li><li>Fully tested</li></ul><p><a title="unitbench" href="http://blog.quibb.org/unitbench">unitbench documentation</a></p><p><a title="unitbench pypi page" href="http://pypi.python.org/pypi/unitbench">unitbench pypi page (with downloads)</a></p><p><a title="unitbench source code" href="https://bitbucket.org/qbproger/unitbench">unitbench source code</a></p><p>I&#8217;d love to hear any feedback.  All comments are appreciated.  If you notice any bugs or weird behavior post a bug report to bitbucket or leave a comment here.  I&#8217;ll try to address it as soon as possible.  The same goes for feature requests.</p> ]]></content:encoded> <wfw:commentRss>http://blog.quibb.org/2011/03/unitbench-0-1-released/feed/</wfw:commentRss> <slash:comments>3</slash:comments> </item> <item><title>Announcement: cppbench 0.2 released!</title><link>http://blog.quibb.org/2011/01/announcement-cppbench-0-2-released/</link> <comments>http://blog.quibb.org/2011/01/announcement-cppbench-0-2-released/#comments</comments> <pubDate>Wed, 05 Jan 2011 01:44:36 +0000</pubDate> <dc:creator>Joe</dc:creator> <category><![CDATA[Announcement]]></category> <category><![CDATA[benchmarks]]></category> <category><![CDATA[cppbench]]></category> <category><![CDATA[open source]]></category><guid isPermaLink="false">http://blog.quibb.org/?p=321</guid> <description><![CDATA[The second release of cppbench is here.  Cppbench is a open source C++ benchmark framework cppbench.  Someone on reddit suggested including user and system time.  After some research that seemed like a good idea, so here is a new release with some additional features. New Features include: Reporting of user and system time in addition [...]]]></description> <content:encoded><![CDATA[<p>The second release of <a title="cppbench" href="http://blog.quibb.org/cppbench/">cppbench</a> is here.  Cppbench is a open source C++ benchmark framework <a title="cppbench" href="http://blog.quibb.org/cppbench/">cppbench</a>.  Someone on <a title="reddit" href="http://www.reddit.com/r/cpp/comments/epxu5/cppbench_released_a_lightweight_benchmark/">reddit</a> suggested including user and system time.  After some research that seemed like a good idea, so here is a new release with some additional features.</p><p>New Features include:</p><ul><li>Reporting of user and system time in addition to wall clock time.<ul><li>User time is the amount of CPU time spent in user code.</li><li>System time is the amount of CPU time spent in the operating system kernel.</li></ul></li><li>XML output with all the information recorded by the system.</li><li>Now tested on mingw and cygwin.</li></ul><p>cppbench webpage: <a href="http://blog.quibb.org/cppbench/">http://blog.quibb.org/cppbench/</a></p><p>cppbench on bitbucket: <a href="https://bitbucket.org/qbproger/cppbench/overview">https://bitbucket.org/qbproger/cppbench/overview</a></p><p>cppbench api: <a href="http://blog.quibb.org/cppbench-api/">http://blog.quibb.org/cppbench-api/</a></p> ]]></content:encoded> <wfw:commentRss>http://blog.quibb.org/2011/01/announcement-cppbench-0-2-released/feed/</wfw:commentRss> <slash:comments>0</slash:comments> </item> <item><title>Announcement: cppbench released!</title><link>http://blog.quibb.org/2010/12/announcement-cppbench-released/</link> <comments>http://blog.quibb.org/2010/12/announcement-cppbench-released/#comments</comments> <pubDate>Wed, 22 Dec 2010 14:30:06 +0000</pubDate> <dc:creator>Joe</dc:creator> <category><![CDATA[Announcement]]></category> <category><![CDATA[C++]]></category> <category><![CDATA[benchmarks]]></category> <category><![CDATA[cppbench]]></category> <category><![CDATA[open source]]></category><guid isPermaLink="false">http://blog.quibb.org/?p=308</guid> <description><![CDATA[While working on the sqlite benchmarks, I ended up writing a lightweight C++ benchmark framework to make the task easier.  I thought other people might find it useful too.  Then to prepare it for other people to use I wrote documentation and did some cleanup. Some of the features include: Simplified BSD License High fidelity [...]]]></description> <content:encoded><![CDATA[<p>While working on the sqlite benchmarks, I ended up writing a lightweight C++ benchmark framework to make the task easier.  I thought other people might find it useful too.  Then to prepare it for other people to use I wrote documentation and did some cleanup.</p><p>Some of the features include:</p><ul><li>Simplified BSD License</li><li>High fidelity stopwatch</li><li>Output formatters</li><li>Cross platform</li></ul><p>For a more complete list of features see the <a title="cppbench" href="http://blog.quibb.org/cppbench">cppbench</a> webpage.  With the mindset &#8220;release early, release often&#8221; here it is:</p><p>Webpage: <a title="http://blog.quibb.org/cppbench" href="http://blog.quibb.org/cppbench">http://blog.quibb.org/cppbench</a></p><p>Bitbucket: <a href="https://bitbucket.org/qbproger/cppbench/overview">https://bitbucket.org/qbproger/cppbench/overview</a></p><p>API: <a href="http://blog.quibb.org/cppbench-api/">http://blog.quibb.org/cppbench-api/</a></p><p>This being the first open source project I&#8217;ve released, any feedback is appreciated.</p> ]]></content:encoded> <wfw:commentRss>http://blog.quibb.org/2010/12/announcement-cppbench-released/feed/</wfw:commentRss> <slash:comments>1</slash:comments> </item> <item><title>Crawling the Web With Lynx</title><link>http://blog.quibb.org/2010/11/crawling-the-web-with-lynx/</link> <comments>http://blog.quibb.org/2010/11/crawling-the-web-with-lynx/#comments</comments> <pubDate>Tue, 09 Nov 2010 14:38:27 +0000</pubDate> <dc:creator>Joe</dc:creator> <category><![CDATA[Python]]></category> <category><![CDATA[nlp]]></category> <category><![CDATA[web crawling]]></category><guid isPermaLink="false">http://blog.quibb.org/?p=275</guid> <description><![CDATA[Introduction There are a few reasons you&#8217;d want to use a text based browser to crawl the web.  For example, it makes it easier to do natural language processing on web pages.  I was doing this a year or two ago, and at the time I was unable to find a Python library that would [...]]]></description> <content:encoded><![CDATA[<h2>Introduction</h2><p>There  are a few reasons you&#8217;d want to use a text based browser to crawl the  web.  For example, it makes it easier to do <a title="Natural Language Processing" href="http://en.wikipedia.org/wiki/Natural_language_processing">natural language processing</a> on web pages.  I was doing this a year or two ago, and at  the time I was unable to find a Python library that would remove HTML reliably.  Since there was a looming deadline, I only tried BeautifulSoup and <a title="NLTK" href="http://www.nltk.org/">NLTK</a> for removing HTML.  They eventually crashed on websites  after running for a while.</p><p>Keep  in mind this information is targeted at general crawling.  If you&#8217;re  crawling a specific site or similarly formatted sites, different  techniques can be used.  You&#8217;ll be able to use the formatting to your  advantage, so HTML parsers, such as <a title="BeautifulSoup" href="http://www.crummy.com/software/BeautifulSoup/">BeautifulSoup</a>, <a title="html5lib" href="http://code.google.com/p/html5lib/">html5lib</a>, or <a title="lxml" href="http://codespeak.net/lxml/">lxml</a>, become more useful.</p><h2>Lynx</h2><p>Using <a title="Lynx" href="http://lynx.isc.org/"> Lynx</a> is pretty straight forward.  If you type lynx into a command  prompt with lynx installed it will bring up the browser.  There is help  to get you started, and you&#8217;ll be able to browse the Internet.  As you  can see it does a good job of removing the formatting while keeping the  text intact.  That&#8217;s the information we&#8217;re after.</p><p>Lynx  has several command line parameters that we&#8217;ll use to do this.  Looking  at the command line parameters can be done with the command: <em>lynx  -help</em>.  The most important command line parameter is <em>-dump</em>.  It causes  lynx to output the web page to standard out rather than in their browser  interface.  It allows the information to be captured and processed.</p><h2>Python Code</h2><div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #ff7700;font-weight:bold;">def</span> kill_lynx<span style="color: black;">&#40;</span>pid<span style="color: black;">&#41;</span>:
    <span style="color: #dc143c;">os</span>.<span style="color: black;">kill</span><span style="color: black;">&#40;</span>pid, <span style="color: #dc143c;">signal</span>.<span style="color: black;">SIGKILL</span><span style="color: black;">&#41;</span>
    <span style="color: #dc143c;">os</span>.<span style="color: black;">waitpid</span><span style="color: black;">&#40;</span>-<span style="color: #ff4500;">1</span>, <span style="color: #dc143c;">os</span>.<span style="color: black;">WNOHANG</span><span style="color: black;">&#41;</span>
    <span style="color: #ff7700;font-weight:bold;">print</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;lynx killed&quot;</span><span style="color: black;">&#41;</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">def</span> get_url<span style="color: black;">&#40;</span>url<span style="color: black;">&#41;</span>:
    web_data = <span style="color: #483d8b;">&quot;&quot;</span>
&nbsp;
    <span style="color: #dc143c;">cmd</span> = <span style="color: #483d8b;">&quot;lynx -dump -nolist -notitle <span style="color: #000099; font-weight: bold;">\&quot;</span>{0}<span style="color: #000099; font-weight: bold;">\&quot;</span>&quot;</span>.<span style="color: black;">format</span><span style="color: black;">&#40;</span>url<span style="color: black;">&#41;</span>
    lynx = <span style="color: #dc143c;">subprocess</span>.<span style="color: black;">Popen</span><span style="color: black;">&#40;</span><span style="color: #dc143c;">cmd</span>, shell=<span style="color: #008000;">True</span>, stdout=<span style="color: #dc143c;">subprocess</span>.<span style="color: black;">PIPE</span><span style="color: black;">&#41;</span>
    t = <span style="color: #dc143c;">threading</span>.<span style="color: black;">Timer</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">300.0</span>, kill_lynx, args=<span style="color: black;">&#91;</span>lynx.<span style="color: black;">pid</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>
    t.<span style="color: black;">start</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
&nbsp;
    web_data = lynx.<span style="color: black;">stdout</span>.<span style="color: black;">read</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
    t.<span style="color: black;">cancel</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
&nbsp;
    web_data = web_data.<span style="color: black;">decode</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;utf-8&quot;</span>, <span style="color: #483d8b;">'replace'</span><span style="color: black;">&#41;</span>
    <span style="color: #ff7700;font-weight:bold;">return</span> web_data</pre></div></div><p>As  you can see in addition to the <em>-dump</em> flag <em>-nolist</em> and <em>-notitle</em> are used  too.  In most cases the title is included in the text of the website.   Another reason for excluding the title is that most of the time it  isn&#8217;t a complete sentence.  The <em>-nolist</em> parameter removes the list of  links from the bottom of the dump.  I wanted to just parse the  information on the page, and on some pages this greatly decreases the  amount of text to process.</p><p>One  other thing to notice is lynx is killed after 300 seconds.  While  crawling I found that some sites were huge and slow but wouldn&#8217;t  timeout.  Killing lynx helped if it was taking too long on  one site.  An alternate approach would have been to have different  threads running lynx to capture information, so one thread wouldn&#8217;t  block everything.  However, killing it worked well enough for my use  case.</p><h2>Clean Up</h2><p>Now  that the information has been obtained, it needs to be run through some  cleanup to help with the natural language processing of it.  Before  showing the code, let me just say if the  project was ongoing I&#8217;d look into better ways of doing it.  The best way  to describe it is, &#8220;It worked for me at the time.&#8221;</p><div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;">_LINK_BRACKETS = <span style="color: #dc143c;">re</span>.<span style="color: #008000;">compile</span><span style="color: black;">&#40;</span>r<span style="color: #483d8b;">&quot;<span style="color: #000099; font-weight: bold;">\[</span><span style="color: #000099; font-weight: bold;">\d</span>+]&quot;</span>, <span style="color: #dc143c;">re</span>.<span style="color: black;">U</span><span style="color: black;">&#41;</span>
_LEFT_BRACKETS = <span style="color: #dc143c;">re</span>.<span style="color: #008000;">compile</span><span style="color: black;">&#40;</span>r<span style="color: #483d8b;">&quot;<span style="color: #000099; font-weight: bold;">\[</span>&quot;</span>, <span style="color: #dc143c;">re</span>.<span style="color: black;">U</span><span style="color: black;">&#41;</span>
_RIGHT_BRACKETS = <span style="color: #dc143c;">re</span>.<span style="color: #008000;">compile</span><span style="color: black;">&#40;</span>r<span style="color: #483d8b;">&quot;]&quot;</span>, <span style="color: #dc143c;">re</span>.<span style="color: black;">U</span><span style="color: black;">&#41;</span>
_NEW_LINE = <span style="color: #dc143c;">re</span>.<span style="color: #008000;">compile</span><span style="color: black;">&#40;</span>r<span style="color: #483d8b;">&quot;([^<span style="color: #000099; font-weight: bold;">\r</span><span style="color: #000099; font-weight: bold;">\n</span>])<span style="color: #000099; font-weight: bold;">\r</span>?<span style="color: #000099; font-weight: bold;">\n</span>([^<span style="color: #000099; font-weight: bold;">\r</span><span style="color: #000099; font-weight: bold;">\n</span>])&quot;</span>, <span style="color: #dc143c;">re</span>.<span style="color: black;">U</span><span style="color: black;">&#41;</span>
_SPECIAL_CHARS = <span style="color: #dc143c;">re</span>.<span style="color: #008000;">compile</span><span style="color: black;">&#40;</span>r<span style="color: #483d8b;">&quot;<span style="color: #000099; font-weight: bold;">\f</span>|<span style="color: #000099; font-weight: bold;">\r</span>|<span style="color: #000099; font-weight: bold;">\t</span>|_&quot;</span>, <span style="color: #dc143c;">re</span>.<span style="color: black;">U</span><span style="color: black;">&#41;</span>
_WHITE_SPACE = <span style="color: #dc143c;">re</span>.<span style="color: #008000;">compile</span><span style="color: black;">&#40;</span>r<span style="color: #483d8b;">&quot; [ ]+&quot;</span>, <span style="color: #dc143c;">re</span>.<span style="color: black;">U</span><span style="color: black;">&#41;</span>
&nbsp;
MS_CHARS = <span style="color: black;">&#123;</span>u<span style="color: #483d8b;">&quot;<span style="color: #000099; font-weight: bold;">\u</span>2018&quot;</span>:<span style="color: #483d8b;">&quot;'&quot;</span>,
            u<span style="color: #483d8b;">&quot;<span style="color: #000099; font-weight: bold;">\u</span>2019&quot;</span>:<span style="color: #483d8b;">&quot;'&quot;</span>,
            u<span style="color: #483d8b;">&quot;<span style="color: #000099; font-weight: bold;">\u</span>201c&quot;</span>:<span style="color: #483d8b;">&quot;<span style="color: #000099; font-weight: bold;">\&quot;</span>&quot;</span>,
            u<span style="color: #483d8b;">&quot;<span style="color: #000099; font-weight: bold;">\u</span>201d&quot;</span>:<span style="color: #483d8b;">&quot;<span style="color: #000099; font-weight: bold;">\&quot;</span>&quot;</span>,
            u<span style="color: #483d8b;">&quot;<span style="color: #000099; font-weight: bold;">\u</span>2020&quot;</span>:<span style="color: #483d8b;">&quot; &quot;</span>,
            u<span style="color: #483d8b;">&quot;<span style="color: #000099; font-weight: bold;">\u</span>2026&quot;</span>:<span style="color: #483d8b;">&quot; &quot;</span>,
            u<span style="color: #483d8b;">&quot;<span style="color: #000099; font-weight: bold;">\u</span>25BC&quot;</span>:<span style="color: #483d8b;">&quot; &quot;</span>,
            u<span style="color: #483d8b;">&quot;<span style="color: #000099; font-weight: bold;">\u</span>2665&quot;</span>:<span style="color: #483d8b;">&quot; &quot;</span><span style="color: black;">&#125;</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">def</span> clean_lynx<span style="color: black;">&#40;</span><span style="color: #008000;">input</span><span style="color: black;">&#41;</span>:
&nbsp;
    <span style="color: #ff7700;font-weight:bold;">for</span> i <span style="color: #ff7700;font-weight:bold;">in</span> MS_CHARS.<span style="color: black;">keys</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>:
        <span style="color: #008000;">input</span> = <span style="color: #008000;">input</span>.<span style="color: black;">replace</span><span style="color: black;">&#40;</span>i,MS_CHARS<span style="color: black;">&#91;</span>i<span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>
&nbsp;
    <span style="color: #008000;">input</span> = _NEW_LINE.<span style="color: black;">sub</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;<span style="color: #000099; font-weight: bold;">\g</span>&lt;1&gt; <span style="color: #000099; font-weight: bold;">\g</span>&lt;2&gt;&quot;</span>, <span style="color: #008000;">input</span><span style="color: black;">&#41;</span>
    <span style="color: #008000;">input</span> = _LINK_BRACKETS.<span style="color: black;">sub</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;&quot;</span>, <span style="color: #008000;">input</span><span style="color: black;">&#41;</span>
    <span style="color: #008000;">input</span> = _LEFT_BRACKETS.<span style="color: black;">sub</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;(&quot;</span>, <span style="color: #008000;">input</span><span style="color: black;">&#41;</span>
    <span style="color: #008000;">input</span> = _RIGHT_BRACKETS.<span style="color: black;">sub</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;)&quot;</span>, <span style="color: #008000;">input</span><span style="color: black;">&#41;</span>
    <span style="color: #008000;">input</span> = _SPECIAL_CHARS.<span style="color: black;">sub</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot; &quot;</span>, <span style="color: #008000;">input</span><span style="color: black;">&#41;</span>
    <span style="color: #008000;">input</span> = _WHITE_SPACE.<span style="color: black;">sub</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot; &quot;</span>, <span style="color: #008000;">input</span><span style="color: black;">&#41;</span>
&nbsp;
    <span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: #008000;">input</span></pre></div></div><p>This cleanup was done for the natural language processing  algorithms.  Certain algorithms had problems if there were line breaks  in the middle of a sentence.  Smart quotes, or curved quotes, were also  problematic.  I remember square brackets (ie- [ or ]) caused problems  with the tagger or sentence tokenizer I was using, so I replaced them  with parenthesis.  There were a few other rules, and you may need more for your projects.</p><p>I  hope you find this useful.  It should get you started with crawling the  web.  Be prepared to spent time debugging, and it&#8217;s a good idea to spot  check.  The entire time I was working on the project, I was tweaking  different parts of it.  The web is pretty good at throwing every case at you.</p> ]]></content:encoded> <wfw:commentRss>http://blog.quibb.org/2010/11/crawling-the-web-with-lynx/feed/</wfw:commentRss> <slash:comments>6</slash:comments> </item> </channel> </rss>
<!-- Performance optimized by W3 Total Cache. Learn more: http://www.w3-edge.com/wordpress-plugins/

Minified using disk
Page Caching using disk (enhanced)
Database Caching 1/34 queries in 0.168 seconds using disk
Object Caching 494/578 objects using disk

Served from: blog.quibb.org @ 2012-05-20 22:50:20 -->
