nlp – Blog::Quibb

Crawling the Web With Lynx

Joe — Tue, 09 Nov 2010 14:38:27 +0000

Introduction

There are a few reasons you’d want to use a text based browser to crawl the web. For example, it makes it easier to do natural language processing on web pages. I was doing this a year or two ago, and at the time I was unable to find a Python library that would remove HTML reliably. Since there was a looming deadline, I only tried BeautifulSoup and NLTK for removing HTML. They eventually crashed on websites after running for a while.

Keep in mind this information is targeted at general crawling. If you’re crawling a specific site or similarly formatted sites, different techniques can be used. You’ll be able to use the formatting to your advantage, so HTML parsers, such as BeautifulSoup, html5lib, or lxml, become more useful.

Lynx

Using Lynx is pretty straight forward. If you type lynx into a command prompt with lynx installed it will bring up the browser. There is help to get you started, and you’ll be able to browse the Internet. As you can see it does a good job of removing the formatting while keeping the text intact. That’s the information we’re after.

Lynx has several command line parameters that we’ll use to do this. Looking at the command line parameters can be done with the command: lynx -help. The most important command line parameter is -dump. It causes lynx to output the web page to standard out rather than in their browser interface. It allows the information to be captured and processed.

Python Code

def kill_lynx(pid):
    os.kill(pid, signal.SIGKILL)
    os.waitpid(-1, os.WNOHANG)
    print("lynx killed")

def get_url(url):
    web_data = ""

    cmd = "lynx -dump -nolist -notitle \"{0}\"".format(url)
    lynx = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE)
    t = threading.Timer(300.0, kill_lynx, args=[lynx.pid])
    t.start()

    web_data = lynx.stdout.read()
    t.cancel()

    web_data = web_data.decode("utf-8", 'replace')
    return web_data

As you can see in addition to the -dump flag -nolist and -notitle are used too. In most cases the title is included in the text of the website. Another reason for excluding the title is that most of the time it isn’t a complete sentence. The -nolist parameter removes the list of links from the bottom of the dump. I wanted to just parse the information on the page, and on some pages this greatly decreases the amount of text to process.

One other thing to notice is lynx is killed after 300 seconds. While crawling I found that some sites were huge and slow but wouldn’t timeout. Killing lynx helped if it was taking too long on one site. An alternate approach would have been to have different threads running lynx to capture information, so one thread wouldn’t block everything. However, killing it worked well enough for my use case.

Clean Up

Now that the information has been obtained, it needs to be run through some cleanup to help with the natural language processing of it. Before showing the code, let me just say if the project was ongoing I’d look into better ways of doing it. The best way to describe it is, “It worked for me at the time.”

_LINK_BRACKETS = re.compile(r"\[\d+]", re.U)
_LEFT_BRACKETS = re.compile(r"\[", re.U)
_RIGHT_BRACKETS = re.compile(r"]", re.U)
_NEW_LINE = re.compile(r"([^\r\n])\r?\n([^\r\n])", re.U)
_SPECIAL_CHARS = re.compile(r"\f|\r|\t|_", re.U)
_WHITE_SPACE = re.compile(r" [ ]+", re.U)

MS_CHARS = {u"\u2018":"'",
            u"\u2019":"'",
            u"\u201c":"\"",
            u"\u201d":"\"",
            u"\u2020":" ",
            u"\u2026":" ",
            u"\u25BC":" ",
            u"\u2665":" "}

def clean_lynx(input):

    for i in MS_CHARS.keys():
        input = input.replace(i,MS_CHARS[i])

    input = _NEW_LINE.sub("\g<1> \g<2>", input)
    input = _LINK_BRACKETS.sub("", input)
    input = _LEFT_BRACKETS.sub("(", input)
    input = _RIGHT_BRACKETS.sub(")", input)
    input = _SPECIAL_CHARS.sub(" ", input)
    input = _WHITE_SPACE.sub(" ", input)

    return input

This cleanup was done for the natural language processing algorithms. Certain algorithms had problems if there were line breaks in the middle of a sentence. Smart quotes, or curved quotes, were also problematic. I remember square brackets (ie- [ or ]) caused problems with the tagger or sentence tokenizer I was using, so I replaced them with parenthesis. There were a few other rules, and you may need more for your projects.

I hope you find this useful. It should get you started with crawling the web. Be prepared to spent time debugging, and it’s a good idea to spot check. The entire time I was working on the project, I was tweaking different parts of it. The web is pretty good at throwing every case at you.

NLTK Regular Expression Parser (RegexpParser)

Joe — Wed, 27 Jan 2010 13:53:35 +0000

The Natural Language Toolkit (NLTK) provides a variety of tools for dealing with natural language. One such tool is the Regular Expression Parser. If you’re familiar with regular expressions, it can be a useful tool in natural language processing.

Background Information

You must first be familiar with regular expressions to be able to fully utilize the RegexpParser/RegexpChunkParser. If you need to learn about regular expressions, here is a site with an abundance of information to get you started: http://www.regular-expressions.info. It is also necessary to know how to use a tagger, and what the tags mean. A tagger is a tool that marks each word in a sentence with its part of speech. Here is a small comparison I did of python taggers: NLTK vs MontyLingua Part of Speech Taggers. The NLTK RegexpParser works by running regular expressions on top of the part of speech tags added by a tagger. The Brown Corpus tags will be the tags used throughout the rest of this post, and are commonly used by taggers in general. On a side note, the RegexpParser can be used with either the NLTK or MontyLingua tagger.

Basic RegexpParser Usage

Let me start by going over the “how to” provided in the NLTK documentation. The source of this information is here: NLTK RegexParser HowTo. The documentation goes through how you could use the RegexParser/RegexpChunkParser to do a traditional parse of a sentence.

The RegexParser/RegexChunkParser works by defining rules for grouping different words together. A simple example would be: “NP: {

? * *}”. This is a definition for a rule to group of words into a noun phrase. It will group one determinant (usually an article), then zero or more adjectives followed by zero or more nouns. In the how to, they go over prepositions and creating prepositional phrases from a preposition and noun phrase. It’s important to note that earlier regular expressions can be used in later ones. Also, the regular expression syntax can occur within the tags or apply to the tags themselves.

Here is the example from the NLTK website:

parser = RegexpParser('''
    NP: {? * *} # NP
    P: {}           # Preposition
    V: {}          # Verb
    PP: { }      # PP -> P NP
    VP: { *}  # VP -> V (NP|PP)*
    ''')

Alternative RegexpParser Usage

I call this an alternate usage because it can be used to find patterns that aren’t necessarily related to grammatical phrases in English. It can be used to find any pattern in a sentence. Let me start by showing the regular expression grammar from my program.

grammar = """
	NP:   {?*+}
	CP:   {}
	VERB: {}
	THAN: {}
	COMP: {??
?
?}
	"""
self.chunker = RegexpParser(grammar)

I was using it to look for a specific pattern in a sentence. The first part, NP, is looking for a noun phrase. The ? is there because of a bug found in the tagger I was using. It was marking An with a capital ‘A’ as a PRP (Pronoun) rather than a DT (Determinant/Article). I found another workaround for the bug, but left the PRP in there to catch anything that might have slipped through.

Then it moves onto the CP, which is the comparison word. JJR tagged words are comparative adjectives. They include words bigger, smaller, and larger. JJS words are words that signify the most or chief. JJS words include biggest, smallest, and largest.

The next two a simply the VERB and the word THAN. The VERB could be a compound verb, so there would be one or more verbs present. The IN tag denotes a preposition. In this case, I was looking specifically for the word than.

The last line is COMP. This is the regular expression that puts it all together. This was looking for a size comparison of two objects. It might be easier to look at the output of this part of the expression than trying to explain it piece by piece. The only tag not explained above is RB, which is an adverb.

Here is the parse for the sentence “Everyone knows an elephant is larger than a dog.”:

(S
  (NP everyone/NN)
  (VERB knows/VBZ)
  (COMP
    an/DT
    (NP elephant/NN)
    (VERB is/VBZ)
    (CP larger/JJR)
    (THAN than/IN)
    a/DT
    (NP dog/NN))
  ./.)

The output is a simple tree, that makes to easy data extraction. It’s easy to see there are many possibilities that open up when looking for patterns in English text. May this help you in your data mining endeavors.

Spell Checking in Python

Joe — Sat, 11 Apr 2009 15:38:10 +0000

I was looking into spell checking in Python. I found spell4py, and downloaded the zip, but couldn’t get it to build on my system. If I tried a bit longer maybe, but in the end my solution worked out fine. This library was overkill for my needs too.

I found this article here: http://code.activestate.com/recipes/117221/

This seemed to work well for my purposes, but I wanted to test out other spell checking libraries. Mozilla Firefox , Google Chrome, and OpenOffice all use hunspell, so I wanted to try that one (as I’m testing the spelling of words on the Internet). Here are some python snippets to get you up and running with the popular spelling checkers. I modified these to take more than 1 word, split them up, and then return a list of suggestions. They do require each spelling checker to be installed. I was able to do this through the openSuSE package manager.

Ispell

import popen2

class ispell:
    def __init__(self):
        self._f = popen2.Popen3("ispell")
        self._f.fromchild.readline() #skip the credit line
    def __call__(self, words):
        words = words.split(' ')
        output = []
        for word in words:
            self._f.tochild.write(word+'\n')
            self._f.tochild.flush()
            s = self._f.fromchild.readline().strip()
            self._f.fromchild.readline() #skip the blank line
            if s[:8] == "word: ok":
                output.append(None)
            else:
                output.append((s[17:-1]).strip().split(', '))
        return output

Aspell

import popen2

class aspell:
    def __init__(self):
        self._f = popen2.Popen3("aspell -a")
        self._f.fromchild.readline() #skip the credit line
    def __call__(self, words):
        words = words.split(' ')
        output = []
        for word in words:
            self._f.tochild.write(word+'\n')
            self._f.tochild.flush()
            s = self._f.fromchild.readline().strip()
            self._f.fromchild.readline() #skip the blank line
            if s == "*":
                output.append(None)
            elif s[0] == '#':
                output.append("No Suggestions")
            else:
                output.append(s.split(':')[1].strip().split(', '))
        return output

Hunspell

import popen2

class hunspell:
    def __init__(self):
        self._f = popen2.Popen3("hunspell")
        self._f.fromchild.readline() #skip the credit line
    def __call__(self, words):
        words = words.split(' ')
        output = []
        for word in words:
            self._f.tochild.write(word+'\n')
            self._f.tochild.flush()
            s = self._f.fromchild.readline().strip().lower()
            self._f.fromchild.readline() #skip the blank line
            if s == "*":
                output.append(None)
            elif s[0] == '#':
                output.append("No Suggestions")
            elif s[0] == '+':
                pass
            else:
                output.append(s.split(':')[1].strip().split(', '))
        return output

Now, after doing this and seeing the suggestions. I decided a spell checker isn’t really what I was looking for. A spelling checker always tries to make a suggestion, and I wanted to filter out things from a database. I started this with the hope that I would be able to take misspellings and convert them into the correct word. In the end, I just removed words that were not spelled correctly using WordNET through NLTK. WordNET had a bigger dictionary than most of the spell checkers which also helped in the filtering task. NLTK has a simple how to on how to get started using WordNET.

NLTK vs MontyLingua Part of Speech Taggers

Joe — Sun, 29 Mar 2009 02:23:50 +0000

This is a comparison of the part of speech taggers available in python. As far as I know, these are the most prominent python taggers. Let me know if you think another tagger should be added to the comparison.

MontyLingua includes several natural language processing (NLP) tools. The ones that I used in this comparison were the stemmer, tagger, and sentence tokenizer. The Natural Language Toolkit (NLTK) is another set of python tools for natural language processing. It has a much greater breadth of tools than MontyLingua. It has taggers, parsers, tokenizers, chunkers, and stemmers. It usually has a few different implementations of each providing different options to their users. In the case of stemmers, they have the Punkt and WordNet stemmers. Both of these tools are written to aid in NLP using Python.

Taggers

For those that don’t know, a tagger is a NLP tool that will mark the part of speech of a word.

Example:
Input: “A dog walks”
Output: “A/DT dog/NN walks/VBZ”

The meanings of the tokens after the / can be found here.

For NLTK, I’m comparing the built-in tagger to MontyLingua. I didn’t do any training at all and just called nltk.tag.pos_tag(). I used the taggers mostly as is, with some slight modifications. I added a RegExp tagger in front of the NLTK tagger, and make the default tagger the backoff tagger. It will mark A, An, and The as DT always. It was annoying and messing up my results to have them marked as NNP. They were capitalized, and I suppose the tagger thought they were either initials or proper names.

MontyLingua on the other hand was always marking “US” as a pronoun. This was a problem when scanning sentences that said “US Pint” or “US Gallon.” I look at the word before “US” and see if it’s an article, if it is I allow it to continue being processed. Neither tagger is perfect, but it becomes clear that one may be better than the other for my use-case. It may be different for yours. I’m scanning sentences from the web.

Stemmers

A stemmer is a tool that will take a word with a suffix attached to it, and return the ‘stem’ or base word of it.

Example:
Input: dogs
Output: dog

While neither stemmer is perfect, they both do a decent job. MontyLingua is more inclined to take the ‘S’ off the end of something, and the NLTK WordNetLemmatizer doesn’t always take it off. ‘Cows’ is an example of a word the WordNetLemmatizer will not stem to ‘Cow’ but MontyLingua will. On the other hand, MontyLingua is more likely to take the ‘S’ off the end of an acronym, and I wrote code to correct that in some cases. If a word is less than 4 characters or all consonants, I don’t run it on the MontyLingua stemmer. The all consonants is to catch some acronyms. While using MontyLingua on a specific part of speech it’s important to specify whether it’s a noun or a verb with the ‘pos’ parameter. Since I’m only stemming nouns, I used pos=’noun’.

Results

The first results don’t only reflect a change in taggers, but changes in the stemmer and sentence tokenizer also. I did another comparison using the MontyLingua tagger with the NLTK stemmer and sentence tokenizer for comparison.

A phrase found by one algorithm and not by another is shown first. They both were able to find some words that were not found by the other. Hits is the number of times a phrase comes up, it is displayed only if there is a discrepancy. If MontyLingua and NLTK both found a phrase but found it a different number of times, that is reflected there. The first numbers are totals for every discrepancy summed. There is also a graph below showing how many of each difference there is. For example there were 157 times that there was a discrepancy of 1 hit and MontyLingua came out on top. There were 78 times the number of hits were different by 1 and NLTK had more. An interesting one is there was one time MontyLingua had one word with 40 hits more than NLTK. That word was elephant.

MontyLingua toolchain vs NLTK toolchain
In MontyLingua but not NLTK: 514
In NLTK but not MontyLingua: 403

Total Hits: MontyLingua: 1421 vs NLTK: 1184

MontyLingua vs NLTK

Hit Count	MontyLingua	NLTK
1	157	78
2	35	10
3	10	0
4	4	1
5	1	0
6	1	0
13	0	1
14	2	0
40	1	0

On average MontyLingua had more hits than NLTK on words

MontyLingua Tagger NLTK Stemmer & Tokenizer (ML-NLTK) vs MontyLingua Toolchain
For the sake of completeness here are the results of the MontyLingua tagger with the NLTK stemmer and tokenizer.

In ML-NLTK but not in MontyLingua: 65
In MontyLingua but not in ML-NLTK: 68

Total Hits: ML-NLTK: 290 vs MontyLingua: 299

Hit Count	MontyLingua	ML-NLTK
1	20	17
2	0	2
10	1	0

Total Phrases Found By

Name	Phrase Count
NLTK	3777
ML-NLTK	3885
MontyLingua	3888

At the end of the day, I’ll be using the MontyLingua toolchain with some slight modifications I’ve made (mentioned above). I’m definitely still using NLTK, just for different tasks. NLTK has a great and easy to use regexp chunker that I’ll continue to use.

Again, a tagger’s performace can vary greatly based on the data used to train and test it. I was testing them on about 12,000 webpages I downloaded and looking for specific phrases. On a different data set NLTK may turn out to be better.