Crawling the Web With Lynx

Introduction

There are a few reasons you’d want to use a text based browser to crawl the web.  For example, it makes it easier to do natural language processing on web pages.  I was doing this a year or two ago, and at the time I was unable to find a Python library that would remove HTML reliably.  Since there was a looming deadline, I only tried BeautifulSoup and NLTK for removing HTML.  They eventually crashed on websites after running for a while.

Keep in mind this information is targeted at general crawling.  If you’re crawling a specific site or similarly formatted sites, different techniques can be used.  You’ll be able to use the formatting to your advantage, so HTML parsers, such as BeautifulSoup, html5lib, or lxml, become more useful.

Lynx

Using Lynx is pretty straight forward.  If you type lynx into a command prompt with lynx installed it will bring up the browser.  There is help to get you started, and you’ll be able to browse the Internet.  As you can see it does a good job of removing the formatting while keeping the text intact.  That’s the information we’re after.

Lynx has several command line parameters that we’ll use to do this.  Looking at the command line parameters can be done with the command: lynx -help.  The most important command line parameter is -dump.  It causes lynx to output the web page to standard out rather than in their browser interface.  It allows the information to be captured and processed.

Python Code

def kill_lynx(pid):
    os.kill(pid, signal.SIGKILL)
    os.waitpid(-1, os.WNOHANG)
    print("lynx killed")
 
def get_url(url):
    web_data = ""
 
    cmd = "lynx -dump -nolist -notitle \"{0}\"".format(url)
    lynx = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE)
    t = threading.Timer(300.0, kill_lynx, args=[lynx.pid])
    t.start()
 
    web_data = lynx.stdout.read()
    t.cancel()
 
    web_data = web_data.decode("utf-8", 'replace')
    return web_data

As you can see in addition to the -dump flag -nolist and -notitle are used too.  In most cases the title is included in the text of the website.  Another reason for excluding the title is that most of the time it isn’t a complete sentence.  The -nolist parameter removes the list of links from the bottom of the dump.  I wanted to just parse the information on the page, and on some pages this greatly decreases the amount of text to process.

One other thing to notice is lynx is killed after 300 seconds.  While crawling I found that some sites were huge and slow but wouldn’t timeout.  Killing lynx helped if it was taking too long on one site.  An alternate approach would have been to have different threads running lynx to capture information, so one thread wouldn’t block everything.  However, killing it worked well enough for my use case.

Clean Up

Now that the information has been obtained, it needs to be run through some cleanup to help with the natural language processing of it.  Before showing the code, let me just say if the project was ongoing I’d look into better ways of doing it.  The best way to describe it is, “It worked for me at the time.”

_LINK_BRACKETS = re.compile(r"\[\d+]", re.U)
_LEFT_BRACKETS = re.compile(r"\[", re.U)
_RIGHT_BRACKETS = re.compile(r"]", re.U)
_NEW_LINE = re.compile(r"([^\r\n])\r?\n([^\r\n])", re.U)
_SPECIAL_CHARS = re.compile(r"\f|\r|\t|_", re.U)
_WHITE_SPACE = re.compile(r" [ ]+", re.U)
 
MS_CHARS = {u"\u2018":"'",
            u"\u2019":"'",
            u"\u201c":"\"",
            u"\u201d":"\"",
            u"\u2020":" ",
            u"\u2026":" ",
            u"\u25BC":" ",
            u"\u2665":" "}
 
def clean_lynx(input):
 
    for i in MS_CHARS.keys():
        input = input.replace(i,MS_CHARS[i])
 
    input = _NEW_LINE.sub("\g<1> \g<2>", input)
    input = _LINK_BRACKETS.sub("", input)
    input = _LEFT_BRACKETS.sub("(", input)
    input = _RIGHT_BRACKETS.sub(")", input)
    input = _SPECIAL_CHARS.sub(" ", input)
    input = _WHITE_SPACE.sub(" ", input)
 
    return input

This cleanup was done for the natural language processing algorithms.  Certain algorithms had problems if there were line breaks in the middle of a sentence.  Smart quotes, or curved quotes, were also problematic.  I remember square brackets (ie- [ or ]) caused problems with the tagger or sentence tokenizer I was using, so I replaced them with parenthesis.  There were a few other rules, and you may need more for your projects.

I hope you find this useful.  It should get you started with crawling the web.  Be prepared to spent time debugging, and it’s a good idea to spot check.  The entire time I was working on the project, I was tweaking different parts of it.  The web is pretty good at throwing every case at you.

Posted on November 9, 2010 at 10:38 am by Joe · Permalink
In: Python · Tagged with: ,

6 Responses

Subscribe to comments via RSS

  1. Written by yuxcer
    on August 18, 2011 at 5:19 am
    Permalink

    Hi, I have a trouble, When I open some web page encoded with gb2312,it shows:
    Warn gb2312 !

    application/x-zip D)ownload or C)ancel

    Then I download the file,but I have no idea how to use it

    • Written by Joe
      on August 19, 2011 at 12:16 am
      Permalink

      Do you have an example web page? I haven’t done too much with lynx and gb2312.

      • Written by yuxcer
        on August 19, 2011 at 12:30 pm
        Permalink

        For example, I encounter this problem when I visit the site http://www.126.com.
        Before I saw your Blog,I use NLTK or BeautifulSoup for HTML parsing.This article is so useful, Thx a lot!

  2. Written by yuxcer
    on August 22, 2011 at 12:00 pm
    Permalink

    Thx, thought it still can’t work with that page,but it can work with other web pages encoded in gb2312, I don’t know why…

    PS:I have another question about NLTK,because I see you use NLTK either,LOL.
    Today I viewed the code of NLTK.clean_html, I can’t understand the use of the following code which is used to remove inline JavaScript/CSS:

    cleaned = re.sub(r”(?is).*?()”, “”, html.strip())

    what’s the meaning of (?is) and ?

    I use my code instead of it:
    cleaned = re.sub(r”.*?()”, “”, html.strip())

    It also works,this puzzles me a lot.

  3. Written by yuxcer
    on August 22, 2011 at 12:06 pm
    Permalink

    The page can’t show the right code,I paste it here:

    http://dpaste.com/hold/600537/

Subscribe to comments via RSS