Configuring Console2 with Cygwin

When using Windows, Console2 does a great job of managing my console windows, but it’s not intuitive how to configure it with Cygwin (my console of choice).  It’s not that hard to simply get Cygwin to open in Console2, but it can be tricky to get it open to a startup directory.

Here are the steps to get Console2 to open to a specific startup directory:

  1. Launch Console2
  2. Open settings through Edit > Settings
  3. Click tabs from the tree on the left
  4. Click the ‘Add’ button to add a new tab
  5. Set the title to ‘Cygwin’ (or another appropriate name)
  6. In the ‘Shell’ field put: 
  7. C:\cygwin\bin\bash.exe --login -i -c "cd /cygdrive/c/Users/<username>/<path>/; exec /bin/bash"
    • Replace username with your login name, and the path with the path you want to use relative your home directory.  This can also be used to start in other paths on the system.
    • This launched cygwin with the command to change the path and then launches bash again so the console window will stay open.
    • You don’t have to put anything in ‘Startup dir’
  8. You should be set to open Cygwin tabs to a specific directory in Console2.

Feel free to do other customizations to the console, I like to make each console have a different color so it’s easy to tell which type of console I’m looking at very quickly.

 

 

Posted on November 21, 2011 at 1:12 am by Joe · Permalink · 11 Comments
In: Configuration · Tagged with: ,

unitbench 0.1 released!

After writing cppbench, a C++ benchmark framework, I felt inspired to take the next step.  I wanted a benchmark library that was similar to unittest in python.  I started working on unitbench.  I ended up with a library that allows you to start a benchmark with ‘bench’ and it will be run and timed.  It has functions that can be overloaded to denote the number of warmup runs to perform and the input to the benchmarks.  See the documentation below for full examples and all functions.

Features:

unitbench documentation

unitbench pypi page (with downloads)

unitbench source code

I’d love to hear any feedback.  All comments are appreciated.  If you notice any bugs or weird behavior post a bug report to bitbucket or leave a comment here.  I’ll try to address it as soon as possible.  The same goes for feature requests.

Posted on March 8, 2011 at 9:23 am by Joe · Permalink · 3 Comments
In: Announcement, Python · Tagged with: ,

Announcement: cppbench 0.2 released!

The second release of cppbench is here.  Cppbench is a open source C++ benchmark framework cppbench.  Someone on reddit suggested including user and system time.  After some research that seemed like a good idea, so here is a new release with some additional features.

New Features include:

cppbench webpage: http://blog.quibb.org/cppbench/

cppbench on bitbucket: https://bitbucket.org/qbproger/cppbench/overview

cppbench api: http://blog.quibb.org/cppbench-api/

Posted on January 4, 2011 at 9:44 pm by Joe · Permalink · Leave a comment
In: Announcement · Tagged with: , ,

Announcement: cppbench released!

While working on the sqlite benchmarks, I ended up writing a lightweight C++ benchmark framework to make the task easier.  I thought other people might find it useful too.  Then to prepare it for other people to use I wrote documentation and did some cleanup.

Some of the features include:

For a more complete list of features see the cppbench webpage.  With the mindset “release early, release often” here it is:

Webpage: http://blog.quibb.org/cppbench

Bitbucket: https://bitbucket.org/qbproger/cppbench/overview

API: http://blog.quibb.org/cppbench-api/

This being the first open source project I’ve released, any feedback is appreciated.

Posted on December 22, 2010 at 10:30 am by Joe · Permalink · One Comment
In: Announcement, C++ · Tagged with: , ,

Crawling the Web With Lynx

Introduction

There are a few reasons you’d want to use a text based browser to crawl the web.  For example, it makes it easier to do natural language processing on web pages.  I was doing this a year or two ago, and at the time I was unable to find a Python library that would remove HTML reliably.  Since there was a looming deadline, I only tried BeautifulSoup and NLTK for removing HTML.  They eventually crashed on websites after running for a while.

Keep in mind this information is targeted at general crawling.  If you’re crawling a specific site or similarly formatted sites, different techniques can be used.  You’ll be able to use the formatting to your advantage, so HTML parsers, such as BeautifulSoup, html5lib, or lxml, become more useful.

Lynx

Using Lynx is pretty straight forward.  If you type lynx into a command prompt with lynx installed it will bring up the browser.  There is help to get you started, and you’ll be able to browse the Internet.  As you can see it does a good job of removing the formatting while keeping the text intact.  That’s the information we’re after.

Lynx has several command line parameters that we’ll use to do this.  Looking at the command line parameters can be done with the command: lynx -help.  The most important command line parameter is -dump.  It causes lynx to output the web page to standard out rather than in their browser interface.  It allows the information to be captured and processed.

Python Code

def kill_lynx(pid):
    os.kill(pid, signal.SIGKILL)
    os.waitpid(-1, os.WNOHANG)
    print("lynx killed")
 
def get_url(url):
    web_data = ""
 
    cmd = "lynx -dump -nolist -notitle \"{0}\"".format(url)
    lynx = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE)
    t = threading.Timer(300.0, kill_lynx, args=[lynx.pid])
    t.start()
 
    web_data = lynx.stdout.read()
    t.cancel()
 
    web_data = web_data.decode("utf-8", 'replace')
    return web_data

As you can see in addition to the -dump flag -nolist and -notitle are used too.  In most cases the title is included in the text of the website.  Another reason for excluding the title is that most of the time it isn’t a complete sentence.  The -nolist parameter removes the list of links from the bottom of the dump.  I wanted to just parse the information on the page, and on some pages this greatly decreases the amount of text to process.

One other thing to notice is lynx is killed after 300 seconds.  While crawling I found that some sites were huge and slow but wouldn’t timeout.  Killing lynx helped if it was taking too long on one site.  An alternate approach would have been to have different threads running lynx to capture information, so one thread wouldn’t block everything.  However, killing it worked well enough for my use case.

Clean Up

Now that the information has been obtained, it needs to be run through some cleanup to help with the natural language processing of it.  Before showing the code, let me just say if the project was ongoing I’d look into better ways of doing it.  The best way to describe it is, “It worked for me at the time.”

_LINK_BRACKETS = re.compile(r"\[\d+]", re.U)
_LEFT_BRACKETS = re.compile(r"\[", re.U)
_RIGHT_BRACKETS = re.compile(r"]", re.U)
_NEW_LINE = re.compile(r"([^\r\n])\r?\n([^\r\n])", re.U)
_SPECIAL_CHARS = re.compile(r"\f|\r|\t|_", re.U)
_WHITE_SPACE = re.compile(r" [ ]+", re.U)
 
MS_CHARS = {u"\u2018":"'",
            u"\u2019":"'",
            u"\u201c":"\"",
            u"\u201d":"\"",
            u"\u2020":" ",
            u"\u2026":" ",
            u"\u25BC":" ",
            u"\u2665":" "}
 
def clean_lynx(input):
 
    for i in MS_CHARS.keys():
        input = input.replace(i,MS_CHARS[i])
 
    input = _NEW_LINE.sub("\g<1> \g<2>", input)
    input = _LINK_BRACKETS.sub("", input)
    input = _LEFT_BRACKETS.sub("(", input)
    input = _RIGHT_BRACKETS.sub(")", input)
    input = _SPECIAL_CHARS.sub(" ", input)
    input = _WHITE_SPACE.sub(" ", input)
 
    return input

This cleanup was done for the natural language processing algorithms.  Certain algorithms had problems if there were line breaks in the middle of a sentence.  Smart quotes, or curved quotes, were also problematic.  I remember square brackets (ie- [ or ]) caused problems with the tagger or sentence tokenizer I was using, so I replaced them with parenthesis.  There were a few other rules, and you may need more for your projects.

I hope you find this useful.  It should get you started with crawling the web.  Be prepared to spent time debugging, and it’s a good idea to spot check.  The entire time I was working on the project, I was tweaking different parts of it.  The web is pretty good at throwing every case at you.

Posted on November 9, 2010 at 10:38 am by Joe · Permalink · 6 Comments
In: Python · Tagged with: ,