Configuring Console2 with Cygwin

When using Windows, Console2 does a great job of managing my console windows, but it’s not intuitive how to configure it with Cygwin (my console of choice).  It’s not that hard to simply get Cygwin to open in Console2, but it can be tricky to get it open to a startup directory.

Here are the steps to get Console2 to open to a specific startup directory:

  1. Launch Console2
  2. Open settings through Edit > Settings
  3. Click tabs from the tree on the left
  4. Click the ‘Add’ button to add a new tab
  5. Set the title to ‘Cygwin’ (or another appropriate name)
  6. In the ‘Shell’ field put: 
  7. C:\cygwin\bin\bash.exe --login -i -c "cd /cygdrive/c/Users/<username>/<path>/; exec /bin/bash"
    • Replace username with your login name, and the path with the path you want to use relative your home directory.  This can also be used to start in other paths on the system.
    • This launched cygwin with the command to change the path and then launches bash again so the console window will stay open.
    • You don’t have to put anything in ‘Startup dir’
  8. You should be set to open Cygwin tabs to a specific directory in Console2.

Feel free to do other customizations to the console, I like to make each console have a different color so it’s easy to tell which type of console I’m looking at very quickly.



Posted on November 21, 2011 at 1:12 am by Joe · Permalink · 15 Comments
In: Configuration · Tagged with: ,

Announcement: cppbench 0.2 released!

The second release of cppbench is here.  Cppbench is a open source C++ benchmark framework cppbench.  Someone on reddit suggested including user and system time.  After some research that seemed like a good idea, so here is a new release with some additional features.

New Features include:

cppbench webpage:

cppbench on bitbucket:

cppbench api:

Posted on January 4, 2011 at 9:44 pm by Joe · Permalink · Leave a comment
In: Announcement · Tagged with: , ,

Announcement: cppbench released!

While working on the sqlite benchmarks, I ended up writing a lightweight C++ benchmark framework to make the task easier.  I thought other people might find it useful too.  Then to prepare it for other people to use I wrote documentation and did some cleanup.

Some of the features include:

For a more complete list of features see the cppbench webpage.  With the mindset “release early, release often” here it is:




This being the first open source project I’ve released, any feedback is appreciated.

Posted on December 22, 2010 at 10:30 am by Joe · Permalink · One Comment
In: Announcement, C++ · Tagged with: , ,

Crawling the Web With Lynx


There are a few reasons you’d want to use a text based browser to crawl the web.  For example, it makes it easier to do natural language processing on web pages.  I was doing this a year or two ago, and at the time I was unable to find a Python library that would remove HTML reliably.  Since there was a looming deadline, I only tried BeautifulSoup and NLTK for removing HTML.  They eventually crashed on websites after running for a while.

Keep in mind this information is targeted at general crawling.  If you’re crawling a specific site or similarly formatted sites, different techniques can be used.  You’ll be able to use the formatting to your advantage, so HTML parsers, such as BeautifulSoup, html5lib, or lxml, become more useful.


Using Lynx is pretty straight forward.  If you type lynx into a command prompt with lynx installed it will bring up the browser.  There is help to get you started, and you’ll be able to browse the Internet.  As you can see it does a good job of removing the formatting while keeping the text intact.  That’s the information we’re after.

Lynx has several command line parameters that we’ll use to do this.  Looking at the command line parameters can be done with the command: lynx -help.  The most important command line parameter is -dump.  It causes lynx to output the web page to standard out rather than in their browser interface.  It allows the information to be captured and processed.

Python Code

def kill_lynx(pid):
    os.kill(pid, signal.SIGKILL)
    os.waitpid(-1, os.WNOHANG)
    print("lynx killed")
def get_url(url):
    web_data = ""
    cmd = "lynx -dump -nolist -notitle \"{0}\"".format(url)
    lynx = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE)
    t = threading.Timer(300.0, kill_lynx, args=[])
    web_data =
    web_data = web_data.decode("utf-8", 'replace')
    return web_data

As you can see in addition to the -dump flag -nolist and -notitle are used too.  In most cases the title is included in the text of the website.  Another reason for excluding the title is that most of the time it isn’t a complete sentence.  The -nolist parameter removes the list of links from the bottom of the dump.  I wanted to just parse the information on the page, and on some pages this greatly decreases the amount of text to process.

One other thing to notice is lynx is killed after 300 seconds.  While crawling I found that some sites were huge and slow but wouldn’t timeout.  Killing lynx helped if it was taking too long on one site.  An alternate approach would have been to have different threads running lynx to capture information, so one thread wouldn’t block everything.  However, killing it worked well enough for my use case.

Clean Up

Now that the information has been obtained, it needs to be run through some cleanup to help with the natural language processing of it.  Before showing the code, let me just say if the project was ongoing I’d look into better ways of doing it.  The best way to describe it is, “It worked for me at the time.”

_LINK_BRACKETS = re.compile(r"\[\d+]", re.U)
_LEFT_BRACKETS = re.compile(r"\[", re.U)
_RIGHT_BRACKETS = re.compile(r"]", re.U)
_NEW_LINE = re.compile(r"([^\r\n])\r?\n([^\r\n])", re.U)
_SPECIAL_CHARS = re.compile(r"\f|\r|\t|_", re.U)
_WHITE_SPACE = re.compile(r" [ ]+", re.U)
MS_CHARS = {u"\u2018":"'",
            u"\u2020":" ",
            u"\u2026":" ",
            u"\u25BC":" ",
            u"\u2665":" "}
def clean_lynx(input):
    for i in MS_CHARS.keys():
        input = input.replace(i,MS_CHARS[i])
    input = _NEW_LINE.sub("\g<1> \g<2>", input)
    input = _LINK_BRACKETS.sub("", input)
    input = _LEFT_BRACKETS.sub("(", input)
    input = _RIGHT_BRACKETS.sub(")", input)
    input = _SPECIAL_CHARS.sub(" ", input)
    input = _WHITE_SPACE.sub(" ", input)
    return input

This cleanup was done for the natural language processing algorithms.  Certain algorithms had problems if there were line breaks in the middle of a sentence.  Smart quotes, or curved quotes, were also problematic.  I remember square brackets (ie- [ or ]) caused problems with the tagger or sentence tokenizer I was using, so I replaced them with parenthesis.  There were a few other rules, and you may need more for your projects.

I hope you find this useful.  It should get you started with crawling the web.  Be prepared to spent time debugging, and it’s a good idea to spot check.  The entire time I was working on the project, I was tweaking different parts of it.  The web is pretty good at throwing every case at you.

Posted on November 9, 2010 at 10:38 am by Joe · Permalink · 6 Comments
In: Python · Tagged with: ,

Some SQLite 3.7 Benchmarks

Since I wrote the benchmarks for insertions in my last post, SQLite 3.7 has been released. I figured it’d be interesting to see if 3.7 changed the situation at all.

Prepared Statements

The specific versions compared here are and 3.7.3.  I ran the prepared statements benchmark as is without changing any source code.  Both are using a rollback journal in this case.

Prepared Statements Runtime in Seconds

Prepared Statements Inserts Per Second

Prepared Statements Inserts Per Second

As you can see, the new version of SQLite definitely provides better performance.  There is a speedup of about 3 seconds.

Journal Mode Comparison

One of the main features SQLite 3.7 added was Write Ahead Logging (WAL).  The main advantage of write-ahead logging is it allows more concurrent access to the database than a rollback journal.  These benchmarks don’t show the true potential of write-ahead logging.  The benchmarks are single threaded, and they insert a large amount of data in one transaction.  It’s listed as a disadvantage that large transactions can be slow with write-ahead logging, even having the potential to return an error.  I wanted to evaluate write-ahead logging as a drop in replacement.

I ran the prepared statements benchmark with the default, memory, and wal settings.  I also ran each setting with and without synchronous on.  The synchronous setting controls how often sqlite waits for data to be physically written to the hard disk.  The default setting is full, which is the safest because it waits for data to be written to the hard disk most frequently.  This is compared to synchronous being off, which lets the operating system decide when information should be written to the hard disk.  In this case if there is a software crash, it’s more likely the database could become corrupt.

Journal Mode Runtime Comparison

Journal Mode Runtime Comparison

Journal Mode Inserts Per Second

Journal Mode Inserts Per Second

Up until about 100,000 insertions all the journal modes and synchronous settings are about even.  After 100,000, the insertion benchmarks with synchronous off are faster than their default synchronous counterparts.  Journal mode set to memory and synchronous off offered the best performance for this benchmark.

Posted on October 14, 2010 at 9:00 am by Joe · Permalink · 4 Comments
In: C++ · Tagged with: ,