The ease of Python, SQLite, and Storm

I began learning Python this spring, and I must say, the more I program in it the more I like it. I chose the language because of the libraries that are available for it. There is a library for everything. :) Also, there are tools for Natural Language Processing that are a great help, but that’s for another time and another post.

I was originally thinking about using Postgres, and it would probably give me better speed and scalability. But then, I began to think if I really needed a full RDBMS for my application. After all, I’m not expecting the project to get too large, and being able to easily move it from one computer to another by just moving a single files sounds very convenient. SQLite has a great page to see if it’s right for you. I ended up settling on SQLite, and so far am happy with the decision.

Installing SQLite was a breeze. I just opened the package manager in openSuSE and installed the packages. I also installed the python Storm package. There is no daemon process, as there is with Postgres, because you’re just accessing one file on your filesystem. There is a great tool for setting up a SQLite database called SQLite Manager. It will let you create tables, view your data, and run queries. The fact that it’s available as a firefox extention makes it easy to install on many platforms.

Now is when the real fun begins. Enter Storm.

Storm is an object relation mapping (ORM) tool for Python. It allows you to manipulate the database through the manipulation of Python objects. After you map your python objects to database tables, you manipulate them, and your changes will show up in the database for you. I’ve used other ORM tools in the past (Hibernate for Java), but I was amazed at the simplicity of the setup/configuration step when using Storm.

It takes two lines to connect to your sqlite database:

DATABASE = create_database('sqlite:db_name')
# or simply create_database('sqlite:') for in-memory

Mapping a class to a table can be done with ease. Here is an example of one of my classes:

class Sentence(object):
    __storm_table__ = "TABLE_SENT"
    sent_id = Int(primary=True)
    loc_id = Int()
    location = Reference(loc_id, Location.loc_id)
    sentence = Unicode()
    def __init__(self, sent, loc = None):
        if loc: self.location = loc
        # sent cannot be None
        if not isinstance(sent, unicode):
            self.sentence = unicode(sent, "utf-8")
            self.sentence = sent

If you access loc_id, it will give you the database id. If you access the variable without location through a Reference, it will hand you the corresponding database object.

Now, I set this up in about 45 minutes from start to finish, so it might need some more fiddling, but overall it seems to work pretty well. I needed to set something up in one night to keep moving on other parts of the project, and this allowed me to.

It can’t all be sunshine and rainbows, there was one thing that tripped me up a bit. Being new to Python, I wasn’t aware of the u”String” for unicode. It was used in their examples, and after I got an error assumed that’s what it was for, but it tripped me up. As you can see in my constructor, I added some code to handle the case when a string that isn’t unicode is passed in.

As I get into the more advanced aspects of SQLite/Storm, I hope I continue to be impressed.

Posted on March 8, 2009 at 7:43 pm by Joe · Permalink · 4 Comments
In: Python · Tagged with: ,

Regular Expressions Review

A hobby of mine is to learn new programming languages.  I try and learn at least one a year, and use it for more than just a hello world app.  So this year is the year of python, where if I’m required to write a script Python is the go to guy.  Having that said, I recently had a need for regular expressions, so python was used.

Being most familiar with Java and Ruby, Python seems a little in between, but one feature that stuck out was the regular expression syntax.  Python let’s you put a ‘r’ in front of the quote to denote raw input.  This means you don’t have to escape back slashes twice (ala Java).

For those not familiar with regular expressions, here is an example of a regular expression in a few languages:

Find Slashes: "[\\\\/]"

Find Slashes: r"[\\/]"

Find Slashes: /[\\\/]/

This is just a simple example to illustrate a point, but look at Java.  Find slashes has 4 backslashes.  Even if regular expressions were typically this simple (they’re not), that seems unnecessary.  Anytime it’s necessary for a backslash to make it to the regular expression processor, there much be two in the regular expression.  Another quick example, to find a period (.) the regular expression would be “\\.”.  This compounds very quickly, and makes it painful to use regular expressions in Java.

I can’t understand why Java wouldn’t adopt the python syntax of putting a ‘r’ in from of a string to denote raw input.  At quick glance, it doesn’t seem as though it would break any currently in use regular expressions because the old syntax would continue working as expected.  It’d make them better for the future.

Here are some Regular Expression resources that I find useful: A Ruby Regular Expression Tester A Java Regular Expression Tester

Posted on February 14, 2009 at 2:18 pm by Joe · Permalink · Leave a comment
In: Java, Python, Ruby · Tagged with: 

Getting Started with Git (and the Hub)

I’m very new to Git, but I find some of its features interesting and worth a post about.  This is possibly a part 1 of more to come as I learn more about Git and how to use it effectively.  The comments below are made with having mostly used SVN in the past for source control.

Features (that appeal to me):

These features are of DVCS in general, Bisect may be git specific.

Starting out:

Make sure you have git installed before starting here.  I’m assuming git is installed, and you are a little familiar with the commands.  Here are some links to some excellent resources to get you started if you’re not at that point yet:

Well, now that you’re somewhat familiar with git, let’s get started on with github.  Now, the main reason for posting this is because as a new user it didn’t seem that clear what I need to do to actually push to a repository.  After you’ve been given access to, or created, a github repository, you must set your username and email for github to use.

Instructions on that can be found here:

Now, here is the part that really wasn’t clear to me.  You need to setup an ssh key to github as part of their authentication.  I don’t know if this is correct, but I think of this as a layer of encryption that happens to make sure my files get to github unaltered. (similar to PGP for email)

Instructions for this can be found here:

Now that you’re all setup, you’re ready to do some commits and then a push to the server.  Have fun in the world of git.

Posted on January 20, 2009 at 9:30 pm by Joe · Permalink · One Comment
In: Uncategorized · Tagged with: , ,

Netbeans Debugger Not Stopping at Breakpoints

I use Netbeans for doing development for JRuby.  When I say that I don’t mean I’m developing with JRuby, I mean for JRuby.  I’m writing Java code, and wanted to use the debugger for Java code.  Everytime I mentioned JRuby people would assume I was using it and developing Ruby code, that’s not the case at all.  By the way, an awesome way to learn the ins and outs of a language is to develop another language in it.  I’ve learned quite a bit about Java since I started working on this, and I’d have called myself proficient in Java before starting.

Anyway, when I was trying to use the Netbeans debugger on JRuby it wasn’t stopping at my breakpoints, and I couldn’t figure out why.  I thought it might have something to do with me using linux, and after people ran out of ideas they seemed to think the same thing.  This turned out to not be the case.  It was a problem with the ant script for running the debugger.

It was as simple as changing this:

    <nbjpdastart addressproperty="jpda.address" name="JRuby" transport="dt_socket">
        <pathelement path="${jruby.classes.dir}"/>
        <path refid="build.classpath"/>

to this:

    <nbjpdastart addressproperty="jpda.address" name="JRuby" transport="dt_socket">
        <pathelement path="${jruby.classes.dir}"/>
        <path refid="build.classpath"/>
      <sourcepath path="${src.dir}"/>

The sourcepath makes all the difference. I’m posting this because nbjpdastart has very little documentation, and it took me quite a while to figure this out. I hope this saves you some time.

Posted on December 2, 2008 at 9:03 pm by Joe · Permalink · Leave a comment
In: Uncategorized · Tagged with: , , ,

Sort Optimization

This all started one night when I was in the #JRuby channel on  A channel user was complaining about JRuby’s sorting algorithm being slow.  I thought to myself, I should be able to speed it up.  At the time I was thinking since sorting has been so well researched it’d likely be easy to find documentation about it.  That didn’t turn out to be entirely the case.

I started by looking around on wikipedia, to see what that had to offer.  IntroSort caught my eye.  I thought it was interesting that after recursing to a certain depth it would switch from quicksort to heapsort.  I don’t think this optimization turned out to be that needed in the end though.  Other than switching to heapsort, it was pretty much a median of three quicksort with an insertion sort added.

It was a good starting point though.  I switched the heapsort for a shell sort and that dropped the number of comparisons needed by a good amount.  One thing I saw was only my “Median of 3 Killer” test case was affected by that.  I searched around the Internet often during this, and came across this page: QuickSort.  It had some interesting ideas like grouping the same element together when running the partition function.  I tried implementing that several times and it always ended up increasing the number of comparisons and the runtime.  I’m not 100% sure why.  If someone who knows more about sorting that I do has any input on that, I’d be happy to hear it.

Anyway, after working on that for a while I took a look at the competition.  I looked at Ruby’s C code.  They do some interesting things, which I hadn’t thought of up to that point.  First, they take the median of 7 if it has more than 200 elements.  Second, they don’t sample the end elements, this helps me out later in my optimizations.  I didn’t copy exactly what they were doing, but did a similar idea.  One thing they also did was look at the order of the 3 values that they compared last (I’ll refer to these as v1, v2, and v3).  v1 is before v2 and v3 in the list’s current state.  v2 is in the middle of the other two, and so on.

They had a lot more checks than what I ended up using, but I check if v1 <= v2 <= v3.  If this is the case I run the sequential test.  I also check if v1 >= v2 >= v3 to see if the list is in reversed order, and if it is I reverse the list before continuing.  While running these tests, I don’t check the first or last element of list because I have a separate test for that.  If it passes the sequential or reverse test that means the entire list is sorted except for potentially the first and/or the last element.  I then do a test on them and if they’re out of sequence I do bubble sort style swaps until they’re in the correct location.

Checking the end was one of the last optimizations I performed.  The main reason I added it is that case can be slow with the normal sorting algorithm, and I don’t think it’s that uncommon a case.  I’ve seen it happen where an element is appended to a sorted list and then the list is sorted again.  Overall, this catches a case with the potential to be slow in a fairly cheap manner.  These cases weren’t weeded out by the v1 <= v2 <= v3 style checks because the median of 7 that I use doesn’t check the end elements.

One of the last optimizations I performed was converting it to use a stack rather than operating recursively.  To be honest, this provided more speedup than I was expecting.  I guess all the function calls were taking a toll on the performance that I just didn’t realize at the time.

Another implementation technique that I ran tests to figure out which was better was whether to do insertion sort at the end on the entire list, or to do it as I find sections that are smaller than the threshold value.  After running benchmarks it turned out that putting it at the end was better.  I suspect that the extra function calls required for it to be done during the quicksort loop was more overhead than the potential gain of having more cache locality.

On with the benchmarks: (The numbers are time in seconds.  In parenthesis is speedup over Java.)

                          Java                 My Qsort
1245.repeat.1000.txt      1.6552300e-05 (1.00) 8.1874300e-06   (2.02 )
1245.repeat.10000.txt     0.00026284956 (1.00) 0.00012648017   (2.08 )
end.0.1000.txt            5.8904900e-06 (1.00) 1.7295600e-06   (3.41 )
end.0.10000.txt           8.3629030e-05 (1.00) 2.3148840e-05   (3.61 )
identical.1000.txt        3.5435500e-06 (1.00) 5.9168000e-07   (5.99 )
identical.10000.txt       3.7097050e-05 (1.00) 6.5003600e-06   (5.71 )
med.3.killer.1000.txt     1.0182050e-05 (1.00) 7.2132600e-06   (1.41 )
med.3.killer.10000.txt    0.00013968449 (1.00) 9.4941470e-05   (1.47 )
rand.dups.100.txt         1.2044400e-06 (1.00) 6.3194000e-07   (1.91 )
rand.dups.1000.txt        1.9847760e-05 (1.00) 1.0417630e-05   (1.91 )
rand.dups.10000.txt       0.00031415178 (1.00) 0.00020365385   (1.54 )      1.2328200e-06 (1.00) 8.1612000e-07   (1.51 )     1.9309830e-05 (1.00) 1.1890280e-05   (1.62 )    0.00027436851 (1.00) 0.00017424722   (1.57 )
rand.steps.1000.txt       1.6057600e-05 (1.00) 1.0023700e-05   (1.60 )
rand.steps.10000.txt      0.00019955369 (1.00) 0.00017004971   (1.17 )
rev.ends.1000.txt         1.2306600e-05 (1.00) 2.8749300e-06   (4.28 )
rev.ends.10000.txt        9.4499880e-05 (1.00) 2.7255840e-05   (3.47 )
rev.partial.1000.txt      1.7107210e-05 (1.00) 8.7564000e-06   (1.95 )
rev.partial.10000.txt     0.00024198949 (1.00) 0.00013045623   (1.85 )
rev.saw.1000.txt          1.6793840e-05 (1.00) 9.4294600e-06   (1.78 )
rev.saw.10000.txt         0.00025133096 (1.00) 0.00014296088   (1.76 )
reverse.1000.txt          1.4600270e-05 (1.00) 1.1565800e-06   (12.6 )
reverse.10000.txt         0.00020535965 (1.00) 1.3798890e-05   (14.9 )    6.5672500e-06 (1.00) 1.4837800e-06   (4.43 )   4.7335130e-05 (1.00) 7.2842900e-06   (6.50 )
seq.partial.1000.txt      1.5497830e-05 (1.00) 6.0515300e-06   (2.56 )
seq.partial.10000.txt     0.00022936435 (1.00) 9.1791120e-05   (2.50 )
seq.saw.1000.txt          1.1645670e-05 (1.00) 4.6621150e-05   (0.250)
seq.saw.10000.txt         0.00019771144 (1.00) 0.00011184647   (1.77 )
sequential.1000.txt       3.4216700e-06 (1.00) 5.8689000e-07   (5.83 )
sequential.10000.txt      3.7812430e-05 (1.00) 6.1648500e-06   (6.13 )

These benchmarks were taken by running the sorting algorithm on each dataset 10000 times (to warm up the JVM), and then running it on the data 10 times to time it.  I took the average of those 10 runs.  The only case where My Qsort was slower than the built-in Java Arrays.sort() was seq.saw.1000.txt.  I attribute this to noise.  I ran it again and got the following:

seq.saw.1000.txt          0.00013980571 (1.00) 0.00012667049   (1.10 )

Hopefully this makes JRuby’s sorting comparable to Ruby’s.  One note, while Java’s sort is stable, the quicksort I wrote is not.  All that means is that if multiple entries have the same value they may get sorted differently.

You can do whatever you’d like with the source code.  If you do find it useful some credit and/or a link to here would be nice.

Here are the test files: Test Files

Here is the sort itself: Sorting Algorithm

Posted on November 15, 2008 at 1:31 pm by Joe · Permalink · 3 Comments
In: Java, Ruby · Tagged with: , , ,