Passing Parameters Between Builds in Hudson

In my last post, I talked about setting up Hudson to run nightly benchmarks.  While trying to take that to the next step, and get nightly benchmarks recorded in a graph, I discovered that passing parameters between builds may not be as easy as it originally seemed.  If you’re using the Hudson Parameterized Trigger plugin, that gets you part of the way to passing parameters between builds, but I was left wanting more flexibility than it offered.

I wanted to set environment variables with an Execute Shell step, and then be able to pass them as parameters to the benchmarks build.  I wanted to pass the git commit id and timestamp to the benchmarks build for recording.  The Git SCM Plugin doesn’t provide that information to Hudson.  The Parameterized Trigger plugin is able to handle environment variables that are set by Hudson itself.  However, when trying to set them in the Execute Shell step, it didn’t pick up the newly set environment variables.  At this point I looked through the available options.  I saw that I could set the Parameterized Trigger to read from a parameters file.  I tried writing out a parameters file from the Execute Shell section, and reading it in using the Parameterized Trigger plugin.  Success!

Here are the commands I used to write out the properties file:

echo "COMMIT_ID=$(git rev-parse HEAD)" > params.properties
echo "COMMIT_TIME=$(git log -1 --pretty=\"format:%ad\")" >> params.properties

In the end, it worked out pretty well.  After these commands are run, a params.properties file is created.  The Parameterized Trigger plugin is setup to read params.properties, and the information moves on to the next build.

Posted on April 20, 2010 at 11:06 pm by Joe · Permalink · One Comment
In: Uncategorized · Tagged with: ,

Nightly Benchmarks: Setting up Hudson

For some projects, finding out about performance regressions is important.  I’m going to write a two part series about setting up a nightly build machine and displaying the generated data.  This part is going to cover installation of Hudson, and getting the benchmarks running nightly.

I decided to give Hudson a try because I had heard good things about it.  Also after hearing coworkers complain about cruise control and cdash, I thought I’d try something new.  Since Hudson has pretty extensive documentation, I’ll walk you through setting up the JRuby project to build with Hudson and getting benchmarks running on it.

Hudson Installation

On Ubuntu it’s as simple as:

sudo apt-get install hudson

While I didn’t install it on windows, the installation should require little more than installing Tomcat and then downloading the Hudson war file and put it in the web-apps directory.

After installation browsing to http://127.0.0.1:8080 should show the Hudson Dashboard.

Hudson Configuration

After Hudson installation is complete, it requires very little configuration before setting up your first project.  One thing that may be necessary is going to the plugins page and making sure your version control system is covered.  For setting up a continuous integration machine to build JRuby, the git plugin is necessary.

To install the Hudson Git Plugin, click Manage Hudson on the left hand side.  Then click Manage Plugins from the list in the middle of the screen.  Click the Available tab, and find the Hudson GIT plugin in the list.  After it’s installed it will show up in the Installed tab.

After installing all the necessary plugins for your project go back to the Hudson Dashboard by clicking the Hudson logo, or the Back to Dashboard link.

Setting up a Project to Build

A good first step it to make sure the project will build on the given machine without being built through Hudson.  There may be some dependencies that got overlooked, and this is a good way to make sure everything is setup to build your project.

Now, click on the New Job link on the left hand side.  For the JRuby project, the Build a free-style software project is the type of project to setup.  I imagine that is the correct type of project to setup for most projects.

Unless you plan on keeping all the builds produced on the server, the Discard Old Builds is a good option to check, and set how long you want the builds to remain on the server.  Choose the source code management tool that you use for your project, which is Git for JRuby, and set the appropriate settings.

JRuby settings:

URL of Repository: git://github.com/jruby/jruby.git
Branch Specifier (blank for default): master
Repository browser (Auto)

There are several types of Build Triggers by default.  More Build Triggers can be added through plugins, if you’re looking for another way to trigger a build.  For a nightly build at midnight select the Build periodically option, and put @midnight in the field.

For the build step, if you’re building a Java project select Invoke Ant.  Otherwise, Execute shell may be a good option for you.  For JRuby, select Invoke Ant and set the target to jar to build it.

At this point you can click the Save button at the bottom of the page and click Build Now on the next page to build your project.  It’s a good idea to make sure your project builds correctly before trying to add in nightly benchmarks.  It’s easier to debug problems before you have too much going on.  By clicking on the build from the active builds list the console output can be seen from the browser.

Running the Benchmarks

If your benchmarks are in the same repository, you’re mostly done.  Add another build step, and set it up to run your benchmarks.  While JRuby does have benchmarks in its repository, the benchmarks I plan on running are in a different repository.  With this goal in mind, I created another Job in Hudson to checkout and run the benchmarks.

Its setup is very similar to that of JRuby, it checks out the source and runs the benchmarks.  The main difference is that a parameter is passed to the project to tell it which Ruby VM to use.  The Parameterized Trigger Plugin is necessary to pass a parameter from one project to another.  The way it works is you set a parameter in the project receiving the parameter near the top of the page.  In my case, I added a RUBY_PATH parameter.  Then you setup the build job to send that parameter to the benchmarks job.

To do this, I went back to the JRuby job and turned on the Trigger parameterized build on other projects option.  It should be the last option down at the bottom of the page.  I set the JRuby job to trigger with the benchmarks job name, and in the predefined parameters field I put the following:

 RUBY_PATH=$WORKSPACE/bin/jruby

After this is in place, when a JRuby build finishes it will start a benchmarks run.  Now that your benchmarks are up and running, the next part to this series will go over how to display the information in a way that makes it easy to spot regressions.

If you have any questions or if I went over something too quickly, post a comment and/or ask a question.

Posted on April 8, 2010 at 9:36 am by Joe · Permalink · 2 Comments
In: Uncategorized · Tagged with: , ,

JSR-166: The Java fork/join Framework

The JSR-166 are concurrent utilities that were included in Java 5.  The fork/join framework was a piece of it that didn’t make it into Java 5.  After all this time the fork/join framework is finally making it into JDK 7.  What surprised me about the framework is that it is so easy to use.

The fork/join framework is designed to make divide-and-conquer algorithms easy to parallelize.  More specifically, recursive algorithms where the control path branches out over a few paths and they each process an equal part of the data set.  The typical setup is a new class is created that extends either the RecursiveAction or RecursiveTask class.  The parameters that were sent into the recursive function become member variables in the newly defined class.  Then the recursive calls are replaced by invokeAll(…) rather than the calls to the function itself.

In writing this post, I kept going back for forth on whether I should use Fibonacci numbers as an example or something with more meat to it.  The computations done by each recursive call of a Fibonacci numbers algorithm is too small to matter, not only that, but there are much better non-parallel algorithms for Fibonacci numbers.  In the end, I decided on showing a merge sort.  It is used as the example in the fork/join documentation, but this will be a more complete example showing both the sequential algorithm and the changes made for the parallel version of the algorithm.  You’ll see that it’s not that hard.

First let me start by showing the source code for a typical MergeSort:

public class MergeSort {
 
    private static final int SIZE_THRESHOLD = 16;
 
    public static void sort(Comparable[] a) {
        sort(a, 0, a.length-1);
    }
 
    public static void sort(Comparable[] a, int lo, int hi) {
        if (hi - lo < SIZE_THRESHOLD) {
            insertionsort(a, lo, hi);
            return;
        }
 
        Comparable[] tmp = new Comparable[((hi - lo) / 2) + 1];
        mergeSort(a, tmp, lo, hi);
    }
 
    private static void mergeSort(Comparable[] a, Comparable[] tmp, int lo, int hi) {
        if (hi - lo < SIZE_THRESHOLD) {
            insertionsort(a, lo, hi);
            return;
        }
 
        int m = (lo + hi) / 2;
        mergeSort(a, tmp, lo, m);
        mergeSort(a, tmp, m + 1, hi);
        merge(a, tmp, lo, m, hi);
    }
 
    private static void merge(Comparable[] a, Comparable[] b, int lo, int m, int hi) {
        if (a[m].compareTo(a[m+1]) <= 0)
            return;
 
        System.arraycopy(a, lo, b, 0, m-lo+1);
 
        int i = 0;
        int j = m+1;
        int k = lo;
 
        // copy back next-greatest element at each time
        while (k < j && j <= hi) {
            if (b[i].compareTo(a[j]) <= 0) {
                a[k++] = b[i++];
            } else {
                a[k++] = a[j++];
            }
        }
 
        // copy back remaining elements of first half (if any)
        System.arraycopy(b, i, a, k, j-k);
 
    }
 
    private static void insertionsort(Comparable[] a, int lo, int hi) {
        for (int i = lo+1; i <= hi; i++) {
            int j = i;
            Comparable t = a[j];
            while (j > lo && t.compareTo(a[j - 1]) < 0) {
                a[j] = a[j - 1];
                --j;
            }
            a[j] = t;
        }
    }
}

Now here is the code for the parallel version of MergeSort:

public class ParallelMergeSort {
 
    private static final ForkJoinPool threadPool = new ForkJoinPool();
    private static final int SIZE_THRESHOLD = 16;
 
    public static void sort(Comparable[] a) {
        sort(a, 0, a.length-1);
    }
 
    public static void sort(Comparable[] a, int lo, int hi) {
        if (hi - lo < SIZE_THRESHOLD) {
            insertionsort(a, lo, hi);
            return;
        }
 
        Comparable[] tmp = new Comparable[a.length];
        threadPool.invoke(new SortTask(a, tmp, lo, hi));
    }
 
    /**
     * This class replaces the recursive function that was
     * previously here.
     */
    static class SortTask extends RecursiveAction {
        Comparable[] a;
        Comparable[] tmp;
        int lo, hi;
        public SortTask(Comparable[] a, Comparable[] tmp, int lo, int hi) {
            this.a = a;
            this.lo = lo;
            this.hi = hi;
            this.tmp = tmp;
        }
 
        @Override
        protected void compute() {
            if (hi - lo < SIZE_THRESHOLD) {
                insertionsort(a, lo, hi);
                return;
            }
 
            int m = (lo + hi) / 2;
            // the two recursive calls are replaced by a call to invokeAll
            invokeAll(new SortTask(a, tmp, lo, m), new SortTask(a, tmp, m+1, hi));
            merge(a, tmp, lo, m, hi);
        }
    }
 
    private static void merge(Comparable[] a, Comparable[] b, int lo, int m, int hi) {
        if (a[m].compareTo(a[m+1]) <= 0)
            return;
 
        System.arraycopy(a, lo, b, lo, m-lo+1);
 
        int i = lo;
        int j = m+1;
        int k = lo;
 
        // copy back next-greatest element at each time
        while (k < j && j <= hi) {
            if (b[i].compareTo(a[j]) <= 0) {
                a[k++] = b[i++];
            } else {
                a[k++] = a[j++];
            }
        }
 
        // copy back remaining elements of first half (if any)
        System.arraycopy(b, i, a, k, j-k);
 
    }
 
    private static void insertionsort(Comparable[] a, int lo, int hi) {
        for (int i = lo+1; i <= hi; i++) {
            int j = i;
            Comparable t = a[j];
            while (j > lo && t.compareTo(a[j - 1]) < 0) {
                a[j] = a[j - 1];
                --j;
            }
            a[j] = t;
        }
    }
}

As you can see the majority of the algorithm has remained intact.  As stated above a new class is created that extends RecursiveAction, and the parameters of the function are then passed into that class during creation.  One thing to take note, is that previously only half the size of the original array was created as secondary storage.  Now the entire length of the array is created as a temporary storage.  This is used to avoid different threads needing the same area of the array at the same time.

Changes to the algorithm may be needed, but it definitely helps in making it easier to move to parallel processing.  One other thing to note is the presence of the ForkJoinPool.  The default constructor looks at the processor and determines the appropriate level of parallelism for the task.

I have a quad core CPU, so the ForkJoinPool will spawn at least four threads if necessary. That said, I’ve seen in where only two threads are spawned because more than that was not necessary for the given task. The ForkJoinPool spawns more threads as deemed necessary without starting right at the maximum.

A complete API for the fork/join framework can be found here at the Concurrency JSR-166 Interest Site.  All that is needed for Java 6 is the jsr166y package.

Some other algorithms that are suited for parallelism that I’ve been thinking about are graph searching algorithms such as depth first and breadth first search.  Depending on whether they are done on a tree or a graph determines how much the underlying data structure will need to be changed to support the parallelism.  I plan to look at making a parallel version of the quicksort algorithm using this framework.  Most divide and conquer algorithms can be adapted fairly easily to be multi-threaded using this method, but remember for a performance benefit to be seen the task must be sufficiently large.

Posted on March 9, 2010 at 10:53 pm by Joe · Permalink · 14 Comments
In: Java · Tagged with: ,

Setting up a VirtualBox LAMP Server

Introduction

I recently decided to play around with web development a little bit. Not being familiar with setting up a web server, I decided to setup a VirtualBox LAMP server. Since I couldn’t find a good guide that went through all the steps of setting up a VirtualBox LAMP Server in one place, I decided to write about my experience. I wanted a LAMP Server that I could access from any machine on my local network. In retrospect, it isn’t very hard to do, but having all the information in one place is nice.

Installing VirtualBox

Start by installing VirtualBox. The open source edition (OSE) should be good enough to use for the purposes of this guide. I was installing the full edition on openSuSE 11.2, and there were some issues.  The issue I had was solved with this command: sudo chmod +x /usr/lib/virtualbox/VirtualBox and remove /tmp/*vbox*. Generally speaking it’s fairly easy to install VirtualBox on Windows. When creating a new virtual machine, I allocated 512MB of RAM and 12GB of HD space.

Installing LAMP

I chose Ubuntu for my LAMP server largely because there are many documents on how to setup a LAMP server on top of Ubuntu. I’ll do a quick overview here, and provide a link to setting up a LAMP server on Ubuntu (Hardy Heron). I liked this guide more than the guide for Jaunty because this one tells you to install the OpenSSH server, and being able to administer the VM remotely is a good idea.

Start by downloading the Ubuntu Server Edition. I tried downloading and installing Hardy Heron, which is the latest Long Term Support (LTS) release, but I kept getting a Kernel Panic when trying to boot in VirtualBox 3.1.2. It may have been the combination of VirtualBox version and Ubuntu version. Eventually, I ended up going with Karmic Koala. The installation process is almost identical to the Hardy Heron installation, and it provides both the LAMP and OpenSSH options that the guide suggests.

Network Configuration

Click on your newly created virtual machine, and open the settings dialog. Then click the Network settings area. For Adapter 1, make sure that the Enable Network Adapter checkbox is checked. Adapter 1 is attached to NAT by default, switch it to Bridged Adapter for it to look like a regular PC to the rest of your network. It will acquire an IP address from your router, like a normal computer. If you only want your host computer to be able to access it the Host-only Adapter option seems like an appropriate choice, but I did not use or test this option. After changing the setting, start up the virtual machine. To get the IP of the virtual machine use the ifconfig command. If you point your browser at that IP, you should see the apache welcome page.

Static IP

Setting a static IP address for the virtual machine is a good idea so you can always access the same IP address. These are the instructions for setting a static IP in Ubuntu.

Edit the /etc/network/interfaces file using vim or nano:

sudo [your_editor] /etc/network/interfaces

Find this line:

iface eth0 inet dhcp

Change it to:

iface eth0 inet static
address 192.168.1.99
netmask 255.255.255.0
network 192.168.1.0
broadcast 192.168.1.255
gateway 192.168.1.1

These settings worked well for my Linksys router. By default, the DHCP service the router provides starts using IP addresses starting with 192.168.1.100. By using 192.168.1.99 it was outside of that range. The Linksys router defaults to using 192.168.1.1 for its own IP, which is why the gateway is set to that.

Pure-FTPd FTP Server

The last step is setting up an FTP Server on it so you can easily transfer files.  For this I chose Pure-FTPd because the project prides itself on is being easy to configure.  It largely worked right out of the box without any configuration.

To install it:

sudo apt-get install pure-ftpd

Some Pure-FTPd configuration:

CD to the configuration directory located here:
/etc/pure-ftpd/conf

Set display dot files to on (so you can see your .htaccess file):
echo yes > DisplayDotFiles

Restart Pure-FTPd:
sudo /etc/init.d/pure-ftpd restart

Get your user connected to the /var/www directory:
CD to your home folder and create a symbolic link to /var/www
ln -s /var/www www

Change ownership /var/www to your user, so you can write to this directory.
chown -R  /var/www

Change to 755 permissions
chmod -R 755 /var/www

You should now be able to connect to the FTP server from anywhere on your network by pointing your FTP client at: 192.168.1.99 (or any IP you may have chosen). It should have no problem running PHP files.

If any part of this short guide was confusing or didn’t work, leave a comment so I can look into it and update the guide.

Posted on February 16, 2010 at 11:20 pm by Joe · Permalink · 9 Comments
In: Uncategorized · Tagged with: ,

NLTK Regular Expression Parser (RegexpParser)

The Natural Language Toolkit (NLTK) provides a variety of tools for dealing with natural language.  One such tool is the Regular Expression Parser.  If you’re familiar with regular expressions, it can be a useful tool in natural language processing.

Background Information

You must first be familiar with regular expressions to be able to fully utilize the RegexpParser/RegexpChunkParser.  If you need to learn about regular expressions, here is a site with an abundance of information to get you started: http://www.regular-expressions.info.  It is also necessary to know how to use a tagger, and what the tags mean.  A tagger is a tool that marks each word in a sentence with its part of speech.  Here is a small comparison I did of python taggers: NLTK vs MontyLingua Part of Speech Taggers.  The NLTK RegexpParser works by running regular expressions on top of the part of speech tags added by a tagger.  The Brown Corpus tags will be the tags used throughout the rest of this post, and are commonly used by taggers in general.  On a side note, the RegexpParser can be used with either the NLTK or MontyLingua tagger.

Basic RegexpParser Usage

Let me start by going over the “how to” provided in the NLTK documentation.  The source of this information is here: NLTK RegexParser HowTo.  The documentation goes through how you could use the RegexParser/RegexpChunkParser to do a traditional parse of a sentence.

The RegexParser/RegexChunkParser works by defining rules for grouping different words together.  A simple example would be: “NP: {<DT>? <JJ>* <NN>*}”.  This is a definition for a rule to group of words into a noun phrase.  It will group one determinant (usually an article), then zero or more adjectives followed by zero or more nouns.  In the how to, they go over prepositions and creating prepositional phrases from a preposition and noun phrase.  It’s important to note that earlier regular expressions can be used in later ones. Also, the regular expression syntax can occur within the tags or apply to the tags themselves.

Here is the example from the NLTK website:

parser = RegexpParser('''
    NP: {<DT>? <JJ>* <NN>*} # NP
    P: {<IN>}           # Preposition
    V: {<V.*>}          # Verb
    PP: {<P> <NP>}      # PP -> P NP
    VP: {<V> <NP|PP>*}  # VP -> V (NP|PP)*
    ''')

Alternative RegexpParser Usage

I call this an alternate usage because it can be used to find patterns that aren’t necessarily related to grammatical phrases in English.  It can be used to find any pattern in a sentence.  Let me start by showing the regular expression grammar from my program.

grammar = """
	NP:   {<PRP>?<JJ.*>*<NN.*>+}
	CP:   {<JJR|JJS>}
	VERB: {<VB.*>}
	THAN: {<IN>}
	COMP: {<DT>?<NP><RB>?<VERB><DT>?<CP><THAN><DT>?<NP>}
	"""
self.chunker = RegexpParser(grammar)

I was using it to look for a specific pattern in a sentence.  The first part, NP, is looking for a noun phrase.  The <PRP>? is there because of a bug found in the tagger I was using.  It was marking An with a capital ‘A’ as a PRP (Pronoun) rather than a DT (Determinant/Article).  I found another workaround for the bug, but left the PRP in there to catch anything that might have slipped through.

Then it moves onto the CP, which is the comparison word.  JJR tagged words are comparative adjectives.  They include words bigger, smaller, and larger.  JJS words are words that signify the most or chief.  JJS words include biggest, smallest, and largest.

The next two a simply the VERB and the word THAN.  The VERB could be a compound verb, so there would be one or more verbs present.  The IN tag denotes a preposition.  In this case, I was looking specifically for the word than.

The last line is COMP.  This is the regular expression that puts it all together.  This was looking for a size comparison of two objects.  It might be easier to look at the output of this part of the expression than trying to explain it piece by piece. The only tag not explained above is RB, which is an adverb.

Here is the parse for the sentence “Everyone knows an elephant is larger than a dog.”:

(S
  (NP everyone/NN)
  (VERB knows/VBZ)
  (COMP
    an/DT
    (NP elephant/NN)
    (VERB is/VBZ)
    (CP larger/JJR)
    (THAN than/IN)
    a/DT
    (NP dog/NN))
  ./.)

The output is a simple tree, that makes to easy data extraction. It’s easy to see there are many possibilities that open up when looking for patterns in English text.  May this help you in your data mining endeavors.

Posted on January 27, 2010 at 9:53 am by Joe · Permalink · One Comment
In: Python · Tagged with: ,