Sort Optimization
This all started one night when I was in the #JRuby channel on irc.freenode.net. A channel user was complaining about JRuby’s sorting algorithm being slow. I thought to myself, I should be able to speed it up. At the time I was thinking since sorting has been so well researched it’d likely be easy to find documentation about it. That didn’t turn out to be entirely the case.
I started by looking around on wikipedia, to see what that had to offer. IntroSort caught my eye. I thought it was interesting that after recursing to a certain depth it would switch from quicksort to heapsort. I don’t think this optimization turned out to be that needed in the end though. Other than switching to heapsort, it was pretty much a median of three quicksort with an insertion sort added.
It was a good starting point though. I switched the heapsort for a shell sort and that dropped the number of comparisons needed by a good amount. One thing I saw was only my “Median of 3 Killer” test case was affected by that. I searched around the Internet often during this, and came across this page: QuickSort. It had some interesting ideas like grouping the same element together when running the partition function. I tried implementing that several times and it always ended up increasing the number of comparisons and the runtime. I’m not 100% sure why. If someone who knows more about sorting that I do has any input on that, I’d be happy to hear it.
Anyway, after working on that for a while I took a look at the competition. I looked at Ruby’s C code. They do some interesting things, which I hadn’t thought of up to that point. First, they take the median of 7 if it has more than 200 elements. Second, they don’t sample the end elements, this helps me out later in my optimizations. I didn’t copy exactly what they were doing, but did a similar idea. One thing they also did was look at the order of the 3 values that they compared last (I’ll refer to these as v1, v2, and v3). v1 is before v2 and v3 in the list’s current state. v2 is in the middle of the other two, and so on.
They had a lot more checks than what I ended up using, but I check if v1 <= v2 <= v3. If this is the case I run the sequential test. I also check if v1 >= v2 >= v3 to see if the list is in reversed order, and if it is I reverse the list before continuing. While running these tests, I don’t check the first or last element of list because I have a separate test for that. If it passes the sequential or reverse test that means the entire list is sorted except for potentially the first and/or the last element. I then do a test on them and if they’re out of sequence I do bubble sort style swaps until they’re in the correct location.
Checking the end was one of the last optimizations I performed. The main reason I added it is that case can be slow with the normal sorting algorithm, and I don’t think it’s that uncommon a case. I’ve seen it happen where an element is appended to a sorted list and then the list is sorted again. Overall, this catches a case with the potential to be slow in a fairly cheap manner. These cases weren’t weeded out by the v1 <= v2 <= v3 style checks because the median of 7 that I use doesn’t check the end elements.
One of the last optimizations I performed was converting it to use a stack rather than operating recursively. To be honest, this provided more speedup than I was expecting. I guess all the function calls were taking a toll on the performance that I just didn’t realize at the time.
Another implementation technique that I ran tests to figure out which was better was whether to do insertion sort at the end on the entire list, or to do it as I find sections that are smaller than the threshold value. After running benchmarks it turned out that putting it at the end was better. I suspect that the extra function calls required for it to be done during the quicksort loop was more overhead than the potential gain of having more cache locality.
On with the benchmarks: (The numbers are time in seconds. In parenthesis is speedup over Java.)
Java My Qsort 1245.repeat.1000.txt 1.6552300e-05 (1.00) 8.1874300e-06 (2.02 ) 1245.repeat.10000.txt 0.00026284956 (1.00) 0.00012648017 (2.08 ) end.0.1000.txt 5.8904900e-06 (1.00) 1.7295600e-06 (3.41 ) end.0.10000.txt 8.3629030e-05 (1.00) 2.3148840e-05 (3.61 ) identical.1000.txt 3.5435500e-06 (1.00) 5.9168000e-07 (5.99 ) identical.10000.txt 3.7097050e-05 (1.00) 6.5003600e-06 (5.71 ) med.3.killer.1000.txt 1.0182050e-05 (1.00) 7.2132600e-06 (1.41 ) med.3.killer.10000.txt 0.00013968449 (1.00) 9.4941470e-05 (1.47 ) rand.dups.100.txt 1.2044400e-06 (1.00) 6.3194000e-07 (1.91 ) rand.dups.1000.txt 1.9847760e-05 (1.00) 1.0417630e-05 (1.91 ) rand.dups.10000.txt 0.00031415178 (1.00) 0.00020365385 (1.54 ) rand.no.dups.100.txt 1.2328200e-06 (1.00) 8.1612000e-07 (1.51 ) rand.no.dups.1000.txt 1.9309830e-05 (1.00) 1.1890280e-05 (1.62 ) rand.no.dups.10000.txt 0.00027436851 (1.00) 0.00017424722 (1.57 ) rand.steps.1000.txt 1.6057600e-05 (1.00) 1.0023700e-05 (1.60 ) rand.steps.10000.txt 0.00019955369 (1.00) 0.00017004971 (1.17 ) rev.ends.1000.txt 1.2306600e-05 (1.00) 2.8749300e-06 (4.28 ) rev.ends.10000.txt 9.4499880e-05 (1.00) 2.7255840e-05 (3.47 ) rev.partial.1000.txt 1.7107210e-05 (1.00) 8.7564000e-06 (1.95 ) rev.partial.10000.txt 0.00024198949 (1.00) 0.00013045623 (1.85 ) rev.saw.1000.txt 1.6793840e-05 (1.00) 9.4294600e-06 (1.78 ) rev.saw.10000.txt 0.00025133096 (1.00) 0.00014296088 (1.76 ) reverse.1000.txt 1.4600270e-05 (1.00) 1.1565800e-06 (12.6 ) reverse.10000.txt 0.00020535965 (1.00) 1.3798890e-05 (14.9 ) seq.0.is.1000.1000.txt 6.5672500e-06 (1.00) 1.4837800e-06 (4.43 ) seq.0.is.1000.10000.txt 4.7335130e-05 (1.00) 7.2842900e-06 (6.50 ) seq.partial.1000.txt 1.5497830e-05 (1.00) 6.0515300e-06 (2.56 ) seq.partial.10000.txt 0.00022936435 (1.00) 9.1791120e-05 (2.50 ) seq.saw.1000.txt 1.1645670e-05 (1.00) 4.6621150e-05 (0.250) seq.saw.10000.txt 0.00019771144 (1.00) 0.00011184647 (1.77 ) sequential.1000.txt 3.4216700e-06 (1.00) 5.8689000e-07 (5.83 ) sequential.10000.txt 3.7812430e-05 (1.00) 6.1648500e-06 (6.13 )
These benchmarks were taken by running the sorting algorithm on each dataset 10000 times (to warm up the JVM), and then running it on the data 10 times to time it. I took the average of those 10 runs. The only case where My Qsort was slower than the built-in Java Arrays.sort() was seq.saw.1000.txt. I attribute this to noise. I ran it again and got the following:
seq.saw.1000.txt 0.00013980571 (1.00) 0.00012667049 (1.10 )
Hopefully this makes JRuby’s sorting comparable to Ruby’s. One note, while Java’s sort is stable, the quicksort I wrote is not. All that means is that if multiple entries have the same value they may get sorted differently.
You can do whatever you’d like with the source code. If you do find it useful some credit and/or a link to here would be nice.
Here are the test files: Test Files
Here is the sort itself: Sorting Algorithm
In: Java, Ruby · Tagged with: benchmarks, jruby, optimization, sorting
on October 8, 2009 at 10:09 am