Final Report Plus Code Library

I’ve now completed my final report for this project, which can be found here.

There’s also a release of all the code, which should be quite friendly to use here.

Apart from some work still to do at StringPedia, that’s pretty much we done for this project. It’s generally been a good experience and I’ve enjoyed it. I’m now looking forward to a little bit of a break before university starts again in October.

Positive Results At Last

Until now I’d not run any tests comparing the two methods that can report the hamming distance at all alignments of text and pattern (A special case of k-mismatches where k >= m). The methods are the naive O(nm) approach and the algorithm of Abrahamson/Kosaraju which runs in O(n\sqrt{m \log{m} }). Early results are quite pleasing though! The following shows a graph of run-times for varying pattern sizes with a text size of 4,345,138 (the full length of the source). The text and pattern were taken from an English text sourced from Project Gutenberg:

It’s a little hard to see down at the lower pattern sizes (in fact it’s pretty hard to see at all unless you click for a larger version!), but the Abrahamson/Kosaraju method is faster right down to pattern sizes of about 5! It’s interesting to see that there are some slight bumps in the run time of the algorithm; though this can be attributed to the number of “frequent” characters — for each frequent character matching is done using FFTs, which adds some overhead. It’s further complicated by the fact that a character which occurs only just enough times to be considered frequent will have more relative overhead than a character which is far above the “frequent threshold” — in each case we’ll do the same amount of work, but in the latter case, we should see less use of the “counting” part of the algorithm.

I’ll run some tests on larger text sizes over night and also use some different sources; it’ll be interesting to see how it fares with a smaller alphabet size, perhaps using theĀ  DNA corpus from Pizza and Chili

Working, but slow

After some work, I’ve now got the new library working — I’m using LCA queries on the suffix tree in order to calculate pattern-pattern LCEs. In addition, I’ve now converted the p-representation generated to work with the new library.

After all that, I was a bit disappointed to find out that new library isn’t performing so well; the old library builds the suffix tree quite a bit faster and is much much faster at creating the p-representations of the text. Things aren’t quite as bad as I first found: it turns out that the new library is affected quite heavily by optimization levels (perhaps unsurprising, since it’s OO and C++ so inlining, unrolling etc are going to be needed for speed) and also that I was using the slower of the two suffix tree methods. I’m currently investigating ways of making if faster : for example, it provides various implementations of suffix arrays, one of which is uncompressed, so I believe should be faster than the default, compressed array. I’m also expecting an update on the library from the author, who tells me that the latest version is faster than the one I’m currently using.

In general, I think it may always be a little slower than the old library: it provides more functionality and focuses on size as well, but I’m still surprised at how much slower it is. The best I’ve seen from it is still 6-7x slower at generating suffix trees and the p-representation generated is even worse.

Let’s hope I can fix this otherwise the best I’ll be able to do is to use the old library to create the p-representations and the new library for LCE calculations. That would involve creating two suffix trees though, so it seems a bit pointless!

Finally getting somewhere (maybe)

For the last few days I’ve been fiddling around trying to get this library working. It supports LCA queries on suffix trees, which is quite nice. However, it’s been a bit of a pain to compile and it’s also written in C++ which I have no experience with and have to make work with my existing C code. Oh, and Valgrind doesn’t seem to like it too much (but only if I compile it; the examples compiled with the supplied Makefile run fine through Valgrind — odd) but I’ve emailed the developer about this.

But no matter, I think I’m getting there with actually performing LCE queries with it; certainly I’ve managed LCE queries on a few simple strings. A few bugs have come up trying to integrate it into the rest of my work, but lets hope (and pray) they’re easy to fix.

On a side note, I’m going to have to do a little more work on the rest of the O(n \sqrt{k} \log{m}) method anyway. I’ve been able to generate inputs that have the same text and pattern lengths and the same k but fall into different cases becase one has more frequent symbols than the other and, excluding pre-processing time, case 1 has been up to 10x as slow as case 2! Whether or not it’s worth trying to reduce the time complexity to O(n \sqrt{k \log{k}}) or it would be better to simply change the maximum number of frequent symbols allowed for case 1 to be used (I would assume that changing it from 2\sqrt{k} to some fraction of that wouldn’t affect the time complexity, but I’ll need to check) I don’t know.

Food for thought I guess.

Randomised Results

I’ve been meaning to get around to benchmarking a couple of implementations that solve the “matching with don’t cares” problem by using some randomisation. This reduces the number of convolutions we need at the sacrifice of the small chance of a false-positive. More specifically, we can reduce the number of convolutions from 3 to 2, or even 1 if we consider the case where wild-cards are only allowed in the pattern.

The implementation allowing wild-cards in either follows straight from Adam Kalai’s paper; the method only allowing wild-cards in the pattern is a little different though follows the same broad idea. If I get time I guess I should probably implement Kalai’s method for this for completeness.

Anywho, onto the results which are quite pleasing:

  • In both cases, the point at which the algorithms become faster than the naive method is somewhere between a half and two thirds of the deterministic equivalent. This follows nicely from the reduction in the number of convolutions.
  • With wild cards in the text and pattern, this is a reduction from around 80 to 45-50.
  • With no wild cards in the text, we get a reduction from about 50 to 25.

You can see some graphs of this either in SVG format or in PNG format.

These results were repeats for a few different text sizes in the range of 1-4 million; I might repeat it overnight with a larger text size but I don’t expect to see any difference. You can read more about the testing and results over at the page on StringPedia.

A Day Of Frustrating Edge Cases

So I’ve spent most of the day fixing a few bugs and dealing with some edge cases. That’s the nice thing about theory: you can ignore all the slight caveats and small problems. The two problems today have been: 1.) rounding of the values of \sqrt{k} — I ran into a problem because I was rounding down, resulting in it not being possible to find enough matches for a given alignment to surpass the threshold at which potential mis-matches are verified with LCEs and 2.) producing p-representations when not all of the characters in the text are in the pattern — to deal with this we change the last character of the pattern to some special character and set all characters that don’t occur in the pattern to this character as well. You have to be careful if the only occurrence of a character in the pattern is the last character though!

On a more positive note, some tests that enforce the worst case for the naive method (text and pattern are both mostly a’s with the necessary number of b’s at the end to ensure a k-mismatch only occurs right at the end) have shown that all the methods are faster even for pretty small pattern sizes. With a text size of 1,000,000, the O(n\sqrt{m\log{m}}) method is faster from pattern sizes greater than 30 or so. Interestingly, at these small pattern sizes the O(n\sqrt{k}\log{m}) is running at least 1.5x as slow as this, even if we exclude the pre-processing time and use a very small k. This contradicts what I was looking at yesterday, though I suspect those times may be have been a bit skewed by a couple of bugs.

Perhaps for larger pattern sizes we’ll see the latter method performing better.

First Results For K-mismatches

I’ve just finished putting together something to time a few of the different methods for solving the k-mismatches problem. It looks like the results could be quite interesting. There’s quite a few variables and it will be interesting to see how that affects the relative performance of the methods. It’s already looking like that, for randomly generated texts and patterns, the naive method might be hard to beat. I guess this was also the case for the “exact matching with don’t cares problem”, though I’ve not really run any proper tests with randomized inputs for that.

It may be worthwhile putting more effort into the LCE generation. At the moment the pre-processing time is O(n + m^2) and so is taking a fair while. But I’m seeing results that show that, if I exclude the pre-processing from the timing, the O(n\sqrt{k}\log{m}) method is running quite a lot faster than the O(n\sqrt{m\log{m}}) algorithm with k=1. I guess I should wait till morning and see more results before drawing too many conclusions though! In any case, it will be interesting to time the creation of the suffix tree; in particular I’ll be interested to see how the creation of a suffix tree for text and pattern compares with just creating the suffix free for the pattern and using the “p-representation” for the text.

Something New

I’ve now begun work on the k-mismatches problem in which we want to find all alignments of the pattern with the text which give at most k differences between the symbols in the text and pattern. I’ve completed a naive method which runs in O(nm) time and also the O(n\sqrt{m \log m}) method in which we match naively for ‘infrequent’ characters and use an FFT-based approach for ‘frequent’ characters. Perhaps something that needs to be improved here is that, for simplicity, the memory overhead currently depends on the alphabet size to allow a test of whether or not a given character is frequent to take constant time. Given that the char data type is only 1 byte this isn’t too much of a problem but if we wanted to include support for multi-bye characters and hence larger alphabet sizes something slightly smarter would be needed.

So the next step then is to reduce that time complexity of O(n\sqrt{k}\log m). Hopefully this shouldn’t be too difficult; though I suspect that the difficulty largely depends on how easily (or otherwise) it is to implement/find a library to do constant time LCE queries. I also need to give some consideration to how testing should be done for this. It will probably be necessary to ensure worst case performance regardless of the input (e.g. by making the naive implementation carry on looking for mismatches after it’s found more than k) otherwise run-times are going to be quite hard to analyse.

A page at StringPedia also needs to be created and I’ve a feeling that explaining both the theory and implementation details clearly and concisely will require some thought.

Final Results (Kind of)

The wiki page has now been updated with the final set of results for small pattern sizes. I’ve upped the number of repeats for each test from 3 to 7, so everything’s taking a little longer, but I’ll set up some new tests for larger pattern sizes to run over the weekend. Fortunately, I’ve now modified the test harness to output a data file with which I can plot a graph with gnuplot nice and quickly. Not quite sure why I wasn’t doing that before to be honest. This also has the bonus of producing SVG output.

The latest set of results include real-to-real transforms, the O(n \log m) algorithm with a minimum sub-string size, a variation where no wild cards are allowed in the text and combinations of the above. To summarize some results:

  • All methods seem to be performing better wrt to the naive approach. I can only assume this is relation to improvements in the compiler after I updated earlier in the week.
  • The absolute best seems to be a combination of the above (using every trick in the book you might say): we use the O(n \log m) algorithm, with a minimum sub-string size, disallowing wild-cards in the text, and use the real-to-real transform. This beats the naive method on all but the first sample point at m=32.
  • Forcing a minimum size has completely gotten rid of the erratic behaviour for small m
  • If we allow wild cards in the text, the O(n \log n) and O(n \log m) versions become faster than the naive approach when m is greater than approximately 300 and 80 respectively.
  • These results seem to hold fairly well for different n — i’m currently running a test with n=2^{21}=2097152 which should verify that.
  • FLINT is the exception to the consistency rule — it gets noticeably worse wrt to the naive method as m increases (ranging from being faster for m > 600 to m > 900 so far)

There’s plenty of graphs on the wiki page here, as well as updated results. The FFT page has also been updated.

So, still to do:

  • Produce final results for large m
  • Put the library of methods together so it’s a bit more usable and upload it to the wiki
  • Maybe give some consideration to memory usage

A Day Of Wiki-ing

So most of the day has been spent updating the wiki with the findings of the last set of tests. The page for exact matching with don’t cares and all the pages it links to are starting to look a lot more complete. It’s taken a bit longer than I’d hoped, but I think there’s a few interesting results. There’s still more to be updated on the wiki; I think I may just try to add a little more to the section on comparing all the methods and then update the rest side-by-side with whatever I start to look at next…it’s quite hard and time consuming trying to write clearly and concisely for the wiki, but mixing it in with some research and coding should provide a nice break.