A Day Of Frustrating Edge Cases

So I’ve spent most of the day fixing a few bugs and dealing with some edge cases. That’s the nice thing about theory: you can ignore all the slight caveats and small problems. The two problems today have been: 1.) rounding of the values of \sqrt{k} — I ran into a problem because I was rounding down, resulting in it not being possible to find enough matches for a given alignment to surpass the threshold at which potential mis-matches are verified with LCEs and 2.) producing p-representations when not all of the characters in the text are in the pattern — to deal with this we change the last character of the pattern to some special character and set all characters that don’t occur in the pattern to this character as well. You have to be careful if the only occurrence of a character in the pattern is the last character though!

On a more positive note, some tests that enforce the worst case for the naive method (text and pattern are both mostly a’s with the necessary number of b’s at the end to ensure a k-mismatch only occurs right at the end) have shown that all the methods are faster even for pretty small pattern sizes. With a text size of 1,000,000, the O(n\sqrt{m\log{m}}) method is faster from pattern sizes greater than 30 or so. Interestingly, at these small pattern sizes the O(n\sqrt{k}\log{m}) is running at least 1.5x as slow as this, even if we exclude the pre-processing time and use a very small k. This contradicts what I was looking at yesterday, though I suspect those times may be have been a bit skewed by a couple of bugs.

Perhaps for larger pattern sizes we’ll see the latter method performing better.

First Results For K-mismatches

I’ve just finished putting together something to time a few of the different methods for solving the k-mismatches problem. It looks like the results could be quite interesting. There’s quite a few variables and it will be interesting to see how that affects the relative performance of the methods. It’s already looking like that, for randomly generated texts and patterns, the naive method might be hard to beat. I guess this was also the case for the “exact matching with don’t cares problem”, though I’ve not really run any proper tests with randomized inputs for that.

It may be worthwhile putting more effort into the LCE generation. At the moment the pre-processing time is O(n + m^2) and so is taking a fair while. But I’m seeing results that show that, if I exclude the pre-processing from the timing, the O(n\sqrt{k}\log{m}) method is running quite a lot faster than the O(n\sqrt{m\log{m}}) algorithm with k=1. I guess I should wait till morning and see more results before drawing too many conclusions though! In any case, it will be interesting to time the creation of the suffix tree; in particular I’ll be interested to see how the creation of a suffix tree for text and pattern compares with just creating the suffix free for the pattern and using the “p-representation” for the text.

Something New

I’ve now begun work on the k-mismatches problem in which we want to find all alignments of the pattern with the text which give at most k differences between the symbols in the text and pattern. I’ve completed a naive method which runs in O(nm) time and also the O(n\sqrt{m \log m}) method in which we match naively for ‘infrequent’ characters and use an FFT-based approach for ‘frequent’ characters. Perhaps something that needs to be improved here is that, for simplicity, the memory overhead currently depends on the alphabet size to allow a test of whether or not a given character is frequent to take constant time. Given that the char data type is only 1 byte this isn’t too much of a problem but if we wanted to include support for multi-bye characters and hence larger alphabet sizes something slightly smarter would be needed.

So the next step then is to reduce that time complexity of O(n\sqrt{k}\log m). Hopefully this shouldn’t be too difficult; though I suspect that the difficulty largely depends on how easily (or otherwise) it is to implement/find a library to do constant time LCE queries. I also need to give some consideration to how testing should be done for this. It will probably be necessary to ensure worst case performance regardless of the input (e.g. by making the naive implementation carry on looking for mismatches after it’s found more than k) otherwise run-times are going to be quite hard to analyse.

A page at StringPedia also needs to be created and I’ve a feeling that explaining both the theory and implementation details clearly and concisely will require some thought.

Final Results (Kind of)

The wiki page has now been updated with the final set of results for small pattern sizes. I’ve upped the number of repeats for each test from 3 to 7, so everything’s taking a little longer, but I’ll set up some new tests for larger pattern sizes to run over the weekend. Fortunately, I’ve now modified the test harness to output a data file with which I can plot a graph with gnuplot nice and quickly. Not quite sure why I wasn’t doing that before to be honest. This also has the bonus of producing SVG output.

The latest set of results include real-to-real transforms, the O(n \log m) algorithm with a minimum sub-string size, a variation where no wild cards are allowed in the text and combinations of the above. To summarize some results:

  • All methods seem to be performing better wrt to the naive approach. I can only assume this is relation to improvements in the compiler after I updated earlier in the week.
  • The absolute best seems to be a combination of the above (using every trick in the book you might say): we use the O(n \log m) algorithm, with a minimum sub-string size, disallowing wild-cards in the text, and use the real-to-real transform. This beats the naive method on all but the first sample point at m=32.
  • Forcing a minimum size has completely gotten rid of the erratic behaviour for small m
  • If we allow wild cards in the text, the O(n \log n) and O(n \log m) versions become faster than the naive approach when m is greater than approximately 300 and 80 respectively.
  • These results seem to hold fairly well for different n — i’m currently running a test with n=2^{21}=2097152 which should verify that.
  • FLINT is the exception to the consistency rule — it gets noticeably worse wrt to the naive method as m increases (ranging from being faster for m > 600 to m > 900 so far)

There’s plenty of graphs on the wiki page here, as well as updated results. The FFT page has also been updated.

So, still to do:

  • Produce final results for large m
  • Put the library of methods together so it’s a bit more usable and upload it to the wiki
  • Maybe give some consideration to memory usage

A Day Of Wiki-ing

So most of the day has been spent updating the wiki with the findings of the last set of tests. The page for exact matching with don’t cares and all the pages it links to are starting to look a lot more complete. It’s taken a bit longer than I’d hoped, but I think there’s a few interesting results. There’s still more to be updated on the wiki; I think I may just try to add a little more to the section on comparing all the methods and then update the rest side-by-side with whatever I start to look at next…it’s quite hard and time consuming trying to write clearly and concisely for the wiki, but mixing it in with some research and coding should provide a nice break.

Wisdom

The tests that run last night showed that the naive method bettered the O(n \log n) version of the algorithm for pattern sizes up to somewhere between 800 and 1500 (depending on text size), whilst the O(n \log m) version ran faster with m greater than about 750-800:

Graph showing run-time of various algorithms for n=524,288 and varying pattern size

However, after running those tests, FFTW’s “wisdom” generator run overnight too (which basically produces the information required for plans of the specified sizes), and we can get even greater speed-ups from FFTW. So bad news for the naive method and for FLINT, which has only occasionally bettered FFTW, but only when the pattern size has been so small that the naive method wins by far anyway.

So I’ve set up some even more thorough tests to run over the weekend which should give us a really detailed view of both small pattern sizes and how things scale (though I’m not bothering to run the naive method on large pattern sizes — it takes too long!). That should give us the final results. In particular, it’s still not that clear when the O(n \log m) version (padded to a power of 2 or otherwise) beats the O(n \log n) version. This is a particularly tricky question given that the run times for the O(n \log m) version are quite unstable as m (or the transform size, in the padded case) varies. Fortunately, I think I may have found the reason for this variation and the strangely long run-times for small m: FFTWs own benchmarks show a great speed variation too!

They have a whole set of benchmarks on various platforms and for various transforms here. Of note in particular is that for transforms that are powers of 2, FFTW (and most other FFT implementations) are quite slow for small-sized transforms in comparison to larger sized transforms, as the following image shows:

FFTWs benchmarks for single precision, real 1D data

So in the case of the O(n \log m) version, when m is small, we’re doing lots and lots of quite slow transforms! In addition, when the size of the transform isn’t a power of 2, there’s a huge variation in the speed of the transform:

Both of these results occur on different platforms but in slightly different ways, so I think this probably goes a long way to explain the occasional oddities in run-times.

The plan now then is to produce some final graphs on Monday and that should hopefully wrap things up for this particular problem.

Nearly There

So I think I’m nearly there with the matching with don’t care’s algorithms, though I’ll need to do work work to the wiki pages once I’ve produced some statistics from the tests I’ve got running overnight. There’s still the unanswered question of whether or not using floating-point arithmetic is going to cause a problem (with false positives, for example), but it’s probably time to move on. If there’s an opportunity to look at this again later at some point that might be nice though…especially since the new number theory library (FLINT) seems slower than FFTW.

Having gotten to the bottom of some of the oddities in the timing (though the O(n\log m) version without any padding is still throwing up a few surprising results) it really does highlight the need to be careful when implementing some of the theory; all the stuff we ignore or “hide” in the asymptotic notation suddenly becomes a problem.

Also, we now have latex on the wiki!

Or maybe it’s to do with FFTW

So after having spent a while fixing my implementation of the Fourier Transform, a couple of early results seem to be:

+Using either FFTW or my own Fourier Transform, the run time is remarkably unaffected by the pattern size. I guess the theory tells us this much, but it’s still surprising to see an average run-time of 0.36 across about 7 different pattern sizes!

+Using my own Fourier Transform, the O(n log m) version is always quicker than the O(n log m) version and the larger n/m is, the greater the difference

+Using FFTW, this doesn’t quite follow. A graph of the the O (n log m) version would show a curve, starting and finishing higher (as m varies) than the O(n log n) version, but it does appear to dip below in the middle.

Some more tests will run overnight which should firm up these initial findings. It would seem then that the overhead is with FFTW and not the Fourier Transform per se. Though FFTW does, unsurprisingly, out-performs my code by a very long way, except for when the pattern is very very short. I hope to get to the bottom of this bizarre time increase for very small patterns tomorrow.

Back To Basics

It really does seem like there is some-kind of sweet spot in which the O(n log m) version of the FFT algorithm is faster than the O(n log n). Perhaps this make sense…if m is very small in comparison to n, the sheer number of Fourier Transforms and their associated overhead outweighs the asymptotic benefit, but if  m is large compared to n, O(n log m) is roughly the same as O(n log n) and the increased complexity means the former takes longer. In any case, it’s back to basics and I’m currently implementing the Fourier Transform myself for two reasons: firstly so that I can verify the above results aren’t due to FFTW’s “plans” and secondly as it should providing a starting point for implementing the NTT myself, which I think may be necessary in order to fully understand what’s going on. Some more research on the NTT has yielded some potential bounds on the input in order to prevent any overflow, but the more I read the less clear the workings of the library become. Understanding the NTT has also required a bit of a “back to basics” approach…much of this morning was spent trying to get some basic number theory concepts clear in my head, which should hopefully help with the NTT as a whole.

So at the end of the first week, I feel it’s been a bit up and down. At times you seem to be getting a lot done but there are also times when it’s hard going. I guess that’s the nature of anything like this.

Plan for next week: get some definitive results on the algorithms for exact pattern matching with don’t cares.

Not so fast

Progress made in all of the intended areas today. Managed to get something working with NTTs using the library, though there’s a question over false positives. I do seem to be getting some of those now, but that may be an implementation error. Something to look into tomorrow I think.

I’ve been putting together a test harness today as well in order to run the algorithms and benchmark them. Early results are a little unexpected. So far it seems that the theoretically faster O(n log m) FFT-based algorithm is generally slower than the O(n log n) version (though it does have some advantages — the smaller memory footprint means I’ve been able to test a few larger inputs). However, there do seem to be *some* input sizes that reverse the trend. Perhaps there’s a sweet spot of pattern size in relation to the text size. More work is needed, but I think the problem may be down to the sheer number of Fourier transforms being performed. Perhaps some tweaking with FFTW is required. Annoyingly I thought I’d realised my mistake…I was re-computing the Fourier Transforms of the pattern for each substring of the text (as I was using the original algorithm as a black box)…but after fixing there was only a fairly modest improvement.