Something New

I’ve now begun work on the k-mismatches problem in which we want to find all alignments of the pattern with the text which give at most k differences between the symbols in the text and pattern. I’ve completed a naive method which runs in O(nm) time and also the O(n\sqrt{m \log m}) method in which we match naively for ‘infrequent’ characters and use an FFT-based approach for ‘frequent’ characters. Perhaps something that needs to be improved here is that, for simplicity, the memory overhead currently depends on the alphabet size to allow a test of whether or not a given character is frequent to take constant time. Given that the char data type is only 1 byte this isn’t too much of a problem but if we wanted to include support for multi-bye characters and hence larger alphabet sizes something slightly smarter would be needed.

So the next step then is to reduce that time complexity of O(n\sqrt{k}\log m). Hopefully this shouldn’t be too difficult; though I suspect that the difficulty largely depends on how easily (or otherwise) it is to implement/find a library to do constant time LCE queries. I also need to give some consideration to how testing should be done for this. It will probably be necessary to ensure worst case performance regardless of the input (e.g. by making the naive implementation carry on looking for mismatches after it’s found more than k) otherwise run-times are going to be quite hard to analyse.

A page at StringPedia also needs to be created and I’ve a feeling that explaining both the theory and implementation details clearly and concisely will require some thought.

Leave a Reply

Your email address will not be published. Required fields are marked *


*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">