Something New |

I’ve now begun work on the k-mismatches problem in which we want to find all alignments of the pattern with the text which give at most k differences between the symbols in the text and pattern. I’ve completed a naive method which runs in $O(nm)$ time and also the $O(n\sqrt{m \log m})$ method in which we match naively for ‘infrequent’ characters and use an FFT-based approach for ‘frequent’ characters. Perhaps something that needs to be improved here is that, for simplicity, the memory overhead currently depends on the alphabet size to allow a test of whether or not a given character is frequent to take constant time. Given that the char data type is only 1 byte this isn’t too much of a problem but if we wanted to include support for multi-bye characters and hence larger alphabet sizes something slightly smarter would be needed.

So the next step then is to reduce that time complexity of $O(n\sqrt{k}\log m)$ . Hopefully this shouldn’t be too difficult; though I suspect that the difficulty largely depends on how easily (or otherwise) it is to implement/find a library to do constant time LCE queries. I also need to give some consideration to how testing should be done for this. It will probably be necessary to ensure worst case performance regardless of the input (e.g. by making the naive implementation carry on looking for mismatches after it’s found more than k) otherwise run-times are going to be quite hard to analyse.

A page at StringPedia also needs to be created and I’ve a feeling that explaining both the theory and implementation details clearly and concisely will require some thought.

Ben Smithers

Rambling thoughts on Programming / Bioinformatics / Personal Life

Something New

Leave a Reply Cancel reply