Python2 & Python3 on the same Apache server

Posted on October 26, 2017 by Ben

Notes on setting up a python 3 application on an apache server already running python 2 applications. This covers Flask, mod_wsgi-express, Apache2 & HTTPS support.

For my latest personal project – a web app allowing my wife to fetch activities tracked on her garmin GPS watch & upload them to runkeeper – I finally decided to take the plunge & switch to Python 3.⁽¹⁾When I began my PhD way back in 2012 and started learning Python, library support for Python 3 wasn’t yet complete enough for me and I’ve been stuck on Python 2 since. All was well when testing locally using Flask’s built in webserver, but things got tricky when I tried to deploy on my VPS where Apache was already serving Python 2 applications.

Running both isn’t directly possible, though the linked page starts to explain the solution. I found bits of this scattered across the web, but no complete descriptions. Hopefully this post might help others in a similar position. Throughout, I will show the setup using paths on my system, with a Flask application deployed in /var/www/apps/garmin with a wsgi script called flask.wsgi, which we want to serve on the URL /garmin .

mod_wsgi-express

What makes this possible at all is mod_wsgi-express. Using the standard mod_wsgi, Apache uses an embedded Python, which prevents running two versions simultaneously. Instead, mod_wsgi-express configures its own apache instance, which can target whatever Python version we like. There is a pip module, which provides mod_wsgi-express though you will need python & apache headers installed (e.g. apt-get install python3-dev apache2-dev).

We can then start a new web server using the following:

mod_wsgi-express start-server /var/www/apps/garmin/flask.wsgi --reload-on-changes --working-directory /var/www/apps/garmin &

1	mod_wsgi-express start-server /var/www/apps/garmin/flask.wsgi --reload-on-changes --working-directory /var/www/apps/garmin &

This of course needs to be run as a user that can access these files. The –reload-on-changes means that we do not have to restart the server after making changes to our application. A new apache server will be listening on localhost:8000 by default, with the configuration & service scripts available in a directory something like /tmp/mod_wsgi-localhost:8000:106.

Apache2 reverse proxy

With mod_wsgi-express providing a local Apache server, we now need our public-facing Apache instance to forward requests. We set up a Reverse Proxy to forward requests to the appropriate resource to the internal webserver. We can do so this like so:

    ProxyPass         /garmin  http://localhost:8000/
    ProxyPassReverse  /garmin  http://localhost:8000/

1 2	ProxyPass /garmin http://localhost:8000/ ProxyPassReverse /garmin http://localhost:8000/

Unlike standard mod_wsgi, there is no need for us to configure WSGIDaemonProcess, WSGIScriptAlias, WSGIProcessGroup, WSGIApplicationGroup etc. Great!

Fixing the URLs

But wait! Although this appears to work and our site is being served to the world, all of our links that are relative to our site root (i.e. beginning with a /) are broken and point to the public-facing site root. This includes URLs generated by Flask. Fortunately, the fix is quite straightforward – though quite a few guides seem to have this slightly wrong. First, we update the public-facing apache configuration:

    ProxyPass         /garmin  http://localhost:8000/garmin
    ProxyPassReverse  /garmin  http://localhost:8000/garmin

1 2	ProxyPass /garmin http://localhost:8000/garmin ProxyPassReverse /garmin http://localhost:8000/garmin

We then provide the –mount-point flag to mod_wsgi-express:

mod_wsgi-express start-server /var/www/apps/garmin/flask.wsgi --reload-on-changes --working-directory /var/www/apps/garmin --mount-point /garmin &

1	mod_wsgi-express start-server /var/www/apps/garmin/flask.wsgi --reload-on-changes --working-directory /var/www/apps/garmin --mount-point /garmin &

Finally, we need to make sure xhr requests understand where to route to. We can do this by passing through the script root in our base Flask template:

    <script type=application/javascript>
        var SCRIPT_ROOT = {{ request.script_root|tojson|safe }};
    </script>

var SCRIPT_ROOT = {{ request.script_root|tojson|safe }};

</script>

We then simply prefix this to any site-relative links generated Javascript.

HTTPS support

The final challenge was providing HTTPS support – since we’re forced to scrape Garmin’s website, users must provide their garmin username and password. With an SSL certificate in place, at least they’re not sent in the clear⁽²⁾Yes, this really isn’t nice – originally I’d planned to keep this a private application for my wife and keep her credentials on the server, but this at least means it could be used by more than one person and means I don’t have to authenticate users myself. Using the absolutely awesome Let’s Encrypt, I setup a free, signed HTTPS certificate.

The problem is that the internal, forwarded request is a standard HTTP request (not really a security issue; all internal traffic). However, this means URLs constructed by Flask will use HTTP and not HTTPS. This might not be a problem for most people if you’re forwarding HTTP requests to HTTPS but it’s a huge issue if the URL were to be used in, say, a hash function as in the Oauth redirect_uri! To fix, make sure the public-facing Apache sets an appropriate header:

    ProxyPass         /garmin  http://localhost:8000/
    ProxyPassReverse  /garmin  http://localhost:8000/
    RequestHeader set X-Forwarded-Proto "https"

ProxyPass /garmin http://localhost:8000/

ProxyPassReverse /garmin http://localhost:8000/

RequestHeader set X-Forwarded-Proto "https"

and mod_wsgi-express knows to trust such a header:

mod_wsgi-express start-server /var/www/apps/garmin/flask.wsgi --reload-on-changes --working-directory /var/www/apps/garmin --mount-point /garmin --trust-proxy-header "X-Forwarded-Proto" &

1	mod_wsgi-express start-server /var/www/apps/garmin/flask.wsgi --reload-on-changes --working-directory /var/www/apps/garmin --mount-point /garmin --trust-proxy-header "X-Forwarded-Proto" &

In newer versions of Flask – I’m running 0.12.2 — this is good enough and Flask will take care of the rest. It appears that older versions might not to so automatically and you may need to experiment. However, AFAICT the ProxyFix shouldn’t be necessary as our http server is correctly setting the headers for us.

Aside: certbot doesn’t understand WSGI directives

certbot, the tool provided by Let’s Encrypt to auto-configure your setup, doesn’t yet properly understand WSGI directives. In particular, the name given to each WSGIDaemonProcess must be unique. certbot attempts to duplicate these, causing an Apache config error & certbot to rollback the changes. For now, the simplest fix is to comment out such lines before using certbot & then uncomment one of each (or both, renaming the HTTPS daemon process).

[ + ]

1.	↑	When I began my PhD way back in 2012 and started learning Python, library support for Python 3 wasn’t yet complete enough for me and I’ve been stuck on Python 2 since
2.	↑	Yes, this really isn’t nice – originally I’d planned to keep this a private application for my wife and keep her credentials on the server, but this at least means it could be used by more than one person and means I don’t have to authenticate users myself

Analysing the UK General Election Results

Posted on May 11, 2015 by Ben

On Thursday, 7th May, the UK voted in a Conservative government by a small majority, to the surprise of most. The surprise was not so much that the Conservatives won the largest number of seats, but that a majority government could be formed at all – a hung parliament was widely expected. As the final votes were being counted on Friday morning, I set about analysing some of the results. In particular I was interested in how the UK system – in which 650 MPs are elected into parliament by separate votes in each constituency – affected things. I created a webpage to present this; this post is about how I did that.

Getting the data

The dataset I wanted included both the overall results – number of seats and votes won be each party – and the per constituency results. I couldn’t find any website providing this data in an easily accessible form, so I set about scraping the results from the BBC’s live coverage.

Obtaining the overall results proved to be very straightforward, the above page contains the results in a JSON string. So far, so good.

Per constituency results turned out to be a little more tricky. The BBC provided a results page for each constituency (e.g. the constituency I voted in). Unfortunately these results weren’t in a nice JSON format, so I’d have to parse the HTML. The BeautifulSoup Python module is great for this:

#!/usr/bin/env python2
"""
Take the HTML from the constituency page and extract the results into a tsv
"""

import argparse
import json

import BeautifulSoup

parser = argparse.ArgumentParser()
parser.add_argument('filename')
args = parser.parse_args()

with open(args.filename) as handle:
    html_string = handle.read()
    html = BeautifulSoup.BeautifulSoup(html_string, convertEntities=BeautifulSoup.BeautifulSoup.HTML_ENTITIES)

# Get the constituency name
constituency = html.find("h1", attrs={"class": "constituency-title__title"}).text

# To extract the data, we can just target the table columns with the required
# css classes; they will be in the same order as each other so we don't
# need to worry about getting the parent

names = html.findAll("div", attrs={"class":'party__name--long'})
names = [name.text for name in names]

votes = html.findAll("li", attrs={"class":'party__result--votes essential'})
votes = [int(vote.contents[0].replace(',','')) for vote in votes]

# Construct a datastructure to json encode
data = {}
data["name"] = constituency
data["results"] = zip(names, votes)

# Write data out to a json file
with open(args.filename + ".json", "w") as handle:
    json.dump(data, handle)

#!/usr/bin/env python2

"""

Take the HTML from the constituency page and extract the results into a tsv

"""

import argparse

import json

import BeautifulSoup

parser = argparse.ArgumentParser()

parser.add_argument('filename')

args = parser.parse_args()

with open(args.filename) as handle:

html_string = handle.read()

html = BeautifulSoup.BeautifulSoup(html_string, convertEntities=BeautifulSoup.BeautifulSoup.HTML_ENTITIES)

# Get the constituency name

constituency = html.find("h1", attrs={"class": "constituency-title__title"}).text

# To extract the data, we can just target the table columns with the required

# css classes; they will be in the same order as each other so we don't

# need to worry about getting the parent

names = html.findAll("div", attrs={"class":'party__name--long'})

names = [name.text for name in names]

votes = html.findAll("li", attrs={"class":'party__result--votes essential'})

votes = [int(vote.contents[0].replace(',','')) for vote in votes]

# Construct a datastructure to json encode

data = {}

data["name"] = constituency

data["results"] = zip(names, votes)

# Write data out to a json file

with open(args.filename + ".json", "w") as handle:

json.dump(data, handle)

Cool. But I still needed to get the data for all the individual constituencies. The BBC results pages clearly had an ID in the title (e.g. http://www.bbc.co.uk/news/politics/constituencies/E140006020), so I just needed to find a list of IDs somewhere. After a bit of digging, I found that the winner of each constituency was being fetched on the results page in a nice JSON string. The network traffic analyser in the Chrome developer tools was great for finding this. A quick bit of manipulation of this JSON structure later, and we have a textfile containing all 650 constituency IDs. A simple bash pipeline later…

cat ../codes.txt | parallel -n1 -j2 'wget "http://www.bbc.co.uk/news/politics/constituencies/{}" -O {}.html'

1	cat ../codes.txt \| parallel -n1 -j2 'wget "http://www.bbc.co.uk/news/politics/constituencies/{}" -O {}.html'

…and we have all the pages we need, fetched two at a time with parallel and wget. By the way, I love GNU Parallel.

With the data now in an easy format to work with, it was pretty straightforward to knock together some quick⁽¹⁾Time was of the essence; I wanted to have this ready to share whilst interest in the results was high Python scripts to extract some results.

Generating graphs

Initially I planned to create a static image with a few graphs. But I decided that it would be easier to share the analysis as webpage. First up, I would need a Javascript graphing library⁽²⁾Ok, so I could have created the images offline, but I was committed to the web endeavour! Plus this would give a little interactivity to the plots. I’ve used Flot in the past, but I decided to try out Google Charts for no reason other than that it’s good to experiment. The documentation was pretty clear, as were the examples, so it was easy to get started. First-up, a bar-chart of how the result would be affected if seats were allocated based on the proportion of the national result won by the party:

function draw_seats(){
    var data = google.visualization.arrayToDataTable([
        ['Party', 'Difference',{ role: 'style' }],
        ['Conservative', -90, COLOURS['con']],
        ['Labour', -33, COLOURS['lab']],
        ['UKIP', 82, COLOURS['ukip']],
        ['Lib. Dem.', 43, COLOURS['ld']],
        ['SNP', -25, COLOURS['snp']],
        ['Green', 24, COLOURS['grn']],
        ['DUP', -4, COLOURS['dup']],
        ['Plaid Cymru', 1, COLOURS['pc']],
    ]);

    var options = {
        animation: {
            "startup": true,
            "duration": 1000
        },
        title: 'Change in seat allocation if number of seats were determined by percentage of vote',
        chartArea: {width: '60%'},
        legend: { position: 'none' },
        height: 250,
        hAxis: {
            title: 'Seats gained/lost if proportional to vote percentage',
            minValue: 0
        },

        vAxis: {
            title: 'Party'
        }
  };

    var chart = new google.visualization.BarChart(document.getElementById('seats_chart_div'));
    chart.draw(data, options);   
}

function draw_seats(){

var data = google.visualization.arrayToDataTable([

['Party', 'Difference',{ role: 'style' }],

['Conservative', -90, COLOURS['con']],

['Labour', -33, COLOURS['lab']],

['UKIP', 82, COLOURS['ukip']],

['Lib. Dem.', 43, COLOURS['ld']],

['SNP', -25, COLOURS['snp']],

['Green', 24, COLOURS['grn']],

['DUP', -4, COLOURS['dup']],

['Plaid Cymru', 1, COLOURS['pc']],

]);

var options = {

animation: {

"startup": true,

"duration": 1000

title: 'Change in seat allocation if number of seats were determined by percentage of vote',

chartArea: {width: '60%'},

legend: { position: 'none' },

height: 250,

hAxis: {

title: 'Seats gained/lost if proportional to vote percentage',

minValue: 0

vAxis: {

title: 'Party'

}

};

var chart = new google.visualization.BarChart(document.getElementById('seats_chart_div'));

chart.draw(data, options);

}

Getting a PNG image of the graph from Google Charts for this post was also straightforward:

I also generated a pie chart of how seats are allocated under the current system. Setting the colours of the pie slices could not be done in the same way as with the colours of the bars. I’ve no idea why; the latter method seemed much simpler and less bloated.

function draw_result(){
    var data = google.visualization.arrayToDataTable([
        ['Party', 'Seats'],
        ['Conservative', 331],
        ['Labour', 232],
        ['SNP', 56],
        ['Lib. Dem.', 8],
        ['DUP', 8],
        ['Plaid Cymru', 3],
        ['UKIP', 1],
        ['Green', 1],
        ['Other', 10],
    ]);

    var options = {
        animation: {
            "startup": true,
            "duration": 1000
        },
        pieSliceText: 'value',
        pieHole: 0.3,
        chartArea:{
            width: '80%',
            height: '80%'
        },
        title: "Number of seats won",
        legend: { position: 'none' },
        slices: [
        {color: COLOURS['con']},
        {color: COLOURS['lab']},
        {color: COLOURS['snp']},
        {color: COLOURS['ld']},
        {color: COLOURS['dup']},
        {color: COLOURS['pc']},
        {color: COLOURS['ukipn']},
        {color: COLOURS['grn']},
        {color: COLOURS['other']},
        ]
  };

    var chart = new google.visualization.PieChart(document.getElementById('results_chart_div'));
    chart.draw(data, options);   
}

function draw_result(){

var data = google.visualization.arrayToDataTable([

['Party', 'Seats'],

['Conservative', 331],

['Labour', 232],

['SNP', 56],

['Lib. Dem.', 8],

['DUP', 8],

['Plaid Cymru', 3],

['UKIP', 1],

['Green', 1],

['Other', 10],

]);

var options = {

animation: {

"startup": true,

"duration": 1000

pieSliceText: 'value',

pieHole: 0.3,

chartArea:{

width: '80%',

height: '80%'

title: "Number of seats won",

legend: { position: 'none' },

slices: [

{color: COLOURS['con']},

{color: COLOURS['lab']},

{color: COLOURS['snp']},

{color: COLOURS['ld']},

{color: COLOURS['dup']},

{color: COLOURS['pc']},

{color: COLOURS['ukipn']},

{color: COLOURS['grn']},

{color: COLOURS['other']},

]

};

var chart = new google.visualization.PieChart(document.getElementById('results_chart_div'));

chart.draw(data, options);

}

Creating the webpage

Since (web)design is really not my forte, I thought I’d try out out Bootstrap to help ensure everything looked halfway reasonable. This was the first time I’ve tried using one of these frameworks, but it was pretty straightforward to get going with for my simple use case. I used the panel components to separate out the different parts of analysis, with a simple grid to present tables and graphs with simple analysis alongside. I was pleased that this seemed to look ok on mobile devices pretty much straight away, although the table is at risk of being cut-off on smaller screens.

Summing up

Despite not having previously used Google Charts or Bootstrap, I would it pretty straightforward to generate this simple webpage. Both of these tools passed the test of being easy to get started with. Overall, I found this a good way to present the information and made it easy to share with friends.

Finally, putting any personal political viewpoints aside, I think it demonstrates how crazy the current system is.

[ + ]

1.	↑	Time was of the essence; I wanted to have this ready to share whilst interest in the results was high
2.	↑	Ok, so I could have created the images offline, but I was committed to the web endeavour! Plus this would give a little interactivity to the plots

WiFi is not always best

Posted on November 16, 2014 by Ben

I own a Nexus 4. Aside from a few niggles with the camera and distinctly average battery life, my phone is great. Particularly as someone who uses Google services, the user experience is pleasant and if feels like everything is working for you.

Well, almost everything. I find myself constantly fighting the phone over network connections – specifically when the phone should connect to available WiFi networks. The switch to WiFi appears extremely aggressive: as soon as a known network is within reach the phone automatically connects, regardless of the strength of the connection, the availability of internet access from the network, or – and here’s the kicker – what I’m currently doing.

Often I will be walking around Bristol, chatting with friends and family on WhatsApp or Google Hangouts whilst using mobile internet. I’ll walk past a cafe I’ve visited before and my phone switches to its WiFi network, stopping that message I was about to send. I swipe down and turn off WiFi. Later on, I’ll be home checking Facebook/Reddit/Hacker News/Any Other Time Sink and be frustrated that all these images are loading so slowly – I need to see that cute kitten now damnit – until I realise WiFi is still off. So I swipe down and re-enable.

Why is this so frustrating? Well, firstly because it is so common: it probably happens multiple times a day. Secondly, it all feels very preventable:

My phone is aware I’m moving⁽¹⁾This doesn’t even require high battery GPS; the constant change of networks is sufficient. Why jump on a new WiFi network that is likely to drop out of range?
Instant messaging is a very low bandwidth activity. I really don’t care about saving my data allowance. I’m with giffgaff and have a generous data plan⁽²⁾My allowance is 3GB/month. I used to be on an unlimited data plan but I don’t mind the new cap. I rarely hit 1GB and a limited data plan allows me to tether, which is occasionally useful to me
Surely the phone could at least test that WiFi network actually provides internet access before connecting? Many networks are unprotected but require a login after connection. This is particularly true of the large, national networks with many hotspots around a city. Inevitably, these are the biggest problem.

Although the phone provides an option to ‘avoid poor connections’, this doesn’t seem to help much⁽³⁾If anything, it just makes thigns worse. It feels like another setting to fight with: there’s been too many times when I have wanted to connect to a network with a weak signal.. Perhaps I could use one of these apps that only enables WiFi at home and and work? Maybe, but this is definitely not tackling the problem and only fixes some of the symptoms. When I visit my friends or parents, I want to use their WiFi network. When I’m actually in a cafe, I might want to use their network. And particularly at work, which for me is actually an entire university campus, WiFi may not be providing me with a good internet connection at any given time. For me, the fix is this:

Use mobile internet unless there is a real reason not to

Where reasons include:

The user has indicated that they much prefer WiFi networks, e.g. because they have a small data plan
There is a stable connection to a known network that also actually provides a better internet connection
The activity I want to perform is high bandwidth. This is the time for my phone to say “hey, there’s a connection that might work, but you’ll need to log in”
Mobile internet coverage or speeds become poor

Doing this requires the phone to take account of how I’m currently using my phone – Am I moving? What’s my network usage look like at the moment? – as well as testing a WiFi connection before switching to that. Is that really so hard?

[ + ]

1.	↑	This doesn’t even require high battery GPS; the constant change of networks is sufficient
2.	↑	My allowance is 3GB/month. I used to be on an unlimited data plan but I don’t mind the new cap. I rarely hit 1GB and a limited data plan allows me to tether, which is occasionally useful to me
3.	↑	If anything, it just makes thigns worse. It feels like another setting to fight with: there’s been too many times when I have wanted to connect to a network with a weak signal.

My running progress over the last year

Posted on October 1, 2014 by Ben

I started running again just over a year ago and have been using RunKeeper for all of that time. I find it pretty motivating to track activities and compare with previous runs. However, RunKeeper doesn’t make it so easy to track your progress over time – at least not in the free version*

While RunKeeper allows you to export your data, they also have an API – so I knocked together a graph (using Flot) of my running history that will automatically update as I do more exercise. Perhaps I’ll write about how I made this in a later blog post and share the code. The graph below shows distance and pace for my running activities – either as raw data, averaged by month or a moving average

Data type:

Show distance as:

When I first made this, I’d run just shy of 200km in 40 running sessions. I’m pretty happy with that! What’s particularly pleasing is that I’ve been managing to run consistently – apart from a blip in January. Over the last few years, I’ve often tried to get back into running but usually had niggly knee issues. It seems that reasonably regular gym sessions have helped to keep other muscles strong – though I should also get back to yoga

It’s also nice to see that my distance has generally been increasing and pace has improved a little. Here’s to the next 200km!

*I suppose the Pro version (which is crazy expensive, but that’s another matter) might have support for this. Since I’m just using this for my own personal data, I don’t think I’m doing against their terms

Back to blogging

Posted on October 1, 2014 by Ben

I have never really got into blogging. The last blog posts I wrote are now over four years old and were just to document a summer research project.

I’ve decided to try now though, for a few reasons:

Presenting some of my PhD work in a more accessible way is a valuable skill
It will hopefully encourage me to work on side projects in my spare time

A quick overhaul of this old WordPress had the lower barrier to entry. A new theme (Decemberable – slightly customised, without the mug shot it seems a little impersonal) makes things look more modern. I was surprised by the number of ‘popular’ WordPress themes that weren’t responsive; looking OK on mobile/tablet is really a must these days. Maybe in the future I’ll look at different blog software or roll something custom though.

Final Report Plus Code Library

Posted on September 15, 2010 by Ben

I’ve now completed my final report for this project, which can be found here.

There’s also a release of all the code, which should be quite friendly to use here.

Apart from some work still to do at StringPedia, that’s pretty much we done for this project. It’s generally been a good experience and I’ve enjoyed it. I’m now looking forward to a little bit of a break before university starts again in October.

Positive Results At Last

Posted on August 24, 2010 by Ben

Until now I’d not run any tests comparing the two methods that can report the hamming distance at all alignments of text and pattern (A special case of k-mismatches where k >= m). The methods are the naive $O(nm)$ approach and the algorithm of Abrahamson/Kosaraju which runs in $O(n\sqrt{m \log{m} })$ . Early results are quite pleasing though! The following shows a graph of run-times for varying pattern sizes with a text size of 4,345,138 (the full length of the source). The text and pattern were taken from an English text sourced from Project Gutenberg:

It’s a little hard to see down at the lower pattern sizes (in fact it’s pretty hard to see at all unless you click for a larger version!), but the Abrahamson/Kosaraju method is faster right down to pattern sizes of about 5! It’s interesting to see that there are some slight bumps in the run time of the algorithm; though this can be attributed to the number of “frequent” characters — for each frequent character matching is done using FFTs, which adds some overhead. It’s further complicated by the fact that a character which occurs only just enough times to be considered frequent will have more relative overhead than a character which is far above the “frequent threshold” — in each case we’ll do the same amount of work, but in the latter case, we should see less use of the “counting” part of the algorithm.

I’ll run some tests on larger text sizes over night and also use some different sources; it’ll be interesting to see how it fares with a smaller alphabet size, perhaps using the DNA corpus from Pizza and Chili

Working, but slow

Posted on August 20, 2010 by Ben

After some work, I’ve now got the new library working — I’m using LCA queries on the suffix tree in order to calculate pattern-pattern LCEs. In addition, I’ve now converted the p-representation generated to work with the new library.

After all that, I was a bit disappointed to find out that new library isn’t performing so well; the old library builds the suffix tree quite a bit faster and is much much faster at creating the p-representations of the text. Things aren’t quite as bad as I first found: it turns out that the new library is affected quite heavily by optimization levels (perhaps unsurprising, since it’s OO and C++ so inlining, unrolling etc are going to be needed for speed) and also that I was using the slower of the two suffix tree methods. I’m currently investigating ways of making if faster : for example, it provides various implementations of suffix arrays, one of which is uncompressed, so I believe should be faster than the default, compressed array. I’m also expecting an update on the library from the author, who tells me that the latest version is faster than the one I’m currently using.

In general, I think it may always be a little slower than the old library: it provides more functionality and focuses on size as well, but I’m still surprised at how much slower it is. The best I’ve seen from it is still 6-7x slower at generating suffix trees and the p-representation generated is even worse.

Let’s hope I can fix this otherwise the best I’ll be able to do is to use the old library to create the p-representations and the new library for LCE calculations. That would involve creating two suffix trees though, so it seems a bit pointless!

Finally getting somewhere (maybe)

Posted on August 17, 2010 by Ben

For the last few days I’ve been fiddling around trying to get this library working. It supports LCA queries on suffix trees, which is quite nice. However, it’s been a bit of a pain to compile and it’s also written in C++ which I have no experience with and have to make work with my existing C code. Oh, and Valgrind doesn’t seem to like it too much (but only if I compile it; the examples compiled with the supplied Makefile run fine through Valgrind — odd) but I’ve emailed the developer about this.

But no matter, I think I’m getting there with actually performing LCE queries with it; certainly I’ve managed LCE queries on a few simple strings. A few bugs have come up trying to integrate it into the rest of my work, but lets hope (and pray) they’re easy to fix.

On a side note, I’m going to have to do a little more work on the rest of the $O(n \sqrt{k} \log{m})$ method anyway. I’ve been able to generate inputs that have the same text and pattern lengths and the same k but fall into different cases becase one has more frequent symbols than the other and, excluding pre-processing time, case 1 has been up to 10x as slow as case 2! Whether or not it’s worth trying to reduce the time complexity to $O(n \sqrt{k \log{k}})$ or it would be better to simply change the maximum number of frequent symbols allowed for case 1 to be used (I would assume that changing it from $2\sqrt{k}$ to some fraction of that wouldn’t affect the time complexity, but I’ll need to check) I don’t know.

Food for thought I guess.

Randomised Results

Posted on August 12, 2010 by Ben

I’ve been meaning to get around to benchmarking a couple of implementations that solve the “matching with don’t cares” problem by using some randomisation. This reduces the number of convolutions we need at the sacrifice of the small chance of a false-positive. More specifically, we can reduce the number of convolutions from 3 to 2, or even 1 if we consider the case where wild-cards are only allowed in the pattern.

The implementation allowing wild-cards in either follows straight from Adam Kalai’s paper; the method only allowing wild-cards in the pattern is a little different though follows the same broad idea. If I get time I guess I should probably implement Kalai’s method for this for completeness.

Anywho, onto the results which are quite pleasing:

In both cases, the point at which the algorithms become faster than the naive method is somewhere between a half and two thirds of the deterministic equivalent. This follows nicely from the reduction in the number of convolutions.
With wild cards in the text and pattern, this is a reduction from around 80 to 45-50.
With no wild cards in the text, we get a reduction from about 50 to 25.

You can see some graphs of this either in SVG format or in PNG format.

These results were repeats for a few different text sizes in the range of 1-4 million; I might repeat it overnight with a larger text size but I don’t expect to see any difference. You can read more about the testing and results over at the page on StringPedia.

Ben Smithers

Rambling thoughts on Programming / Bioinformatics / Personal Life

Python2 & Python3 on the same Apache server

mod_wsgi-express

Apache2 reverse proxy

Fixing the URLs

HTTPS support

Aside: certbot doesn’t understand WSGI directives

Analysing the UK General Election Results

Getting the data

Generating graphs

Creating the webpage

Summing up

WiFi is not always best

My running progress over the last year

Back to blogging

Final Report Plus Code Library

Positive Results At Last

Working, but slow

Finally getting somewhere (maybe)

Randomised Results