Precise removal of host DNA sequences

August 2023

In search of a lightweight approach for accurately removing human reads from microbial genomes and metagenomes, I found that existing methods removed surprisingly many bacterial reads, even from genomes with negligible similarity to the human genome. This motivated me to develop Hostile (preprint), which removes host reads from FASTQ files with an order of magnitude fewer false positives than existing approaches.

continue reading →

Memorable hash-based identifiers for SARS-CoV-2 sequences

January 2021

The emergence of increasingly convoluted ‘constellations’ of different SARS-CoV-2 variants is proving challenging for those attempting to organise lineage naming. Assigning pronounceable names following a coherently organised structure whilst at the same time acknowledging clinically significant mutations is a thorny problem, stimulating interesting Twitter discussion in recent days. Delay and confusion over SARS-CoV-2 lineage naming has caused headaches for many including myself, slowing discourse about emerging variants when it matters most.

continue reading →

4G+ in a Somerset swamp

September 2020

The UK suffers from a long tail of woeful rural connectivity, where in 2019 a third of households received <10Mb/s including 10% below <2Mb/s. Unfortunately my mother’s address is one such location. Halfway between telephone exchanges in the Somerset levels between Yeovil and Glastonbury, Openreach ADSL2+ gets us roughly one megabit of internet plumbing. An eagerly anticipated Fibre To The Cabinet upgrade improved the situation, but the fibre cabinet turned out to be in a village two miles away, actually nearer to the exchange than our address. Speeds increased to 5Mb/s on a good day, but the connection is flakey and dislikes precipitation. This doesn’t cut it for working from home and has been driving us loopy. Openreach apparently wants £100k to lay fibre to our village and has no plans to improve the situation. 4G (LTE) mobile coverage is poor but existent in our (thankfully flat) area. Given that no one can be bothered to lay fibre here, I also doubt that there is much danger of 5G becoming an option for another decade or so, much to the relief of the weirdly powerful local tin foil hat brigade.

continue reading →

Plotting lineage persistence with Bokeh

March 2020

These plots can be useful for exploring trends in infectious disease outbreaks over time. In some recent work on bugs growing in hospital sinks, I used the one below to help show that sink drains are colonised by a handful of E. coli lineages, which occasionally overlap with infections seen in patients staying on the same wards. This is a small dataset, but the interactivity of Bokeh is proving useful for exploring a larger version of this dataset, where room for annotations is limited. The code below clusters a dataframe of SNP distances (produced here from a recombination adjusted SNP phylogeny), and uses a dataframe of sampling dates to produce a plot much like the one below.

continue reading →