PopPUNK can be used from your browser, Galaxy servers or the command line.
Install/Use PopPUNKUse existing databases to assign new genomes to clusters, and place them in the context of these populations.
Browse databasesSee how to use PopPUNK to solve common genomic epidemiology problems.
See all guidesPopPUNK is software for bacterial genomic epidemiology. It can rapidly calculate core and accessory distances between genomes, use these distances to cluster genomes, assign clusters to new genomes using an existing database, and produce visualisations of these outputs.
PopPUNK uses comparisons of sequence content at a range of k-mer (substring) lengths to fit a relationship between core and accessory distances (the 'database'). Sketching is used to reduce the number of k-mers that need to be kept, massively increasing efficiency.
Machine learning methods (Gaussian mixture model or HDBSCAN) are used to form spatial clusters of these distances in core-accessory space, from which a 'within-strain' component is then identified (the 'model').
These within-strain distances are used to connect samples in a network (the 'network'), in which the connected components become the PopPUNK clusters. The network can be further reduced in size (to 'references') by removing redundant samples, which increases the speed of future assignment without loss of accuracy.
For most users, if an existing species database is available we highly recommend you use it to assign queries, as
the model fit has already been verified, it will be quicker, will allow you to use existing nomenclature, and put
your input into the context of the species. You can do this through your browser, Galaxy server, or on the command line
with poppunk_assign
.
If you have a new species, want to look at the core and accessory distances in your population, or are doing something unusual it will be best create your own database and model fit. You will typically need to do this via the command line.
Potential benefits of using PopPUNK (compared to cgMLST or statistical models such as BAPS):
The main thing to check is that the 'within-strain' component of the model has been correctly identified. K-mer length
issues should now be automatically flagged, but you may wish to add --plot-fit
to check these fits are roughly straight lines. Looking at the clusters compared to a tree (possibly the NJ tree output
by PopPUNK) will tell you whether clusters are too tight or too diverse. You can also plot intra-cluster distances.
Using an existing model, if available, has fewer risks. You may still wish to check the results within one of the visualisation tools, and ensure there were no warning messages (which can occur if the database/model/clusters are different versions).
Please report issues with the code on the GitHub issue tracker. With any bugs please include the command run, a description of the data, and a copy of the error message so we can do our best to fix it. For broader questions please contact us directly.
Use PopPUNK to assign Global Pneumococcal Sequence Clusters in S. pneumoniae using an interactive browser interface