PopPUNK: Population partitioning using nucleotide k-mers

Calculate core and accessory distances, cluster genomes, assign new genomes to clusters, make visualisations

box img

Use PopPUNK

Use/install via three interfaces

Available interfaces

Available interfaces

PopPUNK can be used from your browser, Galaxy servers or the command line.

Install/Use PopPUNK
box img

Species databases

Three species currently available

Assign clusters

Assign clusters

Use existing databases to assign new genomes to clusters, and place them in the context of these populations.

Browse databases
box img

Guides

View common usage guides

Browse documentation

Browse documentation

See how to use PopPUNK to solve common genomic epidemiology problems.

See all guides

Questions

PopPUNK is software for bacterial genomic epidemiology. It can rapidly calculate core and accessory distances between genomes, use these distances to cluster genomes, assign clusters to new genomes using an existing database, and produce visualisations of these outputs.

PopPUNK uses comparisons of sequence content at a range of k-mer (substring) lengths to fit a relationship between core and accessory distances (the 'database'). Sketching is used to reduce the number of k-mers that need to be kept, massively increasing efficiency.

Machine learning methods (Gaussian mixture model or HDBSCAN) are used to form spatial clusters of these distances in core-accessory space, from which a 'within-strain' component is then identified (the 'model').

These within-strain distances are used to connect samples in a network (the 'network'), in which the connected components become the PopPUNK clusters. The network can be further reduced in size (to 'references') by removing redundant samples, which increases the speed of future assignment without loss of accuracy.

For most users, if an existing species database is available we highly recommend you use it to assign queries, as the model fit has already been verified, it will be quicker, will allow you to use existing nomenclature, and put your input into the context of the species. You can do this through your browser, Galaxy server, or on the command line with --assign-query.

If you have a new species, want to look at the core and accessory distances in your population, or are doing something unusual it will be best create your own database and model fit. These tasks can be achieved either on the command line or Galaxy server.

Potential benefits of using PopPUNK (compared to cgMLST or statistical models such as BAPS):

  • Fast to run, and no alignment needed for input.
  • Clusters represent biology, not arbitrary thresholds.
  • Existing schemes can be used, with consistent nomenclature.
  • Small clusters and singletons are not binned together.
But try it yourself to find out what you do or don't like, then let us know!

The main thing to check is that the 'within-strain' component of the model has been correctly identified. K-mer length issues should now be automatically flagged, but you may wish to add --plot-fit to check these fits are roughly straight lines. Looking at the clusters compared to a tree (possibly the NJ tree output by PopPUNK) will tell you whether clusters are too tight or too diverse. You can also plot intra-cluster distances.

See troubleshooting in the docs for more details and examples.

Using an existing model, if available, has fewer risks. You may still wish to check the results within one of the visualisation tools, and ensure there were no warning messages (which can occur if the database/model/clusters are different versions).

We would generally recommend adding --microreact or another visualisation option to get more output options.

If you are refining a model, try adding --indiv-refine for more clustering options (using only core or accessory distances).

If you are having issues with QC that are not solved by removing invalid samples, try increasing --max-a-dist and/or adding --ignore-length.

Please report issues with the code on the GitHub issue tracker. With any bugs please include the command run, a description of the data, and a copy of the error message so we can do our best to fix it. For broader questions please contact us directly.

NEW: Try using PopPUNK interactively

Use PopPUNK to assign Global Pneumococcal Sequence Clusters in S. pneumoniae using an interactive browser interface