PopPUNK uses comparisons of sequence content at a range of k-mer (substring) lengths to fit a relationship
between core and accessory distances (the 'database'). Sketching is used to reduce the number of k-mers that need to be kept,
massively increasing efficiency.
Machine learning methods (Gaussian mixture model or HDBSCAN) are used to
form spatial clusters of these distances in core-accessory space, from which a 'within-strain' component is then
identified (the 'model').
These within-strain distances are used to connect samples in a network (the 'network'),
in which the connected components become the PopPUNK clusters. The network can be further reduced in size (to 'references')
by removing redundant samples, which increases the speed of future assignment without loss of accuracy.