Skip to main content

Supporting data for "Clusterflock: A Flocking Algorithm for Isolating Congruent Phylogenomic Datasets"

Dataset type: Software
Data released on October 14, 2016

Narechania A; Baker R; DeSalle R; Mathema B; Kolokotronis S; Kreiswirth B; Planet PJ (2016): Supporting data for "Clusterflock: A Flocking Algorithm for Isolating Congruent Phylogenomic Datasets" GigaScience Database. https://doi.org/10.5524/100247

DOI10.5524/100247

Collective animal behavior such as the flocking of birds or the shoaling of fish has inspired a class of algorithms designed to optimize distance-based clusters in various applications including document analysis and DNA microarrays. In the flocking model, individual agents respond only to their immediate environment and move according to a few simple rules. After several iterations the agents self-organize and clusters emerge without the need for partitional seeds. In addition to their unsupervised nature, flocking offers several computational advantages including the potential to decrease the number of required comparisons.
In Clusterflock, we implement a flocking algorithm designed to find groups (flocks) of orthologous gene families (OGFs) that share a common evolutionary history. Pairwise distances that measure the phylogenetic incongruence between OGFs guide flock formation. We test this approach on several simulated datasets varying the number of underlying topologies, the proportion of missing data, and evolutionary rates, and show that in datasets containing high levels of missing data and rate heterogeneity, Clusterflock outperforms other well-established clustering techniques. We also demonstrate its utility on a known, large-scale recombination event in Staphylococcus aureus. By isolating sets of OGFs with divergent phylogenetic signal, we can pinpoint the recombined region without forcing a pre-determined number of groupings or defining a pre-determined incongruence threshold.
Clusterflock is an open source tool that can be used to discover horizontally transferred genes, recombined areas of chromosomes, and the phylogenetic “core” of a genome. Though we use it in an evolutionary context, it is generalizable to any clustering problem. Users can write extensions to calculate any distance metric on the unit interval and use these distances to “flock” any type of data.

Additional details

Read the peer-reviewed publication(s):

  • Narechania, A., Baker, R., DeSalle, R., Mathema, B., Kolokotronis, S.-O., Kreiswirth, B., & Planet, P. J. (2016). Clusterflock: a flocking algorithm for isolating congruent phylogenomic datasets. GigaScience, 5(1). https://doi.org/10.1186/s13742-016-0152-3

Additional information:

https://github.com/narechan/clusterflock

https://hub.docker.com/r/narechan/clusterflock-0.1/

https://youtu.be/ELZTVOiqKn8

Click on a table column to sort the results.

Table Settings

File Name Description Sample ID Data Type File Format Size Release Date File Attributes Download
Readme TEXT 1.86 kB 2016-10-10 MD5 checksum: 8654176624d3c4d0f28b6859ab0bd479
GitHub archival copy, downloaded 30-09-2016. Please see the GitHub for the most recent updates https://github.com/narechan/clusterflock GitHub archive archive 17.86 MB 2016-10-10 MD5 checksum: 9fda1ccf7a02b4316de628621d5c6101
This archive contains sequences, LD matrices, and R scripts used to construct the simulation curves in Figures 3 and 4, and the S. aureus data used to test clusterflock's performance on an known large-scale hybridization event. Mixed archive TAR 247.38 MB 2016-10-10 MD5 checksum: 4e1e73b2297b8d9b5ce6ba1ca2312712
Auto-detected Flocks per Frame. Here we show the average number of flocks detected at any given point along a 1000 frame simulation for the S. aureus simulation. The OPTICS spatial clustering algorithm was used to auto-detect flocks in the 100 replicate frames at each point along simulation. Seed clusters form very early and later move to intercept one another. Congruent flocks will absorb one another while incongruent flocks repel. Video UNKNOWN 42.29 MB 2016-10-10 MD5 checksum: 04a6b61cd0b0f77dac0c94a521deb7e2
Date Action
October 14, 2016 Dataset publish
October 25, 2016 External Link updated : https://youtu.be/ELZTVOiqKn8
October 25, 2016 Manuscript Link added : 10.1186/s13742-016-0152-3