Learn how to upload and analyze your own data here.
phylogenize is a web tool that allows users to link
microbial genes to environments, accounting for phylogeny.
More specifically, given community composition data,
phylogenize links genes in microbial genomes to either microbial
prevalence in, or specificity for, a given environment, while also
taking into account an important potential confounder: the phylogenetic
relationships between microbes. The method is described fully in Bradley,
Nayfach, and Pollard (2018).
To use phylogenize, you need to upload the following
You also need to select the following:
Here is a visual explainer of the phylogenize user interface:
Example of form
phylogenize requires taxonomic abundances. These can either
be derived from shotgun data or 16S data, and can be read
counts, relative abundances, or presence-absence data.
More specifically, your data should be provided as a matrix where:
Shotgun data should be processed with
MIDAS, with database
version either v1.2 (recommended) or v1.0 (deprecated). You do not need
to run the full MIDAS pipeline; you will just need to do metagenomic
species profiling as described
The output file you need should be called species_profile.txt
16S data should be processed using a denoising
algorithm. Two common options are Deblur and DADA2. Both of these
algorithms will result in abundances for "amplicon sequence
variants" (ASVs), which correspond to the distinct DNA
sequences that are obtained by denoising.
phylogenize works by mapping these ASVs back to MIDAS genome
clusters using the fast aligner BURST. The default
threshold for matching a genome cluster is 98.5% sequence identity,
since we found that this gives a reasonable sensitivity-specificity
threshold. Ambiguous matches are discarded, and reads/abundances
mapping to the same genome cluster are summed together. (In
practice, though, we have never actually seen ambiguous
Your metadata should be provided in the form of a matrix where:
Your data and metadata should be in either tabular
or BIOM format. Tabular format is simpler, but
takes up more space. The maximum file upload is 35M, so if your data
are larger, you will need to convert them to BIOM format (see below)
or, alternatively, to run phylogenize locally.
You can upload either:
Tabular data and metadata files should look like this:
Top corner of data
Top corner of metadata
Note that the metadata table needs to have a column labeled sample
(matching the columns of your data table).
The table should also
giving the environment of each sample (by default,
phylogenize looks for a column named "env", but
you can change this in the "Environment column" field).
If your data represents samples from different datasets, batches, or
studies, one column should also say which dataset, batch, or study
each sample is taken from (by default: "dataset").
BIOM format data is more
compact for tabular data and is more suitable for larger datasets.
This is because 1. BIOM represents matrices that are mostly zeroes
(i.e., "sparse") more efficiently, and most taxonomic
matrices are sparse, and 2. BIOM files can use a binary
representation (HDF5) that takes up less space than plain text.
A BIOM file can contain multiple tables, so you will only need to
upload one file (though if you prefer, you can upload a separate
metadata file as described above). This one file needs to have the
The sample metadata matrix should include a column with environment
annotations (by default called env) and, if there are
multiple datasets, dataset annotations (by default called
dataset). Whatever these columns are named, they should
match the names you provide in the "Environment column"
and "Dataset column" fields respectively, just as with
To convert your tabular data into a single BIOM file,
biom utility. Then run the conversion as
follows (adapted from the biom-format.org documentation):
biom convert -i your_data.tab -o your_data_and_metadata.biom --to-hdf5 --table-type="OTU table" --sample-metadata-fp metadata.tab
Replace your_data.tab and metadata.tab with
the names of your tab-delimited data files and, optionally,
your_data_and_metadata with the name of your dataset.
Important note: before doing this step, edit your metadata
file and rename the "sample" column to
"#SampleID", otherwise the metadata won't actually be
added (!) and phylogenize will throw a "metadata not
For more help using the biom utility, try looking at the
"Converting between file formats" and
"Adding sample and observation metadata to biom
phylogenize can associate genes with either microbial
prevalence or microbial specificity for a given
environment. Prevalence gives how often a microbe is found in an
environment. Specificity compares this prevalence to prevalences in
all other provided environments. The following toy example gives an
intuition for what this means:
Cartoon of prevalence and specificity calculations
Microbe 1 is detected (black boxes) often in both samples from
healthy (A-G) and sick (H-N) individuals. Microbe 2 is detected more
frequently in healthy than sick samples, and microbe 3 is detected
relatively seldom in either. This means that microbes 1 and 2 will
have high prevalence in the "healthy"
environment, while microbe 3 will have low prevalence.
However, microbe 2 is the only microbe to have high
specificity for "healthy" over "sick,"
because of the difference in prevalence between these
Prevalence will tend to capture both cosmopolitan microbes and ones
that are particularly well-adapted to a given environment.
Specificity is useful for contrasting two sites (like gut vs. skin)
or states (healthy vs. sick).
The exact details of how prevalence and specificity are calculated
can be read in our
Basically, we perform additional steps to re-weight datasets equally
(for prevalence), to make sure that small numbers of observations
don't skew the results (for specificity), and to transform these
quantities into something approximately normally-distributed
Note: it is possible with low numbers of samples that no microbes
will be specific to a given environment, in which case
phylogenize will return an error because there is not enough
signal to compute specificity.
Note: while you do not have to rarefy your data (i.e., resample to
the same read depth), you should make sure that read depth does not
systematically differ across environments. If it does, then
rarefying may be a good idea because some microbes will simply
appear more by chance in deeper-sequenced data, and if read depth is
correlated with environment, you may get spurious associations.
Note: in order to be considered, each environment and/or
dataset must have at least two samples associated with it. This is
because it doesn't really make sense to calculate prevalence based on a
single presence/absence value (and because specificity, in turn,
depends partly on calculating prevalence in each environment).
Simply enter the name of the environment for which you are interested
in calculating a phenotype. This should match an entry in the
env column of the sample metadata (or the other column you
named in the "Environment column" field) exactly. At least
two samples per environment are necessary.
"Data type" is relatively self-explanatory: is your data
16S amplicon sequencing data, or shotgun metagenomics data processed
For most people, the MIDAS database version will be the latest, v1.2.
We also provide the option to use a previous version of the
database, since species are named differently in MIDAS v1.2 and
v1.0, but for most people the default (1.2) will be appropriate.
You can see what a sample results page looks like
Your results will be delivered at a URL that should look like:
www.phylogenize.org/results/<result_id> with a
long string replacing <result_id>. This is where
you should eventually be able to pick up your results. We do not
make the result IDs searchable, so it is a good idea to bookmark or
otherwise note this URL.
Anyone with this URL will be able to see your results, so only share
it with members of your team and collaborators. (Also, even though we
don't make these URLs publicly searchable, you still should not upload
any data with personally-identifiable information, sensitive personal
information, or any other information that needs to remain 100%
The results have two components. The first is an HTML report
summarizing and visualizing the data. The second is a .tgz file
(compressed archive) that should contain the full set of results,
including the following files:
To make phylogenize available for other users, results will
be deleted approximately 7 days after completion, so it is a good
idea to save the report and results to disk before then.
If you would like to take phylogenize for a test drive, we
have provided a BIOM dataset containing 16S data
and metadata from the Human Microbiome Project, re-analyzed with
DADA2. (The data are read counts, merged by individual. The various
regions sequenced of the 16S gene are provided in a single file.
Reads corresponding to different samples from the same individual
have been summed.) Note: this is a large dataset, so it may
take a long time to run (on the order of an hour).
You will have to provide an environment to calculate either
specificity or prevalence. The various environments represented in
this dataset are:
A sample report generated on these data, containing the results for
associating genes with specificity for the "Stool" environment, can
be found here.
In our preprint,
we use data from the Earth Microbiome Project together with
phylogenize to identify microbial genes associated with the
plant rhizosphere. The full EMP data are very large, so we only
provide the report file
Here are a few areas where you may run into trouble:
If there's still a problem, e-mail the webmaster
(see About/Contact in the navigation bar). Add
your result ID (the string of characters after
"results" in the URL) and attach your
input files and the output that appears when you
click "monitor warnings/errors".
We also encourage you to report a bug in phylogenize using
In addition to a web tool, phylogenize can also be run on a
laptop, desktop, or server. In particular, if you are getting an
"out of memory" error, you have a lot of jobs, or you have
a more specific use case that is not covered by the server, we
recommend that you download phylogenize from our code
The core of phylogenize is an R package, with a web app written
in Python using Flask. The web app provides a user interface and a
scheduling system (using Beanstalk). phylogenize can also
be run in QIIME2 using the plugin
This means that there are three ways to run phylogenize
Instructions on how to use phylogenize as an R package or
through a local web server are found on its Bitbucket site.
Instructions on how to use phylogenize with QIIME2 can be
found on the
q2-phylogenize Bitbucket repository.
Bradley PH and Pollard KS, "phylogenize: a web tool to identify
microbial genes underlying environment associations." In review.
Bradley PH, Nayfach S, and Pollard KS, "Phylogeny-corrected
identification of microbial gene families relevant to human gut
colonization." PLOS Comput Biol 14 (8), e1006242.