What is phylogenize?
phylogenize is a web tool that allows users to link
microbial genes to environments, accounting for phylogeny.
More specifically, given community composition data,
phylogenize links genes in microbial genomes to either microbial
prevalence in, or specificity for, a given environment, while also
taking into account an important potential confounder: the phylogenetic
relationships between microbes. The method is described fully in Bradley,
Nayfach, and Pollard (2018).
Using phylogenize
Overview
To use phylogenize, you need to upload the following
things:
-
Community composition data, either from 16S or
shotgun sequencing data.
-
Sample metadata, which maps sample IDs in your
community composition data to environments.
You also need to select the following:
-
Whether to link genes to prevalence or
specificity, which are defined as follows:
- prevalence: how frequently is a microbe observed in a
given environment?
- specificity: how specific is a microbe for a
given environment compared to all others?
- Note: to calculate specificity, there must be more than
one environment present.
-
A choice of environment: prevalence and
specificity are defined with respect to a given environment, like
"soil" or "stool" or "marine."
Here is a visual explainer of the phylogenize user interface:
Data
phylogenize requires taxonomic abundances. These can either
be derived from shotgun data or 16S data, and can be read
counts, relative abundances, or presence-absence data.
More specifically, your data should be provided as a matrix where:
- each row is a species ID (shotgun data) or a sequence variant (16S data)
- each column is a sample ID of your choosing
- each entry is any of the following:
- a number of reads
- a relative abundance
- 0 for absence and 1 for presence
Shotgun data
Shotgun data should be processed with
MIDAS, with database
version either v1.2 (recommended) or v1.0 (deprecated). You do not need
to run the full MIDAS pipeline; you will just need to do metagenomic
species profiling as described
here.
The output file you need should be called species_profile.txt
.
16S data
16S data should be processed using a denoising
algorithm. Two common options are Deblur and DADA2. Both of these
algorithms will result in abundances for "amplicon sequence
variants" (ASVs), which correspond to the distinct DNA
sequences that are obtained by denoising.
phylogenize works by mapping these ASVs back to MIDAS genome
clusters using the fast aligner BURST. The default
threshold for matching a genome cluster is 98.5% sequence identity,
since we found that this gives a reasonable sensitivity-specificity
threshold. Ambiguous matches are discarded, and reads/abundances
mapping to the same genome cluster are summed together. (In
practice, though, we have never actually seen ambiguous
matches.)
Your metadata should be provided in the form of a matrix where:
- each row corresponds to a sample in your data;
- there are at least the following two columns:
- a column containing the sample IDs,
matching the sample IDs in your data (named
sample
for tabular data and
#SampleID
for BIOM data);
- a column contains the environment
labels corresponding to each sample (by default
called
env
, but you can specify a different
column name)
- and, optionally:
- a column containing the batch or
dataset to which each sample belongs. This
is used for adjusting prevalence estimates across
studies with unequal numbers of samples (by default
called
dataset
, but you can specify a
different column name).
- If you only have one dataset, you don't need to
specify a dataset column. Just select 1 under
"Number of datasets".
- Note: right now, you can only calculate
specificity if there is just one dataset.
Your data and metadata should be in either tabular
or BIOM format. Tabular format is simpler, but
takes up more space. The maximum file upload is 35M, so if your data
are larger, you will need to convert them to BIOM format (see below)
or, alternatively, to run phylogenize locally.
You can upload either:
-
two tab-delimited files (one data file and one metadata
file);
-
one BIOM file (containing data and metadata);
-
or one tab-delimited metadata file and one BIOM-formatted
abundance file.
Tabular format data
Tabular data and metadata files should look like this:
Note that the metadata table needs to have a column labeled sample
(matching the columns of your data table).
The table should also
giving the environment of each sample (by default,
phylogenize looks for a column named "env", but
you can change this in the "Environment column" field).
If your data represents samples from different datasets, batches, or
studies, one column should also say which dataset, batch, or study
each sample is taken from (by default: "dataset").
BIOM format data
BIOM format data is more
compact for tabular data and is more suitable for larger datasets.
This is because 1. BIOM represents matrices that are mostly zeroes
(i.e., "sparse") more efficiently, and most taxonomic
matrices are sparse, and 2. BIOM files can use a binary
representation (HDF5) that takes up less space than plain text.
A BIOM file can contain multiple tables, so you will only need to
upload one file (though if you prefer, you can upload a separate
metadata file as described above). This one file needs to have the
following tables:
- Observation matrix or "OTU
table". This is where the community composition
data will go. Despite the name, rows should actually
be either MIDAS species IDs or amplicon sequence
variants (ASVs), not OTUs.
-
Sample metadata matrix (unless uploading a
separate tabular file).
The sample metadata matrix should include a column with environment
annotations (by default called env) and, if there are
multiple datasets, dataset annotations (by default called
dataset). Whatever these columns are named, they should
match the names you provide in the "Environment column"
and "Dataset column" fields respectively, just as with
tabular data.
To convert your tabular data into a single BIOM file,
install the
biom
utility. Then run the conversion as
follows (adapted from the biom-format.org documentation):
biom convert -i your_data.tab -o your_data_and_metadata.biom --to-hdf5 --table-type="OTU table" --sample-metadata-fp metadata.tab
Replace your_data.tab
and metadata.tab
with
the names of your tab-delimited data files and, optionally,
your_data_and_metadata
with the name of your dataset.
Important note: before doing this step, edit your metadata
file and rename the "sample" column to
"#SampleID", otherwise the metadata won't actually be
added (!) and phylogenize will throw a "metadata not
found" error.
For more help using the biom
utility, try looking at the
pages entitled
"Converting between file formats" and
"Adding sample and observation metadata to biom
files".
Choosing a phenotype
phylogenize can associate genes with either microbial
prevalence or microbial specificity for a given
environment. Prevalence gives how often a microbe is found in an
environment. Specificity compares this prevalence to prevalences in
all other provided environments. The following toy example gives an
intuition for what this means:
Microbe 1 is detected (black boxes) often in both samples from
healthy (A-G) and sick (H-N) individuals. Microbe 2 is detected more
frequently in healthy than sick samples, and microbe 3 is detected
relatively seldom in either. This means that microbes 1 and 2 will
have high prevalence in the "healthy"
environment, while microbe 3 will have low prevalence.
However, microbe 2 is the only microbe to have high
specificity for "healthy" over "sick,"
because of the difference in prevalence between these
environments.
Prevalence will tend to capture both cosmopolitan microbes and ones
that are particularly well-adapted to a given environment.
Specificity is useful for contrasting two sites (like gut vs. skin)
or states (healthy vs. sick).
The exact details of how prevalence and specificity are calculated
can be read in our
paper.
Basically, we perform additional steps to re-weight datasets equally
(for prevalence), to make sure that small numbers of observations
don't skew the results (for specificity), and to transform these
quantities into something approximately normally-distributed
(both).
Note: it is possible with low numbers of samples that no microbes
will be specific to a given environment, in which case
phylogenize will return an error because there is not enough
signal to compute specificity.
Note: while you do not have to rarefy your data (i.e., resample to
the same read depth), you should make sure that read depth does not
systematically differ across environments. If it does, then
rarefying may be a good idea because some microbes will simply
appear more by chance in deeper-sequenced data, and if read depth is
correlated with environment, you may get spurious associations.
Note: in order to be considered, each environment and/or
dataset must have at least two samples associated with it. This is
because it doesn't really make sense to calculate prevalence based on a
single presence/absence value (and because specificity, in turn,
depends partly on calculating prevalence in each environment).
Choosing an environment
Simply enter the name of the environment for which you are interested
in calculating a phenotype. This should match an entry in the
env column of the sample metadata (or the other column you
named in the "Environment column" field) exactly. At least
two samples per environment are necessary.
Additional options
"Data type" is relatively self-explanatory: is your data
16S amplicon sequencing data, or shotgun metagenomics data processed
with MIDAS?
For most people, the MIDAS database version will be the latest, v1.2.
We also provide the option to use a previous version of the
database, since species are named differently in MIDAS v1.2 and
v1.0, but for most people the default (1.2) will be appropriate.
Interpreting results
You can see what a sample results page looks like
here.
Your results will be delivered at a URL that should look like:
www.phylogenize.org/results/<result_id>
with a
long string replacing <result_id>
. This is where
you should eventually be able to pick up your results. We do not
make the result IDs searchable, so it is a good idea to bookmark or
otherwise note this URL.
Anyone with this URL will be able to see your results, so only share
it with members of your team and collaborators. (Also, even though we
don't make these URLs publicly searchable, you still should not upload
any data with personally-identifiable information, sensitive personal
information, or any other information that needs to remain 100%
completely secure.)
The results have two components. The first is an HTML report
summarizing and visualizing the data. The second is a .tgz file
(compressed archive) that should contain the full set of results,
including the following files:
phenotype.tab
: Tab-delimited file giving the
calculated phenotype values (i.e., prevalence or specificity)
for every microbe.
pos-sig-thresholded.csv
: Comma-delimited file
giving all FIGfam gene families significantly positively
associated with the phenotype, within a given phylum. This
file also contains descriptions of the significant
genes.
all-results.csv
: Comma-delimited file giving
the effect size and uncorrected p-value for the association
of every FIGfam gene family in every phylum with the
calculated phenotype. This can be useful if, for example,
you want to apply your own threshold for significance or a
different p-value correction method, or if you want to
further filter your results by effect size.
enr-table.csv
: FDR-corrected SEED subsystem
enrichments for the significantly positively associated
FIGfam gene families. These are pre-computed for three
gene-wise levels of significance: "strong" (0.05),
"med" (0.1), and "weak" (0.25). SEED
subsystems significant at an FDR of 25% are returned.
enr-overlaps.csv
and
enr-overlaps-sorted.csv
: These files give the
individual significant FIGfams underlying any SEED subsystem
enrichments, which can aid interpretation.
progress.txt
and stderr.txt
: These
are the outputs of phylogenize; if your job
finished without an error message, you probably don't need
to refer to them.
To make phylogenize available for other users, results will
be deleted approximately 7 days after completion, so it is a good
idea to save the report and results to disk before then.
Examples
Human Microbiome Project
If you would like to take phylogenize for a test drive, we
have provided a BIOM dataset containing 16S data
and metadata from the Human Microbiome Project, re-analyzed with
DADA2. (The data are read counts, merged by individual. The various
regions sequenced of the 16S gene are provided in a single file.
Reads corresponding to different samples from the same individual
have been summed.) Note: this is a large dataset, so it may
take a long time to run (on the order of an hour). We have also
provided a reduced dataset in tabular form with its own metadata consisting of only the Bacteroidetes,
which should run faster, particularly when calculating specificity.
You will have to provide an environment to calculate either
specificity or prevalence. The various environments represented in
this dataset are:
- Anterior nares
- Attached/Keratinized gingiva
- Buccal mucosa
- Hard palate
- L_Antecubital fossa
- Mid vagina
- Palatine Tonsils
- Posterior fornix
- R_Antecubital fossa
- Retroauricular crease
- Saliva
- Stool
- Subgingival plaque
- Supragingival plaque
- Throat
- Tongue dorsum
- Vaginal introitus
A sample report generated on these data, containing the results for
associating genes with specificity for the "Stool" environment, can
be found here.
Earth Microbiome Project
In our preprint,
we use data from the Earth Microbiome Project together with
phylogenize to identify microbial genes associated with the
plant rhizosphere. The full EMP data are very large, so we only
provide the report file
here.
Troubleshooting
Here are a few areas where you may run into trouble:
- phylogenize appears to be stuck a little over halfway through.
- This is probably totally normal; the progress bar is
approximate and this is likely to be the step where
associations are actually calculated. If you check
back, you should eventually see more subtle signs of
progress under "monitor warnings/errors.
"
- phylogenize doesn't appear to be running my
job: the bar is stuck at 0%.
- Because phylogenize takes a lot of memory
to run, only one job can run at a time. If you
reload the page and look under "monitor
warnings/errors" you should see information
about how many other jobs are in the queue. If you
don't see this information and the job is still
stuck at 0% for a prolonged period of time, contact
support (see About/Contact in the navigation
bar).
- phylogenize gave me an error that I don't
understand.
- Check that your data tables satisfy the above
requirements (in particular, are your metadata
columns named correctly, and do your sample IDs
match up?).
- If your job ran out of memory, try converting it
to BIOM format (see above).
If there's still a problem, e-mail the webmaster
(see About/Contact in the navigation bar). Add
your result ID (the string of characters after
"results" in the URL) and attach your
input files and the output that appears when you
click "monitor warnings/errors".
We also encourage you to report a bug in phylogenize using
the issue
tracker.
Running phylogenize locally
In addition to a web tool, phylogenize can also be run on a
laptop, desktop, or server. In particular, if you are getting an
"out of memory" error, you have a lot of jobs, or you have
a more specific use case that is not covered by the server, we
recommend that you download phylogenize from our code
repository at
https://bitbucket.com/pbradz/phylogenize.
The core of phylogenize is an R package, with a web app written
in Python using Flask. The web app provides a user interface and a
scheduling system (using Beanstalk). phylogenize can also
be run in QIIME2 using the plugin
q2-phylogenize.
This means that there are three ways to run phylogenize
locally:
- In R, calling functions from the R package directly;
- In QIIME2, using the plugin q2-phylogenize;
- In a web browser, by running the web application locally.
For most people, the first two approaches are probably the easiest.
The third option is recommended only if you want multiple people to
be able to run
phylogenize on the same local machine.
Instructions on how to use phylogenize as an R package or
through a local web server are found on its Bitbucket site.
Instructions on how to use phylogenize with QIIME2 can be
found on the
q2-phylogenize Bitbucket repository.
Citing phylogenize
Bradley PH and Pollard KS, "phylogenize: a web tool to identify
microbial genes underlying environment associations." In review.
Bradley PH, Nayfach S, and Pollard KS, "Phylogeny-corrected
identification of microbial gene families relevant to human gut
colonization." PLOS Comput Biol 14 (8), e1006242.