Tutorial

Learn how to upload and analyze your own data here.

What is phylogenize?

phylogenize is a web tool that allows users to link microbial genes to environments, accounting for phylogeny.

More specifically, given community composition data, phylogenize links genes in microbial genomes to either microbial prevalence in, or specificity for, a given environment, while also taking into account an important potential confounder: the phylogenetic relationships between microbes. The method is described fully in Bradley, Nayfach, and Pollard (2018).

Using phylogenize

Overview

To use phylogenize, you need to upload the following things:

  • Community composition data, either from 16S or shotgun sequencing data.
  • Sample metadata, which maps sample IDs in your community composition data to environments.

You also need to select the following:

  • Whether to link genes to prevalence or specificity, which are defined as follows:
    • prevalence: how frequently is a microbe observed in a given environment?
    • specificity: how specific is a microbe for a given environment compared to all others?
    • Note: to calculate specificity, there must be more than one environment present.
  • A choice of environment: prevalence and specificity are defined with respect to a given environment, like "soil" or "stool" or "marine."

Here is a visual explainer of the phylogenize user interface:

Example of form

Example of form

Data

phylogenize requires taxonomic abundances. These can either be derived from shotgun data or 16S data, and can be read counts, relative abundances, or presence-absence data.

More specifically, your data should be provided as a matrix where:

  • each row is a species ID (shotgun data) or a sequence variant (16S data)
  • each column is a sample ID of your choosing
  • each entry is any of the following:
    • a number of reads
    • a relative abundance
    • 0 for absence and 1 for presence

Shotgun data

Shotgun data should be processed with MIDAS, with database version either v1.2 (recommended) or v1.0 (deprecated). You do not need to run the full MIDAS pipeline; you will just need to do metagenomic species profiling as described here. The output file you need should be called species_profile.txt .

16S data

16S data should be processed using a denoising algorithm. Two common options are Deblur and DADA2. Both of these algorithms will result in abundances for "amplicon sequence variants" (ASVs), which correspond to the distinct DNA sequences that are obtained by denoising.

phylogenize works by mapping these ASVs back to MIDAS genome clusters using the fast aligner BURST. The default threshold for matching a genome cluster is 98.5% sequence identity, since we found that this gives a reasonable sensitivity-specificity threshold. Ambiguous matches are discarded, and reads/abundances mapping to the same genome cluster are summed together. (In practice, though, we have never actually seen ambiguous matches.)

Metadata

Your metadata should be provided in the form of a matrix where:

  • each row corresponds to a sample in your data;
  • there are at least the following two columns:
    • a column containing the sample IDs, matching the sample IDs in your data (named sample for tabular data and #SampleID for BIOM data);
    • a column contains the environment labels corresponding to each sample (by default called env, but you can specify a different column name)
  • and, optionally:
    • a column containing the batch or dataset to which each sample belongs. This is used for adjusting prevalence estimates across studies with unequal numbers of samples (by default called dataset, but you can specify a different column name).
    • If you only have one dataset, you don't need to specify a dataset column. Just select 1 under "Number of datasets".
    • Note: right now, you can only calculate specificity if there is just one dataset.

File formats

Your data and metadata should be in either tabular or BIOM format. Tabular format is simpler, but takes up more space. The maximum file upload is 35M, so if your data are larger, you will need to convert them to BIOM format (see below) or, alternatively, to run phylogenize locally.

You can upload either:

  • two tab-delimited files (one data file and one metadata file);
  • one BIOM file (containing data and metadata);
  • or one tab-delimited metadata file and one BIOM-formatted abundance file.

Tabular format data

Tabular data and metadata files should look like this:

Top corner of data

Top corner of data

Top corner of metadata

Top corner of metadata

Note that the metadata table needs to have a column labeled sample (matching the columns of your data table).

The table should also giving the environment of each sample (by default, phylogenize looks for a column named "env", but you can change this in the "Environment column" field).

If your data represents samples from different datasets, batches, or studies, one column should also say which dataset, batch, or study each sample is taken from (by default: "dataset").

BIOM format data

BIOM format data is more compact for tabular data and is more suitable for larger datasets. This is because 1. BIOM represents matrices that are mostly zeroes (i.e., "sparse") more efficiently, and most taxonomic matrices are sparse, and 2. BIOM files can use a binary representation (HDF5) that takes up less space than plain text.

A BIOM file can contain multiple tables, so you will only need to upload one file (though if you prefer, you can upload a separate metadata file as described above). This one file needs to have the following tables:

  • Observation matrix or "OTU table". This is where the community composition data will go. Despite the name, rows should actually be either MIDAS species IDs or amplicon sequence variants (ASVs), not OTUs.
  • Sample metadata matrix (unless uploading a separate tabular file).

The sample metadata matrix should include a column with environment annotations (by default called env) and, if there are multiple datasets, dataset annotations (by default called dataset). Whatever these columns are named, they should match the names you provide in the "Environment column" and "Dataset column" fields respectively, just as with tabular data.

To convert your tabular data into a single BIOM file, install the biom utility. Then run the conversion as follows (adapted from the biom-format.org documentation):

biom convert -i your_data.tab -o your_data_and_metadata.biom --to-hdf5 --table-type="OTU table" --sample-metadata-fp metadata.tab

Replace your_data.tab and metadata.tab with the names of your tab-delimited data files and, optionally, your_data_and_metadata with the name of your dataset. Important note: before doing this step, edit your metadata file and rename the "sample" column to "#SampleID", otherwise the metadata won't actually be added (!) and phylogenize will throw a "metadata not found" error.

For more help using the biom utility, try looking at the pages entitled "Converting between file formats" and "Adding sample and observation metadata to biom files".

Choosing a phenotype

phylogenize can associate genes with either microbial prevalence or microbial specificity for a given environment. Prevalence gives how often a microbe is found in an environment. Specificity compares this prevalence to prevalences in all other provided environments. The following toy example gives an intuition for what this means:

Cartoon of prevalence and specificity calculations

Cartoon of prevalence and specificity calculations

Microbe 1 is detected (black boxes) often in both samples from healthy (A-G) and sick (H-N) individuals. Microbe 2 is detected more frequently in healthy than sick samples, and microbe 3 is detected relatively seldom in either. This means that microbes 1 and 2 will have high prevalence in the "healthy" environment, while microbe 3 will have low prevalence. However, microbe 2 is the only microbe to have high specificity for "healthy" over "sick," because of the difference in prevalence between these environments.

Prevalence will tend to capture both cosmopolitan microbes and ones that are particularly well-adapted to a given environment. Specificity is useful for contrasting two sites (like gut vs. skin) or states (healthy vs. sick).

The exact details of how prevalence and specificity are calculated can be read in our paper. Basically, we perform additional steps to re-weight datasets equally (for prevalence), to make sure that small numbers of observations don't skew the results (for specificity), and to transform these quantities into something approximately normally-distributed (both).

Note: it is possible with low numbers of samples that no microbes will be specific to a given environment, in which case phylogenize will return an error because there is not enough signal to compute specificity.

Note: while you do not have to rarefy your data (i.e., resample to the same read depth), you should make sure that read depth does not systematically differ across environments. If it does, then rarefying may be a good idea because some microbes will simply appear more by chance in deeper-sequenced data, and if read depth is correlated with environment, you may get spurious associations.

Note: in order to be considered, each environment and/or dataset must have at least two samples associated with it. This is because it doesn't really make sense to calculate prevalence based on a single presence/absence value (and because specificity, in turn, depends partly on calculating prevalence in each environment).

Choosing an environment

Simply enter the name of the environment for which you are interested in calculating a phenotype. This should match an entry in the env column of the sample metadata (or the other column you named in the "Environment column" field) exactly. At least two samples per environment are necessary.

Additional options

"Data type" is relatively self-explanatory: is your data 16S amplicon sequencing data, or shotgun metagenomics data processed with MIDAS?

For most people, the MIDAS database version will be the latest, v1.2. We also provide the option to use a previous version of the database, since species are named differently in MIDAS v1.2 and v1.0, but for most people the default (1.2) will be appropriate.

Interpreting results

You can see what a sample results page looks like here.

Your results will be delivered at a URL that should look like: www.phylogenize.org/results/<result_id> with a long string replacing <result_id>. This is where you should eventually be able to pick up your results. We do not make the result IDs searchable, so it is a good idea to bookmark or otherwise note this URL.

Anyone with this URL will be able to see your results, so only share it with members of your team and collaborators. (Also, even though we don't make these URLs publicly searchable, you still should not upload any data with personally-identifiable information, sensitive personal information, or any other information that needs to remain 100% completely secure.)

The results have two components. The first is an HTML report summarizing and visualizing the data. The second is a .tgz file (compressed archive) that should contain the full set of results, including the following files:

  • phenotype.tab: Tab-delimited file giving the calculated phenotype values (i.e., prevalence or specificity) for every microbe.
  • pos-sig-thresholded.csv: Comma-delimited file giving all FIGfam gene families significantly positively associated with the phenotype, within a given phylum. This file also contains descriptions of the significant genes.
  • all-results.csv: Comma-delimited file giving the effect size and uncorrected p-value for the association of every FIGfam gene family in every phylum with the calculated phenotype. This can be useful if, for example, you want to apply your own threshold for significance or a different p-value correction method, or if you want to further filter your results by effect size.
  • enr-table.csv: FDR-corrected SEED subsystem enrichments for the significantly positively associated FIGfam gene families. These are pre-computed for three gene-wise levels of significance: "strong" (0.05), "med" (0.1), and "weak" (0.25). SEED subsystems significant at an FDR of 25% are returned.
  • enr-overlaps.csv and enr-overlaps-sorted.csv: These files give the individual significant FIGfams underlying any SEED subsystem enrichments, which can aid interpretation.
  • progress.txt and stderr.txt: These are the outputs of phylogenize; if your job finished without an error message, you probably don't need to refer to them.

To make phylogenize available for other users, results will be deleted approximately 7 days after completion, so it is a good idea to save the report and results to disk before then.

Examples

Human Microbiome Project

If you would like to take phylogenize for a test drive, we have provided a BIOM dataset containing 16S data and metadata from the Human Microbiome Project, re-analyzed with DADA2. (The data are read counts, merged by individual. The various regions sequenced of the 16S gene are provided in a single file. Reads corresponding to different samples from the same individual have been summed.) Note: this is a large dataset, so it may take a long time to run (on the order of an hour).

You will have to provide an environment to calculate either specificity or prevalence. The various environments represented in this dataset are:

  • Anterior nares
  • Attached/Keratinized gingiva
  • Buccal mucosa
  • Hard palate
  • L_Antecubital fossa
  • Mid vagina
  • Palatine Tonsils
  • Posterior fornix
  • R_Antecubital fossa
  • Retroauricular crease
  • Saliva
  • Stool
  • Subgingival plaque
  • Supragingival plaque
  • Throat
  • Tongue dorsum
  • Vaginal introitus

A sample report generated on these data, containing the results for associating genes with specificity for the "Stool" environment, can be found here.

Earth Microbiome Project

In our preprint, we use data from the Earth Microbiome Project together with phylogenize to identify microbial genes associated with the plant rhizosphere. The full EMP data are very large, so we only provide the report file here.

Troubleshooting

Here are a few areas where you may run into trouble:

  • phylogenize appears to be stuck a little over halfway through.
    • This is probably totally normal; the progress bar is approximate and this is likely to be the step where associations are actually calculated. If you check back, you should eventually see more subtle signs of progress under "monitor warnings/errors. "
  • phylogenize doesn't appear to be running my job: the bar is stuck at 0%.
    • Because phylogenize takes a lot of memory to run, only one job can run at a time. If you reload the page and look under "monitor warnings/errors" you should see information about how many other jobs are in the queue. If you don't see this information and the job is still stuck at 0% for a prolonged period of time, contact support (see About/Contact in the navigation bar).
  • phylogenize gave me an error that I don't understand.
    • Check that your data tables satisfy the above requirements (in particular, are your metadata columns named correctly, and do your sample IDs match up?).
    • If your job ran out of memory, try converting it to BIOM format (see above).

If there's still a problem, e-mail the webmaster (see About/Contact in the navigation bar). Add your result ID (the string of characters after "results" in the URL) and attach your input files and the output that appears when you click "monitor warnings/errors".

We also encourage you to report a bug in phylogenize using the issue tracker.

Running phylogenize locally

In addition to a web tool, phylogenize can also be run on a laptop, desktop, or server. In particular, if you are getting an "out of memory" error, you have a lot of jobs, or you have a more specific use case that is not covered by the server, we recommend that you download phylogenize from our code repository at https://bitbucket.com/pbradz/phylogenize.

The core of phylogenize is an R package, with a web app written in Python using Flask. The web app provides a user interface and a scheduling system (using Beanstalk). phylogenize can also be run in QIIME2 using the plugin q2-phylogenize.

This means that there are three ways to run phylogenize locally:

  • In R, calling functions from the R package directly;
  • In QIIME2, using the plugin q2-phylogenize;
  • In a web browser, by running the web application locally.
For most people, the first two approaches are probably the easiest. The third option is recommended only if you want multiple people to be able to run phylogenize on the same local machine.

Instructions on how to use phylogenize as an R package or through a local web server are found on its Bitbucket site. Instructions on how to use phylogenize with QIIME2 can be found on the q2-phylogenize Bitbucket repository.

Citing phylogenize

Bradley PH and Pollard KS, "phylogenize: a web tool to identify microbial genes underlying environment associations." In review.

Bradley PH, Nayfach S, and Pollard KS, "Phylogeny-corrected identification of microbial gene families relevant to human gut colonization." PLOS Comput Biol 14 (8), e1006242.