How to estimate TE¶
Approaches to estimate TE range from a simple ratio calculation, ignoring variance, low expression of RPFs, batch effects, etc. to more complex methodologies such as Ribodiff, Xtail, Riborex, or Anota2Seq, to name but a few. Ribotools uses a simple, but flexible approach that builds on the well-established differential expression software DEseq2, allowing to take into account sample-to-sample variance and/or covariates such as batch effect. With this approach, TE can be estimated in a few minutes, even for large datasets.
Ribotools automatically classifies features into four broad classes, according to the terminology defined in the deltaTE protocol. Translationally forwarded features have a significant change in RNA and RPF at the same rate, with no significant change in TE. It is assumed that changes in RNA abundance drive changes in RPFs. Exclusive features are differentially translated features that have a significant change in RPF, with no change in RNA, i.e. changes in TE are driven by changes in RPFs exclusively. These are classified based on significance only, although for the Wald tests, we set a threshold based on log fold change. Intensified and buffered features take into account the direction of fold change. Translationally intensified features have a significant change in TE that acts with the effect of transcription. Translationally buffered features have a significant change in TE that counteracts the change in RNA (buffering transcription).
How to prepare the sample table for TE¶
The following keys are required (configuration file):
contrastsA dictionary key: value, where key is a name for the contrast to be tested, and value contains 2 items, the first item is the condition to be tested against the second (reference).tea_dataThe base output location for all created files.
For example
riboseq_samples:
d01-1: /path/to/hiPSC-CM.ribo.test-chr1.rep-d01-1.fastq.gz
d01-2: /path/to/hiPSC-CM.ribo.test-chr1.rep-d01-2.fastq.gz
d05-1: /path/to/hiPSC-CM.ribo.test-chr1.rep-d05-1.fastq.gz
d05-2: /path/to/hiPSC-CM.ribo.test-chr1.rep-d05-2.fastq.gz
riboseq_sample_name_map:
d01-1: Ribo-d1-1
d01-2: Ribo-d1-2
d05-1: Ribo-d5-1
d05-2: Ribo-d5-2
rnaseq_samples:
d01-1: /path/to/hiPSC-CM.rna.test-chr1.rep-d01-1.fastq.gz
d01-2: /path/to/hiPSC-CM.rna.test-chr1.rep-d01-2.fastq.gz
d05-1: /path/to/hiPSC-CM.rna.test-chr1.rep-d05-1.fastq.gz
d05-2: /path/to/hiPSC-CM.rna.test-chr1.rep-d05-2.fastq.gz
rnaseq_sample_name_map:
d01-1: RNA-d1-1
d01-2: RNA-d1-2
d05-1: RNA-d5-1
d05-2: RNA-d5-2
matching_samples:
d01-1: d01-1
d01-2: d01-2
d05-1: d05-1
d05-2: d05-2
tea_data: /path/to/tea-results
# second is always reference level - here "d1"
contrasts:
d5_vs_d1:
- d5
- d1
HTSeq workflow¶
After run-htseq-workflow, count tables are available under <riboseq_data>/count-tables and <rnaseq_data>/count-tables, see General usage. To create the sample table
get-sample-table [--help] config
This will create a file named sample-table<-project_name>.csv, where project_name is the value from that key in the configuration file (or none if this key is not present). Each row describes one sample, the first column is the sample name, the second the file path to the count table generated by htseq-count, and remaining columns are metadata, for example
sampleName |
fileName |
assay |
condition |
|---|---|---|---|
Ribo-d1-1 |
/path/to/riboseq-results/count-tables/Ribo-d1-1-unique.length-29-30-31.tsv |
ribo |
d1 |
Ribo-d1-2 |
/path/to/riboseq-results/count-tables/Ribo-d1-2-unique.length-29-30.tsv |
ribo |
d1 |
Ribo-d5-1 |
/path/to/riboseq-results/count-tables/Ribo-d5-1-unique.length-29-30-31.tsv |
ribo |
d5 |
Ribo-d5-2 |
/path/to/riboseq-results/count-tables/Ribo-d5-2-unique.length-29-30-31.tsv |
ribo |
d5 |
RNA-d1-1 |
/path/to/rnaseq-results/count-tables/RNA-d1-1-unique.length-31.tsv |
rna |
d1 |
RNA-d1-2 |
/path/to/rnaseq-results/count-tables/RNA-d1-2-unique.length-30.tsv |
rna |
d1 |
RNA-d5-1 |
/path/to/rnaseq-results/count-tables/RNA-d5-1-unique.length-31.tsv |
rna |
d5 |
RNA-d5-2 |
/path/to/rnaseq-results/count-tables/RNA-d5-2-unique.length-31.tsv |
rna |
d5 |
Important
Values from riboseq_samples and rnaseq_samples (required) are used in replacement for riboseq_sample_name_map
and rnaseq_sample_name_map (optional). Whether you use samples or sample_name_map, we strongly recommend to
assign meaningful names, otherwise you may end up with duplicate row.names. Use keywords such as Ribo and RNA
to facilitate data integration for the analysis. The conditions are assigned using the contrasts values for each contrast,
and these values must match either the samples or the sample_name_map values (no regex). If a match is not possible, entries will be NA.
If count-tables or files are missing, or if values from samples or sample_name_map do not match those used to assign the file names,
the fileName column will be missing. If you did not run-htseq-workflow, this is fine, otherwise check your configuration file
and make sure the workflow completed successfully.
If you have batches, you should add a columm to this file, and the header must be named batch. The assay column must contain two levels: ribo and rna. In all cases, before proceeding further, always proof-read this file!
General workflow¶
To estimate TE with data prepared from a different workflow, the sample table must conform to the Ribotools specs (that derive from DESeq2). In it’s current format, it must have, minimally, the following header sampleName,assay,condition, in this same order (see above). The assay is either ribo or rna, and the condition must match the list of contrasts from the config. The format should be CSV.
The count table must include integer counts for both RNA and RPFs, and column names (samples) must match sampleName from the sample table. The first column must be feature ids or symbols.
The configuration must include additionally the following keys:
sample_tableThe path to a sample table.count_tableThe path to a count table.
General usage¶
run-tea [options] config
For all options, consult the API for run-tea. See also How to add a config file. To estimate TE for Ribo-seq ORFs instead of genes, use --symbolCol, --orfCol, and/or --delim, see Estimate TE or DE using Ribo-seq ORFs for details.
Note
The count tables can be TAB- or CSV-formatted. The default --delim option is TAB. Anything else will fall back to white space (one or
more spaces, tabs, newlines or carriage returns). The sample table must be CSV-formatted.
Tip
To run the program in the background, and redirect the output to log, simply run-tea [options] config > log.out 2>&1 &
Required input¶
Output (count tables) from run-htseq-workflow, see How to estimate abundance, and sample table, see How to prepare the sample table for TE.
Alternatively, existing sample and count tables can be given via the configuration file.
Expected output¶
Output files are written to <tea_data>/<method>/<contrasts>, where tea_data is the path given in the configuration file, method is either LRT or deltaTE, and contrasts are the names given to the contrasts in the configuration file. Files include the full results table in xlxs format, a single PDF output showing the different classes of features that are listed in forwarded.txt, exclusive.txt, intensified.txt, and buffered.txt.
More about the model¶
Two approaches are implemented that provide similar output. The default method is an LRT test where a full model ~assay+condition+assay:condition is tested against a reduced model ~assay+condition, ignoring the contrasts for the p-value calculation, i.e the p-value is for the difference between the full and reduced model, and not the fold change. This essentially tests for the interaction term, or the condition effect across assays, e.g.
Independently of the classification of features into forwarded, exclusive, intensified and buffered, you may be interested in features with a significant padj.dte. Standard Wald tests are used to test for the main effect of condition on RNA and RPF (with log fold change threshold).
We also re-implemented the deltaTE method, which only uses Wald tests, testing for the interaction term using a full model, and re-fitting the data separately to test for RPF and RNA with a reduced model of the form ~condition.
In all cases, independent filtering and p-value adjustment is done using Independent Hypothesis Weighting (IHW). Even without IHW, BH is used by default, and rankings are preserved “up to NA” (due to independent filtering e.g. adjusted p-values are set to NA for outliers). We replace these with 1. Shrunken log2 fold changes are reported, using ashr (adaptive shrinkage estimator) for LRT, and apeglm (adaptive Student’s t prior shrinkage estimator) for deltaTE, although this can be changed with --lfcShrinkMethod.
Note
Currently testing with covariates only allows to include a batch effect, and this is only implemented with the deltaTE method.