How to estimate TE¶

Approaches to estimate TE range from a simple ratio calculation, ignoring variance, low expression of RPFs, batch effects, etc. to more complex methodologies such as Ribodiff, Xtail, Riborex, or Anota2Seq, to name but a few. Ribotools uses a simple, but flexible approach that builds on the well-established differential expression software DEseq2, allowing to take into account sample-to-sample variance and/or covariates such as batch effect. With this approach, TE can be estimated in a few minutes, even for large datasets.

Ribotools automatically classifies features into four broad classes, according to the terminology defined in the deltaTE protocol. Translationally forwarded features have a significant change in RNA and RPF at the same rate, with no significant change in TE. It is assumed that changes in RNA abundance drive changes in RPFs. Exclusive features are differentially translated features that have a significant change in RPF, with no change in RNA, i.e. changes in TE are driven by changes in RPFs exclusively. These are classified based on significance only, although for the Wald tests, we set a threshold based on log fold change. Intensified and buffered features take into account the direction of fold change. Translationally intensified features have a significant change in TE that acts with the effect of transcription. Translationally buffered features have a significant change in TE that counteracts the change in RNA (buffering transcription).

How to prepare the sample table for TE¶

The following keys are required (configuration file):

contrasts A dictionary key: value, where key is a name for the contrast to be tested, and value contains 2 items, the first item is the condition to be tested against the second (reference).
tea_data The base output location for all created files.

For example

riboseq_samples:
 d01-1: /path/to/hiPSC-CM.ribo.test-chr1.rep-d01-1.fastq.gz
 d01-2: /path/to/hiPSC-CM.ribo.test-chr1.rep-d01-2.fastq.gz
 d05-1: /path/to/hiPSC-CM.ribo.test-chr1.rep-d05-1.fastq.gz
 d05-2: /path/to/hiPSC-CM.ribo.test-chr1.rep-d05-2.fastq.gz

riboseq_sample_name_map:
 d01-1: Ribo-d1-1
 d01-2: Ribo-d1-2
 d05-1: Ribo-d5-1
 d05-2: Ribo-d5-2

rnaseq_samples:
 d01-1: /path/to/hiPSC-CM.rna.test-chr1.rep-d01-1.fastq.gz
 d01-2: /path/to/hiPSC-CM.rna.test-chr1.rep-d01-2.fastq.gz
 d05-1: /path/to/hiPSC-CM.rna.test-chr1.rep-d05-1.fastq.gz
 d05-2: /path/to/hiPSC-CM.rna.test-chr1.rep-d05-2.fastq.gz

rnaseq_sample_name_map:
 d01-1: RNA-d1-1
 d01-2: RNA-d1-2
 d05-1: RNA-d5-1
 d05-2: RNA-d5-2

matching_samples:
 d01-1: d01-1
 d01-2: d01-2
 d05-1: d05-1
 d05-2: d05-2

tea_data: /path/to/tea-results

# second is always reference level - here "d1"
contrasts:
 d5_vs_d1:
  - d5
  - d1

HTSeq workflow¶

After run-htseq-workflow, count tables are available under <riboseq_data>/count-tables and <rnaseq_data>/count-tables, see General usage. To create the sample table

get-sample-table [--help] config

This will create a file named sample-table<-project_name>.csv, where project_name is the value from that key in the configuration file (or none if this key is not present). Each row describes one sample, the first column is the sample name, the second the file path to the count table generated by htseq-count, and remaining columns are metadata, for example

sample-table¶
sampleName	fileName	assay	condition
Ribo-d1-1	/path/to/riboseq-results/count-tables/Ribo-d1-1-unique.length-29-30-31.tsv	ribo	d1
Ribo-d1-2	/path/to/riboseq-results/count-tables/Ribo-d1-2-unique.length-29-30.tsv	ribo	d1
Ribo-d5-1	/path/to/riboseq-results/count-tables/Ribo-d5-1-unique.length-29-30-31.tsv	ribo	d5
Ribo-d5-2	/path/to/riboseq-results/count-tables/Ribo-d5-2-unique.length-29-30-31.tsv	ribo	d5
RNA-d1-1	/path/to/rnaseq-results/count-tables/RNA-d1-1-unique.length-31.tsv	rna	d1
RNA-d1-2	/path/to/rnaseq-results/count-tables/RNA-d1-2-unique.length-30.tsv	rna	d1
RNA-d5-1	/path/to/rnaseq-results/count-tables/RNA-d5-1-unique.length-31.tsv	rna	d5
RNA-d5-2	/path/to/rnaseq-results/count-tables/RNA-d5-2-unique.length-31.tsv	rna	d5

Important

Values from riboseq_samples and rnaseq_samples (required) are used in replacement for riboseq_sample_name_map and rnaseq_sample_name_map (optional). Whether you use samples or sample_name_map, we strongly recommend to assign meaningful names, otherwise you may end up with duplicate row.names. Use keywords such as Ribo and RNA to facilitate data integration for the analysis. The conditions are assigned using the contrasts values for each contrast, and these values must match either the samples or the sample_name_map values (no regex). If a match is not possible, entries will be NA.

If count-tables or files are missing, or if values from samples or sample_name_map do not match those used to assign the file names, the fileName column will be missing. If you did not run-htseq-workflow, this is fine, otherwise check your configuration file and make sure the workflow completed successfully.

If you have batches, you should add a columm to this file, and the header must be named batch. The assay column must contain two levels: ribo and rna. In all cases, before proceeding further, always proof-read this file!

General workflow¶

To estimate TE with data prepared from a different workflow, the sample table must conform to the Ribotools specs (that derive from DESeq2). In it’s current format, it must have, minimally, the following header sampleName,assay,condition, in this same order (see above). The assay is either ribo or rna, and the condition must match the list of contrasts from the config. The format should be CSV.

The count table must include integer counts for both RNA and RPFs, and column names (samples) must match sampleName from the sample table. The first column must be feature ids or symbols.

The configuration must include additionally the following keys:

sample_table The path to a sample table.
count_table The path to a count table.

General usage¶

run-tea [options] config

For all options, consult the API for run-tea. See also How to add a config file. To estimate TE for Ribo-seq ORFs instead of genes, use --symbolCol, --orfCol, and/or --delim, see Estimate TE or DE using Ribo-seq ORFs for details.

Note

The count tables can be TAB- or CSV-formatted. The default --delim option is TAB. Anything else will fall back to white space (one or more spaces, tabs, newlines or carriage returns). The sample table must be CSV-formatted.

Tip

To run the program in the background, and redirect the output to log, simply run-tea [options] config > log.out 2>&1 &

Required input¶

Output (count tables) from run-htseq-workflow, see How to estimate abundance, and sample table, see How to prepare the sample table for TE. Alternatively, existing sample and count tables can be given via the configuration file.

Expected output¶

Output files are written to <tea_data>/<method>/<contrasts>, where tea_data is the path given in the configuration file, method is either LRT or deltaTE, and contrasts are the names given to the contrasts in the configuration file. Files include the full results table in xlxs format, a single PDF output showing the different classes of features that are listed in forwarded.txt, exclusive.txt, intensified.txt, and buffered.txt.

More about the model¶

Two approaches are implemented that provide similar output. The default method is an LRT test where a full model ~assay+condition+assay:condition is tested against a reduced model ~assay+condition, ignoring the contrasts for the p-value calculation, i.e the p-value is for the difference between the full and reduced model, and not the fold change. This essentially tests for the interaction term, or the condition effect across assays, e.g.

\[\frac{\left(RPF/RNA\right)_{treatment}}{\left(RPF/RNA\right)_{control}} = \frac{\left(RPF_{treatment}/RPF_{control}\right)}{\left(RNA_{treatment}/RNA_{control}\right)}\]

Independently of the classification of features into forwarded, exclusive, intensified and buffered, you may be interested in features with a significant padj.dte. Standard Wald tests are used to test for the main effect of condition on RNA and RPF (with log fold change threshold).

We also re-implemented the deltaTE method, which only uses Wald tests, testing for the interaction term using a full model, and re-fitting the data separately to test for RPF and RNA with a reduced model of the form ~condition.

In all cases, independent filtering and p-value adjustment is done using Independent Hypothesis Weighting (IHW). Even without IHW, BH is used by default, and rankings are preserved “up to NA” (due to independent filtering e.g. adjusted p-values are set to NA for outliers). We replace these with 1. Shrunken log2 fold changes are reported, using ashr (adaptive shrinkage estimator) for LRT, and apeglm (adaptive Student’s t prior shrinkage estimator) for deltaTE, although this can be changed with --lfcShrinkMethod.

Note

Currently testing with covariates only allows to include a batch effect, and this is only implemented with the deltaTE method.