.. _running_te:

How to estimate TE
==================

Approaches to estimate TE range from a simple ratio calculation, ignoring variance, low expression of RPFs, batch effects, *etc.* to more complex methodologies such as `Ribodiff <https://academic.oup.com/bioinformatics/article/33/1/139/2525694?login=true>`_, `Xtail <https://www.nature.com/articles/ncomms11194>`_, `Riborex <https://academic.oup.com/bioinformatics/article/33/11/1735/2964727?login=true>`_, or `Anota2Seq <https://academic.oup.com/nar/article/47/12/e70/5423604?login=true>`_, to name but a few. **Ribotools** uses a simple, but flexible approach that builds on the well-established differential expression software `DEseq2 <https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0550-8>`_, allowing to take into account sample-to-sample variance and/or covariates such as batch effect. With this approach, TE can be estimated in a few minutes, even for large datasets.

**Ribotools** automatically classifies features into four broad classes, according to the terminology defined in the `deltaTE protocol <https://currentprotocols.onlinelibrary.wiley.com/doi/full/10.1002/cpmb.108>`_. Translationally *forwarded* features have a significant change in RNA and RPF at the same rate, with no significant change in TE. It is assumed that changes in RNA abundance drive changes in RPFs. *Exclusive* features are differentially translated features that have a significant change in RPF, with no change in RNA, *i.e.* changes in TE are driven by changes in RPFs exclusively. These are classified based on significance only, although for the Wald tests, we set a threshold based on log fold change. *Intensified* and *buffered* features take into account the direction of fold change. Translationally *intensified* features have a significant change in TE that acts with the effect of transcription. Translationally *buffered* features have a significant change in TE that counteracts the change in RNA (buffering transcription).

.. _prep_tables_te:

How to prepare the sample table for TE
--------------------------------------

The following keys are required (configuration file):

* ``contrasts`` A dictionary *key: value*, where *key* is a name for the contrast to be tested, and *value* contains 2 items, the first item is the condition to be tested against the second (reference).
* ``tea_data`` The base output location for all created files.

For example

.. code-block:: yaml

   riboseq_samples:
    d01-1: /path/to/hiPSC-CM.ribo.test-chr1.rep-d01-1.fastq.gz
    d01-2: /path/to/hiPSC-CM.ribo.test-chr1.rep-d01-2.fastq.gz
    d05-1: /path/to/hiPSC-CM.ribo.test-chr1.rep-d05-1.fastq.gz
    d05-2: /path/to/hiPSC-CM.ribo.test-chr1.rep-d05-2.fastq.gz

   riboseq_sample_name_map:
    d01-1: Ribo-d1-1
    d01-2: Ribo-d1-2
    d05-1: Ribo-d5-1
    d05-2: Ribo-d5-2

   rnaseq_samples:
    d01-1: /path/to/hiPSC-CM.rna.test-chr1.rep-d01-1.fastq.gz
    d01-2: /path/to/hiPSC-CM.rna.test-chr1.rep-d01-2.fastq.gz
    d05-1: /path/to/hiPSC-CM.rna.test-chr1.rep-d05-1.fastq.gz
    d05-2: /path/to/hiPSC-CM.rna.test-chr1.rep-d05-2.fastq.gz

   rnaseq_sample_name_map:
    d01-1: RNA-d1-1
    d01-2: RNA-d1-2
    d05-1: RNA-d5-1
    d05-2: RNA-d5-2

   matching_samples:
    d01-1: d01-1
    d01-2: d01-2
    d05-1: d05-1
    d05-2: d05-2

   tea_data: /path/to/tea-results

   # second is always reference level - here "d1"
   contrasts:
    d5_vs_d1:
     - d5
     - d1


HTSeq workflow
^^^^^^^^^^^^^^

After ``run-htseq-workflow``, count tables are available under *<riboseq_data>/count-tables* and *<rnaseq_data>/count-tables*, see :ref:`ribotools_usage`. To create the sample table

.. code-block:: bash

    get-sample-table [--help] config

This will create a file named *sample-table<-project_name>.csv*, where ``project_name`` is the value from that key in the configuration file (or none if this key is not present). Each row describes one sample, the first column is the sample name, the second the file path to the count table generated by ``htseq-count``, and remaining columns are metadata, for example

.. csv-table:: sample-table
   :header: sampleName,fileName,assay,condition

   Ribo-d1-1,/path/to/riboseq-results/count-tables/Ribo-d1-1-unique.length-29-30-31.tsv,ribo,d1
   Ribo-d1-2,/path/to/riboseq-results/count-tables/Ribo-d1-2-unique.length-29-30.tsv,ribo,d1
   Ribo-d5-1,/path/to/riboseq-results/count-tables/Ribo-d5-1-unique.length-29-30-31.tsv,ribo,d5
   Ribo-d5-2,/path/to/riboseq-results/count-tables/Ribo-d5-2-unique.length-29-30-31.tsv,ribo,d5
   RNA-d1-1,/path/to/rnaseq-results/count-tables/RNA-d1-1-unique.length-31.tsv,rna,d1
   RNA-d1-2,/path/to/rnaseq-results/count-tables/RNA-d1-2-unique.length-30.tsv,rna,d1
   RNA-d5-1,/path/to/rnaseq-results/count-tables/RNA-d5-1-unique.length-31.tsv,rna,d5
   RNA-d5-2,/path/to/rnaseq-results/count-tables/RNA-d5-2-unique.length-31.tsv,rna,d5


.. important::

   Values from `riboseq_samples` and `rnaseq_samples` (required) are used in replacement for `riboseq_sample_name_map`
   and `rnaseq_sample_name_map` (optional). Whether you use *samples* or *sample_name_map*, we strongly recommend to
   assign meaningful names, otherwise you may end up with duplicate ``row.names``. Use keywords such as *Ribo* and *RNA*
   to facilitate data integration for the analysis. The conditions are assigned using the `contrasts` values for each contrast,
   and these values must match either the *samples* or the *sample_name_map* values (no regex). If a match is not possible, entries will be NA.

   If *count-tables* or files are missing, or if values from *samples* or *sample_name_map* do not match those used to assign the file names,
   the fileName column will be missing. If you did not ``run-htseq-workflow``, this is fine, otherwise check your configuration file
   and make sure the workflow completed successfully.

   If you have batches, you should add a columm to this file, and the header must be named *batch*. The assay column must contain
   two levels: *ribo* and *rna*. In all cases, before proceeding further, always proof-read this file!


.. _prep_tables_te_general:

General workflow
^^^^^^^^^^^^^^^^

To estimate TE with data prepared from a different workflow, the sample table must conform to the **Ribotools** specs (that derive from **DESeq2**). In it's current format, it must have, minimally, the following header ``sampleName,assay,condition``, in this same order (see above). The ``assay`` is either *ribo* or *rna*, and the ``condition`` must match the list of ``contrasts`` from the config. The format should be CSV.

The count table must include integer counts for both RNA and RPFs, and column names (samples) must match ``sampleName`` from the sample table. The first column must be feature ids or symbols.

The configuration must include additionally the following keys:

* ``sample_table`` The path to a sample table.
* ``count_table`` The path to a count table.

.. _general_usage_te:

General usage
-------------

.. code-block:: bash

    run-tea [options] config

For all options, consult the API for :ref:`api_te`. See also :ref:`howto_config`. To estimate TE for Ribo-seq ORFs instead of genes, use ``--symbolCol``, ``--orfCol``, and/or  ``--delim``, see :ref:`using_orfs_tede` for details.

.. note::

   The count tables can be TAB- or CSV-formatted. The default ``--delim`` option is TAB. Anything else will fall back to white space (one or
   more spaces, tabs, newlines or carriage returns). The sample table must be CSV-formatted.

.. tip::

    To run the program in the background, and redirect the output to log, simply ``run-tea [options] config > log.out 2>&1 &``

Required input
^^^^^^^^^^^^^^

Output (count tables) from ``run-htseq-workflow``, see :ref:`running_htseq_workflow`, and sample table, see :ref:`prep_tables_te`.
Alternatively, existing sample and count tables can be given via the configuration file.

Expected output
^^^^^^^^^^^^^^^

Output files are written to *<tea_data>/<method>/<contrasts>*, where ``tea_data`` is the path given in the configuration file, ``method`` is either *LRT* or *deltaTE*, and ``contrasts`` are the names given to the contrasts in the configuration file. Files include the full results table in xlxs format, a single PDF output showing the different classes of features that are listed in *forwarded.txt*, *exclusive.txt*, *intensified.txt*, and *buffered.txt*.

More about the model
^^^^^^^^^^^^^^^^^^^^

Two approaches are implemented that provide similar output. The default method is an *LRT* test where a full model ``~assay+condition+assay:condition`` is tested against a reduced model ``~assay+condition``, ignoring the contrasts for the p-value calculation, *i.e* the p-value is for the difference between the full and reduced model, and not the fold change. This essentially tests for the interaction term, or the condition effect across assays, *e.g.*

.. math::

   \frac{\left(RPF/RNA\right)_{treatment}}{\left(RPF/RNA\right)_{control}} = \frac{\left(RPF_{treatment}/RPF_{control}\right)}{\left(RNA_{treatment}/RNA_{control}\right)}

Independently of the classification of features into *forwarded*, *exclusive*, *intensified* and *buffered*, you may be interested in features with a significant *padj.dte*. Standard *Wald* tests are used to test for the main effect of condition on RNA and RPF (with log fold change threshold).

We also re-implemented the `deltaTE method <https://currentprotocols.onlinelibrary.wiley.com/doi/full/10.1002/cpmb.108>`_, which only uses *Wald* tests, testing for the interaction term using a full model, and re-fitting the data separately to test for RPF and RNA with a *reduced* model of the form ``~condition``.

In all cases, independent filtering and p-value adjustment is done using *Independent Hypothesis Weighting (IHW)*. Even without IHW, BH is used by default, and rankings are preserved "up to NA" (due to independent filtering *e.g.* adjusted p-values are set to NA for outliers). We replace these with 1. Shrunken log2 fold changes are reported, using *ashr* (adaptive shrinkage estimator) for *LRT*, and *apeglm* (adaptive Student's t prior shrinkage estimator) for *deltaTE*, although this can be changed with ``--lfcShrinkMethod``.

.. note::

    Currently testing with covariates only allows to include a *batch* effect, and this is only implemented with the *deltaTE* method.