.. _running_de:

How to estimate DE
==================

Differential expression (DE) is a statistical analysis to discover quantitative changes in the expression levels between experimental groups. Genes are typically used as feature, but if Ribo-seq ORFs are available, DE can be estimated for Ribo-seq ORFs instead of genes.

**Ribotools** uses Ribo-seq periodic fragment lengths by default, *i.e.* to determine which features, genes or Ribo-seq ORFs, are differentially regulated, Ribo-seq reads from periodic fragment lengths only are used.

.. note::

    Although DE can be performed on RNA-seq samples only, **Ribotools** comes in handy when dealing with Ribo-seq data. If you want to use **Ribotools** for RNA-seq only, just replace the necessary keys, and use the appropriate options.


.. _prep_tables_de:

How to prepare the sample table for DE
--------------------------------------

The following keys are required (configuration file):

* ``contrasts`` A dictionary *key: value*, where *key* is a name for the contrast to be tested, and *value* contains 2 items, the first item is the condition to be tested against the second (reference).
* ``dea_data`` The base output location for all created files.

For example

.. code-block:: yaml

    riboseq_samples:
     d01-1: /path/to/hiPSC-CM.ribo.test-chr1.rep-d01-1.fastq.gz
     d01-2: /path/to/hiPSC-CM.ribo.test-chr1.rep-d01-2.fastq.gz
     d05-1: /path/to/hiPSC-CM.ribo.test-chr1.rep-d05-1.fastq.gz
     d05-2: /path/to/hiPSC-CM.ribo.test-chr1.rep-d05-2.fastq.gz

    riboseq_sample_name_map:
     d01-1: Ribo-d1-1
     d01-2: Ribo-d1-2
     d05-1: Ribo-d5-1
     d05-2: Ribo-d5-2

    dea_data: /path/to/dea-results

    # second is always reference level - here "d1"
    contrasts:
     d5_vs_d1:
      - d5
      - d1

HTSeq workflow
^^^^^^^^^^^^^^

After ``run-htseq-workflow``, count tables are available under *<riboseq_data>/count-tables* (or *<rnaseq_data>/count-tables*), see :ref:`ribotools_usage`. To create the sample table

.. code-block:: bash

    get-sample-table [--help] [--ribo] [--rna] --dea config

This will create a file named *sample-table<-project_name>.csv*, where ``project_name`` is the value from that key in the configuration file (or none if this key is not present). Each row describes one sample, the first column is the sample name, the second the file path to the count table generated by ``htseq-count``, and remaining columns are metadata, for example


.. csv-table:: sample-table
   :header: sampleName,fileName,assay,condition

    Ribo-d1-1,/path/to/riboseq-results/count-tables/Ribo-d1-1-unique.length-29-30-31.tsv,ribo,d1
    Ribo-d1-2,/path/to/riboseq-results/count-tables/Ribo-d1-2-unique.length-29-30.tsv,ribo,d1
    Ribo-d5-1,/path/to/riboseq-results/count-tables/Ribo-d5-1-unique.length-29-30-31.tsv,ribo,d5
    Ribo-d5-2,/path/to/riboseq-results/count-tables/Ribo-d5-2-unique.length-29-30-31.tsv,ribo,d5


.. important::

   Values from `riboseq_samples` (or `rnaseq_samples`) are used in replacement for `riboseq_sample_name_map` (or `rnaseq_sample_name_map`).
   Whether you use *samples* or *sample_name_map*, we strongly recommend to assign meaningful names. The conditions are assigned using
   the `contrasts` values for each contrast, and these values must match either the *samples* or the *sample_name_map* values (no regex).
   If a match is not possible, entries will be NA.

   If *count-tables* or files are missing, or if values from *samples* or *sample_name_map* do not match those used to assign the file names,
   the fileName column will be missing. If you did not ``run-htseq-workflow``, this is fine, otherwise check your configuration file
   and make sure the workflow completed successfully.

   If you have batches, you should add a columm to this file, and the header must be named *batch*. The assay column can be ignored, but
   if present, it must contain only one value *e.g.* ribo. In all cases, before proceeding further, always proof-read this file!

.. _prep_tables_de_general:

General workflow
^^^^^^^^^^^^^^^^

To estimate DE with data prepared from a different workflow, the sample table must conform to the **Ribotools** specs (that derive from **DESeq2**). In it's current format, it must have, minimally, the following header ``sampleName,condition``, in this same order (see above). The ``condition`` must match the list of ``contrasts`` from the config. The format should be CSV.

The count table must include integer counts for RPFs (or RNA abundance), and column names (samples) must match ``sampleName`` from the sample table. The first column must be feature ids or symbols.

The configuration must include additionally the following keys:

* ``sample_table`` The path to a sample table.
* ``count_table`` The path to a count table.


.. _general_usage_de:

General usage
-------------

.. code-block:: bash

    run-dea [options] config

For all options, consult the API for :ref:`api_de`. See also :ref:`howto_config`. To estimate DE for Ribo-seq ORFs instead of genes, use ``--symbolCol``, ``--orfCol``, and/or  ``--delim``, see :ref:`using_orfs_tede` for details.

.. note::

   The count tables can be TAB- or CSV-formatted. The default ``--delim`` option is TAB. Anything else will fall back to white space (one or
   more spaces, tabs, newlines or carriage returns). The sample table must be CSV-formatted.

.. tip::

    To run the program in the background, and redirect the output to log, simply ``run-dea [options] config > log.out 2>&1 &``

Required input
^^^^^^^^^^^^^^

Output (count tables) from ``run-htseq-workflow``, see :ref:`running_htseq_workflow`, and sample table, see :ref:`prep_tables_de`.
Alternatively, existing sample and count tables can be given via the configuration file.

Output files
^^^^^^^^^^^^

Output files are written to *<dea_data>/<contrasts>*, where ``dea_data`` is the path given in the configuration file and ``contrasts`` are the names given to the contrasts in the configuration file.