How to estimate DE

Differential expression (DE) is a statistical analysis to discover quantitative changes in the expression levels between experimental groups. Genes are typically used as feature, but if Ribo-seq ORFs are available, DE can be estimated for Ribo-seq ORFs instead of genes.

Ribotools uses Ribo-seq periodic fragment lengths by default, i.e. to determine which features, genes or Ribo-seq ORFs, are differentially regulated, Ribo-seq reads from periodic fragment lengths only are used.

Note

Although DE can be performed on RNA-seq samples only, Ribotools comes in handy when dealing with Ribo-seq data. If you want to use Ribotools for RNA-seq only, just replace the necessary keys, and use the appropriate options.

How to prepare the sample table for DE

The following keys are required (configuration file):

  • contrasts A dictionary key: value, where key is a name for the contrast to be tested, and value contains 2 items, the first item is the condition to be tested against the second (reference).

  • dea_data The base output location for all created files.

For example

riboseq_samples:
 d01-1: /path/to/hiPSC-CM.ribo.test-chr1.rep-d01-1.fastq.gz
 d01-2: /path/to/hiPSC-CM.ribo.test-chr1.rep-d01-2.fastq.gz
 d05-1: /path/to/hiPSC-CM.ribo.test-chr1.rep-d05-1.fastq.gz
 d05-2: /path/to/hiPSC-CM.ribo.test-chr1.rep-d05-2.fastq.gz

riboseq_sample_name_map:
 d01-1: Ribo-d1-1
 d01-2: Ribo-d1-2
 d05-1: Ribo-d5-1
 d05-2: Ribo-d5-2

dea_data: /path/to/dea-results

# second is always reference level - here "d1"
contrasts:
 d5_vs_d1:
  - d5
  - d1

HTSeq workflow

After run-htseq-workflow, count tables are available under <riboseq_data>/count-tables (or <rnaseq_data>/count-tables), see General usage. To create the sample table

get-sample-table [--help] [--ribo] [--rna] --dea config

This will create a file named sample-table<-project_name>.csv, where project_name is the value from that key in the configuration file (or none if this key is not present). Each row describes one sample, the first column is the sample name, the second the file path to the count table generated by htseq-count, and remaining columns are metadata, for example

sample-table

sampleName

fileName

assay

condition

Ribo-d1-1

/path/to/riboseq-results/count-tables/Ribo-d1-1-unique.length-29-30-31.tsv

ribo

d1

Ribo-d1-2

/path/to/riboseq-results/count-tables/Ribo-d1-2-unique.length-29-30.tsv

ribo

d1

Ribo-d5-1

/path/to/riboseq-results/count-tables/Ribo-d5-1-unique.length-29-30-31.tsv

ribo

d5

Ribo-d5-2

/path/to/riboseq-results/count-tables/Ribo-d5-2-unique.length-29-30-31.tsv

ribo

d5

Important

Values from riboseq_samples (or rnaseq_samples) are used in replacement for riboseq_sample_name_map (or rnaseq_sample_name_map). Whether you use samples or sample_name_map, we strongly recommend to assign meaningful names. The conditions are assigned using the contrasts values for each contrast, and these values must match either the samples or the sample_name_map values (no regex). If a match is not possible, entries will be NA.

If count-tables or files are missing, or if values from samples or sample_name_map do not match those used to assign the file names, the fileName column will be missing. If you did not run-htseq-workflow, this is fine, otherwise check your configuration file and make sure the workflow completed successfully.

If you have batches, you should add a columm to this file, and the header must be named batch. The assay column can be ignored, but if present, it must contain only one value e.g. ribo. In all cases, before proceeding further, always proof-read this file!

General workflow

To estimate DE with data prepared from a different workflow, the sample table must conform to the Ribotools specs (that derive from DESeq2). In it’s current format, it must have, minimally, the following header sampleName,condition, in this same order (see above). The condition must match the list of contrasts from the config. The format should be CSV.

The count table must include integer counts for RPFs (or RNA abundance), and column names (samples) must match sampleName from the sample table. The first column must be feature ids or symbols.

The configuration must include additionally the following keys:

  • sample_table The path to a sample table.

  • count_table The path to a count table.

General usage

run-dea [options] config

For all options, consult the API for run-dea. See also How to add a config file. To estimate DE for Ribo-seq ORFs instead of genes, use --symbolCol, --orfCol, and/or --delim, see Estimate TE or DE using Ribo-seq ORFs for details.

Note

The count tables can be TAB- or CSV-formatted. The default --delim option is TAB. Anything else will fall back to white space (one or more spaces, tabs, newlines or carriage returns). The sample table must be CSV-formatted.

Tip

To run the program in the background, and redirect the output to log, simply run-dea [options] config > log.out 2>&1 &

Required input

Output (count tables) from run-htseq-workflow, see How to estimate abundance, and sample table, see How to prepare the sample table for DE. Alternatively, existing sample and count tables can be given via the configuration file.

Output files

Output files are written to <dea_data>/<contrasts>, where dea_data is the path given in the configuration file and contrasts are the names given to the contrasts in the configuration file.