How to estimate abundance¶

See General usage for a short description of required input and expected output. Output from Flexbar, Bowtie2, and STAR are written in FASTQ or BAM formats. If you want to QC your Ribo-seq data as part of the pipeline, see Ribo-seq quality control.

Attention

All Ribo- and/or RNA-seq samples in the configuration file must be from the same organism and use the same genome_base_path, star_index, ribosomal_index, etc. Samples from different organisms or using different annotations must be “split” into different configuration files, and run separately.

Important

Ribo-seq abundance is estimated from ribosome-protected RNA fragments (RPFs) using periodic reads only, unless the --skip-periodicity-estimation option is used. Ribotools does not currently handle paired-end RNA-seq. In the config file, specify only one pair (typically the first) for rnaseq_samples. But you should specify whether the data is from a strand-specific assay for counting reads, see below for further details. Use the option --trim-rna-to-max-fragment-size to trim RNA-seq reads to the maximum periodic fragment length of a matched Ribo-seq sample, to minimize mapping bias. Matching samples are specified in the config file via the matching_samples key.

Tip

If some Ribo-seq samples fail to pass the periodicity estimation quality control, you can still recover matching RNA-seq samples, provided there is enough replicates to run the TE analysis. Simply replace the corresponding Ribo-seq sample name for the matching RNA-seq sample (value from matching_samples) with an integer e.g. !!int 31. The value will be used for --trim-rna-to-max-fragment-size, and can correspond to the mean or median periodic fragment size across all available libraries. Alternatively, you can use fixed lengths and offsets, although this is not recommended.

General usage¶

Ribotools can be called for Ribo-seq only with the ribo option, RNA-seq only with the rna option, or for both Ribo- and RNA-seq at once, using the ribo --run-all option.

# Ribo-seq only (estimate periodicity and generate count tables)
# e.g. stranded protocol (default is "no", i.e. not from a strand-specific protocol)
run-htseq-workflow ribo --htseq-options "--stranded yes" [options] config

# RNA-seq only (generate count tables)
# e.g. pair 1 (reverse protocol) for RNA
# with matching Ribo-seq data use [--trim-rna-to-max-fragment-size] [--ribo-config RIBO_CONFIG]
run-htseq-workflow rna --htseq-options "--stranded reverse" [options] config

# Ribo- and RNA-seq at once, i.e. one after the other
# e.g. trim RNA post adapter removal using max fragment size from matching Ribo-seq samples
run-htseq-workflow ribo --run-all --trim-rna-to-max-fragment-size --ribo-config CONFIG --rna-config CONFIG --htseq-options "--stranded yes" --rna-stranded reverse [options] config

For all options, consult the API for run-htseq-workflow. Even if you use the same configuration file for Ribo- and RNA-seq, you may have to pass it multiple times depending on the selected options, e.g. via --ribo-config and/or --rna-config, in addition to the required positional argument config. See also How to add a config file.

Caution

If using Ribotools with a de novo assembly generated with Rp-Bp, specifying --htseq-options --type=exon, or type other than CDS can have unexpected results! This is because the GTF file created under genome_base_path is a concatenation of gtf and de_novo_gtf, and possibly contains repeated features (see More about de novo ORF discovery). For mapping this is not a problem. For abundance estimation, however, this can be problematic. Unless this GTF file is manually curated, only CDS features should be used (default).

Hint

To estimate abundance for Ribo-seq ORFs instead of genes, you need to prepare a GTF file with Ribo-seq ORFs before, and then run-htseq-workflow with additional options, see How to use Ribo-seq ORFs.

Tip

Use --star-options "--quantMode GeneCounts" to get count tables. You can check counts for unstranded data (column 2), counts for the 1st read strand (column 3, htseq-count -s yes), and counts for the 2nd read strand (column 4, htseq-count -s reverse). The stranded column (3 or 4) with the lowest N_noFeature count should correspond to the correct strand option.

Required input¶

All the input files are given in the configuration file.

Expected output¶

All output files follow the conventions described in Rp-Bp ORF profile construction. For RNA-seq, the same conventions and nomenclature are used. Count tables are written to <riboseq_data>/count-tables and <rnaseq_data>/count-tables.

Ribo-seq quality control¶

Create ORF profiles to facilitate QC! For this, you need the Rp-Bp index files, see How to prepare annotations. Run the pipeline with the --create-orf-profiles option. When completed, summarize the output and visualize your data

summarize-rpbp-profile-construction [options] config
rpbp-profile-construction-dashboard [options] --config CONFIG

See Visualization and QC for more information.

If you use this option, but some of the required index files are missing, the main program run-htseq-workflow will not fail, but you will not be able to run summarize-rpbp-profile-construction. Check the logs to find out wich input files were missing.