# RNA-seq Analysis Walkthrough
This tutorial walks through a complete bulk RNA-seq analysis pipeline — from raw FASTQ reads to a count matrix — using oxo-call to generate every command. This mirrors a real-world analysis workflow.
**Time to complete:** 30–45 minutes (reading) + compute time
**Prerequisites:** oxo-call configured, tools installed: `fastp`, `STAR`, `featureCounts` (or `salmon`)
**You will learn:** end-to-end pipeline construction, workflow integration, skill-driven accuracy
---
## RNA-seq Pipeline Overview
```
Raw FASTQ reads
│
▼ Quality Control (fastp)
Trimmed FASTQ + QC report
│
▼ Alignment (STAR)
Aligned BAM (coordinate-sorted)
│
▼ QC Aggregation (MultiQC)
Combined QC report
│
▼ Quantification (featureCounts)
Count matrix (gene × sample)
```
We will use oxo-call at each step. The commands shown use example filenames — adapt them to your data.
---
## Setup
### Sample data assumptions
This tutorial assumes:
- Paired-end RNA-seq data: `sample1_R1.fastq.gz`, `sample1_R2.fastq.gz`
- STAR genome index at: `/data/star_hg38/`
- GTF annotation at: `/data/gencode.v44.gtf`
### Verify tools and skills
```bash
# Check tools are installed
fastp --version
STAR --version
featureCounts -v
# Check built-in skills
oxo-call skill list | grep -E "fastp|star|featurecounts"
# fastp ✓ built-in
# star ✓ built-in
# featurecounts ✓ built-in
```
### Add remote documentation (optional but recommended)
```bash
oxo-call docs add fastp --url https://github.com/OpenGene/fastp#usage
oxo-call docs add star --url https://physiology.med.cornell.edu/faculty/skrabanek/lab/angsd/lecture_notes/STARmanual.pdf
```
---
## Step 1: Quality Control with fastp
fastp trims adapters, removes low-quality reads, and generates a QC report in one step.
### Dry-run first
```bash
oxo-call dry-run fastp \
"trim adapters from paired-end reads sample1_R1.fastq.gz and sample1_R2.fastq.gz, \
output trimmed reads to trimmed/sample1_R1.fastq.gz and trimmed/sample1_R2.fastq.gz, \
use 4 threads, generate HTML report at qc/sample1.html and JSON at qc/sample1.json"
```
Expected:
```
Command: fastp \
--in1 sample1_R1.fastq.gz --in2 sample1_R2.fastq.gz \
--out1 trimmed/sample1_R1.fastq.gz --out2 trimmed/sample1_R2.fastq.gz \
--thread 4 \
--html qc/sample1.html --json qc/sample1.json
Explanation: fastp auto-detects adapters; --thread 4 for parallelism; HTML/JSON for MultiQC compatibility.
```
### Execute
```bash
mkdir -p trimmed qc
oxo-call run fastp \
"trim adapters from paired-end reads sample1_R1.fastq.gz and sample1_R2.fastq.gz, \
output trimmed reads to trimmed/sample1_R1.fastq.gz and trimmed/sample1_R2.fastq.gz, \
use 4 threads, generate HTML report at qc/sample1.html and JSON at qc/sample1.json"
```
> **Why no `--adapter_sequence`?**
> The fastp skill includes the concept: *"fastp auto-detects adapters — do not hardcode adapter sequences unless they are being missed."* This is a common mistake when migrating from trimmomatic.
---
## Step 2: Alignment with STAR
STAR is the gold-standard splice-aware aligner for RNA-seq. It requires a pre-built genome index.
### Build the index (if you have not already)
```bash
oxo-call dry-run STAR \
"build a genome index for hg38, genome fasta at /data/hg38.fa, \
GTF at /data/gencode.v44.gtf, output to /data/star_hg38, use 8 threads, \
sjdbOverhang 100 (read length minus 1)"
```
Expected:
```
Command: STAR \
--runMode genomeGenerate \
--genomeDir /data/star_hg38 \
--genomeFastaFiles /data/hg38.fa \
--sjdbGTFfile /data/gencode.v44.gtf \
--sjdbOverhang 100 \
--runThreadN 8
```
### Align trimmed reads
```bash
oxo-call dry-run STAR \
"align trimmed paired-end reads trimmed/sample1_R1.fastq.gz and trimmed/sample1_R2.fastq.gz \
to genome index at /data/star_hg38, output to aligned/sample1/, \
sort BAM by coordinate, use 8 threads, generate a BAM file"
```
Expected:
```
Command: STAR \
--genomeDir /data/star_hg38 \
--readFilesIn trimmed/sample1_R1.fastq.gz trimmed/sample1_R2.fastq.gz \
--readFilesCommand zcat \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix aligned/sample1/ \
--runThreadN 8
Explanation: --readFilesCommand zcat handles .gz files; SortedByCoordinate produces a ready-to-use BAM.
```
> **The `--readFilesCommand zcat` pitfall:**
> Forgetting this flag when using `.gz` input is one of the most common STAR mistakes. The STAR skill includes it as a pitfall: *"Always add `--readFilesCommand zcat` for compressed FASTQ."*
Execute:
```bash
mkdir -p aligned/sample1
oxo-call run STAR \
"align trimmed paired-end reads trimmed/sample1_R1.fastq.gz and trimmed/sample1_R2.fastq.gz \
to genome index at /data/star_hg38, output to aligned/sample1/, \
sort BAM by coordinate, use 8 threads, generate a BAM file"
```
> **STAR Two-Pass Mode:**
> For novel splice junction discovery (e.g., tumor RNA-seq, rare transcripts), consider using STAR's two-pass mode. In the first pass, STAR discovers splice junctions; in the second pass, it re-maps reads using the discovered junctions for improved sensitivity. Add `--twopassMode Basic` to your alignment task description. Note that two-pass mode increases runtime and memory usage. For standard differential expression analyses with well-annotated genomes (e.g., human, mouse), one-pass mode with a comprehensive GTF annotation is usually sufficient.
### Index the BAM
```bash
oxo-call run samtools "index aligned/sample1/Aligned.sortedByCoord.out.bam"
```
---
## Step 3: Aggregate QC with MultiQC
MultiQC reads all QC outputs and creates a single interactive report. It should be run after all samples are processed.
```bash
oxo-call dry-run multiqc \
"aggregate QC reports from qc/ and STAR log files from aligned/ into a single report at results/multiqc/"
```
Expected:
```
Command: multiqc qc/ aligned/ -o results/multiqc/
Explanation: multiqc automatically discovers fastp JSON reports and STAR Log.final.out files in the specified directories.
```
```bash
mkdir -p results/multiqc
oxo-call run multiqc \
"aggregate QC reports from qc/ and STAR log files from aligned/ into a single report at results/multiqc/"
```
Open `results/multiqc/multiqc_report.html` in a browser to review:
- Per-sample read quality before/after trimming
- Alignment rates
- Duplication rates
- Any samples that need attention
---
## Step 4: Quantification with featureCounts
featureCounts counts reads per gene. It is part of the `subread` package.
```bash
oxo-call dry-run featureCounts \
"count reads per gene in aligned/sample1/Aligned.sortedByCoord.out.bam \
using GTF at /data/gencode.v44.gtf, for paired-end data, \
use 4 threads, output to results/counts.txt"
```
Expected:
```
Command: featureCounts \
-a /data/gencode.v44.gtf \
-o results/counts.txt \
-p \
-T 4 \
aligned/sample1/Aligned.sortedByCoord.out.bam
Explanation: -p specifies paired-end mode; -a is the GTF annotation; -T sets threads.
```
> **Using Salmon for quantification instead?**
> featureCounts works on aligned BAMs (alignment-based). Salmon uses pseudo-alignment directly on FASTQ files (alignment-free) and is faster for many use cases. Try:
> `oxo-call dry-run salmon "quantify sample1_R1.fastq.gz and sample1_R2.fastq.gz against index at /data/salmon_hg38_index, output to quant/sample1/"`
Execute:
```bash
mkdir -p results
oxo-call run featureCounts \
"count reads per gene in aligned/sample1/Aligned.sortedByCoord.out.bam \
using GTF at /data/gencode.v44.gtf, for paired-end data, \
use 4 threads, output to results/counts.txt"
```
---
## Step 5: Running Multiple Samples
For a real experiment you will have many samples. Instead of repeating commands manually, use the workflow engine:
```bash
# View the built-in RNA-seq template
oxo-call workflow show rnaseq
# Dry-run the complete pipeline
oxo-call workflow dry-run rnaseq
```
To customize for your samples and paths, see the [Workflow Builder tutorial](./workflow-builder.md).
---
## Checking the Full Run History
```bash
oxo-call history list | head -20
```
This shows every command in the order it was executed, with exit codes, timestamps, and provenance metadata.
---
## What You Learned
- How to run a complete RNA-seq pipeline step-by-step with oxo-call
- How built-in skills prevent common mistakes (`--readFilesCommand zcat`, `-p` for paired-end)
- How to add remote documentation for richer LLM context
- How MultiQC integrates naturally into the oxo-call workflow
- Where to go next: the workflow engine for multi-sample automation
**Next steps:**
- [Workflow Builder tutorial](./workflow-builder.md) — automate this for multiple samples
- [featureCounts command reference](../commands/run.md) — full options
- [Skill System reference](../reference/skill-system.md) — how skills improve accuracy