TSG - Transcript Segment Graph
TSG is a Rust library and command-line tool for creating, manipulating, and analyzing transcript segment graphs. It provides a comprehensive framework for modeling segmented transcript data, analyzing alternative splicing events, and working with genomic structural variants.
Features
- Parse and write TSG format files
- Build and manipulate transcript segment graphs
- Analyze paths and connectivity between transcript segments
- Support for various element types: nodes, edges, groups, and chains
- Export graphs to DOT format for visualization
- Traverse the graph to identify valid transcript paths
- Read identity tracking to ensure biological validity
- Build graphs from chains and validate path traversals
- Support for genomic coordinates with strand information
- Support for read evidence with types
Installation
Library
Add this to your Cargo.toml:
[]
= "0.1.0"
Command-line Tool
Install the CLI tool:
Library Usage
Loading a TSG file
use TSGraph;
use Path;
Creating a Graph Programmatically
use ;
use BString;
Building a Graph from Chains
use ;
use HashMap;
Finding Valid Paths Through the Graph
use TSGraph;
CLI Usage
The TSG command-line tool provides a convenient interface for common operations:
# Display help
# Parse and validate a TSG file
# Convert a TSG file to DOT format for visualization
# Extract statistics from a TSG file
# Find all paths through the graph
TSG File Format
The TSG format is a tab-delimited text format representing transcript assemblies as graphs.
Record Types
Each line in a TSG file starts with a letter denoting the record type:
H- Header informationN- Node definition (exon or transcript segment)E- Edge definition (splice junction or structural variant)U- Unordered group (set of elements)O- Ordered group (path through the graph)C- Chain (alternating nodes and edges)A- Attribute for any element (metadata)
Conceptual Model
In the TSG model:
- Chains (C) are used to build the graph structure. They define the nodes and edges that make up the graph.
- Paths (O) are traversals through the constructed graph.
- The complete TSG is built by combining all nodes and edges from all chains.
- After constructing the graph from chains, paths can be defined to represent ways of traversing the graph.
This distinction is important: chains define what the graph is, while paths define ways to traverse the graph.
Example
# Header information
H TSG 1.0
H reference GRCh38
# Nodes (exons)
N n1 chr1:+:1000-1200,1500-1700 read1:SO,read2:SO ACGTACGT
N n2 chr1:+:2000-2200 read4:SO,read5:SO TGCATGCA
N n3 chr1:+:2500-2700 read1:IN,read2:IN,read3:IN,read4:IN CTGACTGA
# Edges (splice junctions)
E e1 n1 n2 chr1,chr1,1700,2000,splice
E e2 n2 n3 chr1,chr1,2200,2500,splice
# Chains (building the graph)
C chain1 n1 e1 n2 e2 n3
# Paths (traversals)
O transcript1 n1+ e1+ n2+ e2+ n3+
# Sets (grouping elements)
U exon_set n1 n2 n3
# Attributes (metadata)
A N n1 expression:f:10.5
A O transcript1 tpm:f:8.2
Node Format
Nodes represent exons or transcript segments with the format:
N <id> <genomic_location> <reads> [<seq>]
Where:
genomic_locationis in formatchromosome:strand:coordinates(e.g.,chr1:+:1000-1200,1500-1700)readsis a comma-separated list of read IDs with types (e.g.,read1:SO,read2:IN)- Read types include:
SO: Source NodeIN: Intermediary NodeSI: Sink Node
Edge Format
Edges represent splice junctions or structural variants:
E <id> <source_id> <sink_id> <SV>
Where:
SVis in formatreference_name1,reference_name2,breakpoint1,breakpoint2,sv_type
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.