TSG - Transcript Segment Graph
TSG is a Rust library and command-line tool for creating, manipulating, and analyzing transcript segment graphs. It provides a comprehensive framework for modeling segmented transcript data, analyzing non-linear splicing events, and working with genomic structural variants.
Features
- Parse and write TSG format files
- Build and manipulate transcript segment graphs
- Support for multiple graphs within a single file
- Analyze paths and connectivity between transcript segments
- Support for various element types: nodes, edges, groups, and chains
- Export graphs to DOT format for visualization
- Traverse the graph to identify valid transcript paths
- Read identity tracking to ensure biological validity
- Build graphs from chains and validate path traversals
- Support for genomic coordinates with strand information
- Support for read evidence with types
- Inter-graph links for fusion events and other cross-graph relationships
Installation
Library
Add this to your Cargo.toml:
[]
= "0.1.0"
Command-line Tool
Install the CLI tool:
Library Usage
Loading a TSG file
use TSGraph;
use Path;
Working with Multiple Graphs
use ;
use BString;
Building Graphs from Chains
use ;
use HashMap;
Finding Valid Paths Through Specific Graphs
use TSGraph;
CLI Usage
The TSG command-line tool provides a convenient interface for common operations:
# Display help
# Parse and validate a TSG file
# List all graphs in a TSG file
# Convert a specific graph to DOT format for visualization
# Extract statistics from a TSG file
# Find all paths through a specific graph
# Find all inter-graph links
TSG File Format
The TSG format is a tab-delimited text format representing transcript assemblies as graphs. It supports multiple independent graphs within a single file.
Multi-Graph Support
TSG supports multiple graphs within a single file using a graph namespace approach. Each element in the file can be associated with a specific graph using a graph ID prefix:
graph_id:element_id
For example, gene_a:n1 refers to node n1 in the graph identified as "gene_a".
Record Types
Each line in a TSG file starts with a letter denoting the record type:
H- Header information (including graph definitions)N- Node definition (exon or transcript segment)E- Edge definition (splice junction or structural variant)U- Unordered group (set of elements)P- Path (ordered traversal through the graph)C- Chain (alternating nodes and edges)A- Attribute for any element (metadata)L- Inter-graph link (connections between different graphs)
Conceptual Model
In the TSG model:
- Graphs (G) represent independent transcript graphs, each with its own set of nodes and edges.
- Chains (C) are used to build each graph's structure.
- Paths (P) are traversals through the constructed graphs.
- Links (L) establish relationships between elements in different graphs.
This distinction is important: chains define what each graph is, paths define ways to traverse each graph, and links define relationships between graphs.
Example with Multiple Graphs
# File header
H TSG 1.0
H reference GRCh38
# Graph definitions
H graph gene_a BRCA1 transcripts
H graph gene_b BRCA2 transcripts
# Nodes for gene_a
N gene_a:n1 chr17:+:41196312-41196402 read1:SO,read2:SO ACGTACGT
N gene_a:n2 chr17:+:41199660-41199720 read2:IN,read3:IN TGCATGCA
N gene_a:n3 chr17:+:41203080-41203134 read1:SI,read2:SI CTGACTGA
# Nodes for gene_b
N gene_b:n1 chr13:+:32315480-32315652 read4:SO,read5:SO GATTACA
N gene_b:n2 chr13:+:32316528-32316800 read4:IN,read5:IN TACGATCG
N gene_b:n3 chr13:+:32319077-32319325 read4:SI,read5:SI CGTACGTA
# Edges for gene_a
E gene_a:e1 gene_a:n1 gene_a:n2 chr17,chr17,41196402,41199660,splice
E gene_a:e2 gene_a:n2 gene_a:n3 chr17,chr17,41199720,41203080,splice
# Edges for gene_b
E gene_b:e1 gene_b:n1 gene_b:n2 chr13,chr13,32315652,32316528,splice
E gene_b:e2 gene_b:n2 gene_b:n3 chr13,chr13,32316800,32319077,splice
# Chains for gene_a
C gene_a:chain1 gene_a:n1 gene_a:e1 gene_a:n2 gene_a:e2 gene_a:n3
# Chains for gene_b
C gene_b:chain1 gene_b:n1 gene_b:e1 gene_b:n2 gene_b:e2 gene_b:n3
# Paths for gene_a
P gene_a:transcript1 gene_a:n1+ gene_a:e1+ gene_a:n2+ gene_a:e2+ gene_a:n3+
# Paths for gene_b
P gene_b:transcript1 gene_b:n1+ gene_b:e1+ gene_b:n2+ gene_b:e2+ gene_b:n3+
# Inter-graph link (e.g., for a fusion transcript)
L fusion1 gene_a:n3 gene_b:n1 fusion type:Z:chromosomal
# Attributes
A N gene_a:n1 expression:f:10.5
A P gene_a:transcript1 tpm:f:8.2
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.