RIGHOR
Install rust (potentially slow):
|
Install the library:
In the git folder:
How to use:
Fast generation:
# Create generation model (once only)
=
# Generate productive amino-acid sequence
= # False for unproductive
Inference:
# load the model
=
# define parameters for the alignment and the inference
=
=
# read the file line by line and align each sequence
=
=
=
Differences with IGoR:
- "dynamic programming" method, instead of summing over all events we first pre-compute over sum of events. This means that we can run it with undefined nucleotides like N (at least in theory, I need to add full support for these).
- The D gene alignment is less constrained
Limitations (I think also true for IGoR but not clear):
- Need to get rid of any primers/ends on the V gene side before running it
- The reads need to be long enough to fully cover the CDR3 (even when it's particularly long)
- still not sure if I should use initial_distribution for the insertion model
Programming stuff:
- I'm working on the web version on a different crate, importing the library, need to push that on git.
- python version: also a different crate now (will maybe loop it back in)
- when adding a model, add it to "models.json". First model in a category is the default model. Each field is one independant model. The elements in chain and species should always be lower-case.
Things to do:
- test the inference in detail
- add more tests
- deal with the "pgen with errors"
- deal with potential insertion in V/J alignment, remove the sequence from the inference if the insertion overlap with the delv range.
- test the restricted V gene option for generation.
- write igor file, offer a json export
- StaticEvent / GenEvent
- modify the way I deal with added error (make it cleaner, with a "ErrorDistribution" thing or smt)
- deal with amino-acid and generic "undefined" stuff. Strat: define an extended Dna object that the alignment can deal with + define the insertion thing so that it can deal with that This second one is slightly a pain (the first one too ? No it's fine, just a bit longer to deal with). I would need to add sums here and there, nothing impossible, but slightly more a pain. In short some position must be linked, this will complexify quite a bit the definition of Dna (more precisely this will be a new class). So UndefinedDna would contains for each position a vec/array of bytes and a int giving the positions they're connected with (just need two options for everything). This is very specific to the aa case, but why should I care. A bit complicated rn, leaving it for later.
- improve alignment so that it can deal with potential indels.
- add simpler inference (without full VDJ, without V-J...)
- publish cargo package
- json export and loading
- run cargo clippy
- clean up gen event / static event if possible.
- make it work with CDR3 + V gene + J gene (require implementing some of the python function in rust)
TODO before v0.2:
- publish pip package
- make a python notebook for example with: load model, align sequences, display aligned sequences, evaluate, display evaluate (incl. features), infer model, display inferred model.
- use pgen for the online version
- fix the inference, there is still a problem left (clearly). V/J work great (makes sense), but D and insertions fail. Similarly delV fails and delJ mostly fails. There's clearly a problem with delv. Best guesss: there's a problem with the normalisation ?
Current status:
- speed is ok (50 seqs/s roughly ?). Could be slightly faster. I think some range should be replaced by iterator.
- pgen works, but because I consider way more D gene alignment than Quentin some issue when endD and startD are really close to each other.