rustpotion-0.2.0 has been yanked.
RustPotion
Oxidized Tokenlearn for blazingly fast word embeddings.
Example
cargo add rustpotion
use ;
use Path;
Why
> be me
> saw cool project
> said "why not rust"
> Now cool project is in rust
Speed
Because tokenlearn is so blazingly fast (mainly cause it's only just an average of some word vectors), the limiting factor is actually the tokenizer implementation.
That's why it's good news that we get ~27MB/s of input sentences for potion-base-2M, which on par (if not marginally better) with most other high performing tokenizers.
Performance
This project is just a rustified version of Tokenlearn so all the results (should be) the same.
Name | MTEB Score |
---|---|
potion-base-8M | 50.03 |
potion-base-4M | 48.23 |
potion-base-2M | 44.77 |
Limitations
- Only english; unicode slicing is pain
- No python bindings, just use Tokenlearn (it's secretly rust if you look deep enough)
- RustPotion::encode_many is multithreaded and will use all available resources
- No limit on sentence length, but performance starts to dip after 500 tokens (~100 words) so be careful.
Final thoughts
Don't use in production unless you like living on the edge