Robust and Fast tokenizations alignment library for Rust and Python
Demo: demo
Rust document: docs.rs
Blog post: How to calculate the alignment between BERT and spaCy tokens effectively and robustly
Usage (Python)
- Installation
- Install from source
This library uses maturin to build the wheel.
$ git clone https://github.com/tamuhey/tokenizations
$ cd tokenizations/python
$ pip install maturin
$ maturin build
Now the wheel is created in python/target/wheels
directory, and you can install it with pip install *whl
.
get_alignments
...
Returns alignment mappings for two different tokenizations:
>>> =
>>> = # the accent is dropped (å -> a) and the letters are lowercased(BC -> bc)
>>> , =
>>>
>>>
a2b[i]
is a list representing the alignment from tokens_a
to tokens_b
.
Usage (Rust)
See here: docs.rs
Related
- Algorithm overview
- Blog post
- seqdiff is used for the diff process.
- textspan
- explosion/spacy-alignments: 💫 A spaCy package for Yohei Tamura's Rust tokenizations library
- Python bindings for this library, maintained by Explosion, author of spaCy. If you feel difficult to install pytokenizations, please try this.