udpipe-rs
Rust bindings for UDPipe — a trainable pipeline for tokenization, tagging, lemmatization, and dependency parsing using Universal Dependencies.
Features
- Full parsing pipeline: Tokenization, POS tagging, lemmatization, and dependency parsing
- Universal Dependencies: Output follows the UD annotation scheme
- Model download utility: Easy download of pre-trained models for 65+ languages (optional)
- Thread-friendly: Models are
Send(can be moved between threads)
Installation
Add to your Cargo.toml:
[]
= "0.1"
Or install via command line:
Usage
Download and load a model
use ;
Output:
1 The DET the 2 <- det
2 quick ADJ quick 5 <- amod
3 brown ADJ brown 5 <- amod
4 fox NOUN fox 5 <- nsubj
5 jumps VERB jump 0 <- root
6 over ADP over 9 <- case
7 the DET the 9 <- det
8 lazy ADJ lazy 9 <- amod
9 dog NOUN dog 5 <- obl
10 . PUNCT . 5 <- punct
Available languages
Pre-trained models are available for 65+ languages. Use udpipe_rs::AVAILABLE_MODELS to see the full list:
// Some examples:
// "english-ewt", "english-gum", "english-lines", "english-partut"
// "german-gsd", "german-hdt"
// "french-gsd", "french-sequoia", "french-spoken"
// "spanish-ancora", "spanish-gsd"
// "dutch-alpino", "dutch-lassysmall"
// "chinese-gsd", "japanese-gsd", "korean-gsd"
// ... and many more
for lang in AVAILABLE_MODELS
Working with morphological features
use Model;
Working with sentence structure
use Model;
Download from custom URL
If you need to download from a different source:
use download_model_from_url;
download_model_from_url.expect;
Thread Safety
Model is Send but not Sync. This means:
- You can move a model to another thread (ownership transfer)
- You cannot share
&Modelacross threads simultaneously
For concurrent access, either:
Option 1: Wrap in Mutex (shared model, serialized access)
use ;
use Model;
let model = new;
// Clone Arc for each thread
let model_clone = clone;
spawn;
Option 2: Separate models per thread (parallel access, higher memory)
use Model;
spawn;
API Reference
Word struct
Each parsed word contains:
| Field | Type | Description |
|---|---|---|
form |
String |
The surface form (actual text) |
lemma |
String |
The lemma (dictionary form) |
upostag |
String |
Universal POS tag (NOUN, VERB, ADJ, etc.) |
xpostag |
String |
Language-specific POS tag |
feats |
String |
Morphological features (e.g., "Mood=Imp|VerbForm=Fin") |
deprel |
String |
Dependency relation (root, nsubj, obj, etc.) |
misc |
String |
Miscellaneous annotations (e.g., "SpaceAfter=No") |
id |
i32 |
1-based index of this word within its sentence |
head |
i32 |
Index of head word (0 = root of sentence) |
sentence_id |
i32 |
0-based index of the sentence this word belongs to |
Helper methods on Word
has_feature(key, value)— Check if a morphological feature is presentget_feature(key)— Get the value of a morphological featureis_verb()— Returns true for VERB or AUX tagsis_noun()— Returns true for NOUN or PROPN tagsis_adjective()— Returns true for ADJ tagis_punct()— Returns true for PUNCT tagis_root()— Returns true if this word is the sentence roothas_space_after()— Returns true if there's a space after this word (default)
Examples
# Download a model
# Parse text
Models
Pre-trained models for 100+ treebanks are available from the LINDAT/CLARIAH-CZ repository. The download_model function fetches from this repository automatically.
Requirements
For users: A C++ compiler with C++11 support. The build script compiles UDPipe as a static library automatically.
For contributors: Just Docker. See CONTRIBUTING.md for details.
License
This crate is dual-licensed under MIT OR Apache-2.0.
UDPipe itself is licensed under the Mozilla Public License 2.0.