1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
// Import `Segmenter` trait.
use crateSegmenter;
// Make a small documentation of the specialized Segmenter like below.
/// <Script/Language> specialized [`Segmenter`].
///
/// This Segmenter uses [`<UsedLibraryToSegment>`] internally to segment the provided text.
/// <OptionalAdditionnalExplanations>
//
//TIP: Name the Segmenter with its purpose and not its internal behavior:
// prefer JapaneseSegmenter (based on the Language) instead of LinderaSegmenter (based on the used Library).
// Same for the filename, prefer `japanese.rs` instead of `lindera.rs`.
;
// All specialized segmenters only need to implement the method `segment_str` of the `Segmenter` trait.
//TIP: Some segmentation Libraries need to initialize a instance of the Segmenter.
// This initialization could be time-consuming and shouldn't be done at each call of `segment_str`.
// In this case, you may want to store the initialized instance in a lazy static like below and call it in `segment_str`.
// Otherwise, just remove below lines.
//
// Put this import at the top of the file.
// use std::sync::LazyLock;
//
// static LIBRARY_SEGMENTER: LazyLock<LibrarySegmenter> = LazyLock::new(|| LibrarySegmenter::new());
// Publish the newly implemented Segmenter:
// - import module by adding `mod dummy;` (filename) in `segmenter/mod.rs`
// - publish Segmenter by adding `pub use dummy::DummySegmenter;` in `segmenter/mod.rs`
// - running `cargo doc --open` you should see your Segmenter in the segmenter module
// Test the segmenter:
// Include the newly implemented Segmenter in the tokenization pipeline:
// - assign Segmenter to a Script and a Language by adding it in `SEGMENTERS` in `segmenter/mod.rs`
// - check if it didn't break any test or benhchmark
// Your Segmenter will now be used on texts of the assigned Script and Language. Thank you for your contribution, and congratulation! 🎉