1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
//! Arabic diacritics (harakat / تشکیل) removal.
//!
//! Diacritics are combining marks that annotate vowel sounds in Arabic and
//! are occasionally present in Persian text (especially classical poetry,
//! religious texts, and language-learning material). For most NLP tasks
//! (search, matching, sentiment analysis) they add noise rather than
//! information.
/// Remove Arabic harakat and the tatweel extender from `text`.
///
/// The following Unicode code points are stripped:
///
/// | Code point | Name | Glyph |
/// |------------|------|-------|
/// | U+064B | Arabic Fathatan | ً |
/// | U+064C | Arabic Dammatan | ٌ |
/// | U+064D | Arabic Kasratan | ٍ |
/// | U+064E | Arabic Fatha | َ |
/// | U+064F | Arabic Damma | ُ |
/// | U+0650 | Arabic Kasra | ِ |
/// | U+0651 | Arabic Shadda | ّ |
/// | U+0652 | Arabic Sukun | ْ |
/// | U+0653 | Arabic Maddah Above | ٓ |
/// | U+0654 | Arabic Hamza Above | ٔ |
/// | U+0655 | Arabic Hamza Below | ٕ |
/// | U+0640 | Arabic Tatweel | ـ |
///
/// ```
/// use parsitext::diacritics::remove_diacritics;
///
/// assert_eq!(remove_diacritics("مُحَمَّد"), "محمد");
/// assert_eq!(remove_diacritics("كِتَابٌ"), "كتاب");
/// assert_eq!(remove_diacritics("سلاـم"), "سلام"); // tatweel removed
/// ```
/// Returns `true` if `c` is an Arabic diacritic or tatweel character that
/// should be stripped by [`remove_diacritics`].