libgrammstein 0.1.0

Hybrid language model (N-gram + Embeddings) for WFST text correction
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
# Language Implementations

This module provides built-in implementations of the `CodeLanguage` trait for various programming languages.

## Overview

libgrammstein supports both mainstream programming languages and domain-specific languages (DSLs). Each implementation provides:

- Tree-sitter grammar integration
- Keyword and operator definitions
- Token classification logic
- Built-in types and standard library functions
- Comment syntax configuration

## Supported Languages

| Language | Feature Flag | Category | Tree-sitter Grammar |
|----------|--------------|----------|---------------------|
| Python | `code-python` | Mainstream | `tree-sitter-python` |
| Rust | `code-rust` | Mainstream | `tree-sitter-rust` |
| JavaScript | `code-javascript` | Mainstream | `tree-sitter-javascript` |
| Rholang | `code-rholang` | DSL (Blockchain) | `rholang-tree-sitter` |
| MeTTa | `code-metta` | DSL (AI/Reasoning) | `tree-sitter-metta` |

## Feature Flags

Enable language support in `Cargo.toml`:

```toml
[dependencies]
libgrammstein = { version = "0.1", features = ["code", "code-python", "code-rust"] }
```

Convenience feature groups:

| Feature | Includes |
|---------|----------|
| `code-mainstream` | Python, Rust, JavaScript |
| `code-dsl` | Rholang, MeTTa |
| `code-full` | All languages + neural features |

---

## Python

Python support with type hints and indentation awareness.

### Usage

```rust
use libgrammstein::code::Python;

let python = Python::new();

// Parse Python code
let source = r#"
def greet(name: str) -> str:
    return f"Hello, {name}!"
"#;

// Access language information
assert_eq!(python.name(), "python");
assert!(python.is_whitespace_significant());
```

### Characteristics

| Property | Value |
|----------|-------|
| File Extensions | `.py`, `.pyw`, `.pyi` |
| Whitespace Significant | Yes |
| Comment Syntax | `#` (line), `"""` (block) |
| Unicode Identifiers | Yes |

### Keywords

```
False   None    True    and     as      assert  async   await
break   class   continue  def   del     elif    else    except
finally for     from    global  if      import  in      is
lambda  nonlocal not     or      pass    raise   return  try
while   with    yield
```

### Built-in Types

```rust
// Primitive types
["int", "float", "complex", "str", "bytes", "bytearray",
 "list", "tuple", "set", "frozenset", "dict", "bool", "object", "type"]

// Type hints (typing module)
["Optional", "Union", "List", "Dict", "Set", "Tuple",
 "Callable", "Any", "Type", "Generic", "Protocol"]

// Exception types
["Exception", "BaseException", "TypeError", "ValueError",
 "KeyError", "IndexError", "AttributeError", "NameError"]
```

### Token Classification

```rust
let python = Python::new();

// Keywords use their text as node_kind
assert_eq!(python.classify_token("def", "def"), TokenType::Keyword);
assert_eq!(python.classify_token("None", "None"), TokenType::Keyword);

// Boolean literals
assert_eq!(python.classify_token("True", "True"), TokenType::BooleanLiteral);

// Identifiers and types
assert_eq!(python.classify_token("foo", "identifier"), TokenType::Identifier);
assert_eq!(python.classify_token("int", "identifier"), TokenType::TypeName);

// Literals
assert_eq!(python.classify_token("42", "integer"), TokenType::NumericLiteral);
assert_eq!(python.classify_token("3.14", "float"), TokenType::NumericLiteral);
```

### Identifier Validation

```rust
let python = Python::new();

// Valid Python identifiers
assert!(python.is_valid_identifier("foo"));
assert!(python.is_valid_identifier("_private"));
assert!(python.is_valid_identifier("CamelCase"));
assert!(python.is_valid_identifier("snake_case_123"));

// Invalid identifiers
assert!(!python.is_valid_identifier("123abc"));  // Starts with digit
assert!(!python.is_valid_identifier(""));        // Empty
assert!(!python.is_valid_identifier("my-var"));  // Contains hyphen
```

---

## Rust

Rust support with macro awareness and raw identifier handling.

### Usage

```rust
use libgrammstein::code::Rust;

let rust = Rust::new();

// Access language information
assert_eq!(rust.name(), "rust");
assert!(!rust.is_whitespace_significant());
```

### Characteristics

| Property | Value |
|----------|-------|
| File Extensions | `.rs` |
| Whitespace Significant | No |
| Comment Syntax | `//` (line), `/* */` (block), `///` (doc) |
| Raw Identifiers | `r#keyword` syntax |

### Keywords

```
as      async   await   break   const   continue  crate   dyn
else    enum    extern  false   fn      for       if      impl
in      let     loop    match   mod     move      mut     pub
ref     return  self    Self    static  struct    super   trait
true    type    unsafe  use     where   while     try
```

### Built-in Types

```rust
// Primitive types
["bool", "char", "str",
 "i8", "i16", "i32", "i64", "i128", "isize",
 "u8", "u16", "u32", "u64", "u128", "usize",
 "f32", "f64"]

// Standard library types
["String", "Vec", "Box", "Rc", "Arc", "Cell", "RefCell",
 "Option", "Result", "Ok", "Err", "Some", "None",
 "HashMap", "HashSet", "BTreeMap", "BTreeSet",
 "Path", "PathBuf", "Cow", "Pin", "PhantomData"]

// Common traits
["Copy", "Clone", "Debug", "Display", "Default",
 "Send", "Sync", "Sized", "Eq", "PartialEq", "Ord", "PartialOrd",
 "Iterator", "IntoIterator", "From", "Into", "Drop", "Fn", "FnMut", "FnOnce"]
```

### Token Classification

```rust
let rust = Rust::new();

// Keywords
assert_eq!(rust.classify_token("fn", "fn"), TokenType::Keyword);
assert_eq!(rust.classify_token("let", "let"), TokenType::Keyword);

// Boolean literals
assert_eq!(rust.classify_token("true", "true"), TokenType::BooleanLiteral);

// Primitive types
assert_eq!(rust.classify_token("i32", "primitive_type"), TokenType::TypeName);

// Macro invocations are special
assert_eq!(rust.classify_token("println!", "macro_invocation"), TokenType::Special);
```

### Raw Identifier Support

```rust
let rust = Rust::new();

// Raw identifiers allow keywords as names
assert!(rust.is_valid_identifier("r#type"));   // Valid: raw identifier
assert!(rust.is_valid_identifier("r#match"));  // Valid: raw identifier
assert!(rust.is_valid_identifier("r#loop"));   // Valid: raw identifier

// Regular identifiers
assert!(rust.is_valid_identifier("foo_bar"));
assert!(rust.is_valid_identifier("_hidden"));
```

---

## JavaScript

JavaScript ES6+ support with JSX awareness.

### Usage

```rust
use libgrammstein::code::JavaScript;

let js = JavaScript::new();

// Access language information
assert_eq!(js.name(), "javascript");
assert_eq!(js.display_name(), "JavaScript");
```

### Characteristics

| Property | Value |
|----------|-------|
| File Extensions | `.js`, `.jsx`, `.mjs`, `.cjs` |
| Whitespace Significant | No |
| Comment Syntax | `//` (line), `/* */` (block), `///` (doc) |
| Dollar Identifiers | `$variable` syntax |

### Keywords

```
async     await     break     case      catch     class     const
continue  debugger  default   delete    do        else      export
extends   false     finally   for       function  if        import
in        instanceof  let     new       null      return    static
super     switch    this      throw     true      try       typeof
undefined var       void      while     with      yield
```

### Built-in Types

```rust
// Primitive wrappers
["Boolean", "Number", "String", "Symbol", "BigInt"]

// Objects
["Object", "Array", "Function", "Date", "RegExp", "Error",
 "Map", "Set", "WeakMap", "WeakSet", "Promise", "Proxy", "Reflect"]

// TypedArrays
["ArrayBuffer", "DataView", "Int8Array", "Uint8Array",
 "Int16Array", "Uint16Array", "Int32Array", "Uint32Array",
 "Float32Array", "Float64Array", "BigInt64Array", "BigUint64Array"]

// Error types
["TypeError", "RangeError", "ReferenceError", "SyntaxError"]
```

### Token Classification

```rust
let js = JavaScript::new();

// Keywords
assert_eq!(js.classify_token("function", "function"), TokenType::Keyword);
assert_eq!(js.classify_token("const", "const"), TokenType::Keyword);

// Null-like keywords
assert_eq!(js.classify_token("null", "null"), TokenType::Keyword);
assert_eq!(js.classify_token("undefined", "undefined"), TokenType::Keyword);

// Boolean literals
assert_eq!(js.classify_token("true", "true"), TokenType::BooleanLiteral);

// JSX elements are special
assert_eq!(js.classify_token("<div>", "jsx_opening_element"), TokenType::Special);
```

### Identifier Validation

```rust
let js = JavaScript::new();

// Valid JavaScript identifiers
assert!(js.is_valid_identifier("foo"));
assert!(js.is_valid_identifier("_private"));
assert!(js.is_valid_identifier("$element"));    // jQuery-style
assert!(js.is_valid_identifier("$$internal"));  // Angular-style

// Invalid identifiers
assert!(!js.is_valid_identifier("123abc"));  // Starts with digit
assert!(!js.is_valid_identifier("my-var"));  // Contains hyphen
```

---

## Rholang

Rholang is a reflective, concurrent programming language based on the rho-calculus, designed for building scalable, secure blockchain applications on the RChain platform.

### Core Concepts

- **Channels (names)**: Communication endpoints prefixed with `@`
- **Processes**: Concurrent computations composed with `|`
- **Contracts**: Persistent receive operations
- **Bundles**: Access control for channels

### Usage

```rust
use libgrammstein::code::Rholang;

let rholang = Rholang::new();

// Access language information
assert_eq!(rholang.name(), "rholang");
assert!(!rholang.is_whitespace_significant());
```

### Characteristics

| Property | Value |
|----------|-------|
| File Extensions | `.rho` |
| Whitespace Significant | No |
| Comment Syntax | `//` (line), `/* */` (block), `///` (doc) |
| Paradigm | Concurrent, process algebra |

### Keywords

```
new     in      if      else    let     match   select  contract  for
or      and     matches not
bundle  bundle- bundle+ bundle0
true    false   Nil
```

### Special Tokens (Operators)

Rholang has a rich set of channel and process operators:

```rust
// Channel operations
"@"   // Quote (process -> name)
"*"   // Eval/dereference (name -> process)

// Send operations
"!"   // Send single
"!!"  // Send persistent
"!?"  // Synchronous send-then-receive

// Receive operations
"<-"  // Linear receive
"<="  // Persistent receive
"<<-" // Peek (non-consuming receive)
"?!"  // Receive-then-send

// Process algebra
"|"   // Parallel composition
"&"   // Concurrent binding
";"   // Sequential composition
"=>"  // Pattern match arm

// Set operations
"++"  // Union/concatenation
"--"  // Difference
"/\\" // Conjunction
"\\/" // Disjunction
"~"   // Negation
```

### Built-in Types

```rust
["Bool", "Int", "String", "Uri", "ByteArray", "Nil"]
```

### Token Classification

```rust
let rholang = Rholang::new();

// Keywords
assert_eq!(rholang.classify_token("new", "new"), TokenType::Keyword);
assert_eq!(rholang.classify_token("contract", "contract"), TokenType::Keyword);

// Boolean literals
assert_eq!(rholang.classify_token("true", "bool_literal"), TokenType::BooleanLiteral);

// Types
assert_eq!(rholang.classify_token("Int", "simple_type"), TokenType::TypeName);

// Variables
assert_eq!(rholang.classify_token("myVar", "var"), TokenType::Identifier);
```

### Identifier Validation

Rholang identifiers can include apostrophes (for mathematical notation):

```rust
let rholang = Rholang::new();

// Valid Rholang identifiers
assert!(rholang.is_valid_identifier("foo"));
assert!(rholang.is_valid_identifier("bar123"));
assert!(rholang.is_valid_identifier("_foo"));
assert!(rholang.is_valid_identifier("x'"));      // With apostrophe
assert!(rholang.is_valid_identifier("foo'bar")); // Apostrophe in middle

// Invalid identifiers
assert!(!rholang.is_valid_identifier("_"));      // Wildcard only
assert!(!rholang.is_valid_identifier("123foo")); // Starts with digit
assert!(!rholang.is_valid_identifier("@foo"));   // Starts with @
```

### Example Rholang Code

```rholang
// A simple contract that echoes messages
new echo, stdout(`rho:io:stdout`) in {
  contract echo(@msg, return) = {
    return!(msg) |
    stdout!(["Echo:", msg])
  } |
  new ack in {
    echo!("Hello", *ack) |
    for (@response <- ack) {
      stdout!(["Response:", response])
    }
  }
}
```

---

## MeTTa

MeTTa (Meta Type Talk) is a functional meta-programming language designed for knowledge representation, reasoning, and AI systems. It features hypergraph-based data structures and powerful pattern matching.

### Core Concepts

- **Atoms**: Basic units (symbols, variables, expressions)
- **Expressions**: S-expression lists `(expr expr ...)`
- **Variables**: Pattern variables prefixed with `$`
- **Spaces**: Atomspace references prefixed with `&`
- **Types**: Gradual typing with `:` annotations

### Usage

```rust
use libgrammstein::code::MeTTa;

let metta = MeTTa::new();

// Access language information
assert_eq!(metta.name(), "metta");
assert_eq!(metta.display_name(), "MeTTa");
```

### Characteristics

| Property | Value |
|----------|-------|
| File Extensions | `.metta`, `.mt` |
| Whitespace Significant | No |
| Comment Syntax | `;` (line), `;;` (doc) |
| Paradigm | Functional, homoiconic |

### Keywords

```
True    False   match   let     let*    if      case    function  return
empty   Error   Type    Atom    Symbol  Variable  Expression  Grounded
Unit    Number  String  Bool
new-space  add-atom  remove-atom  get-atoms  import!  include  bind!  pragma!
sequential  chain  eval  quote  unquote
```

### Special Tokens

```rust
// Prefix operators
"!"   // Reduction/evaluation
"?"   // Query
"'"   // Quote

// Variable prefix
"$"   // Pattern variable marker

// Space reference prefix
"&"   // Atomspace reference (e.g., &self)

// Type annotation
":"   // Type annotation

// Assignment/binding
"="   // Definition/equality
":="  // Rule definition

// Arrows
"->"  // Function type / transformation
```

### Built-in Types

```rust
// Core types
["Type", "Atom", "Symbol", "Variable", "Expression", "Grounded"]

// Primitive types
["Number", "String", "Bool", "Unit"]

// Collection types
["List", "Tuple"]

// Function types
["Function", "->"]

// Special types
["%Undefined%", "%Irreducible%"]
```

### Token Classification

```rust
let metta = MeTTa::new();

// Boolean literals
assert_eq!(metta.classify_token("True", "boolean_literal"), TokenType::BooleanLiteral);

// Numeric literals
assert_eq!(metta.classify_token("42", "integer_literal"), TokenType::NumericLiteral);
assert_eq!(metta.classify_token("3.14", "float_literal"), TokenType::NumericLiteral);

// Variables
assert_eq!(metta.classify_token("$x", "variable"), TokenType::Identifier);

// Keywords
assert_eq!(metta.classify_token("match", "identifier"), TokenType::Keyword);

// Space references
assert_eq!(metta.classify_token("&self", "space_reference"), TokenType::Special);
```

### Identifier Validation

MeTTa has a permissive identifier syntax:

```rust
let metta = MeTTa::new();

// Valid MeTTa identifiers (symbols)
assert!(metta.is_valid_identifier("foo"));
assert!(metta.is_valid_identifier("my-function"));  // Hyphens allowed
assert!(metta.is_valid_identifier("+"));            // Operators as symbols
assert!(metta.is_valid_identifier("->"));           // Arrow as symbol
assert!(metta.is_valid_identifier("_"));            // Wildcard

// Variables
assert!(metta.is_valid_identifier("$x"));
assert!(metta.is_valid_identifier("$var"));
assert!(metta.is_valid_identifier("$"));            // Empty variable

// Space references
assert!(metta.is_valid_identifier("&self"));
assert!(metta.is_valid_identifier("&kb"));
assert!(!metta.is_valid_identifier("&"));           // Empty not allowed

// Invalid identifiers
assert!(!metta.is_valid_identifier(""));            // Empty
assert!(!metta.is_valid_identifier("(foo"));        // Starts with delimiter
assert!(!metta.is_valid_identifier("foo(bar"));     // Contains delimiter
```

### Example MeTTa Code

```metta
; Define a type
(: add-numbers (-> Number Number Number))

; Define a function
(= (add-numbers $x $y)
   (+ $x $y))

; Pattern matching example
(= (factorial $n)
   (if (== $n 0)
       1
       (* $n (factorial (- $n 1)))))

; Using atomspaces
!(bind! &kb
  (new-space))

!(add-atom &kb (Person "Alice"))
!(add-atom &kb (knows (Person "Alice") (Person "Bob")))

; Query the space
!(match &kb
  (knows (Person "Alice") $who)
  $who)
```

---

## Comparison Table

| Feature | Python | Rust | JavaScript | Rholang | MeTTa |
|---------|--------|------|------------|---------|-------|
| Paradigm | OOP/Functional | Systems | Multi-paradigm | Concurrent | Functional |
| Whitespace | Significant | No | No | No | No |
| Line Comment | `#` | `//` | `//` | `//` | `;` |
| Block Comment | `"""` | `/* */` | `/* */` | `/* */` | None |
| Doc Comment | `#` | `///` | `///` | `///` | `;;` |
| Variable Prefix | None | None | `$` optional | None | `$` |
| Type Annotation | `:` | `:` | TypeScript | `:` | `:` |
| Unicode Identifiers | Yes | Yes | Yes | Yes | Yes |
| Raw Identifiers | No | `r#` | No | No | No |

## Thread Safety

All language implementations are `Send + Sync` and can be safely shared across threads:

```rust
use std::sync::Arc;
use std::thread;

let python = Arc::new(Python::new());
let rust = Arc::new(Rust::new());

let handles: Vec<_> = vec![
    {
        let lang = Arc::clone(&python);
        thread::spawn(move || lang.keywords().len())
    },
    {
        let lang = Arc::clone(&rust);
        thread::spawn(move || lang.keywords().len())
    },
];

for handle in handles {
    let count = handle.join().unwrap();
    println!("Keywords: {}", count);
}
```

## See Also

- [Language Framework]language.md - `CodeLanguage` trait and `TokenType` system
- [AST]ast.md - Tree-sitter integration
- [Tokenizer]tokenizer.md - Token extraction from source code
- [Correction]correction.md - How token types affect correction strategies