embedding 0.1.0

A Rust library and CLI for training embeddings from scratch
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
# SPEC - Embedding Trainer Specification

## 1. Overview

### 1.1. Purpose
This document specifies the requirements, architecture, and implementation details for the Rust Embedding Trainer library and CLI tool. The project provides a comprehensive solution for training word embeddings from scratch using multiple algorithms including Skip-gram, CBOW, and Sentence-BERT approaches.

### 1.2. Scope
- **Core library**: Implementation of embedding training algorithms
- **CLI interface**: Command-line interface for training and inference
- **Data processing**: Text preprocessing and vocabulary management
- **Model persistence**: Save/load models and embeddings
- **Performance**: Efficient training and inference capabilities
- **Extensibility**: Plugin system for custom algorithms

### 1.3. Definitions
- **Embedding**: Dense vector representation of words in continuous space
- **Skip-gram**: Neural network architecture that predicts context words from target words
- **CBOW**: Continuous Bag of Words architecture that predicts target words from context
- **Sentence-BERT**: Transformer-based approach for sentence embeddings
- **Vocabulary**: Mapping of words to unique integer identifiers
- **Context window**: Number of surrounding words used for training

## 2. Requirements

### 2.1. Functional Requirements

#### 2.1.1. Training Algorithms
- **FR-001**: Skip-gram algorithm implementation
  - Support configurable context window size
  - Negative sampling capability
  - Efficient gradient calculation
  - Configurable embedding dimensions

- **FR-002**: CBOW (Continuous Bag of Words) algorithm
  - Support configurable context window size
  - Efficient gradient calculation
  - Configurable embedding dimensions
  - Batch processing support

- **FR-003**: Sentence-BERT style training
  - Mean pooling for sentence embeddings
  - Support for transformer-based architectures
  - Configurable embedding dimensions
  - Contextual word representation

#### 2.1.2. Data Processing
- **FR-004**: Text preprocessing
  - Lowercase conversion
  - Punctuation removal
  - Tokenization by whitespace
  - Empty sentence filtering

- **FR-005**: Vocabulary management
  - Build vocabulary from text data
  - Word-to-index mapping
  - Vocabulary size limits
  - Out-of-vocabulary handling

- **FR-006**: Data loading
  - File-based text input
  - Multiple format support (text, JSON)
  - Streaming support for large datasets

#### 2.1.3. Model Management
- **FR-007**: Model training
  - Configurable training parameters
  - Epoch-based training
  - Learning rate scheduling
  - Progress monitoring

- **FR-008**: Model persistence
  - Save/load complete models
  - Embedding export to multiple formats
  - Metadata storage (training parameters, vocabulary)
  - Compression support for large models

- **FR-009**: Similarity computation
  - Cosine similarity calculation
  - Efficient vector operations
  - Batch similarity queries
  - Similarity ranking

#### 2.1.4. CLI Interface
- **FR-010**: Training command
  - Input file specification
  - Output configuration
  - Training parameter tuning
  - Model type selection

- **FR-011**: Similarity command
  - Word pair similarity
  - Batch similarity queries
  - Similarity ranking output
  - Configuration file support

- **FR-012**: Model inspection
  - Model metadata display
  - Vocabulary inspection
  - Embedding statistics
  - Training history

- **FR-013**: Export functionality
  - Multiple export formats (text, JSON, binary)
  - Embedding format options
  - Vocabulary export
  - Model metadata export

### 2.2. Non-Functional Requirements

#### 2.2.1. Performance
- **NF-001**: Training efficiency
  - Optimize for large datasets
  - Parallel processing support
  - Memory-efficient operations
  - GPU acceleration (future)

- **NF-002**: Memory usage
  - Memory-mapped file support
  - Streaming processing
  - Efficient data structures
  - Garbage collection optimization

- **NF-003**: Speed requirements
  - Real-time similarity queries (< 10ms)
  - Efficient batch processing
  - Fast model loading (< 1s for 1M vocab)

#### 2.2.2. Reliability
- **NF-004**: Error handling
  - Comprehensive error types
  - Graceful failure modes
  - Recovery mechanisms
  - Detailed error messages

- **NF-005**: Data integrity
  - Input validation
  - Consistent model format
  - Checksum verification
  - Atomic file operations

#### 2.2.3. Usability
- **NF-006**: CLI usability
  - Intuitive command structure
  - Comprehensive help system
  - Progress indicators
  - Clear error messages

- **NF-007**: API usability
  - Clear documentation
  - Type-safe API
  - Comprehensive examples
  - Consistent naming conventions

#### 2.2.4. Maintainability
- **NF-008**: Code quality
  - Comprehensive test coverage (> 90%)
  - Clean architecture separation
  - Documentation requirements
  - Code review process

- **NF-009**: Documentation
  - API documentation
  - User guides
  - Development documentation
  - Performance benchmarks

### 2.3. Constraints

#### 2.3.1. Technical Constraints
- **TC-001**: Must use Rust programming language
- **TC-002**: Must use ndarray for numerical computations
- **TC-003**: Must support Windows, macOS, and Linux
- **TC-004**: Must be compatible with Rust 1.70+
- **TC-005**: Must use standard file formats for data exchange

#### 2.3.2. Business Constraints
- **BC-001**: Open source license (MIT)
- **BC-002**: No external API dependencies
- **BC-003**: Must work offline
- **BC-004**: Must be self-contained single binary

## 3. Architecture

### 3.1. System Architecture

```
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   CLI Layer     │    │  Library Layer  │    │  Data Layer     │
│                 │    │                 │    │                 │
│┌─────────────┐  │    │┌─────────────┐  │    │┌─────────────┐  │
││   Parser    │  │    ││   Config    │  │    ││   Text Data │  │
│├─────────────┤  │    │├─────────────┤  │    │├─────────────┤  │
││ Commands    │  │    ││ Training    │  │    ││ Vocabulary   │  │
│├─────────────┤  │    ││ Algorithms  │  │    ││ Processing   │  │
││ Interface   │  │    │├─────────────┤  │    │├─────────────┤  │
│└─────────────┘  │    ││ Embedding  │  │    ││ I/O         │  │
│                 │    ││ Model       │  │    │└─────────────┘  │
│┌─────────────┐  │    │└─────────────┘  │    │                 │
││  CLI Logic  │  │    │┌─────────────┐  │    │┌─────────────┐  │
│├─────────────┤  │    ││ Serialization│  │    ││ Persistence  │  │
││ Error       │  │    ││├─────────────┤  │    ││├─────────────┤  │
││ Handling    │  │    │││  JSON/Bin  │  │    │││   Models    │  │
│└─────────────┘  │    │││  Formats   │  │    ││├─────────────┤  │
└─────────────────┘    ││└─────────────┘  │    │││   Embeddings│  │
                       ││┌─────────────┐  │    ││└─────────────┘  │
                       │││ Utilities   │  │    │└─────────────┘  │
                       ││├─────────────┤  │    │                 │
                       │││ Text Process│  │    └─────────────────┘
                       │││ Similarity  │  │
                       │││ Helpers     │  │
                       ││└─────────────┘  │
                       └─────────────────┘
```

### 3.2. Component Architecture

#### 3.2.1. Core Library Components

**TrainingConfig**
```rust
pub struct TrainingConfig {
    pub embedding_dim: usize,          // Output embedding dimension
    pub learning_rate: f64,           // Learning rate for training
    pub epochs: usize,                 // Number of training epochs
    pub batch_size: usize,             // Mini-batch size
    pub context_window: usize,        // Context window size
    pub negative_samples: usize,      // Number of negative samples
    pub model_type: ModelType,        // Algorithm to use
}
```

**TrainingData**
```rust
pub struct TrainingData {
    pub sentences: Vec<Vec<String>>,   // Tokenized sentences
    pub vocab: HashMap<String, usize>, // Word to index mapping
    pub reverse_vocab: Vec<String>,  // Index to word mapping
}
```

**EmbeddingModel**
```rust
pub struct EmbeddingModel {
    pub embeddings: Array2<f32>,      // Word embedding matrix
    pub config: TrainingConfig,       // Training configuration
    pub vocab_size: usize,            // Vocabulary size
}
```

#### 3.2.2. CLI Components

**Command Parser**
```rust
#[derive(Subcommand)]
enum Commands {
    Train { /* Training parameters */ },
    Similarity { /* Similarity query */ },
    Info { /* Model inspection */ },
    Export { /* Export functionality */ },
}
```

#### 3.2.3. Data Processing Components

**Text Processor**
- Tokenization: Split text into words
- Normalization: Lowercase, punctuation removal
- Vocabulary building: Create word-to-index mappings

**Embedding Trainer**
- Skip-gram implementation
- CBOW implementation
- Sentence-BERT implementation

### 3.3. Data Flow

#### 3.3.1. Training Flow
```
Input Text → Text Processing → Vocabulary Building → Model Initialization → Training → Model Output
    ↓               ↓                  ↓                   ↓               ↓              ↓
Raw text → Cleaned text → Word mappings → Random embeddings → Gradient updates → Trained model
```

#### 3.3.2. Inference Flow
```
Query Words → Vocabulary lookup → Embedding retrieval → Similarity calculation → Results
    ↓              ↓                   ↓                  ↓                ↓
Text input → Index mapping → Vector extraction → Mathematical computation → Output
```

## 4. Implementation Details

### 4.1. Algorithm Details

#### 4.1.1. Skip-gram Algorithm
**Objective**: Maximize log probability of context words given target word

**Mathematical Formulation**:
```
maximize Σ_{t=1 to T} Σ_{-m≤j≤m, j≠0} log P(w_{t+j} | w_t; θ)
```

**Gradient Calculation**:
```
∂J/∂θ = -Σ_{t,j} [y - P(w_{t+j} | w_t)] ∂P/∂θ
```

**Implementation Notes**:
- Hierarchical softmax or negative sampling for efficiency
- Gradient descent with configurable learning rate
- Context window parameter for training

#### 4.1.2. CBOW Algorithm
**Objective**: Maximize log probability of target word given context words

**Mathematical Formulation**:
```
maximize Σ_{t=1 to T} log P(w_t | w_{t-m}, ..., w_{t+m}; θ)
```

**Gradient Calculation**:
```
∂J/∂θ = -Σ_{t} [y - P(w_t | context)] ∂P/∂θ
```

**Implementation Notes**:
- Average context word embeddings
- Similar gradient structure to Skip-gram
- Context window parameter for training

#### 4.1.3. Sentence-BERT Algorithm
**Objective**: Generate sentence-level embeddings using transformer architecture

**Mathematical Formulation**:
```
E(sentence) = mean(P(word_i) for word_i in sentence)
```

**Implementation Notes**:
- Mean pooling over word embeddings
- Contextual representation using transformer layers
- Sentence similarity based on cosine distance

### 4.2. Performance Considerations

#### 4.2.1. Memory Optimization
- Use ndarray for efficient array operations
- Implement lazy loading for large datasets
- Memory-mapped file support for embeddings
- Batch processing to reduce memory overhead

#### 4.2.2. Computational Efficiency
- Parallel processing with Rayon
- Vectorized operations using ndarray
- Efficient similarity calculations
- Cache-friendly data access patterns

#### 4.2.3. Algorithm Optimizations
- Negative sampling for Skip-gram
- Hierarchical softmax alternatives
- Learning rate scheduling
- Early stopping mechanisms

### 4.3. Error Handling

#### 4.3.1. Error Types
```rust
pub enum TrainingError {
    InvalidInput(String),
    VocabularyError(String),
    ModelError(String),
    IOError(String),
    ConfigurationError(String),
}
```

#### 4.3.2. Error Recovery
- Graceful handling of missing words
- Fallback mechanisms for corrupted data
- Transactional file operations
- Consistent state management

## 5. Testing Strategy

### 5.1. Unit Testing
- **Algorithm testing**: Verify mathematical correctness
- **Data processing**: Test tokenization and vocabulary building
- **Model operations**: Test embedding operations and similarity
- **Serialization**: Test save/load functionality

### 5.2. Integration Testing
- **End-to-end training**: Complete training workflow
- **CLI functionality**: Test all command interfaces
- **File I/O**: Test loading and saving different formats
- **Error scenarios**: Test edge cases and error handling

### 5.3. Performance Testing
- **Training benchmarks**: Speed and memory usage
- **Query performance**: Similarity calculation speed
- **Large dataset handling**: Scalability testing
- **Memory usage patterns**: Peak memory consumption

### 5.4. Test Data
- **Small datasets**: For unit testing
- **Medium datasets**: For integration testing
- **Large datasets**: For performance testing
- **Real-world data**: For validation testing

## 6. Deployment

### 6.1. Build Requirements
- Rust toolchain 1.70+
- Cargo build system
- Standard C compiler for native dependencies
- Cross-platform compilation support

### 6.2. Distribution
- Binary distribution via crates.io
- Source code on GitHub
- Docker container for easy deployment
- Platform-specific packages for Linux, macOS, Windows

### 6.3. Documentation
- API documentation with rustdoc
- User guides and tutorials
- Performance benchmarks
- Contributing guidelines

## 7. Security Considerations

### 7.1. Input Validation
- File path validation
- Memory safety guarantees
- Buffer overflow protection
- Integer overflow handling

### 7.2. Data Protection
- Secure file handling
- Memory wiping for sensitive data
- Access control mechanisms
- Audit logging capabilities

### 7.3. Supply Chain Security
- Dependency vulnerability scanning
- Secure build process
- Code review requirements
- Integrity verification

## 8. Version Control

### 8.1. Version Numbering
- **Major**: Breaking changes
- **Minor**: New features, backward compatible
- **Patch**: Bug fixes, backward compatible

### 8.2. Change Management
- Semantic versioning compliance
- Detailed changelog maintenance
- Beta release process
- Deprecation policy

## 9. Maintenance

### 9.1. Code Quality
- Static code analysis
- Automated testing
- Code review process
- Refactoring guidelines

### 9.2. Performance Monitoring
- Benchmark tracking
- Memory usage monitoring
- CPU profiling
- Load testing

### 9.3. Community Support
- Issue tracking system
- Documentation updates
- User feedback integration
- Performance optimization

---

**Specification Version**: 1.0  
**Last Updated**: 2026-06-09  
**Review Date**: Quarterly  
**Approvals**: Technical Lead, Project Manager