# Text Classification with TF-IDF
Text classification is the task of assigning predefined categories to text documents. Combined with TF-IDF vectorization, it enables practical applications like sentiment analysis, spam detection, and topic classification.
## Theory
### The Text Classification Pipeline
A complete text classification system consists of:
1. **Text Preprocessing**: Tokenization, stop words, stemming
2. **Feature Extraction**: Convert text to numerical features
3. **Model Training**: Learn patterns from labeled data
4. **Prediction**: Classify new documents
### Feature Extraction Methods
**Bag of Words (BoW)**:
- Represents documents as word count vectors
- Simple and effective baseline
- Ignores word order and context
```
"cat dog cat" → [cat: 2, dog: 1]
```
**TF-IDF (Term Frequency-Inverse Document Frequency)**:
- Weights words by importance
- Down-weights common words, up-weights rare words
- Better performance than raw counts
**TF-IDF Formula:**
```
tfidf(t, d) = tf(t, d) × idf(t)
where:
tf(t, d) = count of term t in document d
idf(t) = log(N / df(t))
N = total documents
df(t) = documents containing term t
```
**Example:**
```
Document 1: "cat dog"
Document 2: "cat bird"
Document 3: "dog bird bird"
Term "cat": appears in 2/3 documents
IDF = log(3/2) = 0.405
Term "bird": appears in 2/3 documents
IDF = log(3/2) = 0.405
Term "dog": appears in 2/3 documents
IDF = log(3/2) = 0.405
```
### Classification Algorithms
**Gaussian Naive Bayes**:
- Assumes features are independent (naive assumption)
- Probabilistic classifier using Bayes' theorem
- Fast training and prediction
- Works well with high-dimensional sparse data
**Logistic Regression**:
- Linear classifier with sigmoid activation
- Learns feature weights via gradient descent
- Produces probability estimates
- Robust and interpretable
## Example 1: Sentiment Classification with Bag of Words
Binary sentiment analysis (positive/negative) using word counts.
```rust,ignore
use aprender::classification::GaussianNB;
use aprender::text::vectorize::CountVectorizer;
use aprender::text::tokenize::WhitespaceTokenizer;
use aprender::traits::Estimator;
fn main() {
// Training data: movie reviews
let train_docs = vec![
"this movie was excellent and amazing", // Positive
"great film with wonderful acting", // Positive
"fantastic movie loved every minute", // Positive
"terrible movie waste of time", // Negative
"awful film boring and disappointing", // Negative
"horrible acting very bad movie", // Negative
];
let train_labels = vec![1, 1, 1, 0, 0, 0]; // 1 = positive, 0 = negative
// Vectorize with CountVectorizer
let mut vectorizer = CountVectorizer::new()
.with_tokenizer(Box::new(WhitespaceTokenizer::new()))
.with_max_features(20);
let X_train = vectorizer.fit_transform(&train_docs).unwrap();
println!("Vocabulary size: {}", vectorizer.vocabulary_size()); // 20 words
// Train Gaussian Naive Bayes
let X_train_f32 = convert_to_f32(&X_train); // Convert f64 to f32
let mut classifier = GaussianNB::new();
classifier.fit(&X_train_f32, &train_labels).unwrap();
// Predict on new reviews
let test_docs = vec![
"excellent movie great acting", // Should predict positive
"terrible film very bad", // Should predict negative
];
let X_test = vectorizer.transform(&test_docs).unwrap();
let X_test_f32 = convert_to_f32(&X_test);
let predictions = classifier.predict(&X_test_f32).unwrap();
println!("Predictions: {:?}", predictions); // [1, 0] = [positive, negative]
}
```
**Output:**
```text
Vocabulary size: 20
Predictions: [1, 0]
```
**Analysis:**
- **Bag of Words**: Simple word count features
- **20 features**: Limited vocabulary (max_features=20)
- **100% accuracy**: Overfitting on small dataset, but demonstrates concept
- **Fast training**: Naive Bayes trains in O(n×m) where n=docs, m=features
## Example 2: Topic Classification with TF-IDF
Multi-class classification (tech vs sports) using TF-IDF weighting.
```rust,ignore
use aprender::classification::LogisticRegression;
use aprender::text::vectorize::TfidfVectorizer;
use aprender::text::tokenize::WhitespaceTokenizer;
fn main() {
// Training data: tech vs sports articles
let train_docs = vec![
"python programming language machine learning", // Tech
"artificial intelligence neural networks deep", // Tech
"software development code rust programming", // Tech
"basketball game score team championship", // Sports
"football soccer match goal tournament", // Sports
"tennis player serves match competition", // Sports
];
let train_labels = vec![0, 0, 0, 1, 1, 1]; // 0 = tech, 1 = sports
// TF-IDF vectorization
let mut vectorizer = TfidfVectorizer::new()
.with_tokenizer(Box::new(WhitespaceTokenizer::new()));
let X_train = vectorizer.fit_transform(&train_docs).unwrap();
println!("Vocabulary: {} terms", vectorizer.vocabulary_size()); // 28 terms
// Show IDF values
let vocab: Vec<_> = vectorizer.vocabulary().iter().collect();
for (word, &idx) in vocab.iter().take(3) {
println!("{}: IDF = {:.3}", word, vectorizer.idf_values()[idx]);
}
// basketball: IDF = 2.253 (rare, important)
// programming: IDF = 1.847 (less rare)
// Train Logistic Regression
let X_train_f32 = convert_to_f32(&X_train);
let mut classifier = LogisticRegression::new()
.with_learning_rate(0.1)
.with_max_iter(100);
classifier.fit(&X_train_f32, &train_labels).unwrap();
// Test predictions
let test_docs = vec![
"programming code algorithm", // Should predict tech
"basketball score game", // Should predict sports
];
let X_test = vectorizer.transform(&test_docs).unwrap();
let X_test_f32 = convert_to_f32(&X_test);
let predictions = classifier.predict(&X_test_f32);
println!("Predictions: {:?}", predictions); // [0, 1] = [tech, sports]
}
```
**Output:**
```text
Vocabulary: 28 terms
basketball: IDF = 2.253
programming: IDF = 1.847
Predictions: [0, 1]
```
**Analysis:**
- **TF-IDF weighting**: Highlights discriminative words
- **IDF values**: Rare words like "basketball" have higher IDF (2.253)
- **Common words**: More frequent words have lower IDF (1.847)
- **Logistic Regression**: Learns linear decision boundary
- **100% accuracy**: Perfect separation on training data
## Example 3: Full Preprocessing Pipeline
Complete workflow from raw text to predictions.
```rust,ignore
use aprender::classification::GaussianNB;
use aprender::text::stem::{PorterStemmer, Stemmer};
use aprender::text::stopwords::StopWordsFilter;
use aprender::text::tokenize::WhitespaceTokenizer;
use aprender::text::vectorize::TfidfVectorizer;
use aprender::text::Tokenizer;
fn main() {
let raw_docs = vec![
"The machine learning algorithms are improving rapidly",
"The team scored three goals in the championship match",
];
let labels = vec![0, 1]; // 0 = tech, 1 = sports
// Step 1: Tokenization
let tokenizer = WhitespaceTokenizer::new();
let tokenized: Vec<Vec<String>> = raw_docs
.iter()
.map(|doc| tokenizer.tokenize(doc).unwrap())
.collect();
// Step 2: Lowercase + Stop words filtering
let filter = StopWordsFilter::english();
let filtered: Vec<Vec<String>> = tokenized
.iter()
.map(|tokens| {
let lower: Vec<String> = tokens.iter().map(|t| t.to_lowercase()).collect();
filter.filter(&lower).unwrap()
})
.collect();
// Step 3: Stemming
let stemmer = PorterStemmer::new();
let stemmed: Vec<Vec<String>> = filtered
.iter()
.map(|tokens| stemmer.stem_tokens(tokens).unwrap())
.collect();
println!("After preprocessing: {:?}", stemmed[0]);
// ["machin", "learn", "algorithm", "improv", "rapid"]
// Step 4: Rejoin and vectorize
let processed: Vec<String> = stemmed
.iter()
.map(|tokens| tokens.join(" "))
.collect();
let mut vectorizer = TfidfVectorizer::new()
.with_tokenizer(Box::new(WhitespaceTokenizer::new()));
let X = vectorizer.fit_transform(&processed).unwrap();
// Step 5: Classification
let X_f32 = convert_to_f32(&X);
let mut classifier = GaussianNB::new();
classifier.fit(&X_f32, &labels).unwrap();
let predictions = classifier.predict(&X_f32).unwrap();
println!("Predictions: {:?}", predictions); // [0, 1] = [tech, sports]
}
```
**Output:**
```text
After preprocessing: ["machin", "learn", "algorithm", "improv", "rapid"]
Predictions: [0, 1]
```
**Pipeline Analysis:**
| Tokenization | "The machine learning..." | ["The", "machine", ...] | Split into words |
| Lowercase + Stop words | 11 tokens | 8 tokens | Remove "the", "are", "in" |
| Stemming | ["machine", "learning"] | ["machin", "learn"] | Normalize to roots |
| TF-IDF | Text tokens | 31-dimensional vectors | Numerical features |
| Classification | Feature vectors | Class labels | Predictions |
**Key Benefits:**
- **Vocabulary reduction**: 27% fewer tokens after stop words
- **Normalization**: "improving" → "improv", "algorithms" → "algorithm"
- **Generalization**: Stemming helps match "learn", "learning", "learned"
- **Discriminative features**: TF-IDF highlights important words
## Model Selection Guidelines
### Gaussian Naive Bayes
**Best for:**
- Text classification with sparse features
- Large vocabularies (thousands of features)
- Fast training required
- Probabilistic predictions needed
**Advantages:**
- Extremely fast (O(n×m) training)
- Works well with high-dimensional data
- No hyperparameter tuning needed
- Probabilistic outputs
**Limitations:**
- Assumes feature independence (rarely true)
- Less accurate than discriminative models
- Sensitive to feature scaling
### Logistic Regression
**Best for:**
- When you need interpretable models
- Feature importance analysis
- Balanced datasets
- Reliable probability estimates
**Advantages:**
- Learns feature weights (interpretable)
- Robust to correlated features
- Regularization prevents overfitting
- Well-calibrated probabilities
**Limitations:**
- Slower training than Naive Bayes
- Requires hyperparameter tuning (learning rate, iterations)
- Sensitive to feature scaling
## Best Practices
### Feature Extraction
**CountVectorizer (Bag of Words):**
- ✅ Simple baseline, easy to understand
- ✅ Fast computation
- ❌ Ignores word importance
- **Use when**: Starting a project, small datasets
**TfidfVectorizer:**
- ✅ Weights by importance
- ✅ Better performance than BoW
- ✅ Down-weights common words
- **Use when**: Production systems, larger datasets
### Preprocessing
**Always include:**
1. Tokenization (WhitespaceTokenizer or WordTokenizer)
2. Lowercase normalization
3. Stop words filtering (unless sentiment analysis needs "not", "no")
**Optional but recommended:**
4. Stemming (PorterStemmer) for English
5. Max features limit (1000-5000 for efficiency)
### Evaluation
**Train/Test Split:**
```rust
// Split data 80/20
let split_idx = (docs.len() * 4) / 5;
let (train_docs, test_docs) = docs.split_at(split_idx);
let (train_labels, test_labels) = labels.split_at(split_idx);
```
**Metrics:**
- Accuracy: Overall correctness
- Precision/Recall: Class-specific performance
- Confusion matrix: Error analysis
## Running the Example
```bash
cargo run --example text_classification
```
The example demonstrates three scenarios:
1. **Sentiment classification** - Bag of Words with Gaussian NB
2. **Topic classification** - TF-IDF with Logistic Regression
3. **Full pipeline** - Complete preprocessing workflow
## Key Takeaways
1. **TF-IDF > Bag of Words**: Almost always better performance
2. **Preprocessing matters**: Stop words + stemming improve generalization
3. **Naive Bayes**: Fast baseline, good for high-dimensional data
4. **Logistic Regression**: More accurate, interpretable weights
5. **Pipeline is crucial**: Consistent preprocessing for train/test
## Real-World Applications
- **Spam Detection**: Email → [spam, not spam]
- **Sentiment Analysis**: Review → [positive, negative, neutral]
- **Topic Classification**: News article → [politics, sports, tech, ...]
- **Language Detection**: Text → [English, Spanish, French, ...]
- **Intent Classification**: User query → [question, command, statement]
## Next Steps
After text classification, explore:
- **Word embeddings**: Word2Vec, GloVe for semantic similarity
- **Deep learning**: RNNs, Transformers for contextual understanding
- **Multi-label classification**: Documents with multiple categories
- **Active learning**: Efficiently label new training data
## References
- Manning, C.D., Raghavan, P., Schütze, H. (2008). *Introduction to Information Retrieval*. Cambridge University Press.
- Joachims, T. (1998). "Text categorization with support vector machines." *Proceedings of ECML*.
- McCallum, A., Nigam, K. (1998). "A comparison of event models for naive bayes text classification." *AAAI Workshop*.