# A* Search Implementation in ciphey
The A* search algorithm in ciphey uses a sophisticated heuristic system for prioritizing decoder paths. This document details the mathematical foundations and implementation specifics.
## Core Algorithm
The A* implementation uses the standard f = g + h formula where:
- g = depth in search tree (cost so far)
- h = heuristic value (estimated cost to goal)
## Execution Flow
1. **Initial Full Run**
- All decoders are run on the input text first
- This ensures we don't miss any simple single-decoder solutions
- Results are checked immediately for success
- Failed results inform initial heuristic calculations
2. **A* Search Phase**
- If no immediate solution found, begin A* search
- Process "decoder-tagged" decoders first at each node
- Queue remaining decoders based on heuristic scores
## Heuristic Calculation
### Mathematical Model
The heuristic function h(n) is calculated through a series of multiplicative penalties applied to a base score:
\[h(n) = b * p_s * p_r * p_p * p_q * p_c\]
Where:
- b = base score from Cipher Identifier (0.0-1.0)
- p_s = sequence penalty
- p_r = success rate penalty
- p_p = popularity penalty
- p_q = quality penalty
- p_c = character set penalty
### Base Score (b)
Generated by Cipher Identifier, normalized to [0.0, 1.0]. For unknown patterns, a random value between 0.5 and 1.0 is assigned to maintain exploration.
### Sequence Penalty (p_s)
\[p_s = \begin{cases}
1.25 & \text{if sequence is uncommon} \\
1.0 & \text{otherwise}
\end{cases}\]
### Success Rate Penalty (p_r)
For a decoder with success rate r:
\[p_r = 1.0 + (1.0 - r)\]
This creates a linear penalty scaling with failure rate:
- r = 1.0 → p_r = 1.0 (no penalty)
- r = 0.5 → p_r = 1.5 (50% penalty)
- r = 0.0 → p_r = 2.0 (100% penalty)
### Popularity Penalty (p_p)
For a decoder with popularity p:
\[p_p = 1.0 + (2.0 * (1.0 - p))\]
This creates an aggressive penalty for unpopular decoders:
- p = 1.0 → p_p = 1.0 (no penalty)
- p = 0.5 → p_p = 2.0 (100% penalty)
- p = 0.1 → p_p = 2.8 (180% penalty)
### Quality Penalty (p_q)
For a string with quality q:
\[p_q = 1.0 + (1.0 - q)\]
Where q is calculated based on string length l:
\[q = \begin{cases}
0.1 & \text{if } l < 3 \\
0.3 & \text{if } l > 5000 \\
1.0 - \frac{|l - 100|}{900} & \text{otherwise}
\end{cases}\]
### Character Set Penalty (p_c)
For a string with non-printable ratio r:
\[p_c = \begin{cases}
1.0 & \text{if } r = 0 \\
1.0 + e^{100r} & \text{otherwise}
\end{cases}\]
### Implementation
The heuristic calculation is implemented as follows:
```rust
fn generate_heuristic(text: &str, path: &[CrackResult]) -> f32 {
// Base score from Cipher Identifier
let (cipher, base_score) = get_cipher_identifier_score(text);
let mut final_score = base_score;
if let Some(last_result) = path.last() {
// Penalize uncommon sequences
if !is_common_sequence(last_result.decoder, &cipher) {
final_score *= 1.25; // 25% penalty
}
// Penalize low success rates
let success_rate = get_decoder_success_rate(last_result.decoder);
final_score *= 1.0 + (1.0 - success_rate); // Penalty scales with failure rate
// Penalize decoders with low popularity
let popularity = get_decoder_popularity(last_result.decoder);
final_score *= 1.0 + (2.0 * (1.0 - popularity)); // Penalty scales with unpopularity
}
// Penalize low quality strings
final_score *= 1.0 + (1.0 - calculate_string_quality(text));
// Penalize non-printable characters
let non_printable_ratio = calculate_non_printable_ratio(text);
if non_printable_ratio > 0.0 {
final_score *= 1.0 + (non_printable_ratio * 100.0).exp();
}
final_score
}
```
## Decoder Popularity Ratings
Decoders are assigned static popularity ratings based on their frequency of use in real-world scenarios:
```rust
Base64, Hexadecimal, Binary, rot13, rot47 → 1.0
Base32, Vigenere → 0.8
Base58 → 0.7
Base85, SimpleSubstitution → 0.5
Base91 → 0.3
Citrix CTX1 → 0.1
Unknown decoders → 0.5
```
## Memory Management
The search space is pruned using a quality-based retention system:
1. When seen strings exceed threshold T:
- Calculate quality scores for all strings
- Sort by quality descending
- Keep top 50% highest quality strings
- Adjust T based on search progress:
\[T_{new} = T_{initial} - (5000 * \frac{depth}{MAX\_DEPTH})\]
2. Early string filtering:
- Strings ≤ 2 characters are immediately rejected
- Non-printable character ratio affects priority but doesn't cause rejection
### Implementation
```rust
if seen_count > prune_threshold {
// Quality-based pruning
let mut quality_scores: Vec<(String, f32)> = seen_strings
.iter()
.map(|s| (s.clone(), calculate_string_quality(s)))
.collect();
// Keep top 50% highest quality strings
let keep_count = seen_strings.len() / 2;
seen_strings = quality_scores
.into_iter()
.take(keep_count)
.map(|(s, _)| s)
.collect();
// Dynamic threshold adjustment
prune_threshold = INITIAL_PRUNE_THRESHOLD
- (progress_factor * 5000.0) as usize;
}
```
## Performance Optimizations
1. **Initial Full Run**
- All decoders attempt decoding immediately
- Early exit if simple solution found
- Failed attempts inform heuristic calculations
2. **Decoder Tagging**
- "decoder"-tagged decoders are executed immediately at each node
- Non-tagged decoders are queued for future exploration
- Prevents wasted computation on unlikely paths
3. **Reciprocal Prevention**
- Tracks reciprocal decoders (e.g., encode/decode pairs)
- Prevents consecutive application of reciprocal operations
- Reduces search space by eliminating obvious cycles
4. **Statistical Learning**
- Success rates are tracked per decoder
- Rates influence future path priorities through p_r penalty
- Adapts to decoder performance during search
5. **Dynamic Pruning**
- Threshold adjusts with search depth
- More aggressive pruning at deeper levels
- Balances memory usage with search completeness
## Implementation Notes
The multiplicative nature of penalties means they compound effectively:
- A decoder with poor scores in multiple categories is severely penalized
- Good scores in some categories can partially offset poor scores in others
- The exponential penalty for non-printable characters acts as a strong deterrent for binary/corrupted paths
The system is tuned to favor:
1. Common, reliable decoders (high popularity, high success rate)
2. Clean text output (low non-printable ratio)
3. Reasonable length strings (near 100 characters)
4. Known decoder sequences (based on common patterns)