ciphey 0.12.0 - Docs.rs

# A* Search Implementation in ciphey

The A* search algorithm in ciphey uses a sophisticated heuristic system for prioritizing decoder paths. This document details the mathematical foundations and implementation specifics.

## Core Algorithm

The A* implementation uses the standard f = g + h formula where:
- g = depth in search tree (cost so far)
- h = heuristic value (estimated cost to goal)

## Execution Flow

1. **Initial Full Run**
   - All decoders are run on the input text first
   - This ensures we don't miss any simple single-decoder solutions
   - Results are checked immediately for success
   - Failed results inform initial heuristic calculations

2. **A* Search Phase**
   - If no immediate solution found, begin A* search
   - Process "decoder-tagged" decoders first at each node
   - Queue remaining decoders based on heuristic scores

## Heuristic Calculation

### Mathematical Model

The heuristic function h(n) is calculated through a series of multiplicative penalties applied to a base score:

\[h(n) = b * p_s * p_r * p_p * p_q * p_c\]

Where:
- b = base score from Cipher Identifier (0.0-1.0)
- p_s = sequence penalty
- p_r = success rate penalty
- p_p = popularity penalty
- p_q = quality penalty
- p_c = character set penalty

### Base Score (b)
Generated by Cipher Identifier, normalized to [0.0, 1.0]. For unknown patterns, a random value between 0.5 and 1.0 is assigned to maintain exploration.

### Sequence Penalty (p_s)
\[p_s = \begin{cases} 
1.25 & \text{if sequence is uncommon} \\
1.0 & \text{otherwise}
\end{cases}\]

### Success Rate Penalty (p_r)
For a decoder with success rate r:
\[p_r = 1.0 + (1.0 - r)\]

This creates a linear penalty scaling with failure rate:
- r = 1.0 → p_r = 1.0 (no penalty)
- r = 0.5 → p_r = 1.5 (50% penalty)
- r = 0.0 → p_r = 2.0 (100% penalty)

### Popularity Penalty (p_p)
For a decoder with popularity p:
\[p_p = 1.0 + (2.0 * (1.0 - p))\]

This creates an aggressive penalty for unpopular decoders:
- p = 1.0 → p_p = 1.0 (no penalty)
- p = 0.5 → p_p = 2.0 (100% penalty)
- p = 0.1 → p_p = 2.8 (180% penalty)

### Quality Penalty (p_q)
For a string with quality q:
\[p_q = 1.0 + (1.0 - q)\]

Where q is calculated based on string length l:
\[q = \begin{cases}
0.1 & \text{if } l < 3 \\
0.3 & \text{if } l > 5000 \\
1.0 - \frac{|l - 100|}{900} & \text{otherwise}
\end{cases}\]

### Character Set Penalty (p_c)
For a string with non-printable ratio r:
\[p_c = \begin{cases}
1.0 & \text{if } r = 0 \\
1.0 + e^{100r} & \text{otherwise}
\end{cases}\]

### Implementation

The heuristic calculation is implemented as follows:

```rust
fn generate_heuristic(text: &str, path: &[CrackResult]) -> f32 {
    // Base score from Cipher Identifier
    let (cipher, base_score) = get_cipher_identifier_score(text);
    
    let mut final_score = base_score;

    if let Some(last_result) = path.last() {
        // Penalize uncommon sequences
        if !is_common_sequence(last_result.decoder, &cipher) {
            final_score *= 1.25; // 25% penalty
        }

        // Penalize low success rates
        let success_rate = get_decoder_success_rate(last_result.decoder);
        final_score *= 1.0 + (1.0 - success_rate); // Penalty scales with failure rate
        
        // Penalize decoders with low popularity
        let popularity = get_decoder_popularity(last_result.decoder);
        final_score *= 1.0 + (2.0 * (1.0 - popularity)); // Penalty scales with unpopularity
    }

    // Penalize low quality strings
    final_score *= 1.0 + (1.0 - calculate_string_quality(text));

    // Penalize non-printable characters
    let non_printable_ratio = calculate_non_printable_ratio(text);
    if non_printable_ratio > 0.0 {
        final_score *= 1.0 + (non_printable_ratio * 100.0).exp();
    }

    final_score
}
```

## Decoder Popularity Ratings

Decoders are assigned static popularity ratings based on their frequency of use in real-world scenarios:

```rust
Base64, Hexadecimal, Binary, rot13, rot47 → 1.0
Base32, Vigenere → 0.8
Base58 → 0.7
Base85, SimpleSubstitution → 0.5
Base91 → 0.3
Citrix CTX1 → 0.1
Unknown decoders → 0.5
```

## Memory Management

The search space is pruned using a quality-based retention system:

1. When seen strings exceed threshold T:
   - Calculate quality scores for all strings
   - Sort by quality descending
   - Keep top 50% highest quality strings
   - Adjust T based on search progress:
     \[T_{new} = T_{initial} - (5000 * \frac{depth}{MAX\_DEPTH})\]

2. Early string filtering:
   - Strings ≤ 2 characters are immediately rejected
   - Non-printable character ratio affects priority but doesn't cause rejection

### Implementation

```rust
if seen_count > prune_threshold {
    // Quality-based pruning
    let mut quality_scores: Vec<(String, f32)> = seen_strings
        .iter()
        .map(|s| (s.clone(), calculate_string_quality(s)))
        .collect();
        
    // Keep top 50% highest quality strings
    let keep_count = seen_strings.len() / 2;
    seen_strings = quality_scores
        .into_iter()
        .take(keep_count)
        .map(|(s, _)| s)
        .collect();

    // Dynamic threshold adjustment
    prune_threshold = INITIAL_PRUNE_THRESHOLD 
        - (progress_factor * 5000.0) as usize;
}
```

## Performance Optimizations

1. **Initial Full Run**
   - All decoders attempt decoding immediately
   - Early exit if simple solution found
   - Failed attempts inform heuristic calculations

2. **Decoder Tagging**
   - "decoder"-tagged decoders are executed immediately at each node
   - Non-tagged decoders are queued for future exploration
   - Prevents wasted computation on unlikely paths

3. **Reciprocal Prevention**
   - Tracks reciprocal decoders (e.g., encode/decode pairs)
   - Prevents consecutive application of reciprocal operations
   - Reduces search space by eliminating obvious cycles

4. **Statistical Learning**
   - Success rates are tracked per decoder
   - Rates influence future path priorities through p_r penalty
   - Adapts to decoder performance during search

5. **Dynamic Pruning**
   - Threshold adjusts with search depth
   - More aggressive pruning at deeper levels
   - Balances memory usage with search completeness

## Implementation Notes

The multiplicative nature of penalties means they compound effectively:
- A decoder with poor scores in multiple categories is severely penalized
- Good scores in some categories can partially offset poor scores in others
- The exponential penalty for non-printable characters acts as a strong deterrent for binary/corrupted paths

The system is tuned to favor:
1. Common, reliable decoders (high popularity, high success rate)
2. Clean text output (low non-printable ratio)
3. Reasonable length strings (near 100 characters)
4. Known decoder sequences (based on common patterns)