Expand description
§Mechanistic Interpretability Module
“Digitális Agysebészet” - Neuron-level monitoring and activation analysis for understanding AI decision-making at the deepest level.
§Features
- Neuron Activation Tracking: Monitor individual neuron activations
- Attention Head Analysis: Understand what the model is “looking at”
- Circuit Discovery: Find computational circuits in the model
- Feature Attribution: Trace decisions back to input features
- Probing Classifiers: Test for internal representations
- Activation Patching: Surgical intervention in model computation
§Philosophy
“Nem elég tudni MIT csinál az AI - tudnunk kell MIÉRT.” (It’s not enough to know WHAT the AI does - we must know WHY.)
Structs§
- Activation
Patch - Activation patch for surgical intervention
- Activation
Snapshot - Snapshot of activations at a point in time
- Analysis
Proof - Cryptographic proof of interpretability analysis
- Attention
Head - Attention head representation
- Circuit
- A computational circuit in the model
- Feature
Attribution - Feature attribution for a decision
- Interpretability
Engine - The main interpretability engine
- Interpretability
Report - Interpretability report
- Interpretability
Stats - Statistics for interpretability engine
- Model
Info - Model information
- Neuron
- Represents a single neuron in the model
- Neuron
Stats - Statistics for a single neuron
- Patch
Effect - Effect of an activation patch
- Probe
Result - Probing classifier result
- Risk
Factor - Risk factor in safety analysis
- Safety
Analysis - Result of safety analysis
Enums§
- Attention
Type - Types of attention patterns we can detect
- Circuit
Function - Types of circuits we can identify
- Risk
Type - Types of risks we can detect