Module interpretability

Module interpretability

Expand description

§Mechanistic Interpretability Module

“Digitális Agysebészet” - Neuron-level monitoring and activation analysis for understanding AI decision-making at the deepest level.

§Features

Neuron Activation Tracking: Monitor individual neuron activations
Attention Head Analysis: Understand what the model is “looking at”
Circuit Discovery: Find computational circuits in the model
Feature Attribution: Trace decisions back to input features
Probing Classifiers: Test for internal representations
Activation Patching: Surgical intervention in model computation

§Philosophy

“Nem elég tudni MIT csinál az AI - tudnunk kell MIÉRT.” (It’s not enough to know WHAT the AI does - we must know WHY.)

Structs§

ActivationPatch: Activation patch for surgical intervention
ActivationSnapshot: Snapshot of activations at a point in time
AnalysisProof: Cryptographic proof of interpretability analysis
AttentionHead: Attention head representation
Circuit: A computational circuit in the model
FeatureAttribution: Feature attribution for a decision
InterpretabilityEngine: The main interpretability engine
InterpretabilityReport: Interpretability report
InterpretabilityStats: Statistics for interpretability engine
ModelInfo: Model information
Neuron: Represents a single neuron in the model
NeuronStats: Statistics for a single neuron
PatchEffect: Effect of an activation patch
ProbeResult: Probing classifier result
RiskFactor: Risk factor in safety analysis
SafetyAnalysis: Result of safety analysis

Enums§

AttentionType: Types of attention patterns we can detect
CircuitFunction: Types of circuits we can identify
RiskType: Types of risks we can detect