Module interpretability

Module interpretability 

Source
Expand description

§Mechanistic Interpretability Module

“Digitális Agysebészet” - Neuron-level monitoring and activation analysis for understanding AI decision-making at the deepest level.

§Features

  • Neuron Activation Tracking: Monitor individual neuron activations
  • Attention Head Analysis: Understand what the model is “looking at”
  • Circuit Discovery: Find computational circuits in the model
  • Feature Attribution: Trace decisions back to input features
  • Probing Classifiers: Test for internal representations
  • Activation Patching: Surgical intervention in model computation

§Philosophy

“Nem elég tudni MIT csinál az AI - tudnunk kell MIÉRT.” (It’s not enough to know WHAT the AI does - we must know WHY.)

Structs§

ActivationPatch
Activation patch for surgical intervention
ActivationSnapshot
Snapshot of activations at a point in time
AnalysisProof
Cryptographic proof of interpretability analysis
AttentionHead
Attention head representation
Circuit
A computational circuit in the model
FeatureAttribution
Feature attribution for a decision
InterpretabilityEngine
The main interpretability engine
InterpretabilityReport
Interpretability report
InterpretabilityStats
Statistics for interpretability engine
ModelInfo
Model information
Neuron
Represents a single neuron in the model
NeuronStats
Statistics for a single neuron
PatchEffect
Effect of an activation patch
ProbeResult
Probing classifier result
RiskFactor
Risk factor in safety analysis
SafetyAnalysis
Result of safety analysis

Enums§

AttentionType
Types of attention patterns we can detect
CircuitFunction
Types of circuits we can identify
RiskType
Types of risks we can detect