aprender 0.31.2

<!-- PCU: ml-fundamentals-graph-neural-networks | contract: contracts/apr-page-ml-fundamentals-graph-neural-networks-v1.yaml -->
<!-- Example: cargo run -p aprender-core --example none -->
<!-- Status: enforced -->

# Graph Neural Networks Theory

Graph Neural Networks (GNNs) extend deep learning to graph-structured data, enabling learning on social networks, molecules, knowledge graphs, and more.

## Why Graphs?

Many real-world systems are naturally graphs:

| Domain | Nodes | Edges |
|--------|-------|-------|
| Social Networks | Users | Friendships |
| Molecules | Atoms | Bonds |
| Knowledge Graphs | Entities | Relations |
| Citation Networks | Papers | Citations |
| Traffic | Intersections | Roads |

Traditional neural networks require fixed-size inputs. GNNs handle:
- Variable number of nodes
- Variable node connectivity
- Permutation invariance (node ordering doesn't matter)

## Message Passing Framework

Most GNNs follow the **message passing** paradigm:

```
For each layer:
  1. AGGREGATE: Collect messages from neighbors
  2. UPDATE: Transform aggregated messages
  3. COMBINE: Merge with node's own features
```

Mathematically:

```
h_v^(l+1) = UPDATE(h_v^(l), AGGREGATE({h_u^(l) : u ∈ N(v)}))
```

Where:
- h_v^(l) = node v's representation at layer l
- N(v) = neighbors of node v

## Graph Convolutional Network (GCN)

Kipf & Welling (2017) introduced GCN with symmetric normalization:

```
H^(l+1) = σ(D̃^(-1/2) Ã D̃^(-1/2) H^(l) W^(l))
```

Where:
- Ã = A + I (adjacency with self-loops)
- D̃ = degree matrix of Ã
- W = learnable weight matrix
- σ = activation function

**Per-node formulation:**

```
h_i' = σ(Σⱼ (1/√(dᵢdⱼ)) · W · hⱼ)
```

The normalization 1/√(dᵢdⱼ) prevents feature explosion in high-degree nodes.

## Graph Attention Network (GAT)

Velickovic et al. (2018) introduced attention to learn edge importance:

```
α_ij = softmax_j(LeakyReLU(aᵀ[Wh_i || Wh_j]))
h_i' = σ(Σⱼ α_ij · W · hⱼ)
```

**Multi-head attention:**

```
h_i' = ||ₖ₌₁ᴷ σ(Σⱼ α_ij^k · W^k · hⱼ)
```

Where || denotes concatenation across K attention heads.

**Advantages over GCN:**
- Learns which neighbors are important
- Handles heterogeneous graphs better
- More expressive aggregation

## GraphSAGE

Hamilton et al. (2017) introduced sampling and aggregation:

```
h_N(v) = AGGREGATE({h_u : u ∈ Sample(N(v), k)})
h_v' = σ(W · [h_v || h_N(v)])
```

**Aggregation functions:**
- Mean: h_N(v) = mean({h_u})
- Max-pooling: h_N(v) = max({σ(W_pool · h_u)})
- LSTM: h_N(v) = LSTM({h_u}) (permutation variant)

**Key innovation:** Samples fixed-size neighborhood for scalability.

## Comparison of GNN Architectures

| Architecture | Aggregation | Normalization | Attention |
|--------------|-------------|---------------|-----------|
| GCN | Sum | Symmetric | No |
| GAT | Weighted sum | None | Yes |
| GraphSAGE | Mean/Max/LSTM | None | No |
| GIN | Sum + MLP | None | No |

## Expressive Power

**Weisfeiler-Lehman Test:**
GNNs are at most as powerful as the 1-WL graph isomorphism test.

```
Two nodes get same embedding if and only if
they have the same 1-WL color after k iterations.
```

**Graph Isomorphism Network (GIN):**
Xu et al. (2019) designed maximally expressive GNN:

```
h_v' = MLP((1 + ε) · h_v + Σⱼ h_j)
```

This achieves theoretical maximum expressiveness under the WL framework.

## Over-Smoothing Problem

**Issue:** Deep GNNs make all node embeddings converge:

```
Layer 1: h_v distinct
Layer 2: h_v similar to neighbors
Layer 3: h_v similar to 2-hop neighbors
...
Layer k: All h_v nearly identical
```

**Solutions:**
1. **Skip connections:** h' = h + GNN(h)
2. **Jumping Knowledge:** Concat all layer outputs
3. **DropEdge:** Randomly remove edges during training
4. **PairNorm:** Normalize to maintain separation

## Node Classification

**Task:** Predict labels for nodes given partial labels.

```
Input: Graph G, node features X, labels Y_L for subset L
Output: Labels for unlabeled nodes
```

**Architecture:**

```
X → GCN → ReLU → Dropout → GCN → Softmax → Ŷ
```

**Loss:** Cross-entropy on labeled nodes:

```
L = -Σᵢ∈L Σc y_ic · log(ŷ_ic)
```

## Graph Classification

**Task:** Predict label for entire graphs.

```
Input: Set of graphs {G_1, G_2, ...} with labels
Output: Graph-level classifier
```

**Readout (pooling):**

```
h_G = READOUT({h_v : v ∈ G})
```

Common readouts:
- Mean: h_G = mean(h_v)
- Sum: h_G = Σ h_v
- Set2Set: Attention-based
- DiffPool: Hierarchical clustering

## Link Prediction

**Task:** Predict missing edges.

```
Input: Graph with some edges removed
Output: Score for each potential edge
```

**Scoring function:**

```
score(u, v) = h_u · h_v  (dot product)
score(u, v) = MLP([h_u || h_v])  (neural)
```

## Heterogeneous Graphs

Graphs with multiple node/edge types:

```
RGCN: h_v' = σ(Σᵣ Σⱼ (1/|N_r(v)|) · Wᵣ · hⱼ)
```

Where r indexes relation types.

## Temporal Graphs

Graphs evolving over time:

```
h_v^(t+1) = GNN(h_v^(t), Graph^(t))
```

Combine GNN with sequence models (LSTM, Transformer).

## Computational Considerations

### Mini-batching

Sampling strategies for large graphs:
1. **Node sampling:** Random subset of nodes
2. **Layer sampling:** Sample neighbors per layer (GraphSAGE)
3. **Subgraph sampling:** Extract connected subgraphs

### Sparse Operations

Use sparse matrix operations for efficiency:

```
# Dense: O(N²) memory
H' = A @ H @ W

# Sparse: O(E) memory
H' = sparse_mm(A, H) @ W
```

## Implementation Notes

### Edge Index Format

COO (Coordinate) format:

```
edge_index = [(0, 1), (1, 2), (2, 0), ...]
             source   target
```

### Self-Loops

Adding self-loops (A + I):
- Ensures node's own features contribute
- Prevents information loss in disconnected nodes
- Required for GCN normalization

## References

- Kipf, T. N., & Welling, M. (2017). "Semi-Supervised Classification with Graph Convolutional Networks." ICLR.
- Velickovic, P., et al. (2018). "Graph Attention Networks." ICLR.
- Hamilton, W. L., et al. (2017). "Inductive Representation Learning on Large Graphs." NeurIPS.
- Xu, K., et al. (2019). "How Powerful are Graph Neural Networks?" ICLR.