Expand description
TensorParallelTrainer — weight-sharded matmul: each replica
owns a slice of the weight matrix; activations are split, each
shard runs a partial matmul, then results are summed via
AllReduce.
F6 ships the public surface + a host-side reference. Each shard
implements ShardProtocol which receives a partial input slice
and returns its partial output. The trainer collects all
partials and sums them.