Module tensor_parallel

Expand description

TensorParallelTrainer — weight-sharded matmul: each replica owns a slice of the weight matrix; activations are split, each shard runs a partial matmul, then results are summed via AllReduce.

F6 ships the public surface + a host-side reference. Each shard implements ShardProtocol which receives a partial input slice and returns its partial output. The trainer collects all partials and sums them.