pub fn build_program() -> Program
Build the default megakernel IR (256 lanes × 1 workgroup, no custom opcodes).