pub fn value_matmul(a: &Value, b: &Value) -> Result<Value, String>
GPU-aware matmul entry: if both inputs are GpuTensor handles, call provider; otherwise fall back to CPU.