tensorlogic-infer 0.1.0-beta.1

# Backend Development Tutorial

**Build Your First TensorLogic Backend in 30 Minutes**

This hands-on tutorial walks you through creating a minimal but functional TensorLogic backend from scratch.

## Table of Contents

- [Prerequisites](#prerequisites)
- [Tutorial Overview](#tutorial-overview)
- [Part 1: Project Setup](#part-1-project-setup)
- [Part 2: Define Core Types](#part-2-define-core-types)
- [Part 3: Implement TlExecutor](#part-3-implement-tlexecutor)
- [Part 4: Testing](#part-4-testing)
- [Part 5: Optimization](#part-5-optimization)
- [Part 6: Advanced Features](#part-6-advanced-features)
- [Next Steps](#next-steps)

## Prerequisites

- Rust 1.70+ installed
- Basic understanding of Rust traits and generics
- Familiarity with tensor operations (optional but helpful)

**Estimated Time**: 30-45 minutes

## Tutorial Overview

We'll build **SimpleTensor**, a minimal CPU-based backend using `ndarray`. By the end, you'll have:

- ✅ A working `TlExecutor` implementation
- ✅ Support for basic operations (einsum, element-wise, reduce)
- ✅ Comprehensive tests
- ✅ Integration with the TensorLogic ecosystem

**What We Won't Cover** (but can be added later):
- GPU acceleration
- Automatic differentiation
- Distributed execution

## Part 1: Project Setup

### Step 1.1: Create the Project

```bash
cargo new --lib simple-tensor-backend
cd simple-tensor-backend
```

### Step 1.2: Add Dependencies

Edit `Cargo.toml`:

```toml
[package]
name = "simple-tensor-backend"
version = "0.1.0"
edition = "2021"

[dependencies]
tensorlogic-ir = "0.1"
tensorlogic-infer = "0.1"
ndarray = "0.15"
thiserror = "1.0"

[dev-dependencies]
tensorlogic-compiler = "0.1"
```

### Step 1.3: Set Up Module Structure

Create `src/lib.rs`:

```rust
//! SimpleTensor - A minimal TensorLogic backend using ndarray

mod tensor;
mod executor;
mod error;

pub use tensor::SimpleTensor;
pub use executor::SimpleExecutor;
pub use error::SimpleError;
```

## Part 2: Define Core Types

### Step 2.1: Define the Tensor Type

Create `src/tensor.rs`:

```rust
use ndarray::ArrayD;

/// A simple tensor backed by ndarray
#[derive(Clone, Debug)]
pub struct SimpleTensor {
    /// The tensor data
    pub data: ArrayD<f64>,
    /// Unique identifier for debugging
    pub id: String,
}

impl SimpleTensor {
    /// Create a new tensor with the given data
    pub fn new(id: impl Into<String>, data: ArrayD<f64>) -> Self {
        Self {
            data,
            id: id.into(),
        }
    }

    /// Create a tensor filled with zeros
    pub fn zeros(id: impl Into<String>, shape: &[usize]) -> Self {
        Self::new(id, ArrayD::zeros(shape))
    }

    /// Create a tensor filled with ones
    pub fn ones(id: impl Into<String>, shape: &[usize]) -> Self {
        Self::new(id, ArrayD::ones(shape))
    }

    /// Create a tensor with specific data
    pub fn with_data(id: impl Into<String>, shape: &[usize], data: Vec<f64>) -> Self {
        let array = ArrayD::from_shape_vec(shape, data)
            .expect("Shape and data length must match");
        Self::new(id, array)
    }

    /// Get the shape of the tensor
    pub fn shape(&self) -> &[usize] {
        self.data.shape()
    }

    /// Get the number of elements
    pub fn size(&self) -> usize {
        self.data.len()
    }
}
```

### Step 2.2: Define the Error Type

Create `src/error.rs`:

```rust
use thiserror::Error;

#[derive(Error, Debug)]
pub enum SimpleError {
    #[error("Shape mismatch: expected {expected:?}, got {actual:?}")]
    ShapeMismatch {
        expected: Vec<usize>,
        actual: Vec<usize>,
    },

    #[error("Invalid einsum specification: {0}")]
    InvalidEinsum(String),

    #[error("Unsupported operation: {0}")]
    UnsupportedOperation(String),

    #[error("Invalid input: {0}")]
    InvalidInput(String),

    #[error("Computation error: {0}")]
    ComputationError(String),
}
```

### Step 2.3: Define the Executor Type

Create `src/executor.rs`:

```rust
use crate::{SimpleTensor, SimpleError};

/// A simple executor for TensorLogic operations
#[derive(Default)]
pub struct SimpleExecutor {
    /// Optional: State for caching, profiling, etc.
}

impl SimpleExecutor {
    /// Create a new executor
    pub fn new() -> Self {
        Self::default()
    }
}
```

## Part 3: Implement TlExecutor

Now the fun part - implementing the trait!

### Step 3.1: Implement Element-wise Operations

Add to `src/executor.rs`:

```rust
use tensorlogic_infer::{TlExecutor, ElemOp, ReduceOp};
use ndarray::{Array, Axis, Zip};

impl TlExecutor for SimpleExecutor {
    type Tensor = SimpleTensor;
    type Error = SimpleError;

    fn elem_op(&mut self, op: ElemOp, x: &Self::Tensor)
        -> Result<Self::Tensor, Self::Error>
    {
        let result_data = match op {
            ElemOp::Relu => x.data.mapv(|v| v.max(0.0)),
            ElemOp::OneMinus => x.data.mapv(|v| 1.0 - v),
            ElemOp::Sigmoid => x.data.mapv(|v| 1.0 / (1.0 + (-v).exp())),
            _ => return Err(SimpleError::UnsupportedOperation(
                format!("Element-wise operation {:?} not supported", op)
            )),
        };

        Ok(SimpleTensor::new(
            format!("{}_op", x.id),
            result_data
        ))
    }

    fn elem_op_binary(&mut self, op: ElemOp, x: &Self::Tensor, y: &Self::Tensor)
        -> Result<Self::Tensor, Self::Error>
    {
        // Validate shapes match
        if x.shape() != y.shape() {
            return Err(SimpleError::ShapeMismatch {
                expected: x.shape().to_vec(),
                actual: y.shape().to_vec(),
            });
        }

        let result_data = match op {
            ElemOp::Add => &x.data + &y.data,
            ElemOp::Multiply => &x.data * &y.data,
            ElemOp::Max => {
                let mut result = x.data.clone();
                Zip::from(&mut result)
                    .and(&y.data)
                    .for_each(|a, &b| *a = a.max(b));
                result
            },
            ElemOp::Min => {
                let mut result = x.data.clone();
                Zip::from(&mut result)
                    .and(&y.data)
                    .for_each(|a, &b| *a = a.min(b));
                result
            },
            _ => return Err(SimpleError::UnsupportedOperation(
                format!("Binary operation {:?} not supported", op)
            )),
        };

        Ok(SimpleTensor::new(
            format!("{}_{}_op", x.id, y.id),
            result_data
        ))
    }

    // We'll add reduce and einsum next...
    fn reduce(&mut self, op: ReduceOp, x: &Self::Tensor, axes: &[usize])
        -> Result<Self::Tensor, Self::Error>
    {
        todo!("Implement in next step")
    }

    fn einsum(&mut self, spec: &str, inputs: &[Self::Tensor])
        -> Result<Self::Tensor, Self::Error>
    {
        todo!("Implement in next step")
    }
}
```

### Step 3.2: Implement Reduce Operations

Add to the `TlExecutor` impl:

```rust
fn reduce(&mut self, op: ReduceOp, x: &Self::Tensor, axes: &[usize])
    -> Result<Self::Tensor, Self::Error>
{
    // Validate axes
    for &axis in axes {
        if axis >= x.data.ndim() {
            return Err(SimpleError::InvalidInput(
                format!("Axis {} out of bounds for tensor with {} dimensions",
                    axis, x.data.ndim())
            ));
        }
    }

    let mut result = x.data.clone();

    // Reduce along each axis (in reverse order to maintain axis indices)
    for &axis in axes.iter().rev() {
        result = match op {
            ReduceOp::Sum => result.sum_axis(Axis(axis)),
            ReduceOp::Max => result.map_axis(Axis(axis), |view| {
                view.iter().fold(f64::NEG_INFINITY, |a, &b| a.max(b))
            }),
            ReduceOp::Min => result.map_axis(Axis(axis), |view| {
                view.iter().fold(f64::INFINITY, |a, &b| a.min(b))
            }),
            ReduceOp::Product => result.map_axis(Axis(axis), |view| {
                view.iter().fold(1.0, |a, &b| a * b)
            }),
        };
    }

    Ok(SimpleTensor::new(
        format!("{}_reduce", x.id),
        result
    ))
}
```

### Step 3.3: Implement Einsum (Simplified)

For this tutorial, we'll implement a simplified einsum that handles common cases:

```rust
fn einsum(&mut self, spec: &str, inputs: &[Self::Tensor])
    -> Result<Self::Tensor, Self::Error>
{
    // Parse einsum spec
    let parts: Vec<&str> = spec.split("->").collect();
    if parts.len() != 2 {
        return Err(SimpleError::InvalidEinsum(
            format!("Invalid einsum spec: {}", spec)
        ));
    }

    let input_specs: Vec<&str> = parts[0].split(',').collect();
    let output_spec = parts[1];

    // Validate input count
    if inputs.len() != input_specs.len() {
        return Err(SimpleError::InvalidEinsum(
            format!("Expected {} inputs, got {}", input_specs.len(), inputs.len())
        ));
    }

    // Handle common cases
    match (input_specs.as_slice(), output_spec) {
        // Identity: "ij->ij"
        (["ij"], "ij") if inputs.len() == 1 => {
            Ok(inputs[0].clone())
        },

        // Matrix multiplication: "ik,kj->ij"
        (["ik", "kj"], "ij") if inputs.len() == 2 => {
            let a = &inputs[0].data;
            let b = &inputs[1].data;

            if a.ndim() != 2 || b.ndim() != 2 {
                return Err(SimpleError::ShapeMismatch {
                    expected: vec![2, 2],
                    actual: vec![a.ndim(), b.ndim()],
                });
            }

            let result = a.dot(b);
            Ok(SimpleTensor::new("matmul", result))
        },

        // Batch matrix multiplication: "bik,bkj->bij"
        (["bik", "bkj"], "bij") if inputs.len() == 2 => {
            let a = &inputs[0].data;
            let b = &inputs[1].data;

            if a.ndim() != 3 || b.ndim() != 3 {
                return Err(SimpleError::ShapeMismatch {
                    expected: vec![3, 3],
                    actual: vec![a.ndim(), b.ndim()],
                });
            }

            // Simplified batch matmul
            let batch_size = a.shape()[0];
            let m = a.shape()[1];
            let k = a.shape()[2];
            let n = b.shape()[2];

            let mut result = Array::zeros((batch_size, m, n));

            for b_idx in 0..batch_size {
                let a_slice = a.index_axis(Axis(0), b_idx);
                let b_slice = b.index_axis(Axis(0), b_idx);
                let prod = a_slice.dot(&b_slice);
                result.index_axis_mut(Axis(0), b_idx).assign(&prod);
            }

            Ok(SimpleTensor::new("batch_matmul", result))
        },

        // Element-wise product: "i,i->i"
        (["i", "i"], "i") if inputs.len() == 2 => {
            self.elem_op_binary(ElemOp::Multiply, &inputs[0], &inputs[1])
        },

        // Add more patterns as needed...
        _ => Err(SimpleError::UnsupportedOperation(
            format!("Einsum pattern '{}' not yet supported", spec)
        )),
    }
}
```

## Part 4: Testing

### Step 4.1: Write Unit Tests

Add to `src/executor.rs`:

```rust
#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_elem_op_relu() {
        let mut exec = SimpleExecutor::new();
        let tensor = SimpleTensor::with_data(
            "test",
            &[4],
            vec![-2.0, -1.0, 0.0, 1.0]
        );

        let result = exec.elem_op(ElemOp::Relu, &tensor).unwrap();

        assert_eq!(result.data.as_slice().unwrap(), &[0.0, 0.0, 0.0, 1.0]);
    }

    #[test]
    fn test_elem_op_binary_add() {
        let mut exec = SimpleExecutor::new();
        let t1 = SimpleTensor::with_data("t1", &[2, 2], vec![1.0, 2.0, 3.0, 4.0]);
        let t2 = SimpleTensor::with_data("t2", &[2, 2], vec![5.0, 6.0, 7.0, 8.0]);

        let result = exec.elem_op_binary(ElemOp::Add, &t1, &t2).unwrap();

        assert_eq!(result.data.as_slice().unwrap(), &[6.0, 8.0, 10.0, 12.0]);
    }

    #[test]
    fn test_reduce_sum() {
        let mut exec = SimpleExecutor::new();
        let tensor = SimpleTensor::with_data(
            "test",
            &[2, 3],
            vec![1.0, 2.0, 3.0, 4.0, 5.0, 6.0]
        );

        // Sum along axis 0 (columns)
        let result = exec.reduce(ReduceOp::Sum, &tensor, &[0]).unwrap();

        assert_eq!(result.shape(), &[3]);
        assert_eq!(result.data.as_slice().unwrap(), &[5.0, 7.0, 9.0]);
    }

    #[test]
    fn test_einsum_matmul() {
        let mut exec = SimpleExecutor::new();
        let a = SimpleTensor::with_data("a", &[2, 3], vec![1.0, 2.0, 3.0, 4.0, 5.0, 6.0]);
        let b = SimpleTensor::with_data("b", &[3, 2], vec![1.0, 2.0, 3.0, 4.0, 5.0, 6.0]);

        let result = exec.einsum("ik,kj->ij", &[a, b]).unwrap();

        assert_eq!(result.shape(), &[2, 2]);
        // [[1*1 + 2*3 + 3*5, 1*2 + 2*4 + 3*6],
        //  [4*1 + 5*3 + 6*5, 4*2 + 5*4 + 6*6]]
        // = [[22, 28], [49, 64]]
        assert_eq!(result.data.as_slice().unwrap(), &[22.0, 28.0, 49.0, 64.0]);
    }

    #[test]
    fn test_shape_mismatch_error() {
        let mut exec = SimpleExecutor::new();
        let t1 = SimpleTensor::zeros("t1", &[2, 3]);
        let t2 = SimpleTensor::zeros("t2", &[3, 2]);

        let result = exec.elem_op_binary(ElemOp::Add, &t1, &t2);

        assert!(result.is_err());
        assert!(matches!(result.unwrap_err(), SimpleError::ShapeMismatch { .. }));
    }
}
```

### Step 4.2: Run Tests

```bash
cargo test
```

You should see all tests passing! 🎉

## Part 5: Optimization

### Step 5.1: Add Benchmarks

Create `benches/benchmarks.rs`:

```rust
use criterion::{black_box, criterion_group, criterion_main, Criterion};
use simple_tensor_backend::{SimpleExecutor, SimpleTensor};
use tensorlogic_infer::{TlExecutor, ElemOp};

fn bench_matmul(c: &mut Criterion) {
    let mut exec = SimpleExecutor::new();
    let a = SimpleTensor::ones("a", &[100, 100]);
    let b = SimpleTensor::ones("b", &[100, 100]);

    c.bench_function("matmul_100x100", |bencher| {
        bencher.iter(|| {
            exec.einsum("ik,kj->ij", &[a.clone(), b.clone()]).unwrap()
        });
    });
}

criterion_group!(benches, bench_matmul);
criterion_main!(benches);
```

Add to `Cargo.toml`:

```toml
[[bench]]
name = "benchmarks"
harness = false

[dev-dependencies]
criterion = "0.5"
```

Run benchmarks:

```bash
cargo bench
```

### Step 5.2: Profile and Optimize

Use `cargo flamegraph` to find hotspots:

```bash
cargo install flamegraph
cargo flamegraph --bench benchmarks
```

Common optimizations:
- Use `ndarray`'s parallel features
- Implement memory pooling for large tensors
- Cache einsum spec parsing

## Part 6: Advanced Features

### Step 6.1: Add Profiling Support

```rust
use tensorlogic_infer::{TlProfiledExecutor, ProfileData, OpProfile};
use std::collections::HashMap;
use std::time::Instant;

impl TlProfiledExecutor for SimpleExecutor {
    fn enable_profiling(&mut self) {
        self.profiling_enabled = true;
    }

    fn disable_profiling(&mut self) {
        self.profiling_enabled = false;
    }

    fn get_profile_data(&self) -> ProfileData {
        ProfileData {
            op_profiles: self.profiles.clone(),
            memory_profile: Default::default(),
        }
    }
}
```

### Step 6.2: Add Capability Queries

```rust
use tensorlogic_infer::{TlCapabilities, BackendCapabilities, DeviceType, DType, Feature};

impl TlCapabilities for SimpleExecutor {
    fn capabilities(&self) -> BackendCapabilities {
        BackendCapabilities {
            devices: vec![DeviceType::CPU],
            dtypes: vec![DType::F64],
            features: vec![
                Feature::Einsum,
                Feature::ElementWise,
                Feature::Reduction,
            ],
            max_tensor_size: 1_000_000_000, // 1GB
            supports_sparse: false,
            supports_complex: false,
        }
    }
}
```

## Next Steps

Congratulations! You've built a working TensorLogic backend. Here's what to do next:

### Immediate Next Steps

1. **Add More Einsum Patterns**
   - Implement a general einsum parser
   - Support arbitrary contractions
   - Handle broadcasting

2. **Implement TlAutodiff**
   - Add gradient tracking
   - Implement backward passes
   - Support common neural network operations

3. **Optimize Performance**
   - Enable BLAS/LAPACK for matrix operations
   - Add SIMD support
   - Implement memory pooling

### Long-term Improvements

1. **GPU Support**
   - Use `cudarc` or `wgpu` for GPU operations
   - Implement device placement
   - Handle data transfers

2. **Distributed Execution**
   - Add MPI support
   - Implement tensor sharding
   - Support model parallelism

3. **Production Features**
   - Comprehensive error recovery
   - Checkpointing
   - Monitoring and observability

### Resources

- **TensorLogic Documentation**: https://docs.rs/tensorlogic-infer
- **Reference Backend**: `tensorlogic-scirs-backend` crate
- **Community**: https://github.com/cool-japan/tensorlogic/discussions

## Troubleshooting

### Common Issues

**Issue**: Tests failing with shape mismatches
```rust
// Solution: Add shape validation
if x.shape() != expected_shape {
    return Err(SimpleError::ShapeMismatch { ... });
}
```

**Issue**: Out of memory errors
```rust
// Solution: Implement chunked processing
for chunk in inputs.chunks(batch_size) {
    process_chunk(chunk)?;
}
```

**Issue**: Slow performance
```rust
// Solution: Enable ndarray's parallel features
use ndarray::parallel::prelude::*;
```

## Conclusion

You now have a solid foundation for a TensorLogic backend! The patterns you've learned here scale to more complex backends with GPU support, distributed execution, and advanced optimizations.

Happy coding! 🚀

---

**Version**: 1.0
****Last Updated**: 2025-12-16
**Part of**: [TensorLogic Ecosystem](https://github.com/cool-japan/tensorlogic)