typf-render-opixa 5.0.16

Based on an in-depth analysis of your project files and modern rasterization techniques, this document provides a specific and actionable specification to radically increase the performance of the `typf-render-opixa` rasterizer. The recommendations are structured in phases, starting with foundational improvements and progressing to more advanced parallelization, balancing impact against implementation complexity.[1]

### Phase 1: Foundational Optimizations

This phase focuses on establishing a proper optimization workflow and implementing high-impact, low-complexity changes related to memory management and build configuration.[2][3]

*   **1.1. Establish a Profiling Baseline**: Before any code changes, create a benchmark suite that reflects typical usage (e.g., rendering various fonts at different sizes). Use `cargo flamegraph` to generate a performance profile of the current implementation. This will identify the most significant bottlenecks and serve as a baseline to measure all future improvements against.[3]
*   **1.2. Aggressive Build Configuration**: Modify your `Cargo.toml` to ensure the compiler generates the most optimized code possible for release builds. Enable link-time optimization (LTO), reduce codegen units to allow for more cross-crate inlining, and compile for the native CPU to unlock the best available instruction sets (like AVX2).[2][3]

    ```toml
    [profile.release]
    lto = "fat"
    codegen-units = 1
    panic = "abort"
    ```
    Set the following environment variable when compiling: `RUSTFLAGS="-C target-cpu=native"`.[3]

*   **1.3. Zero-Allocation Scan Conversion**: The core rasterization loop in `scan_converter.rs` must perform zero heap allocations.[2][3]
    *   **Action**: Pre-allocate all necessary buffers outside the main loop, such as the Active Edge Table (AET) and scanline span buffers. Pass mutable slices (`&mut [T]`) into the rasterizer.
    *   **Recommendation**: Use an arena allocator like `bumpalo` for managing the memory of all edges for a given glyph. This allows for extremely fast, contiguous allocations that are all deallocated at once when the glyph rendering is complete.

### Phase 2: Algorithmic and Data Structure Enhancements

This phase refines the core algorithms and data layouts for better cache performance and reduced computational overhead.[3][2]

*   **2.1. Optimized Edge Table Management**: The current approach of sorting edges on every scanline is a major performance bottleneck. This should be replaced with a more efficient method in `edge.rs` and `scan_converter.rs`.[2][3]
    *   **Global Edge Table (GET)**: Implement a bucket sort for all glyph edges. Create an array of lists (buckets), where each bucket corresponds to a Y scanline. During setup, place each edge into the bucket of its starting scanline. This eliminates sorting entirely at this stage.[2]
    *   **Active Edge Table (AET)**: As you iterate through scanlines, add edges from the corresponding GET bucket to the AET. The AET can be kept sorted by insertion. For subsequent scanlines, simply update the X-intercept of each active edge by adding its slope (`dx/dy`), which is much faster than re-sorting.[3]
*   **2.2. Adopt Structure-of-Arrays (SoA) Layout**: Transition the `Edge` struct from an Array-of-Structs (AoS) to a Structure-of-Arrays (SoA) layout. This dramatically improves cache-line utilization and is a prerequisite for effective SIMD vectorization.[3]
    *   **Action**: Instead of `Vec<Edge { x, slope, y_max }>`, use a single struct:
        ```rust
        struct Edges {
            x_coords: Vec<F26Dot6>,
            slopes: Vec<F26Dot6>,
            y_max: Vec<i32>,
        }
        ```
*   **2.3. Implement Adaptive Curve Flattening**: The curve decomposition in `curves.rs` should be made adaptive to reduce the number of line segments for the rasterizer to process.[2]
    *   **Action**: Modify the curve flattening function to accept a tolerance parameter. For smoother curves, this will generate fewer, longer line segments, reducing the total edge count significantly.

### Phase 3: Advanced Parallelism

With a solid foundation, this phase introduces multi-level parallelism to leverage modern multi-core CPUs for radical speedups.[3][2]

*   **3.1. Implement SIMD on the Scanline**: The `simd.rs` module should be implemented to process multiple pixels simultaneously within the scan-conversion loop.[2][3]
    *   **Action**: Use Rust's stable `std::simd` API to operate on vectors of 4 or 8 pixels at a time.
    *   **Grayscale Coverage**: Load edge intersection points into SIMD registers. For each pixel in the vector, calculate coverage in parallel. This is especially effective for anti-aliasing. A 2x2 pixel quad approach can be more robust than a horizontal 4x1 span, as it handles both vertical and horizontal edges well.[3]
    *   **Monochrome Fill**: For 1-bit monochrome output, use SIMD instructions to write entire 8- or 16-byte chunks to the bitmap buffer in a single operation, rather than pixel by pixel.[2]
*   **3.2. Refine `rayon` Parallelism Strategy**: Your `parallel.rs` module can be enhanced with a multi-level approach.[1][2]
    *   **Glyph-Level Parallelism**: For rendering strings or texture atlases, continue using `rayon` to distribute the rendering of individual, independent glyphs across the thread pool. This is highly effective and has low overhead.[2]
    *   **Tile-Based Parallelism**: For very large and complex glyphs (e.g., CJK ideographs), implement a tile-based rendering strategy. Divide the glyph's bounding box into small, cache-friendly tiles (e.g., 32x32 or 64x64 pixels). Distribute these tiles as work units to a `rayon` thread pool. Each thread processes a subset of tiles, significantly improving cache locality and enabling parallel processing of a single glyph.[3]

### Phase 4: High-Level Caching Strategy

While the reference C code includes a complex cache manager (`cachemgr.c`), the `opixa` rasterizer itself should remain a pure, stateless computational kernel. Caching should be the responsibility of the application that *uses* the rasterizer.[4][3]

*   **4.1. Glyph Bitmap Cache**: Implement an application-level LRU cache (using the `lru` crate) to store the final rendered glyph bitmaps. Before requesting a rasterization, check the cache for an existing bitmap. The key should be a struct containing the glyph ID, transformation details, and size.[2]
*   **4.2. Outline Cache**: For maximum performance, implement a secondary cache for storing the pre-processed and scaled glyph outlines (the SoA `Edges` struct from step 2.2). This avoids re-calculating the flattened path if a glyph is needed again with a minor change like a subpixel shift.[2]

[1](https://ppl-ai-file-upload.s3.amazonaws.com/web/direct-files/attachments/106844374/a00866e0-6112-4b04-bff0-fc3462000b60/llms-opixa.txt)
[2](https://ppl-ai-file-upload.s3.amazonaws.com/web/direct-files/attachments/106844374/58339bd7-baaf-4230-9995-5369d680f5d3/pplx2.md)
[3](https://ppl-ai-file-upload.s3.amazonaws.com/web/direct-files/attachments/106844374/6044127f-d143-4c8f-b903-192c2e14d592/pplx1.md)
[4](https://ppl-ai-file-upload.s3.amazonaws.com/web/direct-files/attachments/106844374/e0fb27bf-08e4-4d4c-a482-f58e22d14b47/llms-raster.txt)