matrixmultiply 0.2.1

matrixmultiply
==============

General matrix multiplication for f32, f64 matrices. Operates on matrices with
general layout (they can use arbitrary row and column stride).

Please read the `API documentation here`__

__ https://docs.rs/matrixmultiply/


This crate uses the same macro/microkernel approach to matrix multiplication as
the BLIS_ project.

We presently provide a few good microkernels portable and for x86-64, and
only one operation: the general matrix-matrix multiplication (“gemm”).

.. _BLIS: https://github.com/flame/blis

Blog posts about this crate:

+ `gemm: a rabbit hole`__

__ https://bluss.github.io/rust/2016/03/28/a-gemmed-rabbit-hole/

|build_status|_ |crates|_

.. |build_status| image:: https://travis-ci.org/bluss/matrixmultiply.svg?branch=master
.. _build_status: https://travis-ci.org/bluss/matrixmultiply

.. |crates| image:: https://meritbadge.herokuapp.com/matrixmultiply
.. _crates: https://crates.io/crates/matrixmultiply

Recent Changes
--------------

- 0.2.1

  - Improve matrix packing by taking better advantage of contiguous inputs.

    Benchmark improvement: runtime for 64×64 problem where inputs are either
    both row major or both column major changed by -5% sgemm and -1% for dgemm.
    (#26)
  
  - In the sgemm avx kernel, handle column major output arrays just like
    it does row major arrays.

    Benchmark improvement: runtime for 32×32 problem where output is column
    major changed by -11%. (#27)

- 0.2.0

  - Use runtime feature detection on x86 and x86-64 platforms, to enable
    AVX-specific microkernels at runtime if available on the currently
    executing configuration.

    This means no special compiler flags are needed to enable native
    instruction performance!

  - Implement a specialized 8×8 sgemm (f32) AVX microkernel, this speeds up
    matrix multiplication by another 25%.

  - Use ``std::alloc`` for allocation of aligned packing buffers

  - We now require Rust 1.28 as the minimal version

- 0.1.15

  - Fix bug where the result matrix C was not updated in the case of a M × K by
    K × N matrix multiplication where K was zero. (This resulted in the output
    C potentially being left uninitialized or with incorrect values in this
    specific scenario.) By @jturner314 (PR #21)

- 0.1.14

  - Avoid an unused code warning

- 0.1.13

  - Pick 8x8 sgemm (f32) kernel when AVX target feature is enabled
    (with Rust 1.14 or later, no effect otherwise).
  - Use ``rawpointer``, a µcrate with raw pointer methods taken from this
    project.

- 0.1.12

  - Internal cleanup with retained performance

- 0.1.11

  - Adjust sgemm (f32) kernel to optimize better on recent Rust.

- 0.1.10

  - Update doc links to docs.rs

- 0.1.9

  - Workaround optimization regression in rust nightly (1.12-ish) (#9)

- 0.1.8

  - Improved docs

- 0.1.7

  - Reduce overhead slightly for small matrix multiplication problems by using
    only one allocation call for both packing buffers.

- 0.1.6

  - Disable manual loop unrolling in debug mode (quicker debug builds)

- 0.1.5

  - Update sgemm to use a 4x8 microkernel (“still in simplistic rust”),
    which improves throughput by 10%.

- 0.1.4

  - Prepare support for aligned packed buffers
  - Update dgemm to use a 8x4 microkernel, still in simplistic rust,
    which improves throughput by 10-20% when using AVX.

- 0.1.3

  - Silence some debug prints

- 0.1.2

  - Major performance improvement for sgemm and dgemm (20-30% when using AVX).
    Since it all depends on what the optimizer does, I'd love to get
    issue reports that report good or bad performance.
  - Made the kernel masking generic, which is a cleaner design

- 0.1.1

  - Minor improvement in the kernel