caffe2-tensor 0.1.5-alpha.0

## `caffe2-tensor`

This Rust crate is a translation of the `caffe2`
operator library and provides tensor-related
functionality. It is still in the process of being
translated from the original C++ implementation,
so some function bodies may not be fully
translated yet.

The crate contains several key types, including
`Tensor`, `TensorCPU`, `TensorCUDA`, and
`Int8TensorCPU`, which represent different types
of tensors and their storage. Tensors can be
resized, reshaped, and copied, and their data can
be accessed and modified. The crate also includes
functions for filling tensors with values and for
accessing tensor metadata.

One important feature of the crate is the ability
to share data between tensors, either through
cloning or through external pointers. Tensors can
also be resized and reshaped without reallocating
memory, as long as their new shape is compatible
with their existing shape.

The crate also includes several utility functions,
including a tensor printer, a shape function, and
a function for creating tensors filled with
specific values.

Overall, the `caffe2-tensor` crate provides
a powerful set of tools for working with tensors
in Rust. However, it is still a work in progress,
so some features may not be fully implemented yet.

12> why do we need different tensor types for
different storage locations?

Different tensor types are needed for different
storage locations because the underlying storage
for tensors can be located in different memory
spaces, such as CPU memory or GPU memory. The
tensor type determines the device on which the
tensor resides, as well as the data type and shape
of the tensor.

For example, a `TensorCPU` type would be used for
tensors that are stored in CPU memory, while
a `TensorCUDA` type would be used for tensors that
are stored in GPU memory. This is important
because operations between tensors stored in
different memory spaces can be inefficient,
requiring data to be moved between the CPU and GPU
memory, and can result in slower execution times.

Having different tensor types for different
storage locations allows for more efficient use of
memory and processing power, as well as easier
manipulation of tensors on different devices.

13> so in other words we get an explicit point
where we can convert from TensorCPU to TensorCUDA,
and a user who calls this api will understand the
performance cost

Yes, that's correct. The explicit point of
conversion allows the user to understand and
control the performance cost of transferring data
between different storage locations. It also
allows for easier integration with other libraries
that may have different storage requirements.

14> are there other reasons?

Yes, another reason for having different tensor
types is to allow for more efficient
implementation of specific operations on different
hardware. For example, some operations may be more
efficient to compute on a CPU, while others may be
more efficient to compute on a GPU. By having
different tensor types, we can write specialized
implementations of these operations for each type
that are optimized for the underlying
hardware. This can result in significant
performance improvements.

15> are there typically different access patterns?

Yes, there can be different access patterns for
CPU and GPU tensors due to the underlying hardware
architecture. For example, GPUs are optimized for
parallel computation and can handle massive
amounts of parallelism, whereas CPUs are optimized
for single-threaded performance and can handle
smaller amounts of parallelism. Therefore, it may
be more efficient to perform certain operations on
the CPU or the GPU depending on the access pattern
and workload. Additionally, the way data is
transferred between CPU and GPU can also affect
performance, and using the appropriate tensor type
can help optimize this process.

16> could you show me some examples of typical
access patterns for both varieties?

Sure, here are some examples:

### TensorCPU access patterns

- Sequential access: Reading the elements of
  a tensor in row-major order (i.e., accessing
  elements in order of increasing row index and
  then increasing column index).

- Random access: Reading elements of a tensor in
  a non-sequential order (i.e., accessing elements
  in an order that is not related to their index
  values).

- Vectorized access: Reading a contiguous block of
  elements from a tensor using SIMD instructions.

### TensorCUDA access patterns

- Coalesced access: Reading contiguous blocks of
  elements from a tensor that are aligned with
  memory transactions on the GPU. This minimizes
  the number of memory transactions and increases
  memory bandwidth utilization.

- Strided access: Reading non-contiguous blocks of
  elements from a tensor. This can happen when
  using certain CUDA libraries or when performing
  certain operations on tensors with
  non-contiguous strides.

- Pinned memory access: Reading and writing
  tensors stored in host memory that has been
  pinned to allow faster transfers between host
  and device.

These are just a few examples, and there can be
many other access patterns depending on the
specific use case and the operations being
performed on the tensor.

17> thanks, could you show me some others?

Sure, here are a few other examples of access
patterns:

- **Random Access:** This access pattern is
  characterized by randomly accessing different
  elements of the tensor with no predictable
  order. An example use case might be image
  classification, where each pixel of an image
  corresponds to a different element of a tensor.

- **Sequential Access:** This access pattern is
  characterized by accessing the elements of
  a tensor sequentially in order. An example use
  case might be audio processing, where each
  sample in an audio signal corresponds to
  a different element of a tensor.

- **Stride Access:** This access pattern is
  characterized by accessing the elements of
  a tensor using a non-contiguous set of
  indices. An example use case might be natural
  language processing, where each word in
  a sentence corresponds to a different element of
  a tensor, but the words are not necessarily
  contiguous.

- **Block Access:** This access pattern is
  characterized by accessing the elements of
  a tensor in large contiguous blocks. An example
  use case might be matrix multiplication, where
  each element of the result tensor is computed by
  taking the dot product of a row from the first
  tensor and a column from the second tensor.

These are just a few examples, and there are many
other access patterns that might be relevant
depending on the specific use case.

18> do simd instructions apply to gpu
architectures? or are there other instructions
involved?

SIMD (Single Instruction Multiple Data)
instructions are primarily used for CPU
architectures and are executed by CPU cores. SIMD
instructions are not directly applicable to GPU
architectures, which typically have hundreds or
thousands of cores working in parallel to process
massive amounts of data simultaneously. GPUs have
their own set of specialized instructions, such as
those used for parallel processing and matrix
operations. These instructions are optimized for
the GPU architecture and can provide significant
speedups for certain types of
computations. Additionally, some GPUs may support
SIMD instructions for specific operations, such as
packing or unpacking data, but the SIMD
implementation for GPUs is different from that of
CPUs.

19> what are these gpu instructions called?

GPU instructions are often referred to as shader
instructions or compute shader instructions. They
are designed specifically for massively parallel
operations on GPUs and are different from SIMD
instructions used on CPUs. The instruction set
architecture for GPUs is typically different from
that of CPUs, with more emphasis on floating-point
operations and support for parallelism.

20> could you go into more detail?

Sure! GPUs use a variety of specialized
instructions and architectures that are optimized
for parallel processing of large amounts of data.

One of the most important instruction sets used in
modern GPUs is called CUDA (Compute Unified Device
Architecture), which is developed by NVIDIA. CUDA
provides a high-level programming model for
parallel computing on NVIDIA GPUs, and includes
a specialized set of instructions and memory
hierarchy that is optimized for parallel
processing.

Another important GPU instruction set is called
OpenCL (Open Computing Language), which is
a vendor-neutral standard for programming parallel
computing devices such as GPUs, CPUs, and
FPGAs. OpenCL includes a set of optimized math and
vector instructions that are designed to take
advantage of the parallel processing power of
GPUs.

There are also other GPU instruction sets, such as
AMD's ROCm (Radeon Open Compute) and Intel's
OneAPI, which are designed to optimize the use of
GPUs and other accelerators for high-performance
computing tasks.