emd 0.1.1

A library for computing the Earth Mover's Distance
PyEMD: Earth Mover's Distance for Python (and MATLAB)
=====================================================
by Gary Doran (<gary.doran@case.edu>)

Overview
--------
An efficient, accurate, easy-to-use EMD implementation in C with Python
wrapper. New bonus MATLAB wrapper also included.

Installation
------------
*Python:* this package can be installed in two ways (the easy way):

    # If needed:
    # pip install numpy
    # pip install scipy
    pip install -e git+https://github.com/garydoranjr/pyemd.git#egg=pyemd

or by running the setup file manually

    git clone [the url for pyemd]
    cd pyemd
    python setup.py install

Note the code requires Python 2 (Python 3 is not supported) and depends on the
`numpy` and `scipy` packages. So have those installed first. The build will
likely fail if it can't find them. For more information, see:

 + [NumPy]http://www.numpy.org/: Library for efficient matrix math in Python
 + [SciPy]http://www.scipy.org/: Library for more MATLAB-like functionality

*MATLAB:* clone the repository and `cd` to the `matlab` subdirectory. Either
set the `MATLABDIR` environment variable, or edit the first line of the Makefile
to set the path to the desired MATLAB installation, and then run `make`.

After the MEX file has compiled, add the `matlab` subdirectory to the MATLAB
path (e.g., by using the `addpath` command in MATLAB).

Why?
----
Several Python wrappers for C-based EMD implementations already exist, so why is
another one necessary? There are two popular alternative approaches, each with
their limitations:

  - *Solve a LP with GLPK:* Since the transportation problem is a special case
    of the general LP formulation, it can be solved more efficiently and with
    much less use of memory. This implementation is approximately 7-8 times
    faster than a GLPK-based solution, and uses about 500 MB of memory when the
    GLPK-based solution uses over 10 GB before it crashes my machine.  These
    figures are from problems in which samples of size ~1000 with ~100 features
    are compared using the EMD for the multiple-instance learning problems I
    study.

  - *Wrap Yossi Rubner's Implementation:* There exist several wrappers of
    [Yossi Rubner's EMD code]http://robotics.stanford.edu/~rubner/emd/default.htm,
    the most popular of which is in the OpenCV library. The
    first limitation of this code is the use of single-precison versus
    double-precision floating point numbers.  Another issue is a hard-coded
    `MAX_SIG_SIZE`, which limits the size of the samples that can be used in
    the EMD computation.

PyEMD is a more "Pythonic" EMD implementation than other wrappers. Only the
minimal amount of computation is done in C (the core transportation algorithm).
This means that the distance computation is done in Python using the efficient
SciPy library, and a custom, precomputed distance matrix can be easily provided.

Usage
-----
The EMD implementation can be used simply in Python as:

    >>> from emd import emd
    >>> emd(X, Y)

where `X` and `Y` are each n-dimensional samples of points. Each argument should
be a NumPy array with n columns, but possibly different numbers of rows.

If the sample is weighted, the weights can be specified with optional
`X_weights` and `Y_weights` arguments. By default, uniform weights are used.
Because the EMD is a distance between probability measures, the *total weights
of each of the two samples must sum to 1*.

By default, the Euclidean distance between points is used. However, an optional
argument `distance` takes a string that specifies a valid distance type accepted
by the `scipy.spatial.cdist` function. Alternatively, if
`distance='precomputed'`, then a precomputed distance matrix is expected to be
supplied to the optional argument `D`.

Finally, PyEMD can also return the flows between the two samples that are used
to compute the distance. If the `return_flows` argument is `True`, then two
arguments, the distance an array of the flows, are returned.

See the docstring for a more formal description of the functionality. In MATLAB,
the functionality is essentially the same; see the help for details.

Citing
------
If you have used PyEMD for your research and would like to cite it, you can use
the following BibTeX entry:

    @Misc{,
      author =    {Gary Doran},
      title =     {{PyEMD}: Earth Mover's Distance for {Python}},
      year =      {2014--},
      url = "https://github.com/garydoranjr/pyemd",
      note = {[Online; accessed <today>]}
    }

Questions and Issues
--------------------
If you find any bugs or have any questions about this code, please create an
issue on [GitHub](https://github.com/garydoranjr/pyemd/issues), or contact Gary
Doran at <gary.doran@case.edu>. Of course, I cannot guarantee any support for
this software.