1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203
//! ## Table of contents //! //! + [DataFrame](#dataframe) //! + [Combinators](#combinators) //! + [Collection](#collection) //! + [Mixed Types] (#mixed-types) //! //! ## DataFrame //! //! Utah is a dataframe crate backed by [ndarray](http://github.com/bluss/rust-ndarray) for type-conscious, tabular data manipulation with an expressive, functional interface. //! //! The dataframe allows users to access, transform, and compute over two-dimensional data that may or may not have mixed types. //! //! //! //! Please read [this](http://suchin.co/2016/12/27/Introducing-Utah) blog post for an in-depth introduction to the internals of this project. //! //! ### Creating a dataframe //! //! There are multiple ways to create a dataframe. The most straightforward way is to use a builder pattern: //! //! ```ignore //! use utah::prelude::*; //! let c = arr2(&[[2., 6.], [3., 4.], [2., 1.]]); //! let mut df: DataFrame<f64> = DataFrame::new(c) //! .columns(&["a", "b"]).unwrap() //! .index(&["1", "2", "3"]).unwrap(); //! ``` //! //! There's also a `dataframe!` macro which you can use to create new dataframes on the fly. //! //! Finally, you can import data from a CSV. //! //! ```ignore //! use utah::prelude::*; //! let file_name = "test.csv"; //! let df: Result<DataFrame<f64>> = DataFrame::read_csv(file_name); //! ``` //! //! Note that utah's `ReadCSV` trait is pretty barebones right now. //! //! ## Combinators //! //! The user interacts with Utah dataframes by chaining combinators, which are essentially iterator extensions (or _adapters_) over the original dataframe. //! //! The user interacts with Utah dataframes by chaining combinators, which are essentially iterator extensions (or _adapters_) over the original dataframe. This means that each operation is lazy by default. You can chain as many combinators as you want, but it won't do anything until you invoke a collection operation like `as_df`, which would allocate the results to a new dataframe, or `as_matrix`, which would allocate the results into an ndarray matrix. //! //! ### Transform combinators //! //! Transform combinators are meant for changing the shape of the data you're working with. Combinators in this class include `select`, `remove`, and `append`. //! //! ```ignore //! use utah::prelude::*; //! let a = arr2(&[[2, 7], [3, 4], [2, 8]]); //! let df : DataFrame<i32> = DataFrame::new(a).index(&["1","2", "3"]).unwrap().columns(&["a", "b"]).unwrap(); //! let res = df.select(&["a", "c"], UtahAxis::Row); //! ``` //! //! //! ### Process combinators //! //! Process combinators are meant for changing the original data you're working with. Combinators in this class include `impute` and `mapdf`. Impute replaces missing values of a dataframe with the mean of the corresponding column. Not that these operations require the use of a `DataFrameMut`. //! //! ```ignore //! use utah::prelude::*; //! let mut a: DataFrameMut<f64> = dataframe!( //! { //! "a" => col!([NAN, 3., 2.]), //! "b" => col!([2., NAN, 2.]) //! }); //! let res = df.impute(ImputeStrategy::Mean, UtahAxis::Column); //! ``` //! //! //! ### Interact combinators //! //! Interact combinators are meant for interactions between dataframes. They generally take at least two dataframe arguments. Combinators in this class include `inner_left_join`, `outer_left_join`, `inner_right_join`, `outer_right_join`, and `concat`. //! //! ```ignore //! let a: DataFrame<f64> = dataframe!( //! { //! "a" => col!([NAN, 3., 2.]), //! "b" => col!([2., NAN, 2.]) //! }); //! let b: DataFrame<f64> = dataframe!( //! { //! "b" => col!([NAN, 3., 2.]), //! "c" => col!([2., NAN, 2.]) //! }); //! let res = a.inner_left_join(b).as_df()?; //! ``` //! //! ### Aggregate combinators //! //! Aggregate combinators are meant for reduction of a chain of combinators to some result. They are usually the last operation in a chain, but don't necessarily have to be. Combinators in this class include `sumdf`, `mindf`, `maxdf`, `stdev` (standard deviation), and `mean`. Currently, aggregate combinators are not iterator collection operations, because they do not invoke an iterator chain. This may change in the future. //! //! ```ignore //! let a = arr2(&[[2.0, 7.0], [3.0, 4.0], [2.0, 8.0]]); //! let df = DataFrame::new(a); //! let res = df.mean(UtahAxis::Row); //! ``` //! //! ### Chaining combinators //! //! The real power in combinators come from the ability to chain them together in expressive transformations. I can do things like this: //! //! ```ignore //! let result = df.df_iter(UtahAxis::Row) //! .remove(&["1"]) //! .select(&["2"]) //! .append("8", new_data.view()) //! .inner_left_join(df_1) //! .sumdf() //! .as_df().unwrap(); //! ``` //! //! Because we've built the chain on a row-wise dataframe iterator, each subsequent operation will only operate on the rows of the dataframe. If you want to operate on the columns, invoke a dataframe iterator with a `UtahAxis::Column`. //! ### Collection //! //! There are many ways you can access or store the result of your chained operations. Because each data transformation is just an iterator, we can naturally collect the output of the chained operations via `collect()` or a `for loop`: //! //! ```ignore //! for x in df.concat(&df_1) { //! println!("{:?}", x) //! } //! ``` //! //! But we also have an `AsDataFrame` trait, which dumps the output of chained combinators into a new dataframe, matrix, or array, so we can do something like the following: //! //! //! ```ignore //! let maximum_values = df.concat(&df_1).maxdf(UtahAxis::Column).as_df()?; //! ``` //! //! //! ### Mixed Types //! //! Now, I mentioned in the beginning that most dataframes provide mixed types, and I wanted to provide a similar functionality here. In the module `utah::mixtypes`, I've defined `InnerType`, which is an enum over various types of data that can coexist in the same dataframe: //! //! ```ignore //! pub enum InnerType { //! Float(f64), //! Int64(i64), //! Int32(i32), //! Str(String), //! Empty, //! } //! ``` //! //! I've also defined `OuterType`, which is an enum over the various types of *axis labels* that can coexist: //! //! ```ignore //! pub enum OuterType { //! Str(String), //! Int64(i64), //! Int32(i32), //! USize(usize), //! } //! ``` //! //! //! With these wrappers, you can have Strings and f64s in the same dataframe. //! //! ```ignore //! let file_name = "test.csv"; //! let df: Result<DataFrame<f64>> = DataFrame::read_csv(file_name); //! ``` #![cfg_attr(nightly,test)] #![cfg_attr(nightly,custom_derive)] #![cfg_attr(nightly,stmt_expr_attributes)] #![cfg_attr(nightly,specialization)] #![recursion_limit = "1024"] #[macro_use] extern crate ndarray; extern crate ndarray_rand; extern crate rand; #[cfg(nightly)] extern crate test; extern crate num; #[macro_use] extern crate error_chain; extern crate itertools; extern crate rustc_serialize; extern crate csv; pub mod combinators; pub mod dataframe; #[macro_use] pub mod util; mod implement; pub mod mixedtypes; mod bench; #[macro_use] mod tests; pub mod prelude;