# What's in my big data?
This repository contains the code for running What's In My Big Data (WIMBD), which accompanies our [recent paper](http://arxiv.org/abs/2310.20707) (with the same name).

> What is WIMBD?
WIMBD is composed of two components
1. A set of tools for analyzing and revealing the content of large-scale datasets
2. A set of analyses we apply to those datasets, using the aforementioned tools
*WIMBD tools* consist of two parts:
1. Count
2. Search
The count follows a map-reduce functionality, which divides the task into smaller chunks, applies the operation (e.g., extract the domain from a URL) and then aggregates the counts.
We have two implementations for this. One through python functions (e.g., for [domain counts](wimbd/url_counts/)) which is easily extendable and scalable,
and one through a Rust CLI for faster processing. The [Rust implementation](wimbd/src/) covers the summary statistics (presented in Table 2 in the paper) such as the corpus size, number of tokens, etc. In addition, it computes the most & least common $n$-grams approximation using counting Bloom filters.
In practice, we implement search using [elasticsearch](https://www.elastic.co/). We index 5 of the corpora we consider, and provide both a UI and a programmatic access to those.
We built some wrappers around the ES API, which allows `count` and `extract` functionalities. We provide a more detailed documentation [here](./wimbd/es/README.md).
## Getting started
There are two distinct parts of this toolkit: a Python library of functions and a Rust-based CLI.
### Using the Python library
#### Create python environment
```
conda create -n wimbd python=3.9
conda activate wimbd
pip install -r requirements.txt
export PYTHONPATH="${PYTHONPATH}:/PATH/TO/wimbd/"
```
As an example, run the following command that counts the domain counts, per token (Section 4.2.2 in the paper):
```sh
bash wimbd/url_per_tok_counts/run.sh /PATH-TO/c4/en/c4-train.* > data/benchmark/benchmark_url_tok_c4.jsonl
```
#### Run scheme counts
```
./wimbd/scheme_counts/run.sh /PATH-TO/laion2B-en/*.gz > data/scheme_laion2B-en.jsonl
```
This will run the map reduce scripts, and dump the results into a file
### Using the Rust CLI
This part of the repository is written in Rust, so first you'll have to [install the Rust toolchain](https://www.rust-lang.org/tools/install). There's a simple one-liner for that:
```bash
Then you can either install the latest release from [crates.io](https://crates.io/crates/wimbd) directly or install from source.
To install from `crates.io`, run:
```bash
cargo install wimbd
```
Or to install from source, run:
```bash
make release DIR=./bin
```
(make sure to change `DIR` to a directory of your choice that's on your `PATH`)
And now you should have be able to run the `wimbd` CLI:
```bash
wimbd --help
```
For example, find the top 20 3-grams in some c4 files with:
```bash
wimbd topk \
/PATH-TO/c4/en/c4-train.01009-of-01024.json.gz \
/PATH-TO/c4/en/c4-train.01010-of-01024.json.gz \
-n 3 \
-k 20 \
--size 16GiB
```
## Search
Due to the nature of ElasticSearch, we cannot release the API keys on the web.
If you are interested in using our ElasticSearch indices, please fill up this [form](https://forms.gle/Mk9uwJibR9H4hh9Y9), and we'll get back to you as soon as we can.
## Issues
If there's an issue with the code, or you have questions, feel free to [open an issue](https://github.com/allenai/wimbd/issues/new/choose)
or send a [PR](https://github.com/allenai/wimbd/compare)