# Test Data
## Downloading Test Data
To download the test data files, you'll need Python with the `requests` and `tqdm` packages installed. Run:
```bash
pip install requests tqdm
python download_testdata.py
```
This script will download CAIDA trace files (0.dat through 10.dat) that are used for testing and benchmarking the Jigsaw Sketch implementation. The files will be downloaded to the current directory.
## Data Description
The downloaded files contain network flow data in binary format from the CAIDA dataset. Each file contains a sequence of network flow records, which are used to test and benchmark the Jigsaw Sketch's ability to identify elephant flows (high-volume network flows) in high-speed networks.
The data files (0.dat through 10.dat) are sourced from the CAIDA dataset and contain anonymized network flow records. Each record represents a network flow with its associated metadata, making them ideal for testing the Jigsaw Sketch's performance in identifying high-volume flows in network traffic analysis.
## Data Format
- Each file (0.dat through 10.dat) contains IPv4 5-tuple flow records
- Each record is 13 bytes with fields in this order:
- Source IP (4 bytes)
- Source Port (2 bytes)
- Destination IP (4 bytes)
- Destination Port (2 bytes)
- Protocol (1 byte)
Example record:
```hex
99 c1 8a e6 01 bb db 0a ab 60 9a 9a 06
[ src_ip ] [sp] [ dst_ip ] [dp] [p]
```
## Data Source
These files are from the CAIDA dataset, provided by [DHS](https://github.com/ZeBraHack0/DHS/tree/main/data) for testing sketch performance. For other uses of CAIDA datasets, please register with CAIDA and apply for access.
## Verification
After downloading, you should have:
- 0.dat through 10.dat
- Each file contains binary 5-tuple IPv4 flow records
- Files are used by the flow_processor example to test sketch performance
### File Format Verification
You can verify the file format using hexdump:
```bash
## Use Synthetic Data
Alternatively, you can generate synthetic data with the script
```bash
python generate.py
```