hawk-data 0.2.3

Modern data analysis tool for structured data (JSON, YAML, CSV)
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
# Changelog

All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [0.2.2] - 2025-07-18

### 🎉 New Features

#### Logical Operations

- **NOT operator support**: Added `not` operator for negating conditions in `select()` statements

  - Syntax: `select(not (.condition))` - parentheses are required for clarity
  - Works with all comparison operators and string operations
  - Examples: `select(not (.age > 65))`, `select(not (.email | contains("spam")))`

- **OR operator support**: Added pipe-delimited pattern matching for OR conditions
  - Syntax: `select(.field | contains("pattern1|pattern2|pattern3"))`
  - Multi-pattern matching in a single operation
  - Examples: `select(.level | contains("ERROR|FATAL"))`, `select(.role | contains("admin|manager|supervisor"))`

#### Array Slicing Operations

- **Python-style slicing**: Complete slice notation support for arrays and string split results

  - Basic slicing: `.[start:end]`, `.[start:]`, `.[:end]`, `.[:]`
  - Negative index support: `.[-5:]`, `.[:-3]`, `.[-10:-5]`
  - Field-specific slicing: `.users[0:10]`, `.logs[-100:]`

- **String split slicing**: Advanced slicing support for string split operations

  - Direct slicing after split: `split(",")[1:3]`, `split("/")[-1]`
  - Complex path processing: `split("/")[1:-1]` for middle path components
  - CSV column extraction: `split(",")[2:5]` for specific column ranges

- **Negative indexing**: Support for negative array indices
  - Last element access: `.array[-1]`, `.users[-1]`
  - Reverse indexing: `.array[-3]` for third from last
  - Compatible with all array operations

#### Enhanced Filtering Capabilities

- **Complex logical combinations**: Combine NOT and OR operations for sophisticated filtering

  - Example: `select(not (.status | contains("deleted|suspended|inactive")))`
  - Multi-condition filtering: `select(not (.type | contains("debug|trace|verbose")))`

- **Pattern-based exclusion**: Exclude multiple patterns efficiently
  - File filtering: `select(not (.filename | contains(".tmp|.bak|.swp")))`
  - Content filtering: `select(not (. | contains("DEBUG|INFO|TRACE")))`

### 🔧 Improvements

#### Enhanced String Operations

- **Improved split operations**: Better integration with slicing for complex text processing
- **Optimized pattern matching**: More efficient OR pattern processing with pipe-delimited syntax
- **Better error handling**: Clearer error messages for logical operation syntax errors

#### Performance Optimizations

- **Slice operation efficiency**: Optimized memory usage for large array slicing operations
- **Pattern matching performance**: Improved performance for multi-pattern OR operations
- **Logical operation caching**: Better performance for complex NOT/OR combinations

#### Documentation and Examples

- **Comprehensive logical operation examples**: Real-world use cases for NOT and OR operations
- **Slicing operation guide**: Complete reference for all slicing capabilities
- **Advanced filtering patterns**: Examples combining multiple logical operations

### 📊 New Use Cases Enabled

#### Advanced Log Analysis

```bash
# Exclude multiple log levels efficiently
hawk '. | select(not (.level | contains("DEBUG|INFO|TRACE")))' app.log

# Get recent critical errors only
hawk '.logs[-1000:] | select(.level | contains("ERROR|FATAL|CRITICAL"))' system.log

# Extract specific time ranges with slicing
hawk '.logs[] | .timestamp | split("T")[0] | split("-")[1:3]' logs.json
```

#### Sophisticated Data Filtering

```bash
# Filter active users excluding test accounts
hawk '.users[] | select(not (.email | contains("test|demo|temp")))' users.json

# Find high-priority items excluding archived
hawk '.items[] | select(.priority | contains("high|urgent")) | select(not (.status | contains("archived|deleted")))' items.json

# Process middle sections of arrays
hawk '.data[100:200] | select(not (.type | contains("noise|test")))' dataset.json
```

#### Complex Text Processing

```bash
# Extract domain names efficiently
hawk '.urls[] | split("://")[1] | split("/")[0]' urls.txt

# Get file paths without extension
hawk '.files[] | .path | split(".")[:-1] | join(".")' filelist.json

# Process CSV columns with exclusions
hawk -t '. | split(",")[2:8] | select(not (.[0] | contains("null|empty|N/A")))' data.csv
```

#### Advanced Data Analysis Workflows

```bash
# Multi-step filtering with slicing
hawk '.events[0:5000] | select(not (.type | contains("debug|trace"))) | group_by(.service) | count' events.json

# Recent data analysis with pattern exclusion
hawk '.metrics[-500:] | select(not (.source | contains("test|staging"))) | avg(.value)' metrics.json

# Complex field extraction with logical operations
hawk '.users[] | select(.role | contains("admin|manager")) | select(not (.status | contains("inactive|suspended"))) | count' users.json
```

### 🛠️ Technical Improvements

#### Filter Module Enhancements

- **New function `parse_not_condition_with_parentheses`**: Robust NOT operator parsing with mandatory parentheses
- **Enhanced `apply_filter_with_string_operations`**: Support for NOT operator in string operation pipelines
- **Improved error handling**: Better error messages for missing parentheses and invalid syntax

#### Slicing Infrastructure

- **Universal slicing support**: `apply_universal_slice_operation` handles all data types
- **Negative index processing**: `parse_index_with_negative` for Python-style negative indexing
- **Data structure detection**: `detect_data_structure` for intelligent slicing behavior

#### Pattern Matching Optimization

- **OR pattern preprocessing**: Efficient pipe-delimited pattern parsing
- **Multi-pattern contains operations**: Optimized string matching for multiple patterns
- **Regex-free implementation**: Fast pattern matching without regex overhead

### 🔄 Breaking Changes

None. This release is fully backward compatible with v0.2.x.

### 🐛 Bug Fixes

- Fixed slice boundary checking for out-of-range indices
- Improved pattern matching edge cases with empty patterns
- Enhanced error reporting for malformed logical operations
- Fixed memory usage issues with large slice operations

### 📖 Documentation Updates

- **Complete logical operations reference**: Documentation for NOT and OR operators
- **Comprehensive slicing guide**: All slicing capabilities with examples
- **Advanced filtering patterns**: Real-world use case examples
- **Performance best practices**: Guidelines for efficient query construction

### 🚀 Migration Guide

#### For users upgrading from v0.2.1:

All existing queries continue to work without changes. New features are additive:

**New NOT operator usage:**

```bash
# Old approach (still works)
hawk '.users[] | select(.age <= 65)' users.json

# New approach with NOT operator
hawk '.users[] | select(not (.age > 65))' users.json
```

**New OR operator usage:**

```bash
# Old approach with multiple queries
hawk '.logs[] | select(.level == "ERROR")' logs.json
hawk '.logs[] | select(.level == "FATAL")' logs.json

# New approach with OR operator
hawk '.logs[] | select(.level | contains("ERROR|FATAL"))' logs.json
```

**New slicing capabilities:**

```bash
# Get last 10 users (new)
hawk '.users[-10:]' users.json

# Get middle section of data (new)
hawk '.data[100:200]' data.json

# Extract filename from path (new)
hawk '.files[] | .path | split("/")[-1]' files.json
```

## [0.2.1] - 2024-07-16

### 🐛 Bug Fixes

- Fixed single object field access (e.g., `.Parameters` in CloudFormation templates)
- Corrected info display for single objects ("Single Object" vs "Object Array")
- Enhanced support for YAML/JSON single object files

### 🔧 Improvements

- Better error messages for field access
- Improved CloudFormation, Docker Compose, Kubernetes manifest support

## [0.2.0] - 2025-07-16

### 🎉 Major Features Added

#### Plain Text Support

- **Universal file format support**: Now processes plain text files, log files, configuration files, and any text-based data
- **Automatic format detection**: Intelligently detects JSON, YAML, CSV, and plain text files
- **Unified query syntax**: Same query language works across all supported formats
- **Text-as-array processing**: Each line becomes a string element in an array for consistent processing

#### String Operations

- **Complete string manipulation suite**: `upper`, `lower`, `trim`, `trim_start`, `trim_end`
- **String analysis functions**: `length`, `reverse`
- **Pattern matching**: `contains(pattern)`, `starts_with(pattern)`, `ends_with(pattern)`
- **Text transformation**: `replace(old, new)`, `substring(start, length)`
- **String parsing**: `split(delimiter)` to convert strings to arrays
- **Array joining**: `join(delimiter)` to convert arrays back to strings

#### Enhanced map() Function

- **Data transformation pipeline**: Transform data elements with chained string operations
- **Type-safe operations**: Proper error handling for incompatible data types
- **Complex pipelines**: Support for multi-step transformations like `map(. | trim | upper | replace("old", "new"))`

#### Statistical Functions

- **Descriptive statistics**: `median`, `stddev` (standard deviation)
- **Data manipulation**: `unique` (remove duplicates), `sort` (sort values)
- **Array operations**: `length` for counting elements
- **Field-specific operations**: All statistical functions support field specification (e.g., `median(.price)`)

#### Colored Output

- **Automatic TTY detection**: Colors in terminal, plain text when piped or redirected
- **Beautiful syntax highlighting**:
  - Table headers in blue with bold formatting
  - Numbers in green
  - Boolean values in yellow
  - Null values in gray
  - JSON syntax highlighting with colored keys and values
- **Environment variable support**: Respects `NO_COLOR` environment variable
- **Multiple output formats**: Enhanced table, JSON, and list outputs with appropriate coloring

### 🔧 Improvements

#### Enhanced Error Handling

- **Detailed error messages**: Context-aware error reporting with specific field and operation information
- **Type-safe operations**: Better validation of operations on different data types
- **Pipeline debugging**: Improved error location reporting in complex query pipelines

#### Better File Format Detection

- **Robust detection algorithms**: Improved heuristics for distinguishing between formats
- **Edge case handling**: Better support for malformed or ambiguous files
- **Fallback mechanisms**: Graceful degradation to text processing when format detection fails

#### Performance Optimizations

- **Memory-efficient processing**: Optimized data structures for large datasets
- **Faster pipeline execution**: Improved query parsing and execution engine
- **Reduced startup time**: Optimized initialization and dependency loading

#### Pipeline Processing Improvements

- **Parentheses-aware parsing**: Proper handling of nested operations like `map(. | contains("text") | not)`
- **Complex query support**: Better support for multi-level operations and transformations
- **Operation chaining**: Improved reliability of long pipeline chains

### 📊 New Use Cases Enabled

#### Log File Analysis

```bash
# Extract error logs with timestamps
hawk '. | select(. | contains("ERROR")) | map(. | substring(0, 19))' app.log

# Count log levels
hawk '. | map(. | split(" ")[2]) | group_by(.) | count' application.log

# Find unique IP addresses
hawk '. | map(. | split(" ")[0]) | unique | sort' access.log
```

#### Text Data Processing

```bash
# Clean and normalize text
hawk '. | map(. | trim | lower)' names.txt

# Extract file extensions
hawk '. | map(. | split(".") | last)' filelist.txt

# Statistical text analysis
hawk '. | map(. | split(" ") | length) | median' documents.txt
```

#### Data Cleaning and Normalization

```bash
# Email normalization
hawk '.users[] | map(.email | lower | trim)' users.csv

# Complex string transformations
hawk '.products[] | map(.name | replace("_", " ") | upper)' inventory.json

# Data validation and cleaning
hawk '.records[] | select(.id | length == 8) | map(.status | upper)' data.csv
```

#### Advanced Analytics

```bash
# Statistical analysis
hawk '.measurements[] | group_by(.sensor) | stddev(.temperature)' sensor_data.json

# Median calculations
hawk '.sales[] | group_by(.region) | median(.amount)' sales_data.csv

# Unique value analysis
hawk '.users[] | unique(.department) | sort' employee_data.json
```

### 🛠️ Technical Improvements

#### New Dependencies

- `termcolor ^1.4`: For colored output support
- `is-terminal ^0.4`: For TTY detection

#### Code Architecture

- **New module `string_ops`**: Centralized string operation handling
- **New module `stats_ops`**: Statistical function implementations
- **Enhanced `filter.rs`**: Improved pipeline operation handling
- **Updated `output.rs`**: Comprehensive colored output support
- **Improved `setup.rs`**: Better file format detection and text processing

#### Testing

- **Comprehensive test suite**: Added tests for all new string operations
- **Statistical function testing**: Validation of median, stddev, and other statistical operations
- **Integration testing**: End-to-end testing of complex pipeline operations
- **Edge case coverage**: Testing of malformed inputs and error conditions

### 🔄 Breaking Changes

None. This release is fully backward compatible with v0.1.x.

### 📦 Migration Guide

No migration required. All existing queries and workflows continue to work unchanged.

### 🐛 Bug Fixes

- Fixed pipeline parsing issues with complex nested operations
- Improved CSV type inference accuracy
- Enhanced error reporting for malformed queries
- Fixed memory usage issues with large datasets

### 📖 Documentation Updates

- **Comprehensive README update**: Added extensive documentation for new features
- **String operations guide**: Complete reference for all string manipulation functions
- **Statistical functions documentation**: Usage examples and parameter descriptions
- **Text processing examples**: Real-world use cases for log analysis and text processing
- **Enhanced query syntax reference**: Updated with all new operations and examples

### 🙏 Acknowledgments

- Community feedback on string processing needs
- Performance suggestions from early adopters
- Documentation improvements from user contributions

## [0.1.0] - 2024-07-12

### 🎉 Initial Release

#### Core Features

- **Multi-format support**: JSON, YAML, CSV parsing and processing
- **Pandas-like query language**: Intuitive syntax for data analysis
- **Field access and navigation**: Deep nested field access with array expansion
- **Filtering operations**: `select()` with comparison operators
- **Aggregation functions**: `count`, `sum`, `avg`, `min`, `max`
- **Grouping operations**: `group_by()` with aggregation support
- **Multiple output formats**: Table, JSON, list with automatic format detection

#### Technical Foundation

- **Rust implementation**: Fast, memory-safe data processing
- **serde_json integration**: Robust JSON parsing and manipulation
- **Type-aware processing**: Intelligent handling of numbers, strings, booleans
- **Error handling**: Comprehensive error reporting with thiserror
- **CLI interface**: User-friendly command-line interface with clap

#### Supported Operations

- Field access: `.field`, `.array[0]`, `.array[]`, `.nested.field`
- Filtering: `select(.field > value)`, `select(.field == "value")`
- Aggregation: `sum(.field)`, `avg(.field)`, `min(.field)`, `max(.field)`, `count`
- Grouping: `group_by(.field)` with aggregation support
- Info: `. | info` for data structure exploration

#### Output Formats

- **Table format**: Structured table output for object arrays
- **JSON format**: Pretty-printed JSON output
- **List format**: Simple list output for array data
- **Auto format**: Intelligent format selection based on data structure

---

For more details about any release, please see the [GitHub releases page](https://github.com/kyotalab/hawk/releases).