static USAGE: &str = r#"
Apply a series of transformation functions to a given CSV column. This can be used to
perform typical data-wrangling tasks and/or to harmonize some values, etc.
It has six subcommands:
* operations - 36 string, format, currency, regex & NLP operators.
* emptyreplace - replace empty cells with <--replacement> string.
* datefmt - Formats a recognized date column to a specified format using <--formatstr>.
* dynfmt - Dynamically constructs a new column from other columns using the <--formatstr> template.
* geocode - geocodes a WGS84 location against a static copy of the Geonames cities database.
* calcconv - parse and evaluate math expressions, with support for units and conversions.
OPERATIONS
Multiple operations can be applied, with the comma-delimited operation series
applied in order:
trim => Trim the cell
trim,upper => Trim the cell, then transform to uppercase
lower,simdln => Lowercase the cell, then compute the normalized
Damerau-Levenshtein similarity to --comparand
Operations support multi-column transformations. Just make sure the
number of transformed columns with the --rename option is the same. e.g.:
$ qsv apply operations trim,upper col1,col2,col3 -r newcol1,newcol2,newcol3 file.csv
It has 36 supported operations:
* len: Return string length
* lower: Transform to lowercase
* upper: Transform to uppercase
* squeeze: Compress consecutive whitespaces
* squeeze0: Remove whitespace
* trim: Trim (drop whitespace left & right of the string)
* ltrim: Left trim whitespace
* rtrim: Right trim whitespace
* mtrim: Trims --comparand matches left & right of the string (Rust trim_matches)
* mltrim: Left trim --comparand matches (Rust trim_start_matches)
* mrtrim: Right trim --comparand matches (Rust trim_end_matches)
* strip_prefix: Removes specified prefix in --comparand
* strip_suffix: Remove specified suffix in --comparand
* escape - escape (Rust escape_default)
* encode64: base64 encode
* decode64: base64 decode
* replace: Replace all matches of a pattern (using --comparand)
with a string (using --replacement) (Rust replace)
* regex_replace: Replace all regex matches in --comparand w/ --replacement.
* titlecase - capitalizes English text using Daring Fireball titlecase style
https://daringfireball.net/2008/05/title_case
* censor: profanity filter. Add additional comma-delimited profanities with --comparand.
* censor_check: check if profanity is detected (boolean).
Add additional comma-delimited profanities with -comparand.
* censor_count: count of profanities detected.
Add additional comma-delimited profanities with -comparand.
* round: Round numeric values to the specified number of decimal places using
Midpoint Nearest Even Rounding Strategy AKA "Bankers Rounding."
Specify the number of decimal places with --formatstr (default: 3).
* thousands: Add thousands separators to numeric values.
Specify the separator policy with --formatstr (default: comma). The valid policies are:
comma, dot, space, underscore, hexfour (place a space every four hex digits) and
indiancomma (place a comma every two digits, except the last three digits).
The decimal separator can be specified with --replacement (default: '.')
* currencytonum: Gets the numeric value of a currency. Supports currency symbols
(e.g. $,¥,£,€,֏,₱,₽,₪,₩,ƒ,฿,₫) and strings (e.g. USD, EUR, RMB, JPY, etc.).
Recognizes point, comma and space separators.
* numtocurrency: Convert a numeric value to a currency. Specify the currency symbol
with --comparand. Automatically rounds values to two decimal places. Specify
"euro" formatting (e.g. 1.000,00 instead of 1,000.00 ) by setting --formatstr to "euro".
Specify conversion rate by setting --replacement to a number.
* copy: Mark a column for copying
* simdl: Damerau-Levenshtein similarity to --comparand
* simdln: Normalized Damerau-Levenshtein similarity to --comparand (between 0.0 & 1.0)
* simjw: Jaro-Winkler similarity to --comparand (between 0.0 & 1.0)
* simsd: Sørensen-Dice similarity to --comparand (between 0.0 & 1.0)
* simhm: Hamming distance to --comparand. Num of positions characters differ.
* simod: Optimal String Alignment (OSA) Distance to --comparand.
* eudex: Multi-lingual sounds like --comparand (boolean)
Tested on English, Catalan, German, Spanish, Swedish and Italian dictionaries.
It supports all C1 letters (e.g. ü, ö, æ, ß, é, etc.) and takes their sound into account.
It should work on other European languages that use the Latin alphabet.
* sentiment: Normalized VADER sentiment score (English only - between -1.0 to 1.0).
* whatlang: Language Detection for 87 supported languages, with default confidence threshold
of 0.9, which can be overriden by assigning 0.0 to 1.0 to --comparand.
If language detection confidence is below the threshold, it will still show the best language
guess, followed by the confidence score, ending with a question mark.
If you want to always displays the confidence score, end the --comparand value with a
question mark (e.g. 0.9?)
https://github.com/greyblake/whatlang-rs/blob/master/SUPPORTED_LANGUAGES.md
Examples:
Trim, then transform to uppercase the surname field.
$ qsv apply operations trim,upper surname file.csv
Trim, then transform to uppercase the surname field and rename the column uppercase_clean_surname.
$ qsv apply operations trim,upper surname -r uppercase_clean_surname file.csv
Trim, then transform to uppercase the surname field and
save it to a new column named uppercase_clean_surname.
$ qsv apply operations trim,upper surname -c uppercase_clean_surname file.csv
Trim, then transform to uppercase the firstname and surname fields and
rename the columns ufirstname and usurname.
$ qsv apply operations trim,upper firstname,surname -r ufirstname,usurname file.csv
Trim parentheses & brackets from the description field.
$ qsv apply operations mtrim description --comparand '()<>' file.csv
Replace ' and ' with ' & ' in the description field.
$ qsv apply replace description --comparand ' and ' --replacement ' & ' file.csv
Extract the numeric value of the Salary column in a new column named Salary_num.
$ qsv apply operations currencytonum Salary -c Salary_num file.csv
Convert the USD_Price to PHP_Price using the currency symbol "PHP" with a conversion rate of 60.
$ qsv apply operations numtocurrency USD_Price -C PHP -R 60 -c PHP_Price file.csv
Base64 encode the plaintext_col column and save the encoded value into new column named encoded_col
and then decode it.
$ qsv apply operations encode plaintext_col -c encoded_col file.csv | qsv apply operations decode encode_col
Compute the Normalized Damerau-Levenshtein similarity of the neighborhood column to the string 'Roxbury'
and save it to a new column named dln_roxbury_score.
$ qsv apply operations lower,simdln neighborhood --comparand roxbury -c dln_roxbury_score boston311.csv
You can also use this subcommand command to make a copy of a column:
$ qsv apply operations copy col_to_copy -c col_copy file.csv
EMPTYREPLACE
Replace empty cells with <--replacement> string.
Non-empty cells are not modified. See the `fill` command for more complex empty field operations.
Examples:
Replace empty cells in file.csv Measurement column with 'None'.
$ qsv apply emptyreplace Measurement --replacement None file.csv
Replace empty cells in file.csv Measurement column with 'Unknown Measurement'.
$ qsv apply emptyreplace --replacement 'Unknown Measurement' file.csv
DATEFMT
Formats a recognized date column to a specified format using <--formatstr>.
See https://github.com/jqnatividad/belt/tree/main/dateparser#accepted-date-formats for
recognized date formats.
See https://docs.rs/chrono/latest/chrono/format/strftime/ for
accepted date formats for --formatstr.
Defaults to ISO 8601/RFC 3339 format when --formatstr is not specified.
Datefmt also supports multi-column formatting.
Examples:
Format dates in Open Date column to ISO 8601/RFC 3339 format:
$ qsv apply datefmt 'Open Date' file.csv
Format multiple date columns in file.csv to ISO 8601/RFC 3339 format:
$ qsv apply datefmt 'Open Date,Modified Date,Closed Date' file.csv
Format dates in OpenDate column using '%Y-%m-%d' format:
$ qsv apply datefmt OpenDate --formatstr '%Y-%m-%d' file.csv
Format multiple date columns using '%Y-%m-%d' format:
$ qsv apply datefmt OpenDate,CloseDate,ReopenDate --formatstr '%Y-%m-%d' file.csv
Get the week number for OpenDate and store it in the week_number column:
$ qsv apply dateformat OpenDate --formatstr '%V' --new-column week_number file.csv
Get the day of the week for several date columns and store it in the corresponding weekday columns:
$ qsv apply dateformat OpenDate,CloseDate --formatstr '%u' --rename Open_weekday,Close_weekday file.csv
DYNFMT
Dynamically constructs a new column from other columns using the <--formatstr> template.
The template can contain arbitrary characters. To insert a column value, enclose the
column name in curly braces, replacing all non-alphanumeric characters with underscores.
If you need to dynamically construct a column with more complex formatting requirements and
computed values, check out the py command to take advantage of Python's f-string formatting.
Examples:
Create a new column 'mailing address' from 'house number', 'street', 'city' and 'zip-code' columns:
$ qsv apply dynfmt --formatstr '{house_number} {street}, {city} {zip_code} USA' -c 'mailing address' file.csv
Create a new column 'FullName' from 'FirstName', 'MI', and 'LastName' columns:
$ qsv apply dynfmt --formatstr 'Sir/Madam {FirstName} {MI}. {LastName}' -c FullName file.csv
GEOCODE
Geocodes to the nearest city center point given a location column
[i.e. a column which contains a latitude, longitude WGS84 coordinate] against
an embedded copy of the Geonames city database.
The geocoded information is formatted based on --formatstr, returning
it in 'city-state' format if not specified.
Use the --new-column option if you want to keep the location column:
Examples:
Geocode file.csv Location column and set the geocoded value to a
new column named City.
$ qsv apply geocode Location --new-column City file.csv
Geocode file.csv Location column with --formatstr=city-state and
set the geocoded value a new column named City.
$ qsv apply geocode Location --formatstr city-state --new-column City file.csv
CALCCONV
Parse and evaluate math expressions into a new column, with support for units and conversions.
The math expression is built dynamically using the <--formatstr> template, similar to the DYNFMT
subcommand, with the addition that if the literal '<UNIT>' is found at the end of the template, the
inferred unit will be appended to the result.
For a complete list of supported units, constants, operators and functions, see https://docs.rs/cpc
Examples:
Do simple arithmetic:
$ qsv apply calcconv --formatstr '{col1} + {col2} * {col3}' --new-column result file.csv
With support for operators like % and ^:
$ qsv apply calcconv --formatstr '{col1} % 3' --new-column remainder file.csv
Convert from one unit to another:
$ qsv apply calcconv --formatstr '{col1}mb in gigabytes' -c gb file.csv
$ qsv apply calcconv --formatstr '{col1} Fahrenheit in Celsius" -c metric_temperature file.csv
Mix units and conversions are automatically done for you:
$ qsv apply calcconv --formatstr '{col1}km + {col2}mi in meters' -c meters file.csv
You can append the inferred unit at the end of the result by ending the expression with '<UNIT>':
$ qsv apply calcconv --formatstr '({col1} + {col2})km to light years <UNIT>' -c light_years file.csv
You can even do complex temporal unit conversions:
$ qsv apply calcconv --formatstr '{col1}m/s + {col2}mi/h in kilometers per h' -c kms_per_h file.csv
Use math functions - see https://docs.rs/cpc/latest/cpc/enum.FunctionIdentifier.html for list of functions:
$ qsv apply calcconv --formatstr 'round(sqrt{col1}^4)! liters' -c liters file.csv
Use percentages:
$ qsv apply calcconv --formatstr '10% of abs(sin(pi)) horsepower to watts' -c watts file.csv
And use very large numbers:
$ qsv apply calcconv --formatstr '{col1} Billion Trillion * {col2} quadrillion vigintillion' -c num_atoms file.csv
For more extensive examples, see https://github.com/jqnatividad/qsv/blob/master/tests/test_apply.rs.
Usage:
qsv apply operations <operations> [options] <column> [<input>]
qsv apply emptyreplace --replacement=<string> [options] <column> [<input>]
qsv apply datefmt [--formatstr=<string>] [options] <column> [<input>]
qsv apply dynfmt --formatstr=<string> [options] --new-column=<name> [<input>]
qsv apply geocode [--formatstr=<string>] [options] <column> [<input>]
qsv apply calcconv --formatstr=<string> [options] --new-column=<name> [<input>]
qsv apply --help
The <column> argument can be a list of columns for the operations and datefmt subcommands.
See 'qsv select --help' for the format details.
apply options:
-c, --new-column <name> Put the transformed values in a new column instead.
-r, --rename <name> New name for the transformed column.
-C, --comparand=<string> The string to compare against for replace & similarity operations.
Also used with numtocurrency operation to specify currency symbol.
-R, --replacement=<string> The string to use for the replace & emptyreplace operations.
Also used with numtocurrency operation to conversion rate.
--prefer-dmy Prefer to parse dates in dmy format. Otherwise, use mdy format.
Only used with the DATEFMT subcommand.
--keep-zero-time If a formatted date ends with "T00:00:00+00:00", keep the time
instead of removing it. Only used with the DATEFMT subcommand.
-f, --formatstr=<string> This option is used by several subcommands:
OPERATIONS:
numtocurrency
If set to "euro", will format the currency to use "." instead of ","
as separators (e.g. 1.000,00 instead of 1,000.00 )
thousands
The thousands separator policy to use. The available policies are:
comma, dot, space, underscore, hexfour (place a space every four
hex digits) and indiancomma (place a comma every two digits,
except the last three digits). (default: comma)
round
The number of decimal places to round to (default: 3)
DATEFMT: The date format to use. For formats, see
https://docs.rs/chrono/latest/chrono/format/strftime/
Default to ISO 8601 / RFC 3339 date & time format.
[default: %+]
DYNFMT: the template to use to construct a new column.
GEOCODE: the place format to use with the geocode subcommand.
The available formats are:
- 'city-state' (default) - e.g. Brooklyn, New York
- 'city-country' - Brooklyn, US
- 'city-state-country' | 'city-admin1-country' - Brooklyn, New York US
- 'city' - Brooklyn
- 'county' | 'admin2' - Kings County
- 'state' | 'admin1' - New York
- 'county-country' | 'admin2-country' - Kings County, US
- 'county-state-country' | 'admin2-admin1-country' - Kings County, New York US
- 'country' - US
-j, --jobs <arg> The number of jobs to run in parallel.
When not set, the number of jobs is set to the number of CPUs detected.
-b, --batch <size> The number of rows per batch to load into memory, before running in parallel.
[default: 50000]
Common options:
-h, --help Display this message
-o, --output <file> Write output to <file> instead of stdout.
-n, --no-headers When set, the first row will not be interpreted
as headers.
-d, --delimiter <arg> The field delimiter for reading CSV data.
Must be a single character. (default: ,)
-p, --progressbar Show progress bars. Not valid for stdin.
"#;
use std::str::FromStr;
use cached::proc_macro::cached;
use censor::{Censor, Sex, Zealous};
use cpc::{eval, units::Unit};
use data_encoding::BASE64;
use dynfmt::Format;
use eudex::Hash;
use indicatif::{ProgressBar, ProgressDrawTarget};
use log::debug;
use once_cell::sync::OnceCell;
use qsv_currency::Currency;
use qsv_dateparser::parse_with_preference;
use rayon::prelude::*;
use regex::Regex;
use reverse_geocoder::{Locations, ReverseGeocoder};
use serde::Deserialize;
use strsim::{
damerau_levenshtein, hamming, jaro_winkler, normalized_damerau_levenshtein, osa_distance,
sorensen_dice,
};
use strum_macros::EnumString;
use thousands::{policies, Separable, SeparatorPolicy};
use titlecase::titlecase;
use vader_sentiment::SentimentIntensityAnalyzer;
use whatlang::detect;
use crate::{
clitypes::CliError,
config::{Config, Delimiter},
regex_once_cell,
select::SelectColumns,
util, CliResult,
};
#[derive(EnumString)]
#[strum(ascii_case_insensitive)]
#[allow(non_camel_case_types)]
enum Operations {
Censor,
Censor_Check,
Censor_Count,
Copy,
Currencytonum,
Decode,
Encode,
Escape,
Eudex,
Len,
Lower,
Ltrim,
Mltrim,
Mrtrim,
Mtrim,
Numtocurrency,
Regex_Replace,
Replace,
Round,
Rtrim,
Sentiment,
Simdl,
Simdln,
Simhm,
Simjw,
Simod,
Simsd,
Squeeze,
Squeeze0,
Strip_Prefix,
Strip_Suffix,
Thousands,
Titlecase,
Trim,
Upper,
Whatlang,
}
#[derive(Deserialize, Debug)]
struct Args {
arg_column: SelectColumns,
cmd_operations: bool,
arg_operations: String,
cmd_datefmt: bool,
cmd_dynfmt: bool,
cmd_emptyreplace: bool,
cmd_geocode: bool,
cmd_calcconv: bool,
arg_input: Option<String>,
flag_rename: Option<String>,
flag_comparand: String,
flag_replacement: String,
flag_prefer_dmy: bool,
flag_keep_zero_time: bool,
flag_formatstr: String,
flag_batch: u32,
flag_jobs: Option<usize>,
flag_new_column: Option<String>,
flag_output: Option<String>,
flag_no_headers: bool,
flag_delimiter: Option<Delimiter>,
flag_progressbar: bool,
}
static CENSOR: OnceCell<Censor> = OnceCell::new();
static LOCS: OnceCell<Locations> = OnceCell::new();
static GEOCODER: OnceCell<ReverseGeocoder> = OnceCell::new();
static EUDEX_COMPARAND_HASH: OnceCell<eudex::Hash> = OnceCell::new();
static REGEX_REPLACE: OnceCell<Regex> = OnceCell::new();
static SENTIMENT_ANALYZER: OnceCell<SentimentIntensityAnalyzer> = OnceCell::new();
static THOUSANDS_POLICY: OnceCell<SeparatorPolicy> = OnceCell::new();
static ROUND_PLACES: OnceCell<u32> = OnceCell::new();
static WHATLANG_CONFIDENCE_THRESHOLD: OnceCell<f64> = OnceCell::new();
const DEFAULT_THRESHOLD: f64 = 0.9;
const DEFAULT_ROUND_PLACES: u32 = 3;
static INDIANCOMMA_POLICY: SeparatorPolicy = SeparatorPolicy {
separator: ",",
groups: &[3, 2],
digits: thousands::digits::ASCII_DECIMAL,
};
enum ApplySubCmd {
Operations,
DateFmt,
DynFmt,
Geocode,
EmptyReplace,
CalcConv,
}
#[inline]
fn replace_column_value(
record: &csv::StringRecord,
column_index: usize,
new_value: &str,
) -> csv::StringRecord {
record
.into_iter()
.enumerate()
.map(|(i, v)| if i == column_index { new_value } else { v })
.collect()
}
pub fn run(argv: &[&str]) -> CliResult<()> {
let args: Args = util::get_args(USAGE, argv)?;
let rconfig = Config::new(&args.arg_input)
.delimiter(args.flag_delimiter)
.no_headers(args.flag_no_headers)
.checkutf8(false)
.select(args.arg_column);
let mut rdr = rconfig.reader()?;
let mut wtr = Config::new(&args.flag_output).writer()?;
let headers = rdr.byte_headers()?.clone();
let sel = rconfig.selection(&headers)?;
let column_index = *sel.iter().next().unwrap();
let mut headers = rdr.headers()?.clone();
if let Some(new_name) = args.flag_rename {
let new_col_names = util::ColumnNameParser::new(&new_name).parse()?;
if new_col_names.len() != sel.len() {
return fail!("Number of new columns does not match input column selection.");
}
for (i, col_index) in sel.iter().enumerate() {
headers = replace_column_value(&headers, *col_index, &new_col_names[i]);
}
}
if !rconfig.no_headers {
if let Some(new_column) = &args.flag_new_column {
headers.push_field(new_column);
}
wtr.write_record(&headers)?;
}
let mut dynfmt_fields = Vec::with_capacity(10); let mut dynfmt_template = args.flag_formatstr.clone();
if args.cmd_dynfmt || args.cmd_calcconv {
if args.flag_no_headers {
return fail!("dynfmt/calcconv subcommand requires headers.");
}
let (safe_headers, _) = util::safe_header_names(&headers, false, false, &Vec::new(), "");
let formatstr_re: &'static Regex = crate::regex_once_cell!(r"\{(?P<key>\w+)?\}");
for format_fields in formatstr_re.captures_iter(&args.flag_formatstr) {
dynfmt_fields.push(format_fields.name("key").unwrap().as_str());
}
dynfmt_fields.sort_unstable();
for (i, field) in safe_headers.into_iter().enumerate() {
if dynfmt_fields.binary_search(&field.as_str()).is_ok() {
let field_with_curly = format!("{{{field}}}");
let field_index = format!("{{{i}}}");
dynfmt_template = dynfmt_template.replace(&field_with_curly, &field_index);
}
}
debug!("dynfmt_fields: {dynfmt_fields:?} dynfmt_template: {dynfmt_template}");
}
let mut ops_vec: Vec<Operations> = Vec::new();
let apply_cmd = if args.cmd_operations {
match validate_operations(
&args.arg_operations.split(',').collect(),
&args.flag_comparand,
&args.flag_replacement,
&args.flag_new_column,
&args.flag_formatstr,
) {
Ok(operations_vec) => ops_vec = operations_vec,
Err(e) => return Err(e),
}
ApplySubCmd::Operations
} else if args.cmd_geocode {
ApplySubCmd::Geocode
} else if args.cmd_datefmt {
ApplySubCmd::DateFmt
} else if args.cmd_dynfmt {
ApplySubCmd::DynFmt
} else if args.cmd_emptyreplace {
ApplySubCmd::EmptyReplace
} else if args.cmd_calcconv {
ApplySubCmd::CalcConv
} else {
return fail!("Unknown apply subcommand.");
};
let show_progress =
(args.flag_progressbar || std::env::var("QSV_PROGRESSBAR").is_ok()) && !rconfig.is_stdin();
let progress = ProgressBar::with_draw_target(None, ProgressDrawTarget::stderr_with_hz(5));
if show_progress {
util::prep_progress(&progress, util::count_rows(&rconfig)?);
} else {
progress.set_draw_target(ProgressDrawTarget::hidden());
}
let prefer_dmy = args.flag_prefer_dmy || rconfig.get_dmy_preference();
#[allow(unused_assignments)]
let mut batch_record = csv::StringRecord::new();
let batchsize: usize = args.flag_batch as usize;
let mut batch = Vec::with_capacity(batchsize);
let mut batch_results = Vec::with_capacity(batchsize);
util::njobs(args.flag_jobs);
'batch_loop: loop {
for _ in 0..batchsize {
match rdr.read_record(&mut batch_record) {
Ok(has_data) => {
if has_data {
batch.push(batch_record.clone());
} else {
break;
}
}
Err(e) => {
return fail_clierror!("Error reading file: {e}");
}
}
}
if batch.is_empty() {
break 'batch_loop;
}
batch
.par_iter()
.map(|record_item| {
let mut record = record_item.clone();
match apply_cmd {
ApplySubCmd::Geocode => {
let mut cell = record[column_index].to_owned();
if !cell.is_empty() {
let search_result = search_cached(&cell, &args.flag_formatstr);
if let Some(geocoded_result) = search_result {
cell = geocoded_result;
}
}
if args.flag_new_column.is_some() {
record.push_field(&cell);
} else {
record = replace_column_value(&record, column_index, &cell);
}
}
ApplySubCmd::Operations => {
let mut cell = String::new();
for col_index in sel.iter() {
record[*col_index].clone_into(&mut cell);
apply_operations(
&ops_vec,
&mut cell,
&args.flag_comparand,
&args.flag_replacement,
&args.flag_formatstr,
);
if args.flag_new_column.is_some() {
record.push_field(&cell);
} else {
record = replace_column_value(&record, *col_index, &cell);
}
}
}
ApplySubCmd::EmptyReplace => {
let mut cell = record[column_index].to_owned();
if cell.trim().is_empty() {
cell = args.flag_replacement.clone();
}
if args.flag_new_column.is_some() {
record.push_field(&cell);
} else {
record = replace_column_value(&record, column_index, &cell);
}
}
ApplySubCmd::DateFmt => {
let mut cell = String::new();
for col_index in sel.iter() {
record[*col_index].clone_into(&mut cell);
if !cell.is_empty() {
let parsed_date = parse_with_preference(&cell, prefer_dmy);
if let Ok(format_date) = parsed_date {
let formatted_date =
format_date.format(&args.flag_formatstr).to_string();
if !args.flag_keep_zero_time
&& formatted_date.ends_with("T00:00:00+00:00")
{
cell = formatted_date[..10].to_string();
} else {
cell = formatted_date;
}
}
}
if args.flag_new_column.is_some() {
record.push_field(&cell);
} else {
record = replace_column_value(&record, *col_index, &cell);
}
}
}
ApplySubCmd::DynFmt => {
let mut cell = record[column_index].to_owned();
if !cell.is_empty() {
let mut record_vec: Vec<String> = Vec::with_capacity(record.len());
for field in &record {
record_vec.push(field.to_string());
}
if let Ok(formatted) =
dynfmt::SimpleCurlyFormat.format(&dynfmt_template, record_vec)
{
cell = formatted.to_string();
}
}
if args.flag_new_column.is_some() {
record.push_field(&cell);
} else {
record = replace_column_value(&record, column_index, &cell);
}
}
ApplySubCmd::CalcConv => {
let result = if record[column_index].is_empty() {
String::new()
} else {
let mut cell = record[column_index].to_owned();
let mut record_vec: Vec<String> = Vec::with_capacity(record.len());
for field in &record {
record_vec.push(field.to_string());
}
if let Ok(formatted) =
dynfmt::SimpleCurlyFormat.format(&dynfmt_template, record_vec)
{
cell = formatted.to_string();
}
let mut append_unit = false;
let cell_for_eval = if cell.ends_with("<UNIT>") {
append_unit = true;
cell.trim_end_matches("<UNIT>")
} else {
&cell
};
match eval(cell_for_eval, true, Unit::Celsius, false) {
Ok(answer) => {
if append_unit {
format!("{} {:?}", answer.value, answer.unit)
} else {
answer.value.to_string()
}
}
Err(e) => {
format!("ERROR: {e}")
}
}
};
if args.flag_new_column.is_some() {
record.push_field(&result);
} else {
record = replace_column_value(&record, column_index, &result);
}
}
}
record
})
.collect_into_vec(&mut batch_results);
for result_record in &batch_results {
wtr.write_record(result_record)?;
}
if show_progress {
progress.inc(batch.len() as u64);
}
batch.clear();
}
if show_progress {
if args.cmd_geocode {
util::update_cache_info!(progress, SEARCH_CACHED);
}
util::finish_progress(&progress);
}
Ok(wtr.flush()?)
}
fn validate_operations(
operations: &Vec<&str>,
flag_comparand: &str,
flag_replacement: &str,
flag_new_column: &Option<String>,
flag_formatstr: &str,
) -> Result<Vec<Operations>, CliError> {
let mut censor_invokes = 0_u8;
let mut copy_invokes = 0_u8;
let mut eudex_invokes = 0_u8;
let mut regex_replace_invokes = 0_u8;
let mut replace_invokes = 0_u8;
let mut sentiment_invokes = 0_u8;
let mut sim_invokes = 0_u8;
let mut strip_invokes = 0_u8;
let mut whatlang_invokes = 0_u8;
let mut ops_vec: Vec<Operations> = Vec::with_capacity(operations.len());
for op in operations {
let Ok(operation) = Operations::from_str(op) else {
return fail_clierror!("Unknown '{op}' operation");
};
match operation {
Operations::Censor | Operations::Censor_Check | Operations::Censor_Count => {
if flag_new_column.is_none() {
return fail!("--new_column (-c) is required for censor operations.");
}
if censor_invokes == 0
&& CENSOR
.set({
let mut censored_words = Censor::Standard + Zealous + Sex;
for word in flag_comparand.split(',') {
censored_words += word.trim();
}
censored_words
})
.is_err()
{
return fail!("Cannot initialize Censor engine.");
}
censor_invokes = censor_invokes.saturating_add(1);
}
Operations::Copy => {
if flag_new_column.is_none() {
return fail!("--new_column (-c) is required for copy operation.");
}
copy_invokes = copy_invokes.saturating_add(1);
}
Operations::Eudex => {
if flag_comparand.is_empty() || flag_new_column.is_none() {
return fail!("--comparand (-C) and --new_column (-c) is required for eudex.");
}
if eudex_invokes == 0
&& EUDEX_COMPARAND_HASH
.set(eudex::Hash::new(flag_comparand))
.is_err()
{
return fail!("Cannot initialize Eudex.");
}
eudex_invokes = eudex_invokes.saturating_add(1);
}
Operations::Mtrim | Operations::Mltrim | Operations::Mrtrim => {
if flag_comparand.is_empty() {
return fail!("--comparand (-C) is required for match trim operations.");
}
}
Operations::Regex_Replace => {
if flag_comparand.is_empty() || flag_replacement.is_empty() {
return fail!(
"--comparand (-C) and --replacement (-R) are required for regex_replace \
operation."
);
}
if regex_replace_invokes == 0 {
let re = match regex::Regex::new(flag_comparand) {
Ok(re) => re,
Err(err) => {
return fail_clierror!("regex_replace expression error: {err:?}");
}
};
let _ = REGEX_REPLACE.set(re);
}
regex_replace_invokes = regex_replace_invokes.saturating_add(1);
}
Operations::Replace => {
if flag_comparand.is_empty() || flag_replacement.is_empty() {
return fail!(
"--comparand (-C) and --replacement (-R) are required for replace \
operation."
);
}
replace_invokes = replace_invokes.saturating_add(1);
}
Operations::Sentiment => {
if flag_new_column.is_none() {
return fail!("--new_column (-c) is required for sentiment operation.");
}
sentiment_invokes = sentiment_invokes.saturating_add(1);
}
Operations::Simdl
| Operations::Simdln
| Operations::Simjw
| Operations::Simsd
| Operations::Simhm
| Operations::Simod => {
if flag_comparand.is_empty() || flag_new_column.is_none() {
return fail!(
"--comparand (-C) and --new_column (-c) is required for similarity \
operations."
);
}
sim_invokes = sim_invokes.saturating_add(1);
}
Operations::Strip_Prefix | Operations::Strip_Suffix => {
if flag_comparand.is_empty() {
return fail!("--comparand (-C) is required for strip operations.");
}
strip_invokes = strip_invokes.saturating_add(1);
}
Operations::Thousands => {
let separator_policy = match flag_formatstr {
"dot" => policies::DOT_SEPARATOR,
"space" => policies::SPACE_SEPARATOR,
"underscore" => policies::UNDERSCORE_SEPARATOR,
"hexfour" => policies::HEX_FOUR,
"indiancomma" => INDIANCOMMA_POLICY,
_ => policies::COMMA_SEPARATOR,
};
if THOUSANDS_POLICY.set(separator_policy).is_err() {
return fail!("Cannot initialize Thousands policy.");
};
}
Operations::Round => {
if ROUND_PLACES
.set(
flag_formatstr
.parse::<u32>()
.unwrap_or(DEFAULT_ROUND_PLACES),
)
.is_err()
{
return fail!("Cannot initialize Round precision.");
};
}
Operations::Whatlang => {
if flag_new_column.is_none() {
return fail!("--new_column (-c) is required for whatlang language detection.");
}
if whatlang_invokes == 0
&& WHATLANG_CONFIDENCE_THRESHOLD
.set(if flag_comparand.is_empty() {
DEFAULT_THRESHOLD
} else {
let preparsed_threshold;
let show_confidence = if flag_comparand.ends_with('?') {
preparsed_threshold = flag_comparand.trim_end_matches('?');
true
} else {
preparsed_threshold = flag_comparand;
false
};
let desired_threshold = preparsed_threshold
.parse::<f64>()
.unwrap_or(DEFAULT_THRESHOLD);
let final_threshold = if (0.0..=1.0).contains(&desired_threshold) {
desired_threshold
} else {
DEFAULT_THRESHOLD
};
if show_confidence {
final_threshold * -1.0
} else {
final_threshold
}
})
.is_err()
{
return fail!("cannot initialize WhatLang language detection.");
}
whatlang_invokes = whatlang_invokes.saturating_add(1);
}
_ => {}
}
ops_vec.push(operation);
}
if censor_invokes > 1
|| copy_invokes > 1
|| eudex_invokes > 1
|| regex_replace_invokes > 1
|| replace_invokes > 1
|| sentiment_invokes > 1
|| sim_invokes > 1
|| strip_invokes > 1
|| whatlang_invokes > 1
{
return fail_clierror!(
"you can only use censor({censor_invokes}), copy({copy_invokes}), \
eudex({eudex_invokes}), regex_replace({regex_replace_invokes}), \
replace({replace_invokes}), sentiment({sentiment_invokes}), \
similarity({sim_invokes}), strip({strip_invokes}), and whatlang({whatlang_invokes}) \
ONCE per operation series."
);
};
Ok(ops_vec) }
#[inline]
fn apply_operations(
ops_vec: &Vec<Operations>,
cell: &mut String,
comparand: &str,
replacement: &str,
formatstr: &str,
) {
for op in ops_vec {
match op {
Operations::Len => {
*cell = cell.len().to_string();
}
Operations::Lower => {
*cell = cell.to_lowercase();
}
Operations::Upper => {
*cell = cell.to_uppercase();
}
Operations::Squeeze => {
let squeezer: &'static Regex = regex_once_cell!(r"\s+");
*cell = squeezer.replace_all(cell, " ").to_string();
}
Operations::Squeeze0 => {
let squeezer: &'static Regex = regex_once_cell!(r"\s+");
*cell = squeezer.replace_all(cell, "").to_string();
}
Operations::Trim => {
*cell = String::from(cell.trim());
}
Operations::Ltrim => {
*cell = String::from(cell.trim_start());
}
Operations::Rtrim => {
*cell = String::from(cell.trim_end());
}
Operations::Mtrim => {
let chars_to_trim: &[char] = &comparand.chars().collect::<Vec<_>>();
*cell = String::from(cell.trim_matches(chars_to_trim));
}
Operations::Mltrim => {
*cell = String::from(cell.trim_start_matches(comparand));
}
Operations::Mrtrim => {
*cell = String::from(cell.trim_end_matches(comparand));
}
Operations::Encode => {
*cell = BASE64.encode(cell.as_bytes());
}
Operations::Decode => {
let mut output = vec![0; BASE64.decode_len(cell.len()).unwrap()];
*cell = match BASE64.decode_mut(cell.as_bytes(), &mut output) {
Ok(len) => String::from_utf8(output[0..len].to_vec()).unwrap(),
Err(e) => format!("decoding error: {e:?}"),
};
}
Operations::Escape => {
*cell = cell.escape_default().to_string();
}
Operations::Strip_Prefix => {
if let Some(stripped) = cell.strip_prefix(comparand) {
*cell = String::from(stripped);
}
}
Operations::Strip_Suffix => {
if let Some(stripped) = cell.strip_suffix(comparand) {
*cell = String::from(stripped);
}
}
Operations::Titlecase => {
*cell = titlecase(cell);
}
Operations::Replace => {
*cell = cell.replace(comparand, replacement);
}
Operations::Regex_Replace => {
let regexreplace = REGEX_REPLACE.get().unwrap();
*cell = regexreplace.replace_all(cell, replacement).to_string();
}
Operations::Censor => {
let censor = CENSOR.get().unwrap();
*cell = censor.censor(cell);
}
Operations::Censor_Check => {
let censor = CENSOR.get().unwrap();
*cell = censor.check(cell).to_string();
}
Operations::Censor_Count => {
let censor = CENSOR.get().unwrap();
*cell = censor.count(cell).to_string();
}
Operations::Thousands => {
if let Ok(num) = cell.parse::<f64>() {
let mut temp_string = num.separate_by_policy(*THOUSANDS_POLICY.get().unwrap());
if num.fract() > 0.0 {
*cell = if replacement.is_empty() {
temp_string
} else {
match temp_string.rfind('.') {
Some(last_dot) => {
temp_string.replace_range(last_dot..=last_dot, replacement);
temp_string
}
None => temp_string,
}
};
} else {
*cell = temp_string;
}
}
}
Operations::Round => {
if let Ok(num) = cell.parse::<f64>() {
*cell = util::round_num(num, *ROUND_PLACES.get().unwrap());
}
}
Operations::Currencytonum => {
let fract_3digits: &'static Regex = regex_once_cell!(r"\.\d\d\d$");
let cell_val = if fract_3digits.is_match(cell) {
format!("{cell}0")
} else {
cell.clone()
};
if let Ok(currency_val) = Currency::from_str(&cell_val) {
let currency_coins = currency_val.value();
let coins = format!("{:03}", ¤cy_coins);
let coinlen = coins.len();
if coinlen > 2 && coins != "000" {
let decpoint = coinlen - 2;
let coin_num = &coins[..decpoint];
let coin_frac = &coins[decpoint..];
*cell = format!("{coin_num}.{coin_frac}");
}
}
}
Operations::Numtocurrency => {
let fract_3digits2: &'static Regex = regex_once_cell!(r"\.\d\d\d$");
let cell_val = if fract_3digits2.is_match(cell) {
format!("{cell}0")
} else {
cell.clone()
};
if let Ok(currency_value) = Currency::from_str(&cell_val) {
let currency_wrk = currency_value
.convert(replacement.parse::<f64>().unwrap_or(1.0_f64), comparand);
*cell = if formatstr.contains("euro") {
format!("{currency_wrk:e}")
} else {
format!("{currency_wrk}")
};
}
}
Operations::Simdl => {
*cell = damerau_levenshtein(cell, comparand).to_string();
}
Operations::Simdln => {
*cell = normalized_damerau_levenshtein(cell, comparand).to_string();
}
Operations::Simjw => {
*cell = jaro_winkler(cell, comparand).to_string();
}
Operations::Simsd => {
*cell = sorensen_dice(cell, comparand).to_string();
}
Operations::Simhm => {
let ham_val = hamming(cell, comparand);
match ham_val {
Ok(val) => *cell = val.to_string(),
Err(_) => *cell = String::from("ERROR: Different lengths"),
}
}
Operations::Simod => *cell = osa_distance(cell, comparand).to_string(),
Operations::Eudex => {
let eudex_comparand_hash = EUDEX_COMPARAND_HASH.get().unwrap();
let cell_hash = Hash::new(cell);
*cell = format!("{}", (cell_hash - *eudex_comparand_hash).similar());
}
Operations::Sentiment => {
let sentiment_analyzer = SENTIMENT_ANALYZER
.get_or_init(vader_sentiment::SentimentIntensityAnalyzer::new);
let sentiment_scores = sentiment_analyzer.polarity_scores(cell);
*cell = sentiment_scores.get("compound").unwrap_or(&0.0).to_string();
}
Operations::Whatlang => {
let lang_info = detect(cell);
if let Some(lang_info) = lang_info {
let whatlang_confidence_threshold =
*WHATLANG_CONFIDENCE_THRESHOLD.get().unwrap();
let lang_confidence = lang_info.confidence();
let lang = lang_info.lang();
if lang_confidence >= whatlang_confidence_threshold.abs() {
if whatlang_confidence_threshold >= 0.0 {
*cell = format!("{lang:?}");
} else {
*cell = format!("{lang:?}({lang_confidence:.3})");
}
} else {
*cell = format!("{lang:?}({lang_confidence:.3})?");
}
}
}
Operations::Copy => {} }
}
}
#[cached(
key = "String",
convert = r#"{ format!("{}", cell) }"#,
option = true,
sync_writes = false
)]
fn search_cached(cell: &str, formatstr: &str) -> Option<String> {
let geocoder =
GEOCODER.get_or_init(|| ReverseGeocoder::new(LOCS.get_or_init(Locations::from_memory)));
let locregex: &'static Regex = regex_once_cell!(
r"(?-u)([+-]?[0-9]+\.?[0-9]*|\.[0-9]+),\s*([+-]?[0-9]+\.?[0-9]*|\.[0-9]+)"
);
let loccaps = locregex.captures(cell);
loccaps.and_then(|loccaps| {
let lat = loccaps[1].to_string().parse::<f64>().unwrap_or_default();
let long = loccaps[2].to_string().parse::<f64>().unwrap_or_default();
if (-90.0..=90.0).contains(&lat) && (-180.0..=180.0).contains(&long) {
let search_result = geocoder.search((lat, long));
search_result.map(|locdetails| {
#[allow(clippy::match_same_arms)]
match formatstr {
"%+" | "city-state" => format!(
"{name}, {admin1}",
name = locdetails.record.name,
admin1 = locdetails.record.admin1,
),
"city-country" => format!(
"{name}, {cc}",
name = locdetails.record.name,
cc = locdetails.record.cc
),
"city-state-country" | "city-admin1-country" => format!(
"{name}, {admin1} {cc}",
name = locdetails.record.name,
admin1 = locdetails.record.admin1,
cc = locdetails.record.cc
),
"city" => locdetails.record.name.to_string(),
"county" | "admin2" => locdetails.record.admin2.to_string(),
"state" | "admin1" => locdetails.record.admin1.to_string(),
"county-country" | "admin2-country" => format!(
"{admin2}, {cc}",
admin2 = locdetails.record.admin2,
cc = locdetails.record.cc
),
"county-state-country" | "admin2-admin1-country" => format!(
"{admin2}, {admin1} {cc}",
admin2 = locdetails.record.admin2,
admin1 = locdetails.record.admin1,
cc = locdetails.record.cc
),
"country" => locdetails.record.cc.to_string(),
_ => locdetails.record.name.to_string(),
}
})
} else {
None
}
})
}