Transforms

Fields

Input and output files/objects (also referred to as sources and sinks) are made of both rows and columns. Or tuples and fields.

A tuple has a set of fields, and a field has an optional type (and any associated metadata).

Data files, or objects, have paths and names. Field values can be parsed from the paths and embedded in the tuple stream as fields. This is common when data has been partitioned into files where common values (like month and/or day) can be embedded in the path name to help select relevant files (push down predicates are applied to path values by many query engines).

Declared fields in a pipeline have the following format: <field_name>|<field_type>, where <field_name> is a string, or an ordinal (number representing the position). The <field_name> may be quoted by a single quote (') .

<field_type> is optional, depending on the use. <field_type> further may be formatted as <type>|<metadata>.

The actual supported types and associated metadata are described in Types.

Transforms

Transforms manipulate the tuple stream either by removing (filtering) a given tuple from the stream or by changing the values in any given tuple in the stream.

Filters

A tuple can be retained in a stream if a given predicate expression returns true for any the filter’s arguments.

Expressions

If the expression is a regular expression, the expression will be matched with every argument individually after being coerced into a String. If any value (after coercion) is null, an empty string ("") will be passed to the regular expression matcher.

All values must match to retain the tuple.

One argument: from_field1 ~/expression/
Many arguments: from_field1 + from_field2 + from_fieldN ~/expression/

Note there is no operator after the ~/expression/, this indicates the statement is a filter.

Note a / can be escaped with // in any expression.

Operators

Insert literal: Insert a literal value into a field.
Coerce field: Transform a field, in every tuple.
Copy field: Copy a field value to a new field.
Rename field: Rename a field, optionally coercing its type.
Discard field: Remove a field.
Apply function: Apply intrinsic functions against one or more fields.

There are three transform operators:

=>

Assign a literal value to a new field.

Format

literal => new_field|type
=> new_field|type # insert null into the field

+>

Retain the input field, and assign the result value to a new field.

Format: field +> new_field|type

->

Discard the input fields, and assign the result value to a new field.

Format: field -> new_field|type

For example:

US => country|String - assigns the value US to the field country as a string.
0.5 => ratio|Double - assigns the value 0.5 to the field ratio as a double.
1689820455 => time|DateTime|yyyyMMdd - convert the long value to a date time using the format yyyyMMdd and assign the result to the field time.
ratio +> ratio|Double - Coerces the string field "ratio" to a double, null ok.
ratio|Double - Same as above, coerces the string field "ratio" to a double, null ok.
name +> firstName|String - assigns the value of the field "name" to the field "firstName" as a string. The field name is retained.
name -> firstName|String - assigns the value of the field "name" to the field "firstName" as a string. The field name is discarded (dropped from the tuple stream).
password -> - discards the field password from the tuple stream.

Expressions

Expressions are applied to incoming fields and the results are assigned to a new field. Expressions can have zero or more field arguments.

Many more expression types are planned, including native support for regular expressions and JSON paths.

Current only intrinsic functions are supported. intrinsic functions are built-in functions, with optional parameters

No arguments: ^intrinsic{} +> new_field|type // creates a new value with this field name
No arguments, with parameters: ^intrinsic{param1:value1, param2:value2} +> new_field|type
No arguments and no declared results: ^intrinsic{} → // replace ALL fields with the results
With arguments and no declared results: from_field1 + from_field2 ^intrinsic{} → // applies only the arguments and replaces them with the results
With arguments: from_field1 + from_field2 + from_fieldN ^intrinsic{} +> new_field|type // appends the new value to the tuple
With arguments, with parameters: from_field1 + from_field2 + from_fieldN ^intrinsic{param1:value1, param2:value2} +> new_field|type

Expression may retain or discard the argument fields depending on the operator used.

Intrinsic Functions

Many more functions are planned.

Built-in functions on fields can be applied to one or more fields in every tuple in the tuple stream.

ensureFields

Add any missing fields/columns. Use this when there are many files with overlapping but not consistent field names and the files need to be normalized so they can be processed as a single unit. Note any fields not declared as the result fields will be discarded.

Def

^ensureFields{} → to_field1 + to_field2 - Ensure the results have all declared fields

fixedWidth

Pads a row/tuple to a fixed width by inserting nulls at a given index. Ensure type information is declared for result fields if required.

Def

^fixedWidth{width:…,insertAt:…} → - replace ALL fields with the fixed with result
^fixedWidth{width:…,insertAt:…} → new_field|type + new_field2|type + etc - name the new fields
^fixedWidth{insertAt:…} → new_field|type + new_field2|type + etc - replace all fields with the new field names

Params

width: The width of the row/tuple, defaults to size of the result fields.
insertAt: The index to begin inserting the null padding, defaults to -1 (last element).

fromJson

converts a json string to a row/tuple

Def

^fromJson{} → to_field1 + to_field2 - the fields names must match the json properties at the root node

formatFields

Reformats all the argument fields to a new format. This is especially useful for remaining compatible with Apache Parquet.

Def

^formatFields{format:…} → - replace ALL fields with the formatted result
^formatFields{format:…} +> - append ALL fields with the formatted result

Params

format

The format string.

lowerUnderscore - converts to lower case and replaces /\ .- with underscores. The default.
upperUnderscore - converts to upper case and replaces /\ .- with underscores
camelCase - converts to camel case

sourcePath

Add the URI of the data currently being processed.

Def

^sourcePath{} +> to_field - Assign the current URI to to_field

toJson

Converts a row/tuple to a JSON string.

Def

^toJson{} → - replace ALL fields with the JSON string with the default field name json
from_field1 + from_field2 ^toJson{} +> node - add the arguments to a new json object named node

trimToNull

Convert all arguments to null if the string representation of the value is an empty string or only contains whitespace.

Def

from_field1 + fromField2 ^trimToNull{} → - Convert any whitespace values to null while retaining the field names

tsid

Create a unique id as a long or string (using https://github.com/f4b6a3/tsid-creator).

Def

^tsid{node:…,nodeCount:…,epoch:…,format:…,counterToZero:…} +> intoField|type

type: must be string or long, defaults to long. When string, the format is honored.

Params

node

The node id, defaults to a random int.

If a string is provided, it is hashed to an int.
SIP_HASHER.hashString(s, StandardCharsets.UTF_8).asInt() % nodeCount;

nodeCount

The number of nodes, defaults to 1024

epoch

- The epoch, defaults to Instant.parse("2020-01-01T00:00:00.000Z").toEpochMilli()

format

The format, defaults to null. Example: K%S where %S is a placeholder.

Placeholders:

%S: canonical string in upper case
%s: canonical string in lower case
%X: hexadecimal in upper case
%x: hexadecimal in lower case
%d: base-10
%z: base-62

counterToZero

Resets the counter portion when the millisecond changes, defaults to false.