Transforms
Fields
Input and output files/objects (also referred to as sources and sinks) are made of both rows and columns. Or tuples and fields.
A tuple has a set of fields, and a field has an optional type (and any associated metadata).
Data files, or objects, have paths and names. Field values can be parsed from the paths and embedded in the tuple stream as fields. This is common when data has been partitioned into files where common values (like month and/or day) can be embedded in the path name to help select relevant files (push down predicates are applied to path values by many query engines).
Declared fields in a pipeline have the following format:
<field_name>|<field_type>, where <field_name> is a string, or an ordinal (number representing the position).
The <field_name> may be quoted by a single quote (') .
<field_type> is optional, depending on the use. <field_type> further may be formatted as <type>|<metadata>.
The actual supported types and associated metadata are described in Types.
Transforms
Transforms manipulate the tuple stream either by removing (filtering) a given tuple from the stream or by changing the values in any given tuple in the stream.
Filters
A tuple can be retained in a stream if a given predicate expression returns true for any the filter’s arguments.
Expressions
If the expression is a regular expression, the expression will be matched with every argument individually after being coerced into a String.
If any value (after coercion) is null, an empty string ("") will be passed to the regular expression matcher.
All values must match to retain the tuple.
- One argument
-
from_field1 ~/expression/ - Many arguments
-
from_field1 + from_field2 + from_fieldN ~/expression/
Note there is no operator after the ~/expression/, this indicates the statement is a filter.
Note a / can be escaped with // in any expression.
Operators
- Insert literal
-
Insert a literal value into a field.
- Coerce field
-
Transform a field, in every tuple.
- Copy field
-
Copy a field value to a new field.
- Rename field
-
Rename a field, optionally coercing its type.
- Discard field
-
Remove a field.
- Apply function
-
Apply intrinsic functions against one or more fields.
There are three transform operators:
=>-
Assign a literal value to a new field.
- Format
-
-
literal => new_field|type -
=> new_field|type# insertnullinto the field
-
+>-
Retain the input field, and assign the result value to a new field.
- Format
-
field +> new_field|type
->-
Discard the input fields, and assign the result value to a new field.
- Format
-
field -> new_field|type
For example:
-
US => country|String- assigns the valueUSto the fieldcountryas a string. -
0.5 => ratio|Double- assigns the value0.5to the fieldratioas a double. -
1689820455 => time|DateTime|yyyyMMdd- convert the long value to a date time using the formatyyyyMMddand assign the result to the fieldtime. -
ratio +> ratio|Double- Coerces the string field "ratio" to a double,nullok. -
ratio|Double- Same as above, coerces the string field "ratio" to a double,nullok. -
name +> firstName|String- assigns the value of the field "name" to the field "firstName" as a string. The fieldnameis retained. -
name -> firstName|String- assigns the value of the field "name" to the field "firstName" as a string. The fieldnameis discarded (dropped from the tuple stream). -
password ->- discards the fieldpasswordfrom the tuple stream.
Expressions
Expressions are applied to incoming fields and the results are assigned to a new field. Expressions can have zero or more field arguments.
| Many more expression types are planned, including native support for regular expressions and JSON paths. |
Current only intrinsic functions are supported. intrinsic functions are built-in functions, with optional parameters
- No arguments
-
^intrinsic{} +> new_field|type// creates a new value with this field name - No arguments, with parameters
-
^intrinsic{param1:value1, param2:value2} +> new_field|type - No arguments and no declared results
-
^intrinsic{} →// replace ALL fields with the results - With arguments and no declared results
-
from_field1 + from_field2 ^intrinsic{} →// applies only the arguments and replaces them with the results - With arguments
-
from_field1 + from_field2 + from_fieldN ^intrinsic{} +> new_field|type// appends the new value to the tuple - With arguments, with parameters
-
from_field1 + from_field2 + from_fieldN ^intrinsic{param1:value1, param2:value2} +> new_field|type
Expression may retain or discard the argument fields depending on the operator used.
Intrinsic Functions
| Many more functions are planned. |
Built-in functions on fields can be applied to one or more fields in every tuple in the tuple stream.
ensureFields-
Add any missing fields/columns. Use this when there are many files with overlapping but not consistent field names and the files need to be normalized so they can be processed as a single unit. Note any fields not declared as the result fields will be discarded.
- Def
-
-
^ensureFields{} → to_field1 + to_field2- Ensure the results have all declared fields
-
fixedWidth-
Pads a row/tuple to a fixed width by inserting nulls at a given index. Ensure type information is declared for result fields if required.
- Def
-
-
^fixedWidth{width:…,insertAt:…} →- replace ALL fields with the fixed with result -
^fixedWidth{width:…,insertAt:…} → new_field|type + new_field2|type + etc- name the new fields -
^fixedWidth{insertAt:…} → new_field|type + new_field2|type + etc- replace all fields with the new field names
-
- Params
-
width-
The width of the row/tuple, defaults to size of the result fields.
insertAt-
The index to begin inserting the null padding, defaults to
-1(last element).
fromJson-
converts a json string to a row/tuple
- Def
-
-
^fromJson{} → to_field1 + to_field2- the fields names must match the json properties at the root node
-
formatFields-
Reformats all the argument fields to a new format. This is especially useful for remaining compatible with Apache Parquet.
- Def
-
-
^formatFields{format:…} →- replace ALL fields with the formatted result -
^formatFields{format:…} +>- append ALL fields with the formatted result
-
- Params
-
format-
The format string.
-
lowerUnderscore- converts to lower case and replaces/\ .-with underscores. The default. -
upperUnderscore- converts to upper case and replaces/\ .-with underscores -
camelCase- converts to camel case
-
sourcePath-
Add the URI of the data currently being processed.
- Def
-
-
^sourcePath{} +> to_field- Assign the current URI toto_field
-
toJson-
Converts a row/tuple to a JSON string.
- Def
-
-
^toJson{} →- replace ALL fields with the JSON string with the default field namejson -
from_field1 + from_field2 ^toJson{} +> node- add the arguments to a new json object namednode
-
trimToNull-
Convert all arguments to
nullif the string representation of the value is an empty string or only contains whitespace.- Def
-
-
from_field1 + fromField2 ^trimToNull{} →- Convert any whitespace values to null while retaining the field names
-
tsid-
Create a unique id as a long or string (using https://github.com/f4b6a3/tsid-creator).
- Def
-
^tsid{node:…,nodeCount:…,epoch:…,format:…,counterToZero:…} +> intoField|typetype-
must be
stringorlong, defaults tolong. Whenstring, theformatis honored.
- Params
-
node-
The node id, defaults to a random int.
-
If a string is provided, it is hashed to an int.
-
SIP_HASHER.hashString(s, StandardCharsets.UTF_8).asInt() % nodeCount;
-
nodeCount-
The number of nodes, defaults to
1024 epoch-
- The epoch, defaults to
Instant.parse("2020-01-01T00:00:00.000Z").toEpochMilli() format-
The format, defaults to
null. Example:K%Swhere%Sis a placeholder.- Placeholders:
-
-
%S: canonical string in upper case -
%s: canonical string in lower case -
%X: hexadecimal in upper case -
%x: hexadecimal in lower case -
%d: base-10 -
%z: base-62
-
counterToZero-
Resets the counter portion when the millisecond changes, defaults to
false.