Source and Sink

Pipelines read data from sources and write data to sinks.

Each source or sink has a format and protocol.

And the data may be partitioned by values in the data set in order to allow interleaving of data from new data sets into existing data sets.

This can significantly improve the performance of queries on the data.


Every source and sink supports its own set of formats.

tess --show-source=formats

Lines of text parsed by regex (like Apache or S3 log files).


With or without headers.


Apache Parquet

Regex support is based on regex groups. Groups are matched by ordinal with the declared fields in the schema.

Provided named formats include:

AWS S3 Access Logs


  "source": {
    "schema": {
      "name": "aws-s3-access-log"


Every source and sink supports its own set of protocols.

tess --show-source=protocols

Read/write local files.


Read/write files in AWS S3.


Read/write files on Apache Hadoop HDF filesystem.


Every source and sink supports its own set of compression formats.

tess --show-source=compression

Som common formats supported are:

  • none

  • gzip

  • lz4

  • bzip2

  • brotli

  • snappy


Partition be performed with data from on the values read or created in the pipeline.

Writing Partitions

Path partitioning

Data can be partitioned by intrinsic values in the data set.

named partitions

e.g. year=2023/month=01/day=01, or

unnamed partitions

e.g. 2023/01/01

Partitions, when declared in the pipeline file, can be simple, or represent a transform.


<field_name> becomes /<field_name>=<field_value>/


<field_name>+><partition_name>|<field_type> becomes /<partition_name>=<transformed_value>/

Note the +> operator.

Consider the following example, where time is either a long timestamp, or an Instant.

  • time+>year|DateTime|yyyy

  • time+>month|DateTime|MM

  • time+>day|DateTime|dd

The above produces a path like /year=2023/month=01/day=01/.

Reading Partitions

When reading partitions, the partition values are extracted from the path.

Path partitioning

Data that is partitioned can embed the partition values into the data.

named partitions

e.g. year=2023/month=01/day=01, or

unnamed partitions

e.g. 2023/01/01

When named, the key side of key=value becomes a field in the schema. When unnamed, the declared field name is used.

  • y→year|DateTime|yyyy

  • m→month|DateTime|MM

  • d→day|DateTime|dd

Note the operator, which implies the field is discarded and not embedded in the schema.

(a future version may suppport +> in order to retain the original field and value)

Given a path of `y=2023/m=01/d=01, result in the following fields, all with the type of DateTime:

  • year

  • month

  • day

File naming

Workload processes can fail. And when they do, it is important not to overwrite existing files. It is also important to find the files that were created and written before the failure.

The following metadata can help disambiguate files across processing runs, and also to help detect schema changes.

Filename metadata



The value part by default.


A hash of the schema: field names, and field types, so that schema changes can be detected.


A random UUID or a provided value.

The JSON model for this metadata is:

 "filename" : {
      "prefix" : null, (1)
      "includeGuid" : false, (2)
      "providedGuid" : null, (3)
      "includeFieldsHash" : false (4)
1 The prefix to use for the filename. Defaults to part.
2 Whether to include a random UUID in the filename. Defaults to false.
3 A provided UUID to use in the filename. Defaults to using a random UUID.
4 Whether to include a hash of the schema (field name + type) in the filename. Defaults to false.