Clusterless

You are reading a DRAFT of the Clusterless documentation. Features documented here may not be currently available and available features may not be properly documented, or documented at all.

Clusterless is a tool for deploying decentralized, scalable, and secure data-processing workloads for continuously arriving data, across clouds.

That is, as new data arrives (in a bucket, for example), process it with a (custom python, java, etc.) workload. Then hand the results to another workload by publishing an event.

Where any developer or team can subscribe their workload to any upstream availability event.

This allows developers and teams to work relatively independently when building and updating data pipelines.

And because of the rich standardized metadata Clusterless maintains, stakeholders can quickly understand the reliability and availability of a given data pipeline.

By leveraging native pay-as-you-go primitives, no runtimes or dedicated services need to be managed. That is, every line of code managing or processing your data is part of this open source project, provided by you or your team, or is provided by your cloud provider. No third-party hosted runtimes required.

Subsequently, zero data arriving means zero costs (other than storage for historical data).

A Common Pipeline Scenario

Different stakeholders have different needs from any given dataset.

Log data, for example, can be used by developers and SRE to seek out the root causes to system issues affecting availability.

Product owners may be focused on customer retention, especially for different cohorts of customers and users. Consider a band of customers at risk for non-renewal of a subscription service who frequently complain about performance or frequent errors.

Because of these stakeholders different needs, they will need to access the data differently. To meet these requirements as data arrives, it should be organized, formatted, and exposed to the systems suitable for the different purposes.

Product owners may want the log data broken down by application, release version, and date of the logged event by month. And the data accessible by a nightly report in Superset or Tableau backed by AWS Athena or AWS Redshift.

Developers and SRE may need the logs heavily filtered to relevant log types and organized by logged event date, region or datacenter, and hostnames. Support for spinning up an Elasticsearch cluster with current operational data for the given cohort and time range could then be optimized against the current incident.

Both these paths start with the same dataset, but as the needs and requirement branch, the data pipeline branches to support them. More importantly, the data-developers supporting a stakeholder can work independently within their own pipeline branch.

Pipeline Capabilities

Clusterless provides a framework for deploying and running data processing workloads that only execute when data becomes available upstream (or optionally on a schedule). The framework is agnostic to how these workloads are implemented, but is strict about the metadata it maintains about the available data and the runtime state of any workload.

Because of the metadata Clusterless maintains, workloads and their result datasets can be:

  • back-filled - a new workload (or version) can run against historical data

  • replayed - a fixed workload can re-run over a range of data to correct for an error in place

  • gap reporting - missing data can be accounted for by walking the upstream dependencies

  • enabled/disabled - trivially pause or restart event listening

  • versioned - new versions of workloads can be deployed that introduce incompatibilities or rely on new upstream features

  • scoped - production, development, and testing pipelines can co-exist within a single account

  • tested - workloads locally, or data pipelines locally or in CI/CD pipelines using our scenario runner

Heterogeneous Reusable Workloads

Workloads managed by Clusterless can be used to:

  • reformat data (from text to parquet)

  • repartition data for improved accessibility and performance (group the data by new partition keys)

  • perform feature extraction via custom code

  • execute training and validation

  • enforce GDPR (privacy) compliance (via identity tokenization and data retention policies)

Workloads can be:

  • Docker images (Python, Java, Node, etc)

  • Serverless functions (Python, Java, Node, etc)

  • Native services (AWS SageMaker, AWS Athena, AWS Glue, etc)

Multi-Cloud

The intent of Clusterless isn’t to have functional parity across providers, but to ease secure and reliable interoperability between them.