This is a quick note on the key considerations that I have made over the years while designing and developing large scale distributed real time data processing systems. These principles follow evolving usage patterns and use cases.
- Processing Semantics
- “Exactly Once” processing semantics or “at-least once”. Choose your semantics based on your use case needs. It is possible to configure different semantics for each use case.
- Consistency (strict — eventual mix) over Availability. Refer CAP Theorem, and the fact that you must account for network partitioning.
- In-memory Data Grid (IMDGs) or Caching layer that support Transactions — preferably ACID complaint. Distributed Transaction support in Microservices via Sagas/3 Phase Commits.
- Programmatic Workflows; independent step re-executions. Go for Apache Airflow.
- Aimed Throughput Support — xx million natural incoming events/sec
- Aimed Latency — End to end (< yy secs, zzth percentile. e.g., <2 secs, 99th percentile)
- Circumvent the need for Backpressure. Backpressuring is a an archaic idea, modern ultra low latency systems must be able to handle spiking loads.
2. Service Design
- Go for Stateless Services. Its not just a fad, stateless services will help you scale, and scale fast (new pods being added), vs the time needed to warm up to the current/last cluster state before a new instance is operational.
- Reactive Services — responsive, resilient, elastic and asynchronous message driven Microservices.
- RT Reconciliations: Opt for micro-batched reconciliations — time bound or by volume
- Provide support for on-demand discrepancy resolutions and re-delivery requests/fulfillments. Discrepancies include: missed events, duplicate events due to IP/Kernel packet loss or otherwise. Things will go wrong, be ready to mitigate for them, from Day 1 in your design.
- Enable ‘backfilling’ from your batch service pipeline (and EOD master reconcile). When your pipeline has massive intra-day outage, EOD reconciliation/backfilling is your last resort — especially if you are building systems for Banks where EOD recons are the golden.
- Provide integrations with Online Feature Stores and Data Catalog. Make sure you are building for a unified and cohesive ecosystem. Plugging into Online Feature Stores will make you into your Data Scientists’ saviour.
- Enable sharing of RT Context and Computation between processes. Too often our RT processors work in isolation.
- Serialisation — Deserialisation: first, try to minimise SerDe. Opt for uniformity, speed, payload size and schema evolution support.
3. Data Composition
- Schema Registry — Opt for a standard format (Avro or Protobuf), Schema Validation and Versioning.
- Define your transformations — as — a — code.
- Make sure you account for RT Unstructured data processing too.
- Depending upon where your infrastructure is currently, you must be ready to Orchestrate between Cloud and On-Prem.
- Design a Unified Serving layer/interface across batch, near real time and real time end consumers.
- ANSI SQL Compliant Query-able components (e.g., event transport, IMDG)
- Use a standard in-memory format like, Apache Arrow. If your on disk storage is columnar (Parquet/ORC) this would be an excellent fit along side your analytical querying needs.
- Enable self-service. Your service tenants are excited to try out your new glitzy RT framework, don’t keep them waiting ;)
- Perhaps the most important aspect of your design should be “observability”. When dealing with RT events and in-memory stores its easy to be loose track of what is “actually” happening and how events are flowing. An End-to-end event tracing with Jaeger/Open tracing would be a suggested way forward.
- Support on demand event replays. If your service allows for ad-hoc querying — chances are you will have to allow for event replays. Do this preferably in replicated-simulated sandbox offloads.
- Some organisations may have a need for a Federated Operations Model. Make sure, you enable such intricacies.
- Of course, infra should be agile (containerised) and scalable.