Skip to main content

Intro

Apache Flink is a stream processing framework that can also handle batch tasks. It considers batches to simply be data streams with finite boundaries, and thus treats batch processing as a subset of stream processing. This stream-first approach to all processing has a number of interesting side effects.

This stream-first approach has been called the Kappa architecture, in contrast to the more widely known Lambda architecture (where batching is used as the primary processing method with streams used to supplement and provide early but unrefined results). Kappa architecture, where streams are used for everything, simplifies the model and has only recently become possible as stream processing engines have grown more sophisticated.

The four cornerstones on which Flink is built

  1. Streaming
  2. State
  3. Time
  4. Snapshots

Real-time Stream Processors

Advanced stream processors allow you to perform a wide range of tasks including:

  • Windowed aggregations (e.g., count pageviews per minute)
  • Stream-to-table joins (e.g., enrich clickstream data with user profiles)
  • Event filtering and deduplication
  • Resilience to late or out-of-order data

Stream processing frameworks are purpose-built to manage state, ordering, and fault tolerance at scale.

Streaming

streaming

  • A stream is a sequence of events
  • Business data is always a stream: bounded or unbounded
  • For Flink, batch processing is just a special case in the runtime

Although built as a generic data processor, Flink’s native support of unbounded streams contributed to its popularity as a stream processor. Flink’s common use cases are very similar to Kafka’s use cases, although Flink and Kafka serve slightly different purposes. Kafka usually provides the event streaming while Flink is used to process data from that stream. Flink and Kafka are commonly used together for:

  • Batch processing: Flink is good at processing bounded data sets, suitable for traditional batch processing tasks.
  • Stream processing: Flink handles unbounded data streams for continuous, real-time processing.
  • Event-driven applications: Flink’s event stream capabilities make it valuable for fraud detection, credit card systems, and process monitoring.
  • Update stateful applications (savepoints): Flink’s stateful processing and savepoints ensure consistency during updates or failures.
  • Streaming applications: Real-time data processing to complex event pattern detection.
  • Data analytics (batch, streaming): Ideal for real-time and historical data analysis.
  • Data pipelines/ETL: Flink is used in building pipelines for ETL processes.

Others

  • A consequence of using the "group by" operation in Flink is: Messages will be repartitioned across the cluster.