When developers create streaming applications, they choose several streams processing frameworks such as Apache Flink, Apache Spark, and Apache Kafka Streams. Startups widely use these tools for large corporations to develop near-real-time applications across various industries.
Streaming applications built with these frameworks are typically Directed Acyclic Graphs ( DAG ), breaking tasks into multiple steps. This also applies to Apache Flink jobs, which are therefore scalable and can be executed in a distributed environment.
Getting started with Apache Flink is not always easy for developers without Java knowledge. Therefore, an alternative interface was created with Flink SQL, which enables easy access to the development of real-time applications with Apache Flink. This article builds a sample production-ready application that can be developed using SQL code.
Anatomy Of A Flink Application
Apache Flink is a framework for stateful computations in unbounded data streams. Apache Flink applications have three essential components: stream, state, and time.
Apache Flink supports a variety of sources for unbounded data streams, including Apache Kafka, Amazon Kinesis, and RabbitMQ. If the data is further processed with Apache Flink, a distinction is made between trivial and non-trivial applications. Trivial applications perform transformations or calculations based on individual elements – such as changing the timestamp format. Non-trivial applications, on the other hand, can consider several aspects for analysis or transformation. For example, the sum of a value can be formed across several elements.
In addition to state, time is another critical factor in Flink applications. In concrete terms, a distinction is made between two different types of times or timestamps: When using processing time (“processing time”), the timestamp is set at the time of processing in the application. More commonly, however, Event Time is used for processing. The timestamp is set when the data arrives at the source (e.g., an IoT sensor) arise. The event time is independent of external factors such as transmission speed between source and application. Significantly for sources such as mobile devices, IoT sensors, etc., external factors can influence the arrival time. Within the application, so-called watermarks can define how late events may arrive and how they should be handled.
Development Of A Flink SQL Job
Flink SQL commands can be executed directly or within a Java/Scala application using the Flink Library. SQL statements can be accomplished using the FlinkSQL Client or via a Flink interpreter in Apache Zeppelin with the direct method.
For the Flink application to read data from sources and save results to targets, they must have matching connectors. These are specified as parameters at startup. See the Flink SQL documentation for examples. This example uses the Amazon Kinesis Data Streams and Elasticsearch connectors. The Elasticsearch connector is also compatible with OpenSearch.
The “TLC Trip Records” data set, which describes taxi rides in New York City, USA, is used as an example data set. The relevant fields are PULocationID (pickup point), DOLocationID (destination), tpep_pickup_datetime (pickup time) and tpep_dropoff_datetime (arrival time).
This first Flink job aims to determine how long a journey to EEA airport takes on average. The average is calculated over a one-hour time window.
Define Schema / Create Tables
To work with the data from the data stream, the Flink SQL application must know the schema. For this purpose, a table is created using SQL commands, the Amazon Kinesis Connector.
The WATERMARK line defines the logic for late items. In this case, the application waits up to five seconds after the end of the time window before the result is output. This means that elements that arrive late due to higher latency can also be included.
Elasticsearch is used to visualize the results, for which a table must also be created. Data output is also referred to as a “sink” in Apache Flink. The Average Drive Time table includes the pickup location, average drive time, and the start of the time slot.
Data Processing With Time Windows
To carry out calculations over a defined time window, Flink SQL provides the “Windows” functions. The most commonly used are “Tumble” and “Sliding / HOP.”
Tumbling windows define a time window that does not overlap another, for example, 11:00 to 12:00, while sliding windows can overlap. The latter additionally uses a distance parameter for the timeslot, compared to rolling timeslots, which only have a parameter for the duration of the window. A sliding time slot, spaced five minutes apart and lasting one hour, covers the following periods: 11:00 – 12:00, 11:05 – 12:05, and 11:10 – 12:10.
A “tumbling window” is used for the example. The following SQL command calculates the average of trips to the EEA airport (LocationID 1) in a one-hour time window. To avoid potential outliers, only trips that last four hours are used.
Kibana, the dashboard from the Elasticsearch package, is used to display the data. With Kibana, the data in the Elasticsearch cluster can be accessed directly.
In this visualization, the period of the first week of May 2019 was selected as an example. A daily increase in driving times between 5:00 p.m. and 6:00 p.m. can be seen. The insights generated can be used in various ways to create forecasts or enable end customers to plan their trips better.
Apache Flink SQL On AWS
Flink jobs using Java, Scala, or Python are primarily developed in IDEs (Integrated Development Environment) like IntelliJ. An alternative is Apache Zeppelin notebooks, which support Apache Flink as an interpreter. Since May 2021, AWS has been offering Amazon Kinesis Data Analytics Studio (KDA Studio), a managed service for Zeppelin notebooks with the Flink interpreter. Since there is no need to set up the development environment, getting started with Flink development is much easier.
Another advantage of KDA Studio is exporting created jobs directly to Amazon S3 and running them in Kinesis Data Analytics.
Operation Of Flink Applications On AWS
Apache Flink applications can run on clusters such as Kubernetes, Yarn, or Apache Mesos. These clusters can be operated both on-premises and in the cloud. A significant advantage over the self-hosted variant is managed services, which hand over the operation to the cloud provider. Customers only have to provide the code or the application.
Amazon Kinesis Data Analytics (KDA) is a managed service on AWS and runs a Flink application in a highly available environment without user-provided servers. Furthermore, KDA supports autoscaling and integrates Amazon CloudWatch as a logging solution. Apache Flink uses savepoints to save the state of applications. When the application runs on KDA, automatic savepoints (also called snapshots) are created when the application is updated, stopped, or scaled. Manual snapshots can also be made via API calls.
Conclusion
Apache Flink enables the analysis of data streams in real-time regardless of data throughput. With support for SQL commands, Apache Flink makes it much easier to get started with analyzing unlimited data streams. With Amazon Kinesis Data Analytics Studio, AWS provides a managed service to interactively develop Flink jobs in SQL, Python, and Scala. In addition, AWS offers an integration to deploy applications in notebooks in Amazon Kinesis Data Analytics. Kinesis Data Analytics takes care of the underlying infrastructure’s operation, scaling, and management.
ALSO READ: Good Data Quality – The Basis For Successful AI & Corporate Culture