Combine two streams in Apache Flink regardless on window time

FLoppix picture FLoppix · Sep 2, 2017 · Viewed 8.3k times · Source

I have two data streams that I want to combine. The problem is that one data stream has a much higher frequency than the other and there are times where one stream is not receiving events at all. Is it possible to use the last event from the one stream and join it with the other stream on every event that is coming?

The only solution I found is using the join function, but you have to specify a common window, where you can apply the join function. This is window is not reached, when one stream is not receiving any events.

Is there a possibility to apply the join function on every event that is coming from either one stream or the other and maintain state of the last consumed event and use this event for the join function?

Thanks in advance for any helpful tips!

Answer

David Anderson picture David Anderson · Sep 5, 2017

There are many different approaches to combining or joining two streams in Flink, depending on requirements of each specific use case. When doing this "by hand", you want to be using Flink's ConnectedStreams with a RichCoFlatMapFunction or CoProcessFunction. Either of these will allow you to keep managed state (i.e. the last element from the infrequently updating stream), and join it with the faster stream. CoProcessFunction adds the ability to work with timers, which you should use to clear state for expired keys, if that's relevant.

There's an exercise on the Flink training site about different approaches for implementing such joins: Enrichment Joins. For a simpler example, see also the exercise about Expiring State.

Each recent release of Flink has included additional built-in join functions, so at this point it is less often necessary to roll your own. See the pages on joining with the DataStream API, joins with the Table API, and joins in SQL for more details.