https://kedro.org/ logo
#questions
Title
# questions
z

Zirui Xu

11/02/2022, 4:28 PM
Hi team. Has anyone explored using Kedro with Spark Structured Streaming? What worked/did not work?
👀 1
n

Nok Lam Chan

11/02/2022, 7:24 PM
@Deepyaman Datta did some PoC with Kafka 2+ years ago https://github.com/deepyaman/kedro-streaming
z

Zirui Xu

11/02/2022, 7:28 PM
Thanks @Nok Lam Chan. @Deepyaman Datta any high-level findings? I saw that you defined new datasets for streaming. Does the whole streaming with Kedro experience feel “natural”?
d

Deepyaman Datta

11/03/2022, 1:04 PM
Off the top of my head, the dataset was a pretty straightforward implementation (it's definitely still in a hacked together state from the hackathon Nikhil and I built it in; see the GitHub issues for some of the stuff that had down as next steps). However, as for being "natural", Kedro is very batch oriented. One of the challenges as-is that I think people (@datajoely?) highlighted is that you have nodes that don't terminate.
What kind of use case is this for, just curious?
d

datajoely

11/03/2022, 1:07 PM
Yeah I don’t think Kedro can work with reading from structured streaming, but you could consider triggering a kedro pipeline from each streaming event
d

Deepyaman Datta

11/03/2022, 1:26 PM
Triggering a regular pipeline on event seems slow and expensive, for any sort of real-time use case. I think you do want a stream processing framework in that case.
z

Zirui Xu

11/03/2022, 1:42 PM
Thanks @Deepyaman Datta @datajoely. • Our use case is indeed a streaming use case. We want to use Spark structured streaming and thus the same pipeline/transformation for batch and stream. We structured our batch pipelines in kedro so the most straightforward way is to swap out the source and sink (like @Deepyaman Datta did in the github repo). • Agreed that we would rather have a non-stopping kedro pipeline instead of reinitialising Kedro things at each event trigger. A streaming pipeline naturally runs forever so it’s fine from that aspect. @datajoely Does Kedro behave weirdly with long-running/non-stopping pipelines?
d

Deepyaman Datta

11/03/2022, 1:45 PM
@Zirui Xu IIRC in Kedro your sink nodes would all have to be terminal, since the DAG doesn't progress otherwise, and you'd need to use some sort of multi-process runner so you can have multiple sinks
👍 1
z

Zirui Xu

11/03/2022, 1:47 PM
For our use case, Kedro only needs to compose the transformations to a pipeline. I would even go one step further than the hackathon repo by removing sinks for intermediate/primary. IMO round trips to disks is never really needed in Spark pipelines but more a leftover from pandas-based pipelines
👍 1
That’s a good point @Deepyaman Datta. Was the save still blocking even if you
.awaitTermination
?
d

Deepyaman Datta

11/03/2022, 2:05 PM
It's been so long, I don't remember the details. 😅 You can probably find an internal presentation in a DE Coffee.
3 Views