Hi team. Has anyone explored using Kedro with Spar...
# questions
z
Hi team. Has anyone explored using Kedro with Spark Structured Streaming? What worked/did not work?
šŸ‘€ 1
n
@Deepyaman Datta did some PoC with Kafka 2+ years ago https://github.com/deepyaman/kedro-streaming
z
Thanks @Nok Lam Chan. @Deepyaman Datta any high-level findings? I saw that you defined new datasets for streaming. Does the whole streaming with Kedro experience feel ā€œnaturalā€?
d
Off the top of my head, the dataset was a pretty straightforward implementation (it's definitely still in a hacked together state from the hackathon Nikhil and I built it in; see the GitHub issues for some of the stuff that had down as next steps). However, as for being "natural", Kedro is very batch oriented. One of the challenges as-is that I think people (@datajoely?) highlighted is that you have nodes that don't terminate.
What kind of use case is this for, just curious?
d
Yeah I donā€™t think Kedro can work with reading from structured streaming, but you could consider triggering a kedro pipeline from each streaming event
d
Triggering a regular pipeline on event seems slow and expensive, for any sort of real-time use case. I think you do want a stream processing framework in that case.
z
Thanks @Deepyaman Datta @datajoely. ā€¢ Our use case is indeed a streaming use case. We want to use Spark structured streaming and thus the same pipeline/transformation for batch and stream. We structured our batch pipelines in kedro so the most straightforward way is to swap out the source and sink (like @Deepyaman Datta did in the github repo). ā€¢ Agreed that we would rather have a non-stopping kedro pipeline instead of reinitialising Kedro things at each event trigger. A streaming pipeline naturally runs forever so itā€™s fine from that aspect. @datajoely Does Kedro behave weirdly with long-running/non-stopping pipelines?
d
@Zirui Xu IIRC in Kedro your sink nodes would all have to be terminal, since the DAG doesn't progress otherwise, and you'd need to use some sort of multi-process runner so you can have multiple sinks
šŸ‘ 1
z
For our use case, Kedro only needs to compose the transformations to a pipeline. I would even go one step further than the hackathon repo by removing sinks for intermediate/primary. IMO round trips to disks is never really needed in Spark pipelines but more a leftover from pandas-based pipelines
šŸ‘ 1
Thatā€™s a good point @Deepyaman Datta. Was the save still blocking even if you
.awaitTermination
?
d
It's been so long, I don't remember the details. šŸ˜… You can probably find an internal presentation in a DE Coffee.