Hi team Has anyone explored using Kedro with Spark Structure Kedro #questions

Join Slack

Hi team. Has anyone explored using Kedro with Spar...

# questions

Zirui Xu

11/02/2022, 4:28 PM

Hi team. Has anyone explored using Kedro with Spark Structured Streaming? What worked/did not work?

👀 1

Nok Lam Chan

11/02/2022, 7:24 PM

@Deepyaman Datta did some PoC with Kafka 2+ years ago https://github.com/deepyaman/kedro-streaming

Zirui Xu

11/02/2022, 7:28 PM

Thanks @Nok Lam Chan. @Deepyaman Datta any high-level findings? I saw that you defined new datasets for streaming. Does the whole streaming with Kedro experience feel “natural”?

Deepyaman Datta

11/03/2022, 1:04 PM

Off the top of my head, the dataset was a pretty straightforward implementation (it's definitely still in a hacked together state from the hackathon Nikhil and I built it in; see the GitHub issues for some of the stuff that had down as next steps). However, as for being "natural", Kedro is very batch oriented. One of the challenges as-is that I think people (@datajoely?) highlighted is that you have nodes that don't terminate.

Deepyaman Datta

11/03/2022, 1:05 PM

What kind of use case is this for, just curious?

datajoely

11/03/2022, 1:07 PM

Yeah I don’t think Kedro can work with reading from structured streaming, but you could consider triggering a kedro pipeline from each streaming event

Deepyaman Datta

11/03/2022, 1:26 PM

Triggering a regular pipeline on event seems slow and expensive, for any sort of real-time use case. I think you do want a stream processing framework in that case.

Zirui Xu

11/03/2022, 1:42 PM

Thanks @Deepyaman Datta @datajoely. • Our use case is indeed a streaming use case. We want to use Spark structured streaming and thus the same pipeline/transformation for batch and stream. We structured our batch pipelines in kedro so the most straightforward way is to swap out the source and sink (like @Deepyaman Datta did in the github repo). • Agreed that we would rather have a non-stopping kedro pipeline instead of reinitialising Kedro things at each event trigger. A streaming pipeline naturally runs forever so it’s fine from that aspect. @datajoely Does Kedro behave weirdly with long-running/non-stopping pipelines?

Deepyaman Datta

11/03/2022, 1:45 PM

@Zirui Xu IIRC in Kedro your sink nodes would all have to be terminal, since the DAG doesn't progress otherwise, and you'd need to use some sort of multi-process runner so you can have multiple sinks

👍 1

Zirui Xu

11/03/2022, 1:47 PM

For our use case, Kedro only needs to compose the transformations to a pipeline. I would even go one step further than the hackathon repo by removing sinks for intermediate/primary. IMO round trips to disks is never really needed in Spark pipelines but more a leftover from pandas-based pipelines

👍 1

Zirui Xu

11/03/2022, 1:52 PM

That’s a good point @Deepyaman Datta. Was the save still blocking even if you

.awaitTermination

Deepyaman Datta

11/03/2022, 2:05 PM

It's been so long, I don't remember the details. 😅 You can probably find an internal presentation in a DE Coffee.

5 Views

Open in Slack

Previous Next