Hi, are there any plans to make Kedro natively bui...
# questions
j
Hi, are there any plans to make Kedro natively built for streaming (spark Streaming for reading,writing,deleting and merging streaming data) without using using custom nodes and hooks?
j
hi @Jamal Sealiti! how do you solve the lack of streaming support at the moment? or put another way, what do you mean "custom nodes and hooks"?
j
custom datasets
the disadvantage that I have to write a lot of spark logic to handle streaming
As it is, for me seeam like Kedro it's more batch-oriented
j
at the moment yes. but it's a question that comes up from time to time, so I collected some earlier examples of that https://github.com/kedro-org/kedro/discussions/4754
👍 1
paging @Deepyaman Datta, he's been interested in this for a while
j
Then there are plans in the near future to further develop Kedro into a version that is more streaming-oriented?
j
unclear if in the "near future", but we do want to work on having more flexible I/O for Data Engineering pipelines and streaming would fall within this. we're talking about ~months
🥳 1
j
Good to hear 🙂 I'm looking forward to getting this part in place. I have compared Kedro, with Beam, dbt and other frameworks and I actually liked Kedro but in terms of streaming handling, Kedro will be the perfect framework for my purposes
K 1
d
@Jamal Sealiti Do you have much existing experience with streaming? Have you tried the approach in https://kedro.org/blog/kedro-dataset-for-spark-structured-streaming? I'd be curious on your thoughts of the gaps.
Then there are plans in the near future to further develop Kedro into a version that is more streaming-oriented?
I'd say, again, this depends on what it entails/what the current gaps are. IMO the biggest issue with Kedro for streaming isn't defining the logic (if you can write it in Python, you can shoehorn it into Kedro 😉 to an extent); the bigger issue is that
kedro run
(or, more generally, running at a point in time) doesn't make sense for streaming, and you need to "deploy" the streaming application. In that regard, there's been little work so far, partially because there also haven't been users looking to do true streaming work. Realistically, what you would do would probably be quite similar to what https://github.com/getindata/dbt-flink-adapter does for dbt. Another powerful approach with Kedro could be to use Ibis (with either a Flink or Spark Streaming backend), if you don't specifically want Spark Streaming.
👍 1
j
@Deepyaman Datta Thank you : What I meant was that it would be nice if Kedro handled streaming without having a lot of logic in Spark/Flink for handling read, write, merge, delete, decoding of parquet files, schema validation, etc. For example, I had to create a custom dataset that reads from Kafka.
d
Ah, I understand
Then there are plans in the near future to further develop Kedro into a version that is more streaming-oriented?
In that case, I highly doubt this would happen in the near future, unless it's driven by the community; Kedro is a fairly unopinionated structurer of Python code and this works fairly well for batch workflows. Agree streaming requires more things to be built in to work, and this could be a plugin or something, but I don't know how much demand there has been for this to be prioritized from the core team at this point.
👍 1