Hi! I am wondering if anyone has experience with j...
# questions
j
Hi! I am wondering if anyone has experience with joining dataframes in Kedro and handling updates to the underlying dataframes? I am doing a stream-batch join, and i want to ensure that any updates to the batch dataframe gets propagated into my sink containing the joined data. The way I would want to solve this is to have a separate node that inputs my batch data and merges it into my sink with set intervals. In Kedro it is not possible to have two nodes outputting to the same dataframe. Is there a way to handle this in a diferent way? I thought about creating two instances of the batch dataset in the data catalog, which might omit the restriction kedro has on several nodes outputting to the same dataframe, but i don't know if it would be a good solution. To summarize: • I have a node that takes a streaming dataframe and a batch dataframe as input • The result is outputted to a sink (format: delta table) • I want my sink to reflect any updates to both data sources after the stream has started. • As of now, if there are any changes in the batch data, rows already existing in the sink will not be updated. • Also, i want to handle changes no matter when they arrive, so doing windowing is not an option. Any input will be appreciated 🙂
d
Hi Julie, This sounds like an advanced use case - could you clarify a bit more about what streaming DF you're using and how you're integrating it within Kedro? If I understand correctly, your Kedro node runs for a long duration while the streaming dataset is continuously updating, and that part already works well. The issue is that the batch dataset is only read once at the beginning of the node execution, so any later updates to it aren't reflected. That behaviour is due to how Kedro is designed - datasets are typically loaded once per node execution. If you want to work around this, I see two possible options: 1. Emulate streaming behaviour for the batch dataset. 2. Bypass the catalog entirely for the batch dataset and perform the read operation directly within the node logic. This breaks the Kedro I/O pattern but gives you more flexibility.
j
Hi, thank you for responding. I have created a custom dataset that loads and saves data to a delta table using spark structured streaming. The stream has the trigger once option set, so when the node loads the dataframe it should only process the data available at that point in one batch and then terminate. the node will join any available data from the stream with the batch dataframe that is loaded in its whole for each node execution. I think my problem is not in the node execution itself, but happens when the batch source is updated at some later point. Lets say the first node execution processes one row from the streaming dataframe which is joined with some data from the batch dataframe and the result is written to the sink. If at some later point the data that comes from the batch source is updated and there are no changes in the streaming source, the row will not be reprocessed in the node (due to checkpointing) and therefore the changes will not be reflected in the sink. if that makes sense?