Hi! I am wondering if anyone has experience with joining dataframes in Kedro and handling updates to the underlying dataframes?
I am doing a stream-batch join, and i want to ensure that any updates to the batch dataframe gets propagated into my sink containing the joined data. The way I would want to solve this is to have a separate node that inputs my batch data and merges it into my sink with set intervals. In Kedro it is not possible to have two nodes outputting to the same dataframe. Is there a way to handle this in a diferent way? I thought about creating two instances of the batch dataset in the data catalog, which might omit the restriction kedro has on several nodes outputting to the same dataframe, but i don't know if it would be a good solution.
To summarize:
• I have a node that takes a streaming dataframe and a batch dataframe as input
• The result is outputted to a sink (format: delta table)
• I want my sink to reflect any updates to both data sources after the stream has started.
• As of now, if there are any changes in the batch data, rows already existing in the sink will not be updated.
• Also, i want to handle changes no matter when they arrive, so doing windowing is not an option.
Any input will be appreciated 🙂