Hi everyone ! :slightly_smiling_face: I’m facing a...
# questions
m
Hi everyone ! 🙂 I’m facing an issue with Kedro-Viz. I have a node that performs a merge into a Delta Table. In this node, I pass two inputs: • the dataframe to be inserted, and • the destination Delta Table itself. Inside the node, I execute the merge logic directly. The problem is that Kedro-Viz treats the Delta Table as an input, whereas I’d like it to be represented as the output after the merge, so that the lineage is clearer and reflects the actual data flow. Is there a way to indicate which dataset is the true input and which one should be considered the final output in this kind of use case? Thanks for your help! 🙏
👀 1
r
Hi @Mohamed El Guendouz, If I understand correctly, your input and output dataset is the same ? This seems like an anti-pattern if we consider a node to be a pure python function and in-place mutations are discouraged. I would suggest you to have a destination dataset which points to the same location but a different entry in catalog. This way you will not modify the inputs to a node. Let me know if it works. Thank you
m
@Ravi Kumar Pilla Actually, the existing Dataset for managing Delta Tables only supports reading, not updating. The merge logic isn’t handled by the Dataset at all. 😞 So I’m forced to perform the merge myself inside a Python function rather than having it managed at the Dataset level. As a result, the node doesn’t return a dataset as output — it returns either
None
or just a flag to confirm that everything worked correctly.
r
I did not quite understand this. I got that your node returns None or just a flag. If the existing dataset only supports reading, how are you merging it into a delta table ? Is this an issue with KedroViz or you have issue doing
kedro run
too ?
As far as I understand, your setup is like - ds1 , ds2 -> func (ds1=ds1+ds2) -> None . This is not the recommended approach to mutate datasets.
m
@Ravi Kumar Pilla Thanks for your question! Let me clarify: The merge is done inside the node using the Python
delta
library (outside of Kedro’s Dataset abstraction). I load the existing Delta Table, run the merge logic programmatically, and then commit the changes. Since the current Dataset only supports reading, Kedro treats it purely as an input. That’s why the node technically returns either
None
or a simple flag — the actual update happens internally and is not returned as a Kedro-managed dataset. So there is no issue when running
kedro run
, it works fine. The concern is mainly with Kedro-Viz, because the lineage shows the Delta Table only as an input, while in reality it is also the updated output. Here’s a simplified example of what the node logic looks like:
Copy code
from delta.tables import DeltaTable

def merge_into_delta(existing_table, new_data_df) -> None:
    
    delta_table = DeltaTable.forPath(spark, existing_table_path)

    (
        delta_table.alias("target")
        .merge(
            new_data_df.alias("source"),
            "target.id = source.id"
        )
        .whenMatchedUpdateAll()
        .whenNotMatchedInsertAll()
        .execute()
    )

    # No dataset returned, just return None or a flag
    return None
👍 1
r
Even if it is an updated output, since it is not part of the node
outputs
, kedro-viz will have no idea that this node outputs something. Let me see if there is a workaround. For now, I think KedroViz is working as expected considering your node does an update inplace and your node returns None.
m
Yes, I totally understand. The only issue is that the way the Dataset for Delta Tables was implemented, if I try to set the table as an output, it would raise a
DatasetError
😞
👍 1
This is really a problem for our team, both in terms of pipeline design and for Kedro-Viz.
r
To show the lineage in kedro-viz, a workaround could be having an output with a similar name (may be a memory dataset). But let me think if there is a better solution
m
Would it be possible to evolve the Dataset to handle merge and write operations for Delta Tables? This would simplify the node design and make Kedro-Viz lineage more accurate.
r
Yes I am looking at the delta tables code now. Is it possible for you to open an issue describing the pain points. This way we can track and prioritize in the upcoming sprints ? Thank you
m
Yes 👍 Bug report or feature request ?
r
A feature request would be nice. I also see a related spike - https://github.com/kedro-org/kedro-plugins/issues/542
We will try to address these issues in upcoming sprints. Thanks for your patience
m
@Ravi Kumar Pilla I’ve created an issue : https://github.com/kedro-org/kedro-plugins/issues/1223 🙂
thankyou 1
@Ravi Kumar Pilla Thank you for your help!
👍 1
r
Hi @Mohamed El Guendouz, In the meantime you can also create a custom dataset with the save operation something like - https://github.com/kedro-org/kedro-plugins/issues/542#issuecomment-1981483776 Thank you
👍 1