Hi all I am working with a clustering pipeline that I regula Kedro #questions

Hi all! I am working with a clustering pipeline t...

Thomas d'Hooghe

10/14/2024, 8:40 PM

Hi all! I am working with a clustering pipeline that I regularly want to rerun to monitor cluster migrations. I am using SnowflakeTableDatasets to save data directly to the data warehouse. Now, since it is not possible to have the same input and output dataset in Kedro, I was wondering what would be best practice to rerun clustering and store to the same SnowparkTableDataset when storing on a different timestamp for example. Would appreciate your help here!

👀 1

Ravi Kumar Pilla

10/14/2024, 8:57 PM

Hi @Thomas d'Hooghe, From your use case, I found PartitionedDataset and IncrementalDataset to be helpful. If you haven't tried already, please check the docs here - https://docs.kedro.org/en/stable/data/partitioned_and_incremental_datasets.html . Also, if your clustering pipeline runs the entire dataset and you want to work with different versions, you can try versioning in catalog. Thank you

Thomas d'Hooghe

10/14/2024, 9:12 PM

Hi Ravi, thank you for your quick response! That looks promising indeed. Any chance you have ever tested this to work with a SnowparkTableDataset already? Responding to the versioning, I thought with or without versioning, it is not possible to have the same dataset as both input and output. Are you saying that with versioning this constraint is relieved?

Ravi Kumar Pilla

10/14/2024, 9:18 PM

Oh yes, I think kedro does not allow same datasets to be both inputs and outputs. I haven't tried incremental datasets before. Also I was wondering if I understood your question - 1. You have a pipeline which has a node that takes in dataset x -> dataset x ? or 2. You have a pipeline which has a node that takes in dataset x -> dataset x_with_timestamp ? and then the next iteration would take dataset x_with_timestamp as input

Thomas d'Hooghe

10/14/2024, 9:53 PM

I think both would work, but the latter one would be a bit more clean. Also am wondering what the community thinks what the best solution will be in this case :)

marrrcin

10/15/2024, 7:16 AM

You cannot have the same input and output datasets but 2 differently named data catalog entries can point to the same underlying resource (file/database etc)

🙌 1

thankyou 2

🙏 1

Open in Slack

Previous Next