Hi all!
I am working with a clustering pipeline that I regularly want to rerun to monitor cluster migrations. I am using SnowflakeTableDatasets to save data directly to the data warehouse. Now, since it is not possible to have the same input and output dataset in Kedro, I was wondering what would be best practice to rerun clustering and store to the same SnowparkTableDataset when storing on a different timestamp for example. Would appreciate your help here!
👀 1
r
Ravi Kumar Pilla
10/14/2024, 8:57 PM
Hi @Thomas d'Hooghe, From your use case, I found PartitionedDataset and IncrementalDataset to be helpful. If you haven't tried already, please check the docs here - https://docs.kedro.org/en/stable/data/partitioned_and_incremental_datasets.html .
Also, if your clustering pipeline runs the entire dataset and you want to work with different versions, you can try versioning in catalog. Thank you
t
Thomas d'Hooghe
10/14/2024, 9:12 PM
Hi Ravi, thank you for your quick response! That looks promising indeed. Any chance you have ever tested this to work with a SnowparkTableDataset already? Responding to the versioning, I thought with or without versioning, it is not possible to have the same dataset as both input and output. Are you saying that with versioning this constraint is relieved?
r
Ravi Kumar Pilla
10/14/2024, 9:18 PM
Oh yes, I think kedro does not allow same datasets to be both inputs and outputs. I haven't tried incremental datasets before. Also I was wondering if I understood your question -
1. You have a pipeline which has a node that takes in dataset x -> dataset x ? or
2. You have a pipeline which has a node that takes in dataset x -> dataset x_with_timestamp ? and then the next iteration would take dataset x_with_timestamp as input
t
Thomas d'Hooghe
10/14/2024, 9:53 PM
I think both would work, but the latter one would be a bit more clean. Also am wondering what the community thinks what the best solution will be in this case :)
m
marrrcin
10/15/2024, 7:16 AM
You cannot have the same input and output datasets but 2 differently named data catalog entries can point to the same underlying resource (file/database etc)