https://kedro.org/ logo
#questions
Title
# questions
f

Flavien

02/12/2024, 2:38 PM
Hello fellows, I am wondering if some of you had the opportunity to design a pipeline using
pyspark
with the purpose of saving the final results in two different datasets? For example, I perform some calculations and I want to save/materialize the resulting
DataFrame
in a delta table and a SQL table (or anything else). We tried to do so while running our pipeline on
databricks
and our naive approach — one node returning a tuple of the same
DataFrame
with two different data sets — failed as it seems that the computation was performed twice. We therefore chose to include an intermediate node which performs a
.cache()
before distributing the result. I am curious to know if you would have alternative implementations. Thanks in advance!
m

Mark Druffel

02/12/2024, 11:23 PM
Sorry if I'm not understanding your goal, but I think what you're looking for is ManagedTableDataset
Unless you're just wanting to run createOrReplaceTempView every time you write and for some reason don't want to use managed tables... I actually looked looked into that approach before I saw ManagedTableDataset. I think the most reasonable approach is to create another node as you said you are, but you probably could write hooks to do it.. just not sure it would make sense because your temp tables wouldn't be nodes in kedro and therefore couldn't be used as inputs in the pipeline 🤷
f

Flavien

02/13/2024, 1:29 PM
Hi @Mark Druffel, thanks for the reply! I am aware of
ManagedTableDataset
and it is one of the output I use. I am trying to materialize the data into both a
ManagedTableDataset
and other data sets without performing the calculations several times.
👍 1