Hello fellows I am wondering if some of you had the opportun Kedro #questions

Hello fellows, I am wondering if some of you had t...

Flavien

02/12/2024, 2:38 PM

Hello fellows, I am wondering if some of you had the opportunity to design a pipeline using

pyspark

with the purpose of saving the final results in two different datasets? For example, I perform some calculations and I want to save/materialize the resulting

DataFrame

in a delta table and a SQL table (or anything else). We tried to do so while running our pipeline on

databricks

and our naive approach — one node returning a tuple of the same

DataFrame

with two different data sets — failed as it seems that the computation was performed twice. We therefore chose to include an intermediate node which performs a

.cache()

before distributing the result. I am curious to know if you would have alternative implementations. Thanks in advance!

Mark Druffel

02/12/2024, 11:23 PM

Sorry if I'm not understanding your goal, but I think what you're looking for is ManagedTableDataset

Mark Druffel

02/12/2024, 11:29 PM

Unless you're just wanting to run createOrReplaceTempView every time you write and for some reason don't want to use managed tables... I actually looked looked into that approach before I saw ManagedTableDataset. I think the most reasonable approach is to create another node as you said you are, but you probably could write hooks to do it.. just not sure it would make sense because your temp tables wouldn't be nodes in kedro and therefore couldn't be used as inputs in the pipeline 🤷

Flavien

02/13/2024, 1:29 PM

Hi @Mark Druffel, thanks for the reply! I am aware of

ManagedTableDataset

and it is one of the output I use. I am trying to materialize the data into both a

ManagedTableDataset

and other data sets without performing the calculations several times.

👍 1

3 Views

Open in Slack

Previous Next