Flavien
02/12/2024, 2:38 PMpyspark
with the purpose of saving the final results in two different datasets? For example, I perform some calculations and I want to save/materialize the resulting DataFrame
in a delta table and a SQL table (or anything else).
We tried to do so while running our pipeline on databricks
and our naive approach — one node returning a tuple of the same DataFrame
with two different data sets — failed as it seems that the computation was performed twice. We therefore chose to include an intermediate node which performs a .cache()
before distributing the result.
I am curious to know if you would have alternative implementations. Thanks in advance!Mark Druffel
02/12/2024, 11:23 PMMark Druffel
02/12/2024, 11:29 PMFlavien
02/13/2024, 1:29 PMManagedTableDataset
and it is one of the output I use. I am trying to materialize the data into both a ManagedTableDataset
and other data sets without performing the calculations several times.