Simon Wolf
11/08/2023, 10:44 AMdatabricks.ManagedTableDataset
. I have a PartitionedDataset
in which about 30 files are stored. In a node I read this PartitionedDataset
, process it and now I want to create a new table in Databricks for each original file. So the output of the node should be a PartitionedDataset
of the type databricks.ManagedTableDataset
. Unfortunately, I can't do this because ManagedTableDataset has no filname etc. Does anyone have an idea how I can still realize this?datajoely
11/08/2023, 10:54 AMSimon Wolf
11/08/2023, 11:33 AMdatajoely
11/08/2023, 11:33 AMSimon Wolf
11/08/2023, 11:39 AMmarrrcin
11/08/2023, 11:39 AMyield
your tables (alongside with their names / other properties that you need to save them) and implementing a dataset inherited from the original databricks.ManagedTableDataset
to handle that.
Pseudo code:
# node:
def node(partitioned_dataset):
for partition, loader in partitioned_dataset.items():
# do your thing
yield (partition, <put your data here>)
# dataset:
class YourDataSet(databricks.ManagedTableDataset):
def _save(self, data_to_save: tuple):
partition, data, other_properties = data_to_save
# handle data saving
datajoely
11/08/2023, 11:41 AMSimon Wolf
11/08/2023, 11:52 AMmarrrcin
11/08/2023, 11:54 AM