Hey I have a question about `databricks ManagedTableDataset` Kedro #questions

Hey, I have a question about `databricks.ManagedTa...

Simon Wolf

11/08/2023, 10:44 AM

Hey, I have a question about

databricks.ManagedTableDataset

. I have a

PartitionedDataset

in which about 30 files are stored. In a node I read this

PartitionedDataset

, process it and now I want to create a new table in Databricks for each original file. So the output of the node should be a

PartitionedDataset

of the type

databricks.ManagedTableDataset

. Unfortunately, I can't do this because ManagedTableDataset has no filname etc. Does anyone have an idea how I can still realize this?

datajoely

11/08/2023, 10:54 AM

the PartitionedDataSet is designed around a filepath

datajoely

11/08/2023, 10:55 AM

so I think you would need to implement your own version/subclass if you needed this abstraction

Simon Wolf

11/08/2023, 11:33 AM

ah okay i was afraid of that :D But thank you anyway for the quick response

datajoely

11/08/2023, 11:33 AM

I think we may actually have a better answer for you - this has kicked off a discussion on the dev channel

Simon Wolf

11/08/2023, 11:39 AM

Oh exciting🤩, I can't access the dev channel, can I?

marrrcin

11/08/2023, 11:39 AM

Hi @Simon Wolf - we’ve discussed this internally and we recommend to use the approach with using a generator in the node to

yield

your tables (alongside with their names / other properties that you need to save them) and implementing a dataset inherited from the original

databricks.ManagedTableDataset

to handle that. Pseudo code:

Copy code

# node:
def node(partitioned_dataset):
    for partition, loader in partitioned_dataset.items():
        # do your thing
        yield (partition, <put your data here>)

# dataset:
class YourDataSet(databricks.ManagedTableDataset):
    def _save(self, data_to_save: tuple):
        partition, data, other_properties = data_to_save
        # handle data saving

marrrcin

11/08/2023, 11:40 AM

More info here: https://docs.kedro.org/en/stable/nodes_and_pipelines/nodes.html#how-to-use-generator-functions-in-a-node

datajoely

11/08/2023, 11:41 AM

Also @Simon Wolf if you want to join the Technical Steering Comittee you can be part of the dev channel 😛

Simon Wolf

11/08/2023, 11:52 AM

@datajoely Ah haha I see 😂Thank you very much for discussing this. I hadn't seen the possibility of using nodes with a generator before but that looks super promising! At the moment the solution idea is not 100% clear to me but I will figure it out👍

🚀 2

marrrcin

11/08/2023, 11:54 AM

Let me know if you need further assistance 😉

💛 1

Open in Slack

Previous Next