Hello everyone I have a question about the use of pandas Del Kedro #questions

Hello everyone! I have a question about the use o...

Júlio Resende

02/28/2024, 3:11 PM

Hello everyone! I have a question about the use of pandas.DeltaTableDataset. The link bellow says that is possible to "overwrite a specific partition by using mode=overwrite together with partition_filters". Is it possible to define which partition to overwrite on each execution? Eg: My pipeline can be applied using data from different places, but in each execution only the data from one place will be processed at time. I want to partition the data by place and only overwrite the data from the place being processed. https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-2.1.0/api/kedro_datasets.pandas.DeltaTableDataset.html

Juan Luis

02/28/2024, 3:21 PM

hi Júlio, I think that sentence is copied from https://delta-io.github.io/delta-rs/python/usage.html#overwriting-a-partition

Juan Luis

02/28/2024, 3:21 PM

does that example solve your use case?

Júlio Resende

02/28/2024, 3:24 PM

Hi @Juan Luis. Thanks for your answer! It doesn't solve because I need to use data catalog.

Júlio Resende

02/28/2024, 3:26 PM

Is it possible to use kedro params whithin data catalog, to decide which partition to overwrite?

Juan Luis

02/28/2024, 4:04 PM

you should be able to use runtime params or globals, see https://docs.kedro.org/en/stable/configuration/advanced_configuration.html#how-to-use-global-variables-with-the-omegaconfigloader for example in this case,

Copy code

ds:
  type: pandas.DeltaTableDataset
  save_args:
    mode: overwrite
    partition_filters: "${globals:partition_filters}"

and then add your

partition_filters

conf/base/globals.yml

Juan Luis

02/28/2024, 4:04 PM

(thanks @Nok Lam Chan for the solution!)

Júlio Resende

02/28/2024, 9:17 PM

Thanks @Juan Luis Using globals gonna work for me! But I couldn't get this working using kedro data catalog. The delta api says that partition_filters takes a List of Tuples. How Can I write it on a yaml file? I have the same question for table schema, that expects pyarrow.Schema.

Juan Luis

02/28/2024, 10:44 PM

hmmm getting closer. what about a custom resolver? https://docs.kedro.org/en/stable/configuration/advanced_configuration.html#how-to-use-resolvers-in-the-omegaconfigloader

👍🏼 1

Nok Lam Chan

02/29/2024, 12:28 PM

Custom resolver is the way to go for non primitive types

Júlio Resende

02/29/2024, 4:20 PM

Yay, that worked great! Thanks @Juan Luis and @Nok Lam Chan

🥳 2

🙌🏼 1

Júlio Resende

02/29/2024, 4:34 PM

I had some problems when my nodes were returning a dataframe. Eg:

name 'column_name' present in the specified schema is not found in the columns or index

, but column_name was defined as nullable in the specified schema. I also had some problems related with __index_level_0__ column when no schema was specified (see this issue). Using

pyarrow.Table.from_pandas(df)

as node return fixed all these problems. Perhaps this function could be embedded into pandas.DeltaTableDataset in the next release of kedro datasets?

Juan Luis

02/29/2024, 4:35 PM

hmm interesting, if you could describe those in https://github.com/kedro-org/kedro/issues/ we'd appreciate it!

Júlio Resende

02/29/2024, 4:45 PM

Would it be a feature request or a bug report?

Juan Luis

02/29/2024, 4:52 PM

hah, difficult to say 🙂 but choose any template if you're unsure, we can adapt later

👍 1

Júlio Resende

02/29/2024, 5:15 PM

https://github.com/kedro-org/kedro/issues/3666

👍🏼 1

3 Views

Open in Slack

Previous Next