Hello everyone! I have a question about the use o...
# questions
j
Hello everyone! I have a question about the use of pandas.DeltaTableDataset. The link bellow says that is possible to "overwrite a specific partition by using mode=overwrite together with partition_filters". Is it possible to define which partition to overwrite on each execution? Eg: My pipeline can be applied using data from different places, but in each execution only the data from one place will be processed at time. I want to partition the data by place and only overwrite the data from the place being processed. https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-2.1.0/api/kedro_datasets.pandas.DeltaTableDataset.html
j
does that example solve your use case?
j
Hi @Juan Luis. Thanks for your answer! It doesn't solve because I need to use data catalog.
Is it possible to use kedro params whithin data catalog, to decide which partition to overwrite?
j
you should be able to use runtime params or globals, see https://docs.kedro.org/en/stable/configuration/advanced_configuration.html#how-to-use-global-variables-with-the-omegaconfigloader for example in this case,
Copy code
ds:
  type: pandas.DeltaTableDataset
  save_args:
    mode: overwrite
    partition_filters: "${globals:partition_filters}"
and then add your
partition_filters
in
conf/base/globals.yml
(thanks @Nok Lam Chan for the solution!)
j
Thanks @Juan Luis Using globals gonna work for me! But I couldn't get this working using kedro data catalog. The delta api says that partition_filters takes a List of Tuples. How Can I write it on a yaml file? I have the same question for table schema, that expects pyarrow.Schema.
j
n
Custom resolver is the way to go for non primitive types
j
Yay, that worked great! Thanks @Juan Luis and @Nok Lam Chan
🥳 2
🙌🏼 1
I had some problems when my nodes were returning a dataframe. Eg:
name 'column_name' present in the specified schema is not found in the columns or index
, but column_name was defined as nullable in the specified schema. I also had some problems related with __index_level_0__ column when no schema was specified (see this issue). Using
pyarrow.Table.from_pandas(df)
as node return fixed all these problems. Perhaps this function could be embedded into pandas.DeltaTableDataset in the next release of kedro datasets?
j
hmm interesting, if you could describe those in https://github.com/kedro-org/kedro/issues/ we'd appreciate it!
j
Would it be a feature request or a bug report?
j
hah, difficult to say 🙂 but choose any template if you're unsure, we can adapt later
👍 1
j