Hello everyone! I have a question about the use o...
Hello everyone! I have a question about the use of pandas.DeltaTableDataset. The link bellow says that is possible to "overwrite a specific partition by using mode=overwrite together with partition_filters". Is it possible to define which partition to overwrite on each execution? Eg: My pipeline can be applied using data from different places, but in each execution only the data from one place will be processed at time. I want to partition the data by place and only overwrite the data from the place being processed. https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-2.1.0/api/kedro_datasets.pandas.DeltaTableDataset.html
does that example solve your use case?
Hi @Juan Luis. Thanks for your answer! It doesn't solve because I need to use data catalog.
Is it possible to use kedro params whithin data catalog, to decide which partition to overwrite?
you should be able to use runtime params or globals, see https://docs.kedro.org/en/stable/configuration/advanced_configuration.html#how-to-use-global-variables-with-the-omegaconfigloader for example in this case,
  type: pandas.DeltaTableDataset
    mode: overwrite
    partition_filters: "${globals:partition_filters}"
and then add your
(thanks @Nok Lam Chan for the solution!)
Thanks @Juan Luis Using globals gonna work for me! But I couldn't get this working using kedro data catalog. The delta api says that partition_filters takes a List of Tuples. How Can I write it on a yaml file? I have the same question for table schema, that expects pyarrow.Schema.
Custom resolver is the way to go for non primitive types
Yay, that worked great! Thanks @Juan Luis and @Nok Lam Chan
I had some problems when my nodes were returning a dataframe. Eg:
name 'column_name' present in the specified schema is not found in the columns or index
, but column_name was defined as nullable in the specified schema. I also had some problems related with __index_level_0__ column when no schema was specified (see this issue). Using
as node return fixed all these problems. Perhaps this function could be embedded into pandas.DeltaTableDataset in the next release of kedro datasets?
hmm interesting, if you could describe those in https://github.com/kedro-org/kedro/issues/ we'd appreciate it!
Would it be a feature request or a bug report?
hah, difficult to say 🙂 but choose any template if you're unsure, we can adapt later
