Hello everyone I am encountering some issues regarding the u Kedro #questions

Hello everyone, I am encountering some issues reg...

Hugo Acosta

10/11/2024, 3:53 PM

Hello everyone, I am encountering some issues regarding the use of placeholders for the data catalog and I was hoping you can shed some light on this . I have the following pipeline:

Copy code

load_date = settings.LOAD_DATE_COMPARISON.get("current")
previous_load_date = settings.LOAD_DATE_COMPARISON.get("previous")

def create_pipeline(**kwargs) -> Pipeline:


    format_data_quality = pipeline(
                [   node(
                        func= compare_id,
                        inputs=[f"maestro_indicadores_{load_date}",
                                f"maestro_indicadores_{previous_load_date}"],
                        outputs=f"compare_id_{load_date}_{previous_load_date}",
                        name="compare_id_node",
                        tags = "compare_id"
    ),]
    )
    return format_data_quality

With the corresponding catalog entry for the output:

Copy code

compare_id_{load_date}_{previous_load_date}:
  type: json.JSONDataset
  filepath: reports/{load_date}/id_comparison/id_comparison_{load_date}_{previous_load_date}.json

The issue here is that whenever the value of load date is something like 2024_07_01, it will generate a path like: reports/*2024*/id_comparison/id_comparison_ 2024_07_01_2024_05_01.json Note that the first placeholder is not being substituted with the intended value, while the others are. This will only happen when the value of load_date contains underscores, not happening with dots or hyphens. Why does this happen?

Rashida Kanchwala

10/11/2024, 4:06 PM

Thank you for raising this issue. This will require further investigation from the team. Could you kindly raise this as a bug on GitHub?

Hugo Acosta

10/11/2024, 4:21 PM

Sure, done!

Nok Lam Chan

10/11/2024, 4:31 PM

Can you do

kedro catalog resolve

to understand this better? Is it using the pattern that you are intend to use?

Nok Lam Chan

10/11/2024, 4:32 PM

So the problem is

2024_07_07

somehow become

Hugo Acosta

10/11/2024, 4:49 PM

Yes, 2024_07_07 becomes 2024

👀 1

Nok Lam Chan

10/11/2024, 6:25 PM

Can you make an minimal example that can reproduce this issue?

Vishal Pandey

10/12/2024, 9:23 AM

@Hugo Acosta Just for learning purpose I wanted to know more about

settings.LOAD_DATE_COMPARISON.get("current")

What kind of object is

LOAD_DATE_COMPARISON

and how it is defined in settings.py

Ankita Katiyar

10/12/2024, 10:15 AM

If you’re using dataset factories, it’s because

parse

(https://pypi.org/project/parse/) library that we use under the hood for matching dataset names to patterns works this way. It’ll resolve the brackets for

compare_id_{load_date}__{previous_load_date}_

at the first underscore. It’s expected behaviour and i’d recommend using a different separator between the dates for this output dataset

Hugo Acosta

10/14/2024, 10:53 AM

@Ankita Katiyar In that case, all of the placeholders should suffer from this issue, but it only happens on the one I'm highlighting: reports/*{load_date}*/id_comparison/id_comparison_{load_date}_{previous_load_date}.json

Hugo Acosta

10/14/2024, 10:56 AM

@Vishal Pandey This is the content of settings.py

LOAD_DATE_COMPARISON = globals_config["load_dates_comparison"]

Which refers to the globals.yml file where:

Copy code

load_dates_comparison:
  previous: "2024_07_01"
  current: "2024_10_07"

👍 1

Hugo Acosta

10/14/2024, 1:44 PM

So it turns out the problem comes from the catalog.yml entry naming having underscores and complying with the following schema: When the name is something_{placeholder1}_{placeholder2} the path placeholders take unwanted values. This does not happen if we name the entry like something_{placeholder1}*__vs__*{placeholder2}

Ankita Katiyar

10/14/2024, 1:50 PM

Yeah, if the placeholders themselves contain underscores and the separator between them is also an underscore, the string can be split in multiple ways so that it satisfies the pattern.

parse

library does it in a way that the first match that satisfies the pattern is returned. So

something_{2024}_{07_01_2024_10_07}

and

something_{2024_07}_{01_2024_10_07}

and

something_{2024_07_01}_{2024_10_07}

all satisfy the pattern but the

parse

library returns the first match

2 Views

Open in Slack

Previous Next