Hello everyone, I am encountering some issues reg...
# questions
h
Hello everyone, I am encountering some issues regarding the use of placeholders for the data catalog and I was hoping you can shed some light on this . I have the following pipeline:
Copy code
load_date = settings.LOAD_DATE_COMPARISON.get("current")
previous_load_date = settings.LOAD_DATE_COMPARISON.get("previous")

def create_pipeline(**kwargs) -> Pipeline:


    format_data_quality = pipeline(
                [   node(
                        func= compare_id,
                        inputs=[f"maestro_indicadores_{load_date}",
                                f"maestro_indicadores_{previous_load_date}"],
                        outputs=f"compare_id_{load_date}_{previous_load_date}",
                        name="compare_id_node",
                        tags = "compare_id"
    ),]
    )
    return format_data_quality
With the corresponding catalog entry for the output:
Copy code
compare_id_{load_date}_{previous_load_date}:
  type: json.JSONDataset
  filepath: reports/{load_date}/id_comparison/id_comparison_{load_date}_{previous_load_date}.json
The issue here is that whenever the value of load date is something like 2024_07_01, it will generate a path like: reports/*2024*/id_comparison/id_comparison_ 2024_07_01_2024_05_01.json Note that the first placeholder is not being substituted with the intended value, while the others are. This will only happen when the value of load_date contains underscores, not happening with dots or hyphens. Why does this happen?
r
Thank you for raising this issue. This will require further investigation from the team. Could you kindly raise this as a bug on GitHub?
h
Sure, done!
n
Can you do
kedro catalog resolve
to understand this better? Is it using the pattern that you are intend to use?
So the problem is
2024_07_07
somehow become
2024
?
h
Yes, 2024_07_07 becomes 2024
👀 1
n
Can you make an minimal example that can reproduce this issue?
v
@Hugo Acosta Just for learning purpose I wanted to know more about
settings.LOAD_DATE_COMPARISON.get("current")
What kind of object is
LOAD_DATE_COMPARISON
and how it is defined in settings.py
a
If you’re using dataset factories, it’s because
parse
(https://pypi.org/project/parse/) library that we use under the hood for matching dataset names to patterns works this way. It’ll resolve the brackets for
compare_id_{load_date}__{previous_load_date}_
at the first underscore. It’s expected behaviour and i’d recommend using a different separator between the dates for this output dataset
h
@Ankita Katiyar In that case, all of the placeholders should suffer from this issue, but it only happens on the one I'm highlighting: reports/*{load_date}*/id_comparison/id_comparison_{load_date}_{previous_load_date}.json
@Vishal Pandey This is the content of settings.py
LOAD_DATE_COMPARISON = globals_config["load_dates_comparison"]
Which refers to the globals.yml file where:
Copy code
load_dates_comparison:
  previous: "2024_07_01"
  current: "2024_10_07"
👍 1
So it turns out the problem comes from the catalog.yml entry naming having underscores and complying with the following schema: When the name is something_{placeholder1}_{placeholder2} the path placeholders take unwanted values. This does not happen if we name the entry like something_{placeholder1}*__vs__*{placeholder2}
a
Yeah, if the placeholders themselves contain underscores and the separator between them is also an underscore, the string can be split in multiple ways so that it satisfies the pattern.
parse
library does it in a way that the first match that satisfies the pattern is returned. So
something_{2024}_{07_01_2024_10_07}
and
something_{2024_07}_{01_2024_10_07}
and
something_{2024_07_01}_{2024_10_07}
all satisfy the pattern but the
parse
library returns the first match