Hello! I am looking for advice about the best Kedr...
# questions
j
Hello! I am looking for advice about the best Kedro project (re)design for our problem. We have a rather long data preparations pipeline to train several spacy models, and now we need to scale from one language to two or more, up to five. The center of the solution is a spacy DVC-like project https://spacy.io/usage/projects with kedro graph parts used as commands. Also, the regular DVC is present to help share computational results within a team. From previous experience, it's more convenient to keep one spacy project for one language due to readability and because languages could require different pipelines on some stage. Data for languages are different, as well as some parameters and possibly catalogs. Looks like we have to change the data folder structure from data/01_raw/golden to data/01_raw/en/golden. What is the Kedro concept to achieve our goal with minimal changes? Namespaces? Custom/templated configs?
n
data/01_raw/golden to data/01_raw/en/golden.
From what you said, I think you can use the template value like this:
data/01_raw/${globals: language}/golden
, and define
langauge
in
globals.yml
j
Thank you for the answer! That solves the catalog.yml problem. But how should we handle parameters for pipelines? About 10% could be different for language pipelines.
n
Can you give me an example of parameters of pipeline A and B?
j
It could be: parameters_pipe1.yml: skip_labels:[list_item1, list_item2] parameters_pipe2.yml: some_node_params:{dict of thresholds by label}
n
https://docs.kedro.org/en/stable/nodes_and_pipelines/modular_pipelines.html I think namespace pipeline is a good fit here, you can use
kedro pipeline create <name>
to create the scaffold quickly. You can have shared parameters in
base/parameters.yml
and optionally override the necessary bit in
base/<pipeline>/parameters.yml
when needed.
👍 1
j
I see that config templates fit into our approach. We can define data paths
data/01_raw/${runtime_params:language}/golden
and then modify cli commands in parametrized spacy project.yml providing
Copy code
kedro run --tags train_golden_model --params language:${vars.language}
But how to link templates and namespaces? Is it possible to pass the parameter as a namespace name? We train models one language at a time. Is the code below ok?
Copy code
ds_pipeline_1 = pipeline(
        pipe=pipeline_instance,
        inputs="model_input_table",
        namespace="params:language",
    )
I hope I've figured it out finally 🙂 Looks like we don't need namespaces, it would be enough to make deeper dictionaries of parameters
parameters_pipe2.yml: some_node_params:en:{dict of thresholds by label}
and add language as a parameter of the node. That's the point to start from.