Are there any special characters that should be avoided in t Kedro #questions

Are there any special characters that should be av...

Artur Dobrogowski

03/20/2024, 2:53 PM

Are there any special characters that should be avoided in tag names (nodes, etc)? I can't find relevant info in the docs

datajoely

03/20/2024, 2:54 PM

I think tag names are pretty generic, just needs to be valid YAML

datajoely

03/20/2024, 2:54 PM

are you running into issues?

Artur Dobrogowski

03/20/2024, 2:54 PM

no but I got PR claiming that colons make problems and it's PR to change colons to dots as separator

Artur Dobrogowski

03/20/2024, 2:55 PM

and I wonder whats the issue

Artur Dobrogowski

03/20/2024, 2:55 PM

because I got no more comment

datajoely

03/20/2024, 2:55 PM

colons may break the YAML validity

datajoely

03/20/2024, 2:55 PM

so yes dots or dashes are better

Artur Dobrogowski

03/20/2024, 2:55 PM

huh... you just use quotes and its okay

Artur Dobrogowski

03/20/2024, 2:55 PM

but I guess

datajoely

03/20/2024, 2:55 PM

dots in key names may conflict with namespaces

Artur Dobrogowski

03/20/2024, 2:56 PM

yes

Artur Dobrogowski

03/20/2024, 2:56 PM

that's my intuition that it feels wrong to use dots as they are reserved for namespaces

datajoely

03/20/2024, 2:56 PM

I think it’s safest to use dashes / underscores here

Artur Dobrogowski

03/20/2024, 2:56 PM

😕

Artur Dobrogowski

03/20/2024, 2:57 PM

I don't like it

Artur Dobrogowski

03/20/2024, 2:58 PM

I think I'll use slash

Artur Dobrogowski

03/20/2024, 2:58 PM

It's for tag grouping feature

datajoely

03/20/2024, 2:59 PM

datajoely

03/20/2024, 2:59 PM

I think I’m okay with it

👍 1

Juan Luis

03/20/2024, 3:00 PM

hmmm so we don't have documented anywhere what are the valid chars for names? 🤔 I recall that it's written somewhere

Artur Dobrogowski

03/20/2024, 3:00 PM

maybe, I was looking specifically for tags info and there's nothing about it

datajoely

03/20/2024, 3:01 PM

Copy code

for tag in self._tags:
            if not re.match(r"[\w\.-]+$", tag):
                raise ValueError(
                    f"'{tag}' is not a valid node tag. It must contain only "
                    f"letters, digits, hyphens, underscores and/or fullstops."
                )

datajoely

03/20/2024, 3:01 PM

node.py

datajoely

03/20/2024, 3:01 PM

so actually I don’t think slashes are allowed

Artur Dobrogowski

03/20/2024, 3:01 PM

yeah neither are colons or dots

Artur Dobrogowski

03/20/2024, 3:02 PM

weird it must be a new ~~feature~~ limitation as it was working before

datajoely

03/20/2024, 3:02 PM

namespaces are the way we intended grouping to be noted, is there any reason that doesn’t work for you purposes?

datajoely

03/20/2024, 3:02 PM

it’s the same in 0.18.0

Artur Dobrogowski

03/20/2024, 3:03 PM

yes, it's for plugin for grouping nodes during node translation for execution environment like kedro->vertexai nodes

datajoely

03/20/2024, 3:04 PM

and do namespaces not work for you there?

datajoely

03/20/2024, 3:05 PM

It would be incredibly helpful to get your thoughts here https://github.com/kedro-org/kedro/issues/3094

datajoely

03/20/2024, 3:05 PM

this falls under the first point Deciding on granularity when translating to orchestrator DSL

Artur Dobrogowski

03/20/2024, 3:07 PM

I already commented there

Artur Dobrogowski

03/20/2024, 3:07 PM

I'm Lasica on github

❤️ 1

Artur Dobrogowski

03/20/2024, 3:07 PM

and namespaces are not enough imho

datajoely

03/20/2024, 3:08 PM

it would still be very helpful for you to set out why namespace aren’t

Artur Dobrogowski

03/20/2024, 3:08 PM

yeah I need to gather my thoughts but that was my impression when I was dealing with it last time hence the feature to group nodes via tags

datajoely

03/20/2024, 3:10 PM

this is genuinely incredibly helpful

datajoely

03/20/2024, 3:12 PM

equally if we need to relax the tag validation this wold help make the argument

Juan Luis

03/20/2024, 3:13 PM

to Artur's point, I also don't have an articulate opinion on namespaces yet but I perceive them as "heavy"

Juan Luis

03/20/2024, 3:13 PM

after 1.5 years of using Kedro I'm still not sure how to use them correctly

☝️ 1

datajoely

03/20/2024, 3:13 PM

I should rephrase - @Ivan Danov designed them for this purpose so it’s helpful to articulate where the friction is

Artur Dobrogowski

03/20/2024, 3:15 PM

well they are hierarchical and cumbersome a bit because of that, once you start using them you need to use them everywhere in the pipeline

Artur Dobrogowski

03/20/2024, 3:15 PM

say I got 5 nodes - 1, 2, 3, 4, 5 and I want to group nodes 1-2, and 4-5.

Artur Dobrogowski

03/20/2024, 3:16 PM

if I use namespaces then the best would be to namespace whole pipeline and then add subnamespaces for 1,2 and 4,5

Artur Dobrogowski

03/20/2024, 3:16 PM

and when I want to run the pipeline I need to provide extra parameters - the namespace, which gets longer because I need to add extra steps

Artur Dobrogowski

03/20/2024, 3:17 PM

that's one point of friction

Artur Dobrogowski

03/20/2024, 3:17 PM

but maybe it's only in my head

Artur Dobrogowski

03/20/2024, 3:17 PM

I think I didn't properly consider using namespaces for that because they have some more restrictions and need getting more used to it

Artur Dobrogowski

03/20/2024, 3:17 PM

I think you can't run nodes without namespace together with namespaced nodes

Artur Dobrogowski

03/20/2024, 3:24 PM

Artur Dobrogowski

03/20/2024, 3:24 PM

looks like this limitation is quite fresh, I implemented that feature like 6 months ago

Ivan Danov

03/20/2024, 3:26 PM

• Namespaces were designed to group nodes in an inclusive fashion, i.e. if you want to run a group of nodes as one task in VertexAI/Airflow/etc. • Any solution for this will be by nature hierarchical. Tags on the other hand are inclusive, e.g. you might tag a node to be both for example gpu and largemem node. • How you name your tags has no influence over yaml or namespaces. • Only nodes have tags, but both nodes and datasets have namespaces. • The namespace of a dataset doesn't decide anything in terms of scheduling, but is only needed in order to avoid duplicate dataset names if you are reusing the same pipeline twice in a bigger pipeline.

Ivan Danov

03/20/2024, 3:28 PM

It seems that for your usecase, you'd be better served by namespaces. Not sure what restrictions are preventing you from using them - in fact a namespace is nothing more than just a prefix to your node name, and that's how they are implemented internally. In the future we are considering to add more restrictions to avoid creating loops between namespaces by accident, which will prevent you from scheduling your namespaces as separate tasks (currently people do that a lot, since it's not prohibited by Kedro).

Artur Dobrogowski

03/20/2024, 4:05 PM

I don't disagree that namespaces could probably also handle my case, but I was searching for reasons why I didn't decide to use them or what was the friction, that made me decide to use tags. When I think about it it can be summed up with the following points: • I wasn't very familiar with namespaces yet and didn't know their full purpose • starting to use namespaces has friction in the fact that you need to start using them widely, everywhere. The fact that I need to add namespace to dataset in this case is more of a pain point than a benefit, especially as I was doing it before I was familiar with dataset factories or the feature was not yet released, • we were already using tags to steer behavior of certain nodes in translation process (say assign gpu), so it made sense to expand the functionality of it instead introduce a new mechanism that requires learning • namespaces felt cumbersome in a way that you cant start using them partially, as you can't run default namespace + some other namespace (at least I don't know how). Since the grouping feature was made to be modular/swappable it makes total sense to me now to make namespace grouping feature as an alternative.

Artur Dobrogowski

03/20/2024, 4:11 PM

anyways @datajoely I believe it would be helpful to have information about tag names limits/convention in the following places in docs: https://docs.kedro.org/en/stable/api/kedro.pipeline.node.html - implies any string is fine https://docs.kedro.org/en/stable/nodes_and_pipelines/nodes.html#how-to-tag-a-node - info box here

1000000 3

Juan Luis

03/20/2024, 4:12 PM

absolutely yes - @Artur Dobrogowski do you have a moment to open an issue in https://github.com/kedro-org/kedro/issues/ ?

Artur Dobrogowski

03/20/2024, 4:14 PM

can do

Ivan Danov

03/20/2024, 4:15 PM

Totally makes sense @Artur Dobrogowski. 1 and 3 seem like situational reasons not to use them, i.e. specific for your project. But it would be useful if you give a couple of examples for 2 and 4, which will help us address that, be it through documentation, a blog post or additional feature. Do you mind sharing more about those two points of yours?

Artur Dobrogowski

03/20/2024, 4:16 PM

sure I think I already made an issue about #4 with Marcin

Artur Dobrogowski

03/20/2024, 4:16 PM

and I can elaborate on #2 another time

Ivan Danov

03/20/2024, 4:17 PM

sounds great, thanks a lot 🙇

Artur Dobrogowski

03/20/2024, 4:22 PM

https://github.com/kedro-org/kedro/issues/3727