Few questions on features we re thinking about adding that I Kedro #questions

Few questions on features we're thinking about add...

Pascal Brokmeier

08/06/2024, 12:41 PM

Few questions on features we're thinking about adding that I wanted to discuss to hear if these are code smells or reasonable features • a "env fork" flag. i.e.

kedro run --from-nodes a,b,c --fork-from prod --env dev

which would read all the first reads from env prod and then everything else happens in env dev ◦ allows testing a part of the pipeline in your own space based on prod data without having to copy stuff over • alternatively a

kedro copy --datasets a,b,c --from prod --to dev

◦ same as above but manual first copy then normal kedro run afterwards •

kedro run --without-tags

-> Filter stuff out based on tags, inverse of

--tags

Deepyaman Datta

08/06/2024, 1:44 PM

On the first one, why not define the datasets for dev env in

base

, and overwrite everything beyond the initial datasets in

prod

? Or you need more flexibility (like dynamically choosing which part to test from)? I guess exactly what you're asking for isn't possible right now, but another possibility (if the datasets are not structured so differently in dev and prod) is to define some variables based on env, rather than defining separate sets of conf.

👍🏼 1

Nok Lam Chan

08/06/2024, 1:53 PM

I have similar opinion as @Deepyaman Dattafor the first two points For 3, maybe you can do this without change on CLI via the

pipeline.filter

API

Pascal Brokmeier

08/06/2024, 2:41 PM

not sure I get the comment from Deepyaman on 1)/2) tbh

Nok Lam Chan

08/06/2024, 2:55 PM

Are you trying to read some data base on Prod config and others on Dev config?

Nok Lam Chan

08/06/2024, 2:58 PM

^Deepyaman point is, you should be able to do this with Kedro Environment. This is exactly what the

base

environment do. To do this: 1.

base

(what you called dev) define the shared configuration 2.

prod

override necessary datasets. 3.

dev

(define the dataset that you need to read from Prod)

Nok Lam Chan

08/06/2024, 2:59 PM

If this is not what you are asking, can you give some example in a format like this

Copy code

# dev
a

# prod
b

and tell us what is the expected end result?

Pascal Brokmeier

08/06/2024, 3:00 PM

OK so we have

Copy code

base
local
prod

Almost everything is configured in

base

. Some of our data in

prod

sits in BigQuery instead of GCS but that's besides the point. Anyways, the paths for

base

are all local filepaths but

prod

is all GCS buckets. We want to run the pipeline E2E reading the entrypoints from

prod

but then executing the rest of the pipeline in the

base

environment (i.e. read entry data from prod, then keep running in our own env, just like a git fork)

Pascal Brokmeier

08/06/2024, 3:01 PM

all our catalog paths follow something like

Copy code

datasetname:
  path: ${globals.base_path}/...

the base_path in prod is GCS buckets, in

base

it's

./data/

Deepyaman Datta

08/06/2024, 3:05 PM

Do you always want to have the option of where to get data for a,b,c; or on a different test run maybe you want to fork f or x,y,z? If you want that level of flexibility, don't think it's possible as is.

Nok Lam Chan

08/06/2024, 3:11 PM

So I'd translate this to, you have two sets of configuration (

prod

and

dev

). For any given run, you want the "input" data to get configuration from

prod

while the other goes to

dev

. For an intermediate dataset

dataset_b

, it could be read from

prod

dev

depending on which nodes you start from.

Pascal Brokmeier

08/06/2024, 4:59 PM

Yes the flexibility doesn't exist today, thus the Q if the feature is something that others have thought about or desired or if my need for it signals that we're doing something "wrong". We have our real data in prod. Sometimes we want to run a proper run with real data but not in prod but in another environment, e.g. a developer's cloud instance. Then we want to quickly/easily be able to say

Run the nodes X>Y>Z with real data but on my own machine, and take the first reads from Prod but then store any intermediate data on my own machine

E.g. no one has write rights to prod but everyone having read rights, this would be a way how everyone keeps working off the latest prod data without tripping over each other or overwriting each others' data

Pascal Brokmeier

08/06/2024, 5:00 PM

if we all run in

prod

using versioned kedro datasets, we may accidentally reach each others latest versions because we may be running in parallel. If we all have our own environments, we have to keep manually copy over intermediate results from prod to our respective machines to keep working off the latest data

Deepyaman Datta

08/06/2024, 5:11 PM

I can see this being useful, sure. I can also see value in there being a different, more powerful approach (e.g. something akin to https://tobikodata.com/virtual-data-environments.html)

Pascal Brokmeier

08/06/2024, 5:14 PM

I like it 🤓 But that's a "major change" for me, the above is more of a "helps us where we need to get to this quarter" small feature

1000000 1

Deepyaman Datta

08/06/2024, 5:20 PM

Oh yeah, that's a massive feature for sure 😛 Just throwing my own interests in the pool lol

Deepyaman Datta

08/06/2024, 5:22 PM

Re your request, maybe @Nok Lam Chan or somebody else has thoughts, but I would suggest 1. raising an issue 2. trying to implement it in the short term by extending the CLI/with a hook (think this should be very doable, haven't actually tried looking at it though)

Nok Lam Chan

08/07/2024, 1:15 PM

I'd actually like the direction of having a virtual data environment, but yes please raise an issue

Pascal Brokmeier

08/07/2024, 7:53 PM

yeah the virtual data env sounds slick! gonna drop issue

15 Views

Open in Slack

Previous Next