Pascal Brokmeier
08/06/2024, 12:41 PMkedro run --from-nodes a,b,c --fork-from prod --env dev which would read all the first reads from env prod and then everything else happens in env dev
◦ allows testing a part of the pipeline in your own space based on prod data without having to copy stuff over
• alternatively a kedro copy --datasets a,b,c --from prod --to dev
◦ same as above but manual first copy then normal kedro run afterwards
• kedro run --without-tags -> Filter stuff out based on tags, inverse of --tagsDeepyaman Datta
08/06/2024, 1:44 PMbase, and overwrite everything beyond the initial datasets in prod? Or you need more flexibility (like dynamically choosing which part to test from)?
I guess exactly what you're asking for isn't possible right now, but another possibility (if the datasets are not structured so differently in dev and prod) is to define some variables based on env, rather than defining separate sets of conf.Nok Lam Chan
08/06/2024, 1:53 PMpipeline.filter APIPascal Brokmeier
08/06/2024, 2:41 PMNok Lam Chan
08/06/2024, 2:55 PMNok Lam Chan
08/06/2024, 2:58 PMbase environment do.
To do this:
1. base (what you called dev) define the shared configuration
2. prod override necessary datasets.
3. dev (define the dataset that you need to read from Prod)Nok Lam Chan
08/06/2024, 2:59 PM# dev
a
# prod
b
and tell us what is the expected end result?Pascal Brokmeier
08/06/2024, 3:00 PMbase
local
prod
Almost everything is configured in base. Some of our data in prod sits in BigQuery instead of GCS but that's besides the point. Anyways, the paths for base are all local filepaths but prod is all GCS buckets.
We want to run the pipeline E2E reading the entrypoints from prod but then executing the rest of the pipeline in the base environment (i.e. read entry data from prod, then keep running in our own env, just like a git fork)Pascal Brokmeier
08/06/2024, 3:01 PMdatasetname:
path: ${globals.base_path}/...
the base_path in prod is GCS buckets, in base it's ./data/Deepyaman Datta
08/06/2024, 3:05 PMNok Lam Chan
08/06/2024, 3:11 PMprod and dev). For any given run, you want the "input" data to get configuration from prod while the other goes to dev.
For an intermediate dataset dataset_b, it could be read from prod or dev depending on which nodes you start from.Pascal Brokmeier
08/06/2024, 4:59 PMRun the nodes X>Y>Z with real data but on my own machine, and take the first reads from Prod but then store any intermediate data on my own machineE.g. no one has write rights to prod but everyone having read rights, this would be a way how everyone keeps working off the latest prod data without tripping over each other or overwriting each others' data
Pascal Brokmeier
08/06/2024, 5:00 PMprod using versioned kedro datasets, we may accidentally reach each others latest versions because we may be running in parallel.
If we all have our own environments, we have to keep manually copy over intermediate results from prod to our respective machines to keep working off the latest dataDeepyaman Datta
08/06/2024, 5:11 PMPascal Brokmeier
08/06/2024, 5:14 PMDeepyaman Datta
08/06/2024, 5:20 PMDeepyaman Datta
08/06/2024, 5:22 PMNok Lam Chan
08/07/2024, 1:15 PMPascal Brokmeier
08/07/2024, 7:53 PM