Pascal Brokmeier
08/06/2024, 12:41 PMkedro run --from-nodes a,b,c --fork-from prod --env dev
which would read all the first reads from env prod and then everything else happens in env dev
◦ allows testing a part of the pipeline in your own space based on prod data without having to copy stuff over
• alternatively a kedro copy --datasets a,b,c --from prod --to dev
◦ same as above but manual first copy then normal kedro run afterwards
• kedro run --without-tags
-> Filter stuff out based on tags, inverse of --tags
Deepyaman Datta
08/06/2024, 1:44 PMbase
, and overwrite everything beyond the initial datasets in prod
? Or you need more flexibility (like dynamically choosing which part to test from)?
I guess exactly what you're asking for isn't possible right now, but another possibility (if the datasets are not structured so differently in dev and prod) is to define some variables based on env, rather than defining separate sets of conf.Nok Lam Chan
08/06/2024, 1:53 PMpipeline.filter
APIPascal Brokmeier
08/06/2024, 2:41 PMNok Lam Chan
08/06/2024, 2:55 PMNok Lam Chan
08/06/2024, 2:58 PMbase
environment do.
To do this:
1. base
(what you called dev) define the shared configuration
2. prod
override necessary datasets.
3. dev
(define the dataset that you need to read from Prod)Nok Lam Chan
08/06/2024, 2:59 PM# dev
a
# prod
b
and tell us what is the expected end result?Pascal Brokmeier
08/06/2024, 3:00 PMbase
local
prod
Almost everything is configured in base
. Some of our data in prod
sits in BigQuery instead of GCS but that's besides the point. Anyways, the paths for base
are all local filepaths but prod
is all GCS buckets.
We want to run the pipeline E2E reading the entrypoints from prod
but then executing the rest of the pipeline in the base
environment (i.e. read entry data from prod, then keep running in our own env, just like a git fork)Pascal Brokmeier
08/06/2024, 3:01 PMdatasetname:
path: ${globals.base_path}/...
the base_path in prod is GCS buckets, in base
it's ./data/
Deepyaman Datta
08/06/2024, 3:05 PMNok Lam Chan
08/06/2024, 3:11 PMprod
and dev
). For any given run, you want the "input" data to get configuration from prod
while the other goes to dev
.
For an intermediate dataset dataset_b
, it could be read from prod
or dev
depending on which nodes you start from.Pascal Brokmeier
08/06/2024, 4:59 PMRun the nodes X>Y>Z with real data but on my own machine, and take the first reads from Prod but then store any intermediate data on my own machineE.g. no one has write rights to prod but everyone having read rights, this would be a way how everyone keeps working off the latest prod data without tripping over each other or overwriting each others' data
Pascal Brokmeier
08/06/2024, 5:00 PMprod
using versioned kedro datasets, we may accidentally reach each others latest versions because we may be running in parallel.
If we all have our own environments, we have to keep manually copy over intermediate results from prod to our respective machines to keep working off the latest dataDeepyaman Datta
08/06/2024, 5:11 PMPascal Brokmeier
08/06/2024, 5:14 PMDeepyaman Datta
08/06/2024, 5:20 PMDeepyaman Datta
08/06/2024, 5:22 PMNok Lam Chan
08/07/2024, 1:15 PMPascal Brokmeier
08/07/2024, 7:53 PM