Few questions on features we're thinking about add...
# questions
p
Few questions on features we're thinking about adding that I wanted to discuss to hear if these are code smells or reasonable features • a "env fork" flag. i.e.
kedro run --from-nodes a,b,c --fork-from prod --env dev
which would read all the first reads from env prod and then everything else happens in env dev ◦ allows testing a part of the pipeline in your own space based on prod data without having to copy stuff over • alternatively a
kedro copy --datasets a,b,c --from prod --to dev
◦ same as above but manual first copy then normal kedro run afterwards •
kedro run --without-tags
-> Filter stuff out based on tags, inverse of
--tags
d
On the first one, why not define the datasets for dev env in
base
, and overwrite everything beyond the initial datasets in
prod
? Or you need more flexibility (like dynamically choosing which part to test from)? I guess exactly what you're asking for isn't possible right now, but another possibility (if the datasets are not structured so differently in dev and prod) is to define some variables based on env, rather than defining separate sets of conf.
👍🏼 1
n
I have similar opinion as @Deepyaman Dattafor the first two points For 3, maybe you can do this without change on CLI via the
pipeline.filter
API
p
not sure I get the comment from Deepyaman on 1)/2) tbh
n
Are you trying to read some data base on Prod config and others on Dev config?
^Deepyaman point is, you should be able to do this with Kedro Environment. This is exactly what the
base
environment do. To do this: 1.
base
(what you called dev) define the shared configuration 2.
prod
override necessary datasets. 3.
dev
(define the dataset that you need to read from Prod)
If this is not what you are asking, can you give some example in a format like this
Copy code
# dev
a

# prod
b
and tell us what is the expected end result?
p
OK so we have
Copy code
base
local
prod
Almost everything is configured in
base
. Some of our data in
prod
sits in BigQuery instead of GCS but that's besides the point. Anyways, the paths for
base
are all local filepaths but
prod
is all GCS buckets. We want to run the pipeline E2E reading the entrypoints from
prod
but then executing the rest of the pipeline in the
base
environment (i.e. read entry data from prod, then keep running in our own env, just like a git fork)
all our catalog paths follow something like
Copy code
datasetname:
  path: ${globals.base_path}/...
the base_path in prod is GCS buckets, in
base
it's
./data/
d
Do you always want to have the option of where to get data for a,b,c; or on a different test run maybe you want to fork f or x,y,z? If you want that level of flexibility, don't think it's possible as is.
n
So I'd translate this to, you have two sets of configuration (
prod
and
dev
). For any given run, you want the "input" data to get configuration from
prod
while the other goes to
dev
. For an intermediate dataset
dataset_b
, it could be read from
prod
or
dev
depending on which nodes you start from.
p
Yes the flexibility doesn't exist today, thus the Q if the feature is something that others have thought about or desired or if my need for it signals that we're doing something "wrong". We have our real data in prod. Sometimes we want to run a proper run with real data but not in prod but in another environment, e.g. a developer's cloud instance. Then we want to quickly/easily be able to say
Run the nodes X>Y>Z with real data but on my own machine, and take the first reads from Prod but then store any intermediate data on my own machine
E.g. no one has write rights to prod but everyone having read rights, this would be a way how everyone keeps working off the latest prod data without tripping over each other or overwriting each others' data
if we all run in
prod
using versioned kedro datasets, we may accidentally reach each others latest versions because we may be running in parallel. If we all have our own environments, we have to keep manually copy over intermediate results from prod to our respective machines to keep working off the latest data
d
I can see this being useful, sure. I can also see value in there being a different, more powerful approach (e.g. something akin to https://tobikodata.com/virtual-data-environments.html)
p
I like it 🤓 But that's a "major change" for me, the above is more of a "helps us where we need to get to this quarter" small feature
1000000 1
d
Oh yeah, that's a massive feature for sure 😛 Just throwing my own interests in the pool lol
Re your request, maybe @Nok Lam Chan or somebody else has thoughts, but I would suggest 1. raising an issue 2. trying to implement it in the short term by extending the CLI/with a hook (think this should be very doable, haven't actually tried looking at it though)
n
I'd actually like the direction of having a virtual data environment, but yes please raise an issue
p
yeah the virtual data env sounds slick! gonna drop issue