Hi, are there any examples on using Kedro for infe...
# questions
v
Hi, are there any examples on using Kedro for inference or is it mainly designed for training pipelines? The issue I have with inference is that the input to inference will vary, for instance a path to s3, even though it’s one entry in the data catalog. I would’ve liked to solve this by setting the
filepath
of the catalog entry as a parameter to
kedro run
but according to my other discussion with @datajoely this is not allowed in kedro by design. Has anyone else used kedro this way or should I just skip kedro for inference or similar type of pipelines with varying input?
i
@viveca In cases where the s3 path varies, the best case would be to use some form of templating, which can set the path through an environment variable. Out of curiosity, why does the path to your data/model vary between each run?
v
Thanks for the reply. Yes, I know how I can achieve that with an environment variable. I’m just slightly bugged by the fact that then this is not a recognized parameter, in the kedro sense, of my pipeline. And slightly worried that if this is considered an anti-pattern then the TemplatedConfigLoader may change in the future, disallowing using env vars for configuring the catalog. In this parituclar case, I’m running the pipeline in parallel. Imagine you have many customer datasets that you want to run your model for.
d
How many inputs does your pipeline have? Also, are you going to save your intermediate datasets to different paths, too, if you're running in parallel? Re the specific passing a parameter/setting it in the
TemplatedConfigLoader
, @Antony Milne suggests and alternative to not have to use env vars (linked from here, where you can see this has definitely come up before): https://github.com/kedro-org/kedro/issues/1647
v
@Deepyaman Datta Thanks, that thread looks familiar. It does indeed look like I’m not the only one needing to do similar things 🙂 The number of variables are only a few. For instance there could be 3 of them: an input, an output and an intermediate path. They could all have a fixed format, but with a variable index/id that makes the run unique so different parallel pipelines don’t conflict.