Hello Team, I have been following this for resolvi...
# questions
n
Hello Team, I have been following this for resolving S3 credentials at runtime: https://docs.kedro.org/en/1.0.0/extend/hooks/common_use_cases/#use-hooks-to-load-external-credentials However, I need to be able to connect to multiple S3 buckets (one for each dataset), and I need a few parameters at runtime to be able to assume AWS role and get credentials: account_id, role_arn, etc. To be able to do this with above approach, I would need my credential resolver hook to resolve based on the name of credential which could follow a special format (account_id/role_arn) and I cannot hardcode the names in the code. I need some lambda function values. Is this possible? Or would it be better to use config resolver instead as follows:
Copy code
weather:
 type: polars.EagerPolarsDataset
 filepath: <s3a://your_bucket/data/01_raw/weather*>
 file_format: csv
 credentials: ${s3_creds:123456789012,arn:role}
where s3_creds is a config resolver that returns a dictionary with access keys and secrets. One potential issue I see with this approach is that the credentials could expire if they are evaluated only at the beginning of pipeline and not every time a load or save is performed. Is there any better way to achieve what I want? • Dynamic credential resolution per dataset. • Credential refresh at load/save time.
d
Hmm... I'm not aware of any built-in mechanism to refresh credentials before load/save; this might have to be done custom
before_dataset_loaded
,
before_dataset_saved
, if necessary. Credentials aren't even always handled in the same way; for most filesystem-based datasets, you'd basically need to reconstruct the
fsspec.filesystem
object? I can't find something on wanting to refresh credentials during pipeline runs, with a quick search; maybe somebody else has run into it.
d
it's a tricky use case, I think custom Datasets or datasets hooks can be useful here. @Elena Khaustova, should we consider some catalog updates to make this credential resolution more dynamic, what do you think?
e
It looks like it’s related to https://github.com/kedro-org/kedro/discussions/4320, which totally makes sense, but not as a part of the catalog functionality
👍 1
d
@Dmitry Sorokin @Elena Khaustova I think the part of the problem that's still not addressed is, how do you refresh credentials? I suppose we're saying you need to essentially build "custom Datasets or datasets hooks" for this? I think that makes sense, since I can't imagine we'd change the baseline behavior to refresh credentials for every save/load? Unless was to make this configurable; it's just that haven't seen this request before.
n
@Deepyaman Datta Could it be done it two different parts? 1. Ability to inject lambda like "credential_provider" into datasets. 2. (Specifically for EagerPolarsDataset) -> Initialize FileSystem object in load and save methods instead of constructor. This would mean that you could call the credential provider and ask it to provide new credentials during every call of load and save.
💡 1
d
(Specifically for EagerPolarsDataset) -> Initialize FileSystem object in load and save methods instead of constructor. This would mean that you could call the credential provider and ask it to provide new credentials during every call of load and save.
I think this could work, but I feel like it should eventually be done for all datasets that are
fsspec
-based. My initial concern was that recreating the filesystem each time would be a bad idea. However, most
fsspec.AbstractFileSystem
instances are
cachable
. Instances are cached based on arguments to initialize filesystem and anything in
_extra_tokenize_attributes
. As such, it seems like it shouldn't be a problem to repeatedly create the filesystem.
Ability to inject lambda like "credential_provider" into datasets.
You can use a resolver already for credentials keys (
credentials
itself has to be a dict), but to inject a lambda as
credentials
I guess would need to support a resolver there. One possibility is to iterate through all the keys in
credentials
, and if something is a function object, to call it. This is, in a way, how Kedro supports lazy partitions. The filesystem creation pieces would then go into some helper function, and it would need to be called for every function that requires the filesystem object. If there was some sort of signal that you need to get a new filesystem object (maybe the fact that a
credentials
key is a function is a sufficient signal for that? or could be a more explicit dataset option), then you would need to do
type(self._fs).clear_instance_cache()
before constructing.
So this looks feasible, but a fairly nontrivial change. Maybe it makes sense to try this in a custom version of
EagerPolarsDataset
(if that's the main one you're using right now), see if that works well? If so, could contribute it upstream + standardize the pattern across all the
fsspec
-based datasets? (This probably requires some bigger review, but if it works without issue and it isn't constructing a filesystem instance each time in the current case, I think this could be fine.)