Hey everyone, how can I access the current KedroSe...
# questions
b
Hey everyone, how can I access the current KedroSession from a node? My use case is I want to pass an unspecified number of datasets into a pipeline, the current solution I've come up with is to pass the dataset names as a list and then use the session to get the DataCatalog and read the datasets from there.
w
Node functions are not supposed to know about data loading or saving implementation details, so I think you’re not supposed to access the session directly, but you can access the catalog from a hook and modify any node inputs like this: https://kedro.readthedocs.io/en/stable/hooks/examples.html#modify-node-inputs-using-before-node-run-hook
👍 1
b
Thanks! Yeah I know it's considered unsafe, but there's just a lot of repetition when reusing this particular pipeline that I could eliminate with this method
d
You want to pass an unspecified number of datasets to a node, or you want to run a pipeline multiple times (i.e. for each of an unspecified number of input datasets)?
If you want to pass an unspecified number of datasets to a node, perhaps consider
PartitionedDataSet
?
If you want to run a pipeline multiple times, perhaps try dynamically composing your overall pipeline by using
pipeline
in a loop.
b
@Deepyaman Datta on the node side I've been able to just use *args for my use case but I still have to specify them on the pipeline side as inputs to the node. My understanding of
PartitionedDataSet
is it is just a distributed dataset? Ultimately what I'd like to do is write a parameter like:
Copy code
Datasets:
 - Ds1
 - Ds2
 - DS3
 ...
And pass that to a pipeline and it will use all three of those datasets, one example would be to merge all of the datasets provided on a specific key
d
PartitionedDataSet
isn't really distributed. If DS1, DS2, DS3 are all under the same path:
Copy code
path/to/pds/ds1.csv
path/to/pds/ds2.csv
path/to/pds/ds3.csv
And you define a catalog entry:
Copy code
my_pds:
  type: PartitionedDataSet 
  dataset:
    type: pandas.CSVDataSet
  path: path/to/pds
  filename_suffix: .csv
You can use
my_pds
as a node input and iterate over the 3 dataframes in there.
👍 1
i
Plus one to Deepyaman’s response. We’re using this now to collect flat files from several third party entities and load them into a pipeline.
b
So they don't have to have the same schema? Very interesting. Thanks!
i
Ah! I missed that part. It sure helps if the schema are identical! But still, If any information is stored in the file names, you could handle different behaviors based on the file name. As long as all the partitions are readable of the same kedro dataset class load function (e.g. CSVDataSet).
👍 1