Hey everyone how can I access the current KedroSession from Kedro #questions

Hey everyone, how can I access the current KedroSe...

Brandon Meek

01/07/2023, 3:03 AM

Hey everyone, how can I access the current KedroSession from a node? My use case is I want to pass an unspecified number of datasets into a pipeline, the current solution I've come up with is to pass the dataset names as a list and then use the session to get the DataCatalog and read the datasets from there.

William Caicedo

01/07/2023, 5:28 AM

Node functions are not supposed to know about data loading or saving implementation details, so I think you’re not supposed to access the session directly, but you can access the catalog from a hook and modify any node inputs like this: https://kedro.readthedocs.io/en/stable/hooks/examples.html#modify-node-inputs-using-before-node-run-hook

👍 1

Brandon Meek

01/07/2023, 3:08 PM

Thanks! Yeah I know it's considered unsafe, but there's just a lot of repetition when reusing this particular pipeline that I could eliminate with this method

Deepyaman Datta

01/07/2023, 10:03 PM

You want to pass an unspecified number of datasets to a node, or you want to run a pipeline multiple times (i.e. for each of an unspecified number of input datasets)?

Deepyaman Datta

01/07/2023, 10:04 PM

If you want to pass an unspecified number of datasets to a node, perhaps consider

PartitionedDataSet

Deepyaman Datta

01/07/2023, 10:04 PM

If you want to run a pipeline multiple times, perhaps try dynamically composing your overall pipeline by using

pipeline

in a loop.

Brandon Meek

01/07/2023, 10:11 PM

@Deepyaman Datta on the node side I've been able to just use *args for my use case but I still have to specify them on the pipeline side as inputs to the node. My understanding of

PartitionedDataSet

is it is just a distributed dataset? Ultimately what I'd like to do is write a parameter like:

Copy code

Datasets:
 - Ds1
 - Ds2
 - DS3
 ...

And pass that to a pipeline and it will use all three of those datasets, one example would be to merge all of the datasets provided on a specific key

Deepyaman Datta

01/08/2023, 12:00 AM

PartitionedDataSet

isn't really distributed. If DS1, DS2, DS3 are all under the same path:

Copy code

path/to/pds/ds1.csv
path/to/pds/ds2.csv
path/to/pds/ds3.csv

And you define a catalog entry:

Copy code

my_pds:
  type: PartitionedDataSet 
  dataset:
    type: pandas.CSVDataSet
  path: path/to/pds
  filename_suffix: .csv

You can use

my_pds

as a node input and iterate over the 3 dataframes in there.

👍 1

Ian Anderson

01/08/2023, 5:10 PM

Plus one to Deepyaman’s response. We’re using this now to collect flat files from several third party entities and load them into a pipeline.

Brandon Meek

01/08/2023, 5:11 PM

So they don't have to have the same schema? Very interesting. Thanks!

Ian Anderson

01/08/2023, 5:17 PM

Ah! I missed that part. It sure helps if the schema are identical! But still, If any information is stored in the file names, you could handle different behaviors based on the file name. As long as all the partitions are readable of the same kedro dataset class load function (e.g. CSVDataSet).

👍 1

17 Views

Open in Slack

Previous Next