I m inexperienced so this is basic question I m trying to ad Kedro #questions

I’m inexperienced, so this is basic question. I’m...

Anjali Datta

03/22/2023, 1:00 AM

I’m inexperienced, so this is basic question. I’m trying to add datasets programmatically. I’ve made a catalog.py file that contains:

from <http://kedro.io|kedro.io> import DataCatalog

from <http://kedro.io|kedro.io> import PartitionedDataSet

from kedro.extras.datasets.pandas import CSVDataSet

from kedro.config import ConfigLoader

conf_paths = ['conf/base', 'conf/local']

conf_loader = ConfigLoader(conf_paths)

atlas_regions = conf_loader.get('atlas_regions*') # A .yml file consisting of regions with names

catalog_dictionary = {}

for region in atlas_regions['regions']:

name = region['name']

# catalog_dictionary[f'{name}_data_right'] = PartitionedDataSet(path = '../ClinicalDTI/R_VIM/', \

#     dataset = '<http://programmatic_datasets.io|programmatic_datasets.io>.nifti.NIfTIDataSet', filename_suffix = f'seedmasks/{name}_R_T1.nii.gz')

catalog_dictionary[f'{name}_data_right'] = CSVDataSet(filepath = "../data/01_raw/iris.csv")

# catalog_dictionary[f'{name}_data_right_output'] = CSVDataSet(filepath = "../data/01_raw/iris.csv")

io = DataCatalog(catalog_dictionary)

print(io.list())

(Kedro version 0.17.7) Running catalog.py prints the expected list of datasets. But what do I need to do to be able to use these datasets in a pipeline?

datajoely

03/22/2023, 7:53 AM

Hi @Anjali Datta - since you say you’re new to Kedro, I’d highly recommend you follow the tutorials since this approach isn’t the recommended approach. https://docs.kedro.org/en/stable/tutorial/spaceflights_tutorial.html

Deepyaman Datta

03/22/2023, 10:31 AM

@datajoely I think this use case isn't covered by Spaceflights, because the data layout is complex (configurable set of regions,

PartitionedDataSet

for each region based on the region name, presumably want to run a pipeline for each region to create output for that region).

Deepyaman Datta

03/22/2023, 2:17 PM

@Anjali Datta I've created a quick-and-dirty example of how you can have a dynamic catalog + pipelines using Jinja (see https://docs.kedro.org/en/stable/kedro_project_setup/configuration.html#jinja2-support). This is the diff on top of just creating a new project named "Jinja Example": https://github.com/deepyaman/programmatic-pipelines/commit/cfe7b1afe5fb8e013d7b4568eaa601246538c7b5#diff-0eea4f4e49677291b[…]2cf4f017733483d62adbff5 Cons of this approach: • I've defined

regions

in two places, because you can't use something like

globals.yml

inside Jinja. • In Kedro 0.19, I think a new OmegaConfLoader will be the preferred way to go, and I don't think that support Jinja. I'm not 100% sure how this use case would be best handled there. • Too much Jinja makes pipelines confusing (I think this use case for reused modular pipelines is fair, though). If you aren't familiar with namespacing/reuse of modular pipelines, see https://docs.kedro.org/en/stable/nodes_and_pipelines/modular_pipelines.html#using-the-modular-pipeline-wrapper-to-provide-overrides I can try and add an example of keeping the pipeline definition in Python and using with pipelines as an alternative, even though i don't think it's well documented P.S. I used Kedro 0.18.6, which includes some stuff like pipeline autodiscovery (https://docs.kedro.org/en/stable/nodes_and_pipelines/pipeline_registry.html#pipeline-autodiscovery); if you try to replicate with 0.17.7, you will need to add "data_processing" explicitly

Anjali Datta

03/24/2023, 3:57 AM

Thank you so much, @datajoely and @Deepyaman Datta!!

4 Views

Open in Slack

Previous Next