I’m inexperienced, so this is basic question. I’m...
# questions
I’m inexperienced, so this is basic question. I’m trying to add datasets programmatically. I’ve made a catalog.py file that contains:
from <http://kedro.io|kedro.io> import DataCatalog
from <http://kedro.io|kedro.io> import PartitionedDataSet
from kedro.extras.datasets.pandas import CSVDataSet
from kedro.config import ConfigLoader
conf_paths = ['conf/base', 'conf/local']
conf_loader = ConfigLoader(conf_paths)
atlas_regions = conf_loader.get('atlas_regions*') # A .yml file consisting of regions with names
catalog_dictionary = {}
for region in atlas_regions['regions']:
name = region['name']
# catalog_dictionary[f'{name}_data_right'] = PartitionedDataSet(path = '../ClinicalDTI/R_VIM/', \
#     dataset = '<http://programmatic_datasets.io|programmatic_datasets.io>.nifti.NIfTIDataSet', filename_suffix = f'seedmasks/{name}_R_T1.nii.gz')
catalog_dictionary[f'{name}_data_right'] = CSVDataSet(filepath = "../data/01_raw/iris.csv")
# catalog_dictionary[f'{name}_data_right_output'] = CSVDataSet(filepath = "../data/01_raw/iris.csv")
io = DataCatalog(catalog_dictionary)
(Kedro version 0.17.7) Running catalog.py prints the expected list of datasets. But what do I need to do to be able to use these datasets in a pipeline?
Hi @Anjali Datta - since you say you’re new to Kedro, I’d highly recommend you follow the tutorials since this approach isn’t the recommended approach. https://docs.kedro.org/en/stable/tutorial/spaceflights_tutorial.html
@datajoely I think this use case isn't covered by Spaceflights, because the data layout is complex (configurable set of regions,
for each region based on the region name, presumably want to run a pipeline for each region to create output for that region).
@Anjali Datta I've created a quick-and-dirty example of how you can have a dynamic catalog + pipelines using Jinja (see https://docs.kedro.org/en/stable/kedro_project_setup/configuration.html#jinja2-support). This is the diff on top of just creating a new project named "Jinja Example": https://github.com/deepyaman/programmatic-pipelines/commit/cfe7b1afe5fb8e013d7b4568eaa601246538c7b5#diff-0eea4f4e49677291b[…]2cf4f017733483d62adbff5 Cons of this approach: • I've defined
in two places, because you can't use something like
inside Jinja. • In Kedro 0.19, I think a new OmegaConfLoader will be the preferred way to go, and I don't think that support Jinja. I'm not 100% sure how this use case would be best handled there. • Too much Jinja makes pipelines confusing (I think this use case for reused modular pipelines is fair, though). If you aren't familiar with namespacing/reuse of modular pipelines, see https://docs.kedro.org/en/stable/nodes_and_pipelines/modular_pipelines.html#using-the-modular-pipeline-wrapper-to-provide-overrides I can try and add an example of keeping the pipeline definition in Python and using with pipelines as an alternative, even though i don't think it's well documented P.S. I used Kedro 0.18.6, which includes some stuff like pipeline autodiscovery (https://docs.kedro.org/en/stable/nodes_and_pipelines/pipeline_registry.html#pipeline-autodiscovery); if you try to replicate with 0.17.7, you will need to add "data_processing" explicitly
Thank you so much, @datajoely and @Deepyaman Datta!!