I’m inexperienced, so this is basic question. I’m...
# questions
a
I’m inexperienced, so this is basic question. I’m trying to add datasets programmatically. I’ve made a catalog.py file that contains:
from <http://kedro.io|kedro.io> import DataCatalog
from <http://kedro.io|kedro.io> import PartitionedDataSet
from kedro.extras.datasets.pandas import CSVDataSet
from kedro.config import ConfigLoader
conf_paths = ['conf/base', 'conf/local']
conf_loader = ConfigLoader(conf_paths)
atlas_regions = conf_loader.get('atlas_regions*') # A .yml file consisting of regions with names
catalog_dictionary = {}
for region in atlas_regions['regions']:
name = region['name']
# catalog_dictionary[f'{name}_data_right'] = PartitionedDataSet(path = '../ClinicalDTI/R_VIM/', \
#     dataset = '<http://programmatic_datasets.io|programmatic_datasets.io>.nifti.NIfTIDataSet', filename_suffix = f'seedmasks/{name}_R_T1.nii.gz')
catalog_dictionary[f'{name}_data_right'] = CSVDataSet(filepath = "../data/01_raw/iris.csv")
# catalog_dictionary[f'{name}_data_right_output'] = CSVDataSet(filepath = "../data/01_raw/iris.csv")
io = DataCatalog(catalog_dictionary)
print(io.list())
(Kedro version 0.17.7) Running catalog.py prints the expected list of datasets. But what do I need to do to be able to use these datasets in a pipeline?
d
Hi @Anjali Datta - since you say you’re new to Kedro, I’d highly recommend you follow the tutorials since this approach isn’t the recommended approach. https://docs.kedro.org/en/stable/tutorial/spaceflights_tutorial.html
d
@datajoely I think this use case isn't covered by Spaceflights, because the data layout is complex (configurable set of regions,
PartitionedDataSet
for each region based on the region name, presumably want to run a pipeline for each region to create output for that region).
@Anjali Datta I've created a quick-and-dirty example of how you can have a dynamic catalog + pipelines using Jinja (see https://docs.kedro.org/en/stable/kedro_project_setup/configuration.html#jinja2-support). This is the diff on top of just creating a new project named "Jinja Example": https://github.com/deepyaman/programmatic-pipelines/commit/cfe7b1afe5fb8e013d7b4568eaa601246538c7b5#diff-0eea4f4e49677291b[…]2cf4f017733483d62adbff5 Cons of this approach: • I've defined
regions
in two places, because you can't use something like
globals.yml
inside Jinja. • In Kedro 0.19, I think a new OmegaConfLoader will be the preferred way to go, and I don't think that support Jinja. I'm not 100% sure how this use case would be best handled there. • Too much Jinja makes pipelines confusing (I think this use case for reused modular pipelines is fair, though). If you aren't familiar with namespacing/reuse of modular pipelines, see https://docs.kedro.org/en/stable/nodes_and_pipelines/modular_pipelines.html#using-the-modular-pipeline-wrapper-to-provide-overrides I can try and add an example of keeping the pipeline definition in Python and using with pipelines as an alternative, even though i don't think it's well documented P.S. I used Kedro 0.18.6, which includes some stuff like pipeline autodiscovery (https://docs.kedro.org/en/stable/nodes_and_pipelines/pipeline_registry.html#pipeline-autodiscovery); if you try to replicate with 0.17.7, you will need to add "data_processing" explicitly
a
Thank you so much, @datajoely and @Deepyaman Datta!!