Hello Kedrorians spock hand I m having issues figuring out w Kedro #questions

Hello Kedrorians :spock-hand:! I'm having issues f...

Biel Stela

01/31/2024, 3:51 PM

Hello Kedrorians 🖖! I'm having issues figuring out which is the proper way to work with an arbitrary large number of similar datasets. Since the files are all similar I would like to not to have a verbose pipeline with all the dataset names explicitly written, and the same for the catalog. I've seen the dataset factories thing, and with the

PartitionedDataset

I made an abomination that just works but I'm sure there are better ways! For sake of making this message more useful I will make the example explicit with my case. I'm trying to process in one go all the rasters in MapSpam that I have in a bucked, say

<s3://spam-data/production>

, which contains 54 files with the pattern

spam2010V2r0_global_P_{material}_A.tif

. In the pipeline I have a simple node called

to_h3

that only takes as parameter the dataset. What I did is this: (Pasted in the thread to not spam the channel)

Biel Stela

01/31/2024, 3:53 PM

A datacatalog with PartionedDataset and factories like

Copy code

spam_production.preprocessed:
  type: kedro_datasets.partitions.PartitionedDataset
  path: <s3://spam-data/production/spam2010v2r0_global_prod/>
  dataset:
    type: geo.datasets.xarray_dataset.RasterDataset
  filename_suffix: ".tif"

"spam2010V2r0_global_P_{material}_A#tif":
  type: geo.datasets.xarray_dataset.RasterDataset
  filepath: <s3://spam-data/production/spam2010V2r0_global_P_{name}_A.tif>
  metadata:
    kedro-viz:
      layer: intermediate

"spam2010V2r0_global_P_{material}_A#h3":
  type: pandas.ParquetDataset
  filepath: data/03_primary/spam2010V2r0_global_P_{name}_A.parquet
  metadata:
    kedro-viz:
      layer: intermediate

and the a pipeline maker that uses the Partitioned dataset to list all the needed datasets so I can write the node as

Copy code

def make_spam_pipeline(**kwargs) -> Pipeline:
    # Instantiate an `OmegaConfigLoader` instance with the location of your project configuration.
    conf_path = str(settings.CONF_SOURCE)
    conf_loader = OmegaConfigLoader(
        conf_source=conf_path, base_env="base", default_run_env="local"
    )

    # Fetch the catalog with resolved credentials from the configuration.
    catalog = DataCatalog.from_config(catalog=conf_loader["catalog"], credentials=conf_loader["credentials"])
    spams = catalog.load("spam_production.preprocessed")
    nodes = [node(func=to_h3, inputs=f"{ds}#tif", outputs=f"{ds}#h3") for ds in spams.keys()]
    return pipeline(nodes, tags=["production", "preprocess", "spam"])

Biel Stela

01/31/2024, 3:55 PM

The problem is that it works! but I don't think this is how I'm supposed to use neither the factories or the PartitionedDataset. Do you have any clue on how I can make something similar but more ergo[kedro]nomic?

Deepyaman Datta

01/31/2024, 4:16 PM

Is there a reason

"spam2010V2r0_global_P_{material}_A#tif"

and

"spam2010V2r0_global_P_{material}_A#h3"

can't also be

PartitionedDataset

instances? Otherwise, yes, the partitioned-to-many dataset experience requires non-ideal node generation, like what you've done. It would be a good idea to be able to support something more idiomatic, especially as these cases crop up now and then.

Biel Stela

01/31/2024, 4:29 PM

Don't know for sure... I want to reuse the same node (

to_h3

) I use for simple single datasets in other places. If I wanted to only use

PartitionedDataset

instances I would have to write a special node to "unpack" each component I guess? but then it is not clear how I must connect these many datasets into next node that computes on a single dataset.

Deepyaman Datta

01/31/2024, 5:59 PM

I want to reuse the same node (
to_h3
) I use for simple single datasets in other places.

Makes sense. There is always a possibility to have a helper function that's used in a single node and partition-processing node, but I see your point--nothing feels super clean. Would love to get some more eyes on this, as I think it fits into some existing themes (dynamic pipelines, something about

PartitionedDataset

). @Nok Lam Chan @Merel @Juan Luis you probably know better if there's some existing thinking here.

👍 1

2 Views

Open in Slack

Previous Next