Hello Kedrorians :spock-hand:! I'm having issues f...
# questions
b
Hello Kedrorians 🖖! I'm having issues figuring out which is the proper way to work with an arbitrary large number of similar datasets. Since the files are all similar I would like to not to have a verbose pipeline with all the dataset names explicitly written, and the same for the catalog. I've seen the dataset factories thing, and with the
PartitionedDataset
I made an abomination that just works but I'm sure there are better ways! For sake of making this message more useful I will make the example explicit with my case. I'm trying to process in one go all the rasters in MapSpam that I have in a bucked, say
<s3://spam-data/production>
, which contains 54 files with the pattern
spam2010V2r0_global_P_{material}_A.tif
. In the pipeline I have a simple node called
to_h3
that only takes as parameter the dataset. What I did is this: (Pasted in the thread to not spam the channel)
A datacatalog with PartionedDataset and factories like
Copy code
spam_production.preprocessed:
  type: kedro_datasets.partitions.PartitionedDataset
  path: <s3://spam-data/production/spam2010v2r0_global_prod/>
  dataset:
    type: geo.datasets.xarray_dataset.RasterDataset
  filename_suffix: ".tif"

"spam2010V2r0_global_P_{material}_A#tif":
  type: geo.datasets.xarray_dataset.RasterDataset
  filepath: <s3://spam-data/production/spam2010V2r0_global_P_{name}_A.tif>
  metadata:
    kedro-viz:
      layer: intermediate

"spam2010V2r0_global_P_{material}_A#h3":
  type: pandas.ParquetDataset
  filepath: data/03_primary/spam2010V2r0_global_P_{name}_A.parquet
  metadata:
    kedro-viz:
      layer: intermediate
and the a pipeline maker that uses the Partitioned dataset to list all the needed datasets so I can write the node as
Copy code
def make_spam_pipeline(**kwargs) -> Pipeline:
    # Instantiate an `OmegaConfigLoader` instance with the location of your project configuration.
    conf_path = str(settings.CONF_SOURCE)
    conf_loader = OmegaConfigLoader(
        conf_source=conf_path, base_env="base", default_run_env="local"
    )

    # Fetch the catalog with resolved credentials from the configuration.
    catalog = DataCatalog.from_config(catalog=conf_loader["catalog"], credentials=conf_loader["credentials"])
    spams = catalog.load("spam_production.preprocessed")
    nodes = [node(func=to_h3, inputs=f"{ds}#tif", outputs=f"{ds}#h3") for ds in spams.keys()]
    return pipeline(nodes, tags=["production", "preprocess", "spam"])
The problem is that it works! but I don't think this is how I'm supposed to use neither the factories or the PartitionedDataset. Do you have any clue on how I can make something similar but more ergo[kedro]nomic?
d
Is there a reason
"spam2010V2r0_global_P_{material}_A#tif"
and
"spam2010V2r0_global_P_{material}_A#h3"
can't also be
PartitionedDataset
instances? Otherwise, yes, the partitioned-to-many dataset experience requires non-ideal node generation, like what you've done. It would be a good idea to be able to support something more idiomatic, especially as these cases crop up now and then.
b
Don't know for sure... I want to reuse the same node (
to_h3
) I use for simple single datasets in other places. If I wanted to only use
PartitionedDataset
instances I would have to write a special node to "unpack" each component I guess? but then it is not clear how I must connect these many datasets into next node that computes on a single dataset.
d
I want to reuse the same node (
to_h3
) I use for simple single datasets in other places.
Makes sense. There is always a possibility to have a helper function that's used in a single node and partition-processing node, but I see your point--nothing feels super clean. Would love to get some more eyes on this, as I think it fits into some existing themes (dynamic pipelines, something about
PartitionedDataset
). @Nok Lam Chan @Merel @Juan Luis you probably know better if there's some existing thinking here.
👍 1