Biel Stela
01/31/2024, 3:51 PMPartitionedDataset
I made an abomination that just works but I'm sure there are better ways! For sake of making this message more useful I will make the example explicit with my case.
I'm trying to process in one go all the rasters in MapSpam that I have in a bucked, say <s3://spam-data/production>
, which contains 54 files with the pattern spam2010V2r0_global_P_{material}_A.tif
. In the pipeline I have a simple node called to_h3
that only takes as parameter the dataset.
What I did is this: (Pasted in the thread to not spam the channel)Biel Stela
01/31/2024, 3:53 PMspam_production.preprocessed:
type: kedro_datasets.partitions.PartitionedDataset
path: <s3://spam-data/production/spam2010v2r0_global_prod/>
dataset:
type: geo.datasets.xarray_dataset.RasterDataset
filename_suffix: ".tif"
"spam2010V2r0_global_P_{material}_A#tif":
type: geo.datasets.xarray_dataset.RasterDataset
filepath: <s3://spam-data/production/spam2010V2r0_global_P_{name}_A.tif>
metadata:
kedro-viz:
layer: intermediate
"spam2010V2r0_global_P_{material}_A#h3":
type: pandas.ParquetDataset
filepath: data/03_primary/spam2010V2r0_global_P_{name}_A.parquet
metadata:
kedro-viz:
layer: intermediate
and the a pipeline maker that uses the Partitioned dataset to list all the needed datasets so I can write the node as
def make_spam_pipeline(**kwargs) -> Pipeline:
# Instantiate an `OmegaConfigLoader` instance with the location of your project configuration.
conf_path = str(settings.CONF_SOURCE)
conf_loader = OmegaConfigLoader(
conf_source=conf_path, base_env="base", default_run_env="local"
)
# Fetch the catalog with resolved credentials from the configuration.
catalog = DataCatalog.from_config(catalog=conf_loader["catalog"], credentials=conf_loader["credentials"])
spams = catalog.load("spam_production.preprocessed")
nodes = [node(func=to_h3, inputs=f"{ds}#tif", outputs=f"{ds}#h3") for ds in spams.keys()]
return pipeline(nodes, tags=["production", "preprocess", "spam"])
Biel Stela
01/31/2024, 3:55 PMDeepyaman Datta
01/31/2024, 4:16 PM"spam2010V2r0_global_P_{material}_A#tif"
and "spam2010V2r0_global_P_{material}_A#h3"
can't also be PartitionedDataset
instances?
Otherwise, yes, the partitioned-to-many dataset experience requires non-ideal node generation, like what you've done. It would be a good idea to be able to support something more idiomatic, especially as these cases crop up now and then.Biel Stela
01/31/2024, 4:29 PMto_h3
) I use for simple single datasets in other places. If I wanted to only use PartitionedDataset
instances I would have to write a special node to "unpack" each component I guess? but then it is not clear how I must connect these many datasets into next node that computes on a single dataset.Deepyaman Datta
01/31/2024, 5:59 PMI want to reuse the same node (Makes sense. There is always a possibility to have a helper function that's used in a single node and partition-processing node, but I see your point--nothing feels super clean. Would love to get some more eyes on this, as I think it fits into some existing themes (dynamic pipelines, something about) I use for simple single datasets in other places.to_h3
PartitionedDataset
). @Nok Lam Chan @Merel @Juan Luis you probably know better if there's some existing thinking here.