Hi everybody I m working with a collection of EML files and Kedro #questions

Hi everybody, I'm working with a collection of EML...

Thomas Bury

05/25/2024, 8:19 PM

Hi everybody, I'm working with a collection of EML files and I'd like to create a Kedro pipeline to automate a series of transformations on each file. My initial thought was to use PartitionedDataSet and loop through the EML files. However, I'm unsure if the transformation logic itself should be encapsulated within a separate Kedro pipeline and that pipeline used as a function into another node. While I could write all the cleaning steps as a single function within the main pipeline and loop over the PartitionedDataSet to save each cleaned file as a separate JSON, I believe a sub-pipeline might be a better approach. Would this mock-up be a good starting point?

Copy code

from kedro.pipeline import Pipeline, node
from .nodes import load_data, clean_data, transform_data

def create_eml_processing_pipeline(**kwargs):
    return Pipeline(
        [
            node(clean_data_step1, "loaded_data", "cleaned_data_step1"),
            node(clean_data_step2, "cleaned_data_step1", "cleaned_data_step2"),
        ]
    )

Copy code

from kedro.pipeline import Pipeline, node, pipeline
from .pipeline_eml_processing import create_eml_processing_pipeline

def create_pipeline(**kwargs):
    eml_processing_pipeline = create_eml_processing_pipeline()

    return Pipeline(
        [
            node(
                func=lambda partition: partition,
                inputs="input_data@partition",
                outputs="partition_data",
                name="load_partition_node"
            ),
            pipeline(
                eml_processing_pipeline,
                inputs="partition_data",
                outputs="processed_partition_data",
                namespace="eml_processing"
            ),
            node(
                func=lambda partition: partition,
                inputs="processed_partition_data",
                outputs="processed_data@partition",
                name="save_partition_node"
            ),
        ]
    )

Jitendra Gundaniya

05/27/2024, 8:16 AM

Hi Thomas, Your approach to using Kedro pipelines for processing EML files seems quite logical and modular. Creating a sub-pipeline for the transformation steps encapsulates the logic well and allows for better separation of concerns. But, what's your motivation to use transcoding datasets (the

@partition

part)? https://docs.kedro.org/en/stable/data/data_catalog_yaml_examples.html#read-the-same-file-using-two-different-datasets For more details, refer to the Kedro documentation on data catalog, pipelines, and partitioned datasets: https://docs.kedro.org/en/stable/data/data_catalog.html#partitioneddataset https://docs.kedro.org/en/stable/nodes_and_pipelines/pipeline_introduction.html

Juan Luis

05/27/2024, 8:26 AM

(also @Thomas Bury if you replace '''python with ''' I think your Markdown blocks will render correctly)

👍 1

Jitendra Gundaniya

05/27/2024, 9:49 AM

~~wrong window~~

Thomas Bury

05/27/2024, 10:02 AM

Thanks for your reply, I don't have a motivation for the transcoding. I was uncertain that using a pipeline a function within a node could be a working solution or not. I will try. Thanks

Open in Slack

Previous Next