Hi everybody, I'm working with a collection of EML...
# questions
t
Hi everybody, I'm working with a collection of EML files and I'd like to create a Kedro pipeline to automate a series of transformations on each file. My initial thought was to use PartitionedDataSet and loop through the EML files. However, I'm unsure if the transformation logic itself should be encapsulated within a separate Kedro pipeline and that pipeline used as a function into another node. While I could write all the cleaning steps as a single function within the main pipeline and loop over the PartitionedDataSet to save each cleaned file as a separate JSON, I believe a sub-pipeline might be a better approach. Would this mock-up be a good starting point?
Copy code
from kedro.pipeline import Pipeline, node
from .nodes import load_data, clean_data, transform_data

def create_eml_processing_pipeline(**kwargs):
    return Pipeline(
        [
            node(clean_data_step1, "loaded_data", "cleaned_data_step1"),
            node(clean_data_step2, "cleaned_data_step1", "cleaned_data_step2"),
        ]
    )
Copy code
from kedro.pipeline import Pipeline, node, pipeline
from .pipeline_eml_processing import create_eml_processing_pipeline

def create_pipeline(**kwargs):
    eml_processing_pipeline = create_eml_processing_pipeline()

    return Pipeline(
        [
            node(
                func=lambda partition: partition,
                inputs="input_data@partition",
                outputs="partition_data",
                name="load_partition_node"
            ),
            pipeline(
                eml_processing_pipeline,
                inputs="partition_data",
                outputs="processed_partition_data",
                namespace="eml_processing"
            ),
            node(
                func=lambda partition: partition,
                inputs="processed_partition_data",
                outputs="processed_data@partition",
                name="save_partition_node"
            ),
        ]
    )
j
Hi Thomas, Your approach to using Kedro pipelines for processing EML files seems quite logical and modular. Creating a sub-pipeline for the transformation steps encapsulates the logic well and allows for better separation of concerns. But, what's your motivation to use transcoding datasets (the
@partition
part)? https://docs.kedro.org/en/stable/data/data_catalog_yaml_examples.html#read-the-same-file-using-two-different-datasets For more details, refer to the Kedro documentation on data catalog, pipelines, and partitioned datasets: https://docs.kedro.org/en/stable/data/data_catalog.html#partitioneddataset https://docs.kedro.org/en/stable/nodes_and_pipelines/pipeline_introduction.html
j
(also @Thomas Bury if you replace '''python with ''' I think your Markdown blocks will render correctly)
👍 1
j
wrong window
t
Thanks for your reply, I don't have a motivation for the transcoding. I was uncertain that using a pipeline a function within a node could be a working solution or not. I will try. Thanks