I have configured a dynamic pipeline (catalog and ...
# questions
m
I have configured a dynamic pipeline (catalog and nodes) with a hooks-file. Locally it´s running in a docker container without problems, but when push it to AzureML and run it there, even though i can see the whole pipeline (and all dynamically created nodes names) - i receive "pipeline does not contain that .. node". How is this even possible? Does anyone have a clue?
s
Maybe the hooks aren't being registered before the pipeline is accessed in AzureML. Even if the nodes appear in visualisation, they might not be properly registered for execution. Could we get a bit more information about environment and versions.
m
@Sajid Alam I use the following locally Python 3.10.16 (in WSL with ubuntu 24.04) kedro==0.19.10 kedro-azureml==0.9.0 kedro-datasets==5.1.0 kedro-docker==0.6.2 kedro-telemetry==0.6.1 kedro-viz==10.1.0 and the following in the requirements.txt (for the Docker container pushed to AzureML): kedro==0.19.11 kedro-azureml==0.9.0 kedro-datasets==6.0.0 kedro-telemetry==0.6.2 kedro-viz==10.2.0 The hooks.py contains the following:
Copy code
class DynamicCatalogHook:
    @hook_impl
    def after_catalog_created(self, catalog: DataCatalog, **kwargs) -> None:
        for ....
            for n in [name, processed_name, valid_name, warp_name]:
                logger.info(f"Registering dataset: {n}")

            catalog.add(
                name,
                BinaryBlobDataSet(
                    filepath=raw_path,
                    connection_string=connection_string,
                    container=container_name,
                )
            )
and in the pipeline.py i call the following:
Copy code
def create_pipeline(**kwargs):
    for input_name, output_name in VALID_FILES:
        pipeline_nodes.append(
            node(
                func=node1_check_and_process,
                inputs=_name_from_blob_processed(input_name),
                outputs=_name_from_blob_valid(output_name),
                name=f"check_validity_{_name_from_blob_valid(input_name)[-10:]}_node1",
                # name=f"valid_node1",
                tags=["multi_test_1"],
            )
        )
    return Pipeline(pipeline_nodes, tags="stl_multi_pip")
👀 1
r
Hi @Mattis, Sorry for the bumpy ride. Can you please confirm - • If
create_pipeline()
is called in AzureML? • Are your
HOOKS
registered in
settings.py
? I am not much familiar with AzureML environments but I will try to read through this. Thanks for your patience
👍 1
m
@Ravi Kumar Pilla Yes, both, create_pipeline() is being called and the hooks.py are registered in the settings.py. very strangely, that some nodes run through fine, but some don´t even though i´m only looping through the pipeline set up..
👀 1
I think it is a race codnition.. even though i defined in the dockerfile:
CMD ["kedro", "run", "-r", "SequentialRunner"]
and when executing it on AzureML:
kedro azureml run -p multi_pip -s 12341234ce-1234-123r-23ff-1234f231234--aml-env kedro_env
It is still running the pipelines parallely and not sequentially. By that they don´t have references by the time when they´re called. Is there a way to force AzureML to run it sequentially with the azureml plugin? Because the
kedro azureml run -r SequentialRunner
seems not to be supported.
r
Hi @Mattis, is multi_pip a sum of multiple pipelines or is it the errored out pipeline ? Kedro does a topological sort on node dependencies and orders the execution flow accordingly. So I am not sure if the error is something related to the order. However I would like to know if there is an error log which says
DatasetNotFoundError
or
DatasetError
as you said its a race condition. Since you said it is working fine locally, I don't think Kedro runners would help here. I am not well aware of the azureml plugin. cc: @marrrcin have you seen something like this ? Thank you
m
Hi @Ravi Kumar Pilla multi_pip is the errored out pipeline. No, there´s no such thing like DatasetNotFoundError or DatasetError. The only error i receive is, that the nodes (visible in Azure), cannot find themselves while execution. So e.g. node AA in AzureML states ""pipeline does not contain that AA node", even though i can see it (like in the attached picture). Strangely for one it runs through (see image).
r
Hi @Mattis, sorry for the delay in response. I would like to know how the dynamic pipeline is constructed to understand the issue better. The error -
node AA in AzureML states ""pipeline does not contain that AA node",
is completely new to me. Can we get on a call tomorrow when you have some time ? (I work in the CT timezone). Thank you
m
Each "node" you see in Kedro AzureML is actually running something like (not directly, but functionally the same)
kedro run --pipeline=<pipeline name> --node=<name of the node>
underneath the hood. Check if running your docker image locally with
kedro run --pipeline=<pipeline name> --node=<name of the node that fails in Azure>
works for you. If not, then the approach you took for dynamic pipelines is not correct (Kedro in general does not support dynamic pipelines - there are some workarounds though). There is no way to set
SequentialRunner
in AzureML - again: each node pushed to AzureML is executed with a command similar to
kedro run --pipeline=... --node=...
and the ordering is determined by Kedro itself (toposort based on in/out of nodes as Ravi mentioned).
thankyou 1