Hi guys a question regarding `kedro mlflow` I am trying to i Kedro #plugins-integrations

Hi guys, a question regarding `kedro-mlflow` I am...

Armen Paronikyan

03/27/2023, 3:24 PM

Hi guys, a question regarding

kedro-mlflow

I am trying to implement a distributed architecture where the artifact will be loaded to S3 and metrics logged to DB. I have a problem with a Pytorch weights file. It is not being uploaded to S3, but during the run it tries to access it and I get an error. I guess this is because it wants to access the file before it is uploaded. It the file is loaded to local directory when I change the mlflow server config.

Yolan Honoré-Rougé

03/28/2023, 8:33 PM

Hi, could you elaborate on the sentence "during the run it tries to access it and I get an error"? How does it try to access it? Do you use a

MlflowModelLoggerDataset

to load the model? Can you put the stack trace?

Armen Paronikyan

03/29/2023, 9:38 AM

HI thanks for the reply, here is the error

Retrying (Retry(total=4, connect=5, read=4, redirect=5, status=5)) after connection broken by 'ProtocolError('Connection aborted.',     connectionpool.py:812

RemoteDisconnected('Remote end closed connection without response'))':

/api/2.0/mlflow-artifacts/artifacts/2/fcb684db5bf043b8bcb08a112de0c47f/artifacts/model/data/model.pth

I am using PickleDataset to save the model

Yolan Honoré-Rougé

03/29/2023, 7:47 PM

Could you elaborate on your setup? What does your

catalog

look like for this entry? Do you use

pipeline_ml_factory

function? Do you have a custom hook or just

kedro-mlflow

installed? How do you run kedro? Through a notebook or the

kedro run

command? How is configured your

mlflow.yml

and especially your tracking server? It seems that you do not have the right log into mlflow. What happens if you use

mlflow.log_artifact(model_path)

in a notebook? Do you have the same error?

Armen Paronikyan

03/30/2023, 11:27 AM

@Yolan Honoré-Rougé Thanks for digging in. I figured out the issue. The problem was with the gunicorn workers that

mlflow server

runs under the hood. They have a repsone timeout and since I uploaded large weight files to S3 the timeout was exceeded and the workers were killed by gunicorn. I fixed it with turning off the timeout.

👍 1

Open in Slack

Previous Next