Another `kedro sagemaker` question I manage to get the pipel Kedro #plugins-integrations

Another `kedro-sagemaker` question: I manage to ge...

William Caicedo

11/15/2023, 10:32 PM

Another

kedro-sagemaker

question: I manage to get the pipeline showing in Processing jobs but then I get an

Error: No such command 'sagemaker'.

error. I have

kedro-sagemaker

in my

requirements.txt

file and I’m building and pushing the image myself, so I just do a

kedro sagemaker run

. Any ideas what am I doing wrong?

marrrcin

11/15/2023, 11:30 PM

You have to be in the folder with Kedro project (

cd

to the folder with

src

and

conf

folders)

William Caicedo

11/15/2023, 11:32 PM

I think I’m in the right location. I should’ve also said that the error appears in the cloudwatch logs

William Caicedo

11/16/2023, 1:20 AM

It’s looking like this

marrrcin

11/16/2023, 8:23 AM

If it happens in cloudwatch, then please verify the docker image / dockerfile, especially entrypoint and workdir.

William Caicedo

11/16/2023, 11:00 PM

Copy code

# Do not change the default entrypoint, it will break the Kedro SageMaker integration!
ENTRYPOINT ["kedro", "sagemaker", "entrypoint"]

working_directory: /home/kedro

William Caicedo

11/16/2023, 11:02 PM

I’m running out of ideas here sadcat everything seems correct inside the image,

kedro run

works, but not

kedro sagemaker

marrrcin

11/17/2023, 8:54 AM

Have you added

kedro-sagemaker

to the

requirements.txt

William Caicedo

11/17/2023, 9:01 AM

Yes, but the error kept showing. This is my `requirements.txt`:

Copy code

black~=22.0
flake8>=3.7.9, <4.0
ipython>=7.31.1, <8.0
isort~=5.0
jupyter~=1.0
jupyterlab~=3.0
kedro==0.18.8
kedro-mlflow
kedro-datasets[spark.SparkDataSet, tensorflow.TensorFlowModelDataSet, pickle.PickleDataSet]~=1.4.0
kedro-sagemaker
nbstripout~=0.4
pyarrow
pymc-marketing
pyspark==3.3.0
pytest-cov~=3.0
pytest-mock>=1.7.1, <2.0
pytest~=7.2
scikit-learn
tensorflow==2.15.0
tensorflow-probability==0.22.1
tensorflow_io

marrrcin

11/17/2023, 9:39 AM

Ok. It will be difficult to debug. A few options: • verify whether

pip freeze | grep sagemaker

in the docker image shows the plugin installed • change the workdir in the docker image to sth else - something that doesn’t have

kedro

and

kedro_sagemaker

kedro-sagemaker

in the name (make sure to also update the

sagemaker.yml

accordingly)

William Caicedo

11/19/2023, 8:36 PM

Copy code

root@bc309858e78f:/home/kedro# pip freeze | grep sagemaker
kedro-sagemaker==0.0.1

William Caicedo

11/19/2023, 8:37 PM

I think I know what’s going on - there is a dependency clash between

kedro-sagemaker

and

kedro-datasets

with

s3fs

being the problem

William Caicedo

11/19/2023, 8:39 PM

If I pin the version of

kedro-sagemaker

0.3.0

I get the following error:

Copy code

#9 75.16     kedro-sagemaker 0.3.0 depends on s3fs<2023.0.0 and >=2022.11.0
#9 75.16     kedro-datasets[pickle-pickledataset,spark-sparkdataset,tensorflow-tensorflowmodeldataset] 1.4.0 depends on s3fs<0.5 and >=0.3.0; extra == "spark.sparkdataset"

So if the version is not pinned in the

requirements.txt

file, he only resolution pip can find is to install v0.0.1 of

kedro-sagemaker

which causes my original error

marrrcin

11/20/2023, 8:21 AM

Great, that’s the root cause then. Kedro-sagemaker has a requirement of

s3fs = "^2022.11.0"

. @Nok Lam Chan / @datajoely do you know why

kedro-datasets[spark-sparkdataset]

has a strict limit for old version of

s3fs

? Range

>=0.3.0,<0.5

is from 2019 - 2020 😮

marrrcin

11/20/2023, 8:24 AM

@William Caicedo what you could try is to drop

spark-sparkdataset

from

kedro-datasets

extras and install those separately, so in the requriements.txt you would have:

Copy code

kedro-sagemaker~=0.3.0
kedro-datasets[pickle-pickledataset,tensorflow-tensorflowmodeldataset]~=1.4.0
s3fs
hdfs>=2.5.8, <3.0

and see what happens there.

marrrcin

11/20/2023, 8:27 AM

On our side - we will remove version

0.0.1

to avoid future problems like that.

William Caicedo

11/20/2023, 8:33 AM

@marrrcin Thanks for the help! I forked

kedro-datasets

and bumped the

s3fs

version to

2022.11.0

just to check and the pipeline worked with no issues. I haven’t ran any of the

kedro-datasets

tests yet though. Also, I got a json serialization error when I had some dates as parameters in my

parameters.yml

. The workaround was of course to put quotes around them and treat them as strings. Have you seen that error before?

🥳 1

marrrcin

11/20/2023, 8:36 AM

Haven’t seen that one

William Caicedo

11/20/2023, 8:43 AM

Copy code

/Users/williamc/miniconda3/envs/clv/lib/python3.10/site-packages/kedro_sagemaker/generator.py:98 │
│ in _prepare_sagemaker_params                                                                     │
│                                                                                                  │
│    95 │   │   │   │   sm_param_value = sm_param_types[t](value_name, default_value=v)            │
│    96 │   │   │   else:                                                                          │
│    97 │   │   │   │   sm_param_value = ParameterString(                                          │
│ ❱  98 │   │   │   │   │   value_name, default_value=json.dumps(v)                                │
│    99 │   │   │   │   )                                                                          │
│   100 │   │   │                                                                                  │
│   101 │   │   │   sm_kedro_params.append(sm_param_key)                                           │
│                                                                                                  │
│ /Users/williamc/miniconda3/envs/clv/lib/python3.10/json/__init__.py:231 in dumps                 │
│                                                                                                  │
│   228 │   │   check_circular and allow_nan and                                                   │
│   229 │   │   cls is None and indent is None and separators is None and                          │
│   230 │   │   default is None and not sort_keys and not kw):                                     │
│ ❱ 231 │   │   return _default_encoder.encode(obj)                                                │
│   232 │   if cls is None:                                                                        │
│   233 │   │   cls = JSONEncoder                                                                  │
│   234 │   return cls(                                                                            │
│                                                                                                  │
│ /Users/williamc/miniconda3/envs/clv/lib/python3.10/json/encoder.py:199 in encode                 │
│                                                                                                  │
│   196 │   │   # This doesn't pass the iterator directly to ''.join() because the                 │
│   197 │   │   # exceptions aren't as detailed.  The list call should be roughly                  │
│   198 │   │   # equivalent to the PySequence_Fast that ''.join() would do.                       │
│ ❱ 199 │   │   chunks = self.iterencode(o, _one_shot=True)                                        │
│   200 │   │   if not isinstance(chunks, (list, tuple)):                                          │
│   201 │   │   │   chunks = list(chunks)                                                          │
│   202 │   │   return ''.join(chunks)                                                             │
│                                                                                                  │
│ /Users/williamc/miniconda3/envs/clv/lib/python3.10/json/encoder.py:257 in iterencode             │
│                                                                                                  │
│   254 │   │   │   │   markers, self.default, _encoder, self.indent, floatstr,                    │
│   255 │   │   │   │   self.key_separator, self.item_separator, self.sort_keys,                   │
│   256 │   │   │   │   self.skipkeys, _one_shot)                                                  │
│ ❱ 257 │   │   return _iterencode(o, 0)                                                           │
│   258                                                                                            │
│   259 def _make_iterencode(markers, _default, _encoder, _indent, _floatstr,                      │
│   260 │   │   _key_separator, _item_separator, _sort_keys, _skipkeys, _one_shot,                 │
│                                                                                                  │
│ /Users/williamc/miniconda3/envs/clv/lib/python3.10/json/encoder.py:179 in default                │
│                                                                                                  │
│   176 │   │   │   │   return JSONEncoder.default(self, o)                                        │
│   177 │   │                                                                                      │
│   178 │   │   """                                                                                │
│ ❱ 179 │   │   raise TypeError(f'Object of type {o.__class__.__name__} '                          │
│   180 │   │   │   │   │   │   f'is not JSON serializable')                                       │
│   181 │                                                                                          │
│   182 │   def encode(self, o):                                                                   │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
TypeError: Object of type date is not JSON serializable

Nok Lam Chan

11/20/2023, 8:48 AM

@marrrcin https://github.com/kedro-org/kedro-plugins/pull/348

👍 1

Nok Lam Chan

11/20/2023, 8:50 AM

I think it’s mostly related to tests, we use