Kedro 18 gt 19 migration question Did minimal supported vers Kedro #questions

Kedro 18->19 migration question: Did minimal su...

Artur Dobrogowski

04/04/2024, 2:27 PM

Kedro 18->19 migration question: Did minimal supported version for pyspark change or something? The code is running fine with 0.18.14, but with 0.19.3 I got the configuration error with spark:

Copy code

<class 'str'>: (<class 'pyspark.errors.exceptions.captured.IllegalArgumentException'>, IllegalArgumentException())

Nok Lam Chan

04/04/2024, 2:36 PM

Can you check your config loader? Are you using OmegaConf or your own?

Artur Dobrogowski

04/04/2024, 2:36 PM

omegaconf

Nok Lam Chan

04/04/2024, 2:36 PM

Could you share a more detailed stacktrace?

Artur Dobrogowski

04/04/2024, 2:36 PM

and my own spark hook for initializing spark session

Nok Lam Chan

04/04/2024, 2:36 PM

Where did that error get thrown

Artur Dobrogowski

04/04/2024, 2:36 PM

yes, it's not helpful though

Artur Dobrogowski

04/04/2024, 2:37 PM

Untitled

Nok Lam Chan

04/04/2024, 2:38 PM

Ah java error... Did you spark/kedro-dataests version get changed when you upgrade -> 0.19.3?

Artur Dobrogowski

04/04/2024, 2:38 PM

didn't update to the latest, using 2.0.0 I think

Artur Dobrogowski

04/04/2024, 2:39 PM

Copy code

kedro==0.19.3
kedro-datasets==2.0.0
kedro-viz==7.1.0

Artur Dobrogowski

04/04/2024, 2:39 PM

yeah

Artur Dobrogowski

04/04/2024, 2:40 PM

I'm re-running on 2.1.0

Nok Lam Chan

04/04/2024, 2:40 PM

I am still not sure is it a configuration problem or pyspark/datasets problem.

Copy code

kedro ipython
context.config_loader["spark"]

Can you check if the output of this looks normal?

Artur Dobrogowski

04/04/2024, 2:41 PM

yes and it's the same in my run on version 0.18 that works

Artur Dobrogowski

04/04/2024, 2:42 PM

same issue on 2.1.0

Nok Lam Chan

04/04/2024, 2:42 PM

pip list | grep pyspark -> did pyspark version changed? Kedro doesn't concern about spark so it shouldn't do anything different

Artur Dobrogowski

04/04/2024, 2:42 PM

it did not

Artur Dobrogowski

04/04/2024, 2:42 PM

I specifically only changed kedro version

Artur Dobrogowski

04/04/2024, 2:43 PM

I'll make double sure

Nok Lam Chan

04/04/2024, 2:43 PM

It doesn't mean spark won't be upgraded. It could be updated by 2nd/3rd order dependencies

Artur Dobrogowski

04/04/2024, 2:43 PM

I hard specify it to be specific version

Artur Dobrogowski

04/04/2024, 2:44 PM

3.4.2 for the record - yes I've confirmed it didnt change

Artur Dobrogowski

04/04/2024, 2:48 PM

I might try to update the spark version but I'd like to avoid it, it's a bigger endavour and a head ache

Artur Dobrogowski

04/04/2024, 2:49 PM

and this one is fairly recent from november last year

Nok Lam Chan

04/04/2024, 2:50 PM

and your Python version is?

Artur Dobrogowski

04/04/2024, 2:50 PM

my main question is how to get more feedback about the issue

Artur Dobrogowski

04/04/2024, 2:50 PM

3.10.14

Artur Dobrogowski

04/04/2024, 2:51 PM

hmm I might have a clue about whats the issue

Artur Dobrogowski

04/04/2024, 2:51 PM

let me test it out

👍🏼 1

Nok Lam Chan

04/04/2024, 2:53 PM

I see this in our CI, not sure exactly why

Nok Lam Chan

04/04/2024, 2:56 PM

https://github.com/kedro-org/kedro-plugins/pull/297/files looks related

Artur Dobrogowski

04/04/2024, 3:05 PM

Ok I found the issue, it's probably more restrictive policy on failing on config errors. We included some experimental logging config for log4j that was not properly loaded and there were some errors about it, but it was fine ignoring this failure at previous kedro version. I've fixed the issues in this config and it started working again.

Artur Dobrogowski

04/04/2024, 3:07 PM

this info might be worth including in https://docs.kedro.org/en/stable/resources/migration.html

Nok Lam Chan

04/04/2024, 4:35 PM

this info might be worth including in https://docs.kedro.org/en/stable/resources/migration.html

What should be added there? Is this a spark issue or Kedro?

Nok Lam Chan

04/04/2024, 4:36 PM

I am a bit confused since the error was coming from Spark not Kedro

Artur Dobrogowski

04/05/2024, 10:54 AM

Looks like I was wrong after all. I must have gotten confused yesterday with juggling the versions. The issue still persists and I don't have any more clues why. I've got the setup where just changing between kedro versions (and doing compatibility changes) causes this error to happen.

Nok Lam Chan

04/05/2024, 11:08 AM

I don't have any better solution, this is how I go with these problem usually. The principle are: • Make sure only 1 thing get changed at a time, and run something to check that it works • Started with something that works, in your case it's the old 0.18.x version • Then try to bump kedro-datasets -> 2.0 maybe? This will however automatically bump kedro ->0.19 too I think. • Make sure PySpark version didn't change. • The possibility of error is either Kedro or Kedro-datasets. • Copy the old implementation of SparkDataset of whatever version you are using, see if it works with kedro 0.19. ◦ If works -> then the problem is datasets ◦ If not -> then the problem is something in Kedro • Pip-compile or just

pip freeze

would help before/after everytime you update a library. ◦ You may add additional constrains during upgrade to make sure it doesn't change your spark version.

pip install kedro==0.19 pyspark==3.4.2

(assuming your pyspark was 3.4.2) My intuition here is either the configuration is wrong. Or something with Spark version/datasets.

Artur Dobrogowski

04/05/2024, 11:11 AM

this is all problematic because of name change DataSet -> Dataset

Artur Dobrogowski

04/05/2024, 11:12 AM

I'm working on kedro-datasets version 2.0 in both cases, the only version that changes is kedro 0.18.14 -> 0.19.3

Nok Lam Chan

04/05/2024, 11:15 AM

Yeah I can see how this is troublesome because both Kedro and datasets introduce a breaking change.

Nok Lam Chan

04/05/2024, 11:17 AM

Alright, can you double check really nothing changed in

config_loader["spark"]

? Kedro doesn't do anything with Spark so it's weird that upgrading Kedro break Spark

Artur Dobrogowski

04/05/2024, 11:25 AM

It is weird indeed, and that's why I'm writing about this. I thought that maybe since you included pyspark examples in starters, you also took some liberty to inject some rudimentary spark configuration somewhere when pyspark is detected and that is causing the issue with my spark version. I am sure that configuration is the same. Now I see that the error is thrown at the first encountered SparkDataset - did syntax of catalog entries change for it perhaps? Here's how I have it configured:

Copy code

raw_data_import:
  type: spark.SparkDataset
  file_format: parquet
  filepath: "${globals: raw_folder}/foo/data.parquet"
  load_args:
    header: True
    inferSchema: True
  save_args:
    mode: overwrite

Artur Dobrogowski

04/05/2024, 11:26 AM

I've seen that kedro and kedro-datasets were split in version 0.19 - is that correct?

Artur Dobrogowski

04/05/2024, 11:33 AM

hmm okay I won't bother you with this issue anymore, thanks a lot for your time and help

Artur Dobrogowski

04/05/2024, 11:38 AM

last I've checked is that because in 0.18.14 I also have aliases for dataset so I tried to swap only versions out and the issue persists so it's probably somewhere else associated to things I changed to make 0.19 work

Nok Lam Chan

04/05/2024, 11:38 AM

The spark stuff that we inject is in the project side, not the

kedro

library. So if you choose PySpark as a tool, there will be a default SparkHook created for you.

I've seen that kedro and kedro-datasets were split in version 0.19 - is that correct?

It was splitted earlier, in 0.18.x

kedro.extras.datasets

is in frozen state so you can either use the datasets from

kedro

kedro-datasets

(kedro-datasets take priority if detected)

Artur Dobrogowski

04/05/2024, 11:39 AM

ok, good to confirm this 🙂

Nok Lam Chan

04/05/2024, 11:39 AM

And kedro-datasets 2.0 onwards only support Kedro 0.19+ due to the

Dataset

rename

Open in Slack

Previous Next