Kedro 18->19 migration question: Did minimal su...
# questions
a
Kedro 18->19 migration question: Did minimal supported version for pyspark change or something? The code is running fine with 0.18.14, but with 0.19.3 I got the configuration error with spark:
Copy code
<class 'str'>: (<class 'pyspark.errors.exceptions.captured.IllegalArgumentException'>, IllegalArgumentException())
n
Can you check your config loader? Are you using OmegaConf or your own?
a
omegaconf
n
Could you share a more detailed stacktrace?
a
and my own spark hook for initializing spark session
n
Where did that error get thrown
a
yes, it's not helpful though
Untitled
n
Ah java error... Did you spark/kedro-dataests version get changed when you upgrade -> 0.19.3?
a
didn't update to the latest, using 2.0.0 I think
Copy code
kedro==0.19.3
kedro-datasets==2.0.0
kedro-viz==7.1.0
yeah
I'm re-running on 2.1.0
n
I am still not sure is it a configuration problem or pyspark/datasets problem.
Copy code
kedro ipython
context.config_loader["spark"]
Can you check if the output of this looks normal?
a
yes and it's the same in my run on version 0.18 that works
same issue on 2.1.0
n
pip list | grep pyspark -> did pyspark version changed? Kedro doesn't concern about spark so it shouldn't do anything different
a
it did not
I specifically only changed kedro version
I'll make double sure
n
It doesn't mean spark won't be upgraded. It could be updated by 2nd/3rd order dependencies
a
I hard specify it to be specific version
3.4.2 for the record - yes I've confirmed it didnt change
I might try to update the spark version but I'd like to avoid it, it's a bigger endavour and a head ache
and this one is fairly recent from november last year
n
and your Python version is?
a
my main question is how to get more feedback about the issue
3.10.14
hmm I might have a clue about whats the issue
let me test it out
👍🏼 1
n
I see this in our CI, not sure exactly why
a
Ok I found the issue, it's probably more restrictive policy on failing on config errors. We included some experimental logging config for log4j that was not properly loaded and there were some errors about it, but it was fine ignoring this failure at previous kedro version. I've fixed the issues in this config and it started working again.
this info might be worth including in https://docs.kedro.org/en/stable/resources/migration.html
n
this info might be worth including in https://docs.kedro.org/en/stable/resources/migration.html
What should be added there? Is this a spark issue or Kedro?
I am a bit confused since the error was coming from Spark not Kedro
a
Looks like I was wrong after all. I must have gotten confused yesterday with juggling the versions. The issue still persists and I don't have any more clues why. I've got the setup where just changing between kedro versions (and doing compatibility changes) causes this error to happen.
n
I don't have any better solution, this is how I go with these problem usually. The principle are: • Make sure only 1 thing get changed at a time, and run something to check that it works • Started with something that works, in your case it's the old 0.18.x version • Then try to bump kedro-datasets -> 2.0 maybe? This will however automatically bump kedro ->0.19 too I think. • Make sure PySpark version didn't change. • The possibility of error is either Kedro or Kedro-datasets. • Copy the old implementation of SparkDataset of whatever version you are using, see if it works with kedro 0.19. ◦ If works -> then the problem is datasets ◦ If not -> then the problem is something in Kedro • Pip-compile or just
pip freeze
would help before/after everytime you update a library. ◦ You may add additional constrains during upgrade to make sure it doesn't change your spark version.
pip install kedro==0.19 pyspark==3.4.2
(assuming your pyspark was 3.4.2) My intuition here is either the configuration is wrong. Or something with Spark version/datasets.
a
this is all problematic because of name change DataSet -> Dataset
I'm working on kedro-datasets version 2.0 in both cases, the only version that changes is kedro 0.18.14 -> 0.19.3
n
Yeah I can see how this is troublesome because both Kedro and datasets introduce a breaking change.
Alright, can you double check really nothing changed in
config_loader["spark"]
? Kedro doesn't do anything with Spark so it's weird that upgrading Kedro break Spark
a
It is weird indeed, and that's why I'm writing about this. I thought that maybe since you included pyspark examples in starters, you also took some liberty to inject some rudimentary spark configuration somewhere when pyspark is detected and that is causing the issue with my spark version. I am sure that configuration is the same. Now I see that the error is thrown at the first encountered SparkDataset - did syntax of catalog entries change for it perhaps? Here's how I have it configured:
Copy code
raw_data_import:
  type: spark.SparkDataset
  file_format: parquet
  filepath: "${globals: raw_folder}/foo/data.parquet"
  load_args:
    header: True
    inferSchema: True
  save_args:
    mode: overwrite
I've seen that kedro and kedro-datasets were split in version 0.19 - is that correct?
hmm okay I won't bother you with this issue anymore, thanks a lot for your time and help
last I've checked is that because in 0.18.14 I also have aliases for dataset so I tried to swap only versions out and the issue persists so it's probably somewhere else associated to things I changed to make 0.19 work
n
The spark stuff that we inject is in the project side, not the
kedro
library. So if you choose PySpark as a tool, there will be a default SparkHook created for you.
I've seen that kedro and kedro-datasets were split in version 0.19 - is that correct?
It was splitted earlier, in 0.18.x
kedro.extras.datasets
is in frozen state so you can either use the datasets from
kedro
or
kedro-datasets
(kedro-datasets take priority if detected)
a
ok, good to confirm this 🙂
n
And kedro-datasets 2.0 onwards only support Kedro 0.19+ due to the
Dataset
rename