Jacob Pieniazek
02/20/2025, 5:26 PMSparkSession.getActiveSession()
(see first image). Our pipeline is comprised entirely of Ibis.TableDataset datasets & I/O with pyspark backend. What is throwing me is that other nodes use the pyspark connection and are able to perform operations properly across the spark session, but fails on this single node when leveraging an imported module that it is unable to find the spark session. This issue is not present in Kedro 0.19.10. My best guess is that it has something to do with the updated code in kedro/runner/sequential_runner.py
using ThreadPoolExecutor
and possible scoping issues? Apologies on the somewhat scattered explanation, there is quite a bit I don't fully understand here, so appreciate any help or guidance. Lmk if I can provide any additional info as well.Hall
02/20/2025, 5:26 PMElena Khaustova
02/20/2025, 5:40 PMThreadPoolExecutor
with one thread. But we’ve figured out that it affects non-thread safe runs:
https://github.com/kedro-org/kedro/issues/4486
So currently we’re rolling back to the old approach.
The quick way to check if that’s the case for you is to see whether you’re getting the same error for Kedro 0.19.10 if you use ThreadRunneer
instead of the default SequentialRunner
Jacob Pieniazek
02/20/2025, 5:59 PMThreadRunner
in Kedro 0.9.10 did indeed throw the same error.Elena Khaustova
02/20/2025, 6:43 PM