Hi guys, I have an issue running Kedro with using...
# questions
n
Hi guys, I have an issue running Kedro with using
ThreadRunner
in an AzureML cluster (no scaling, 1 node compute). Kedro==0.19.3 sqlalchemy==2.0.29 oracledb ==2.1.1 / cx-Oracle==8.3.0 (tried both, same results.) The Kedro Pipeline: Executed through a
kedro_script.py
(which essentially is a
KedroSession.create
+
session.run
) 1. It intakes 21
SQLQueryDataset
2. Performs transformations to each in different nodes. 3. Writes to Azure blob storage using a
ParquetDataset
4. Uses all outputs and combines them. (I attached a viz of the pipeline.) The problem: Using
ThreadRunner
in a cluster -- 20 of the transformation nodes(2) run and write their output to storage(3) except the last (random) one. Then it fails with a DB error and the stdout attached. Using
ThreadRunner
in a compute instance with the same environment (docker image, compute type, etc) works just fine Using
SequentialRunner
in the cluster does not reproduce the error ; it runs just fine. (is_async= True/False) Tried: • Different Oracle (ugh, I know) drivers • Different versions of
oracledb
and
cx-Oracle
no luck. • Different amount of workers • Different engine parameters
pool_size
,
max_overflow
,
thick_mode
(yay to the support of sqlalchemy engine params) Any Idea what might be happening here?
stdout
d
So thread runner was designed for spark workloads, if you try parallel runner does it work?
Sorry sometimes we fight the limits of pythons concurrency system
n
@datajoely no, datasets in other pipelines in the project are not serialisable.
n
From your error, SQLScriptDataset is used instead of SQLQueryDataset, is it some custom implementation?
n
@Nok Lam Chan SQLScriptDataset is a child of
SQLQueryDataset
It formats the query in a special way using parameters in the catalog and then
super()
d
My guess is, with
ThreadRunner
, you're trying to create 21 sessions concurrently. And maybe that's problematic.
You can also try to check whether the session is being reused from cache at all (in the dataset). With threads, maybe not?
n
@Deepyaman Datta The DB does support the
N
connections; this is from previous tests in a compute instance in Azure with
ThreadRunner
. The weird part is that it fails when running in a cluster using a
CommandComponent
as part of a Pipeline Job in AzureML under the same conditions. (the pipeline architecture and environment).
Not sure if it's relevant, but I noticed this warning mid-run. Nodes keep executing tho WARNING There are 21 nodes that have not run. runner.py:218 You can resume the pipeline run from the nearest nodes with persisted inputs by adding the following argument to your previous command: --from-nodes "nodo_mt,check_requirements"