Hi guys I have an issue running Kedro with using `ThreadRunn Kedro #questions

Hi guys, I have an issue running Kedro with using...

Nelson Zambrano

04/11/2024, 5:29 AM

Hi guys, I have an issue running Kedro with using

ThreadRunner

in an AzureML cluster (no scaling, 1 node compute). Kedro==0.19.3 sqlalchemy==2.0.29 oracledb ==2.1.1 / cx-Oracle==8.3.0 (tried both, same results.) The Kedro Pipeline: Executed through a

kedro_script.py

(which essentially is a

KedroSession.create

session.run

) 1. It intakes 21

SQLQueryDataset

2. Performs transformations to each in different nodes. 3. Writes to Azure blob storage using a

ParquetDataset

4. Uses all outputs and combines them. (I attached a viz of the pipeline.) The problem: Using

ThreadRunner

in a cluster -- 20 of the transformation nodes(2) run and write their output to storage(3) except the last (random) one. Then it fails with a DB error and the stdout attached. Using

ThreadRunner

in a compute instance with the same environment (docker image, compute type, etc) works just fine Using

SequentialRunner

in the cluster does not reproduce the error ; it runs just fine. (is_async= True/False) Tried: • Different Oracle (ugh, I know) drivers • Different versions of

oracledb

and

cx-Oracle

no luck. • Different amount of workers • Different engine parameters

pool_size

max_overflow

thick_mode

(yay to the support of sqlalchemy engine params) Any Idea what might be happening here?

Nelson Zambrano

04/11/2024, 5:31 AM

stdout

datajoely

04/11/2024, 6:09 AM

So thread runner was designed for spark workloads, if you try parallel runner does it work?

datajoely

04/11/2024, 6:10 AM

Sorry sometimes we fight the limits of pythons concurrency system

Nelson Zambrano

04/11/2024, 6:15 AM

@datajoely no, datasets in other pipelines in the project are not serialisable.

Nok Lam Chan

04/11/2024, 8:25 AM

From your error, SQLScriptDataset is used instead of SQLQueryDataset, is it some custom implementation?

Nelson Zambrano

04/11/2024, 3:52 PM

@Nok Lam Chan SQLScriptDataset is a child of

SQLQueryDataset

It formats the query in a special way using parameters in the catalog and then

super()

Deepyaman Datta

04/11/2024, 3:57 PM

Maybe check this? https://stackoverflow.com/a/162381

Deepyaman Datta

04/11/2024, 3:58 PM

My guess is, with

ThreadRunner

, you're trying to create 21 sessions concurrently. And maybe that's problematic.

Deepyaman Datta

04/11/2024, 3:59 PM

You can also try to check whether the session is being reused from cache at all (in the dataset). With threads, maybe not?

Nelson Zambrano

04/14/2024, 2:33 AM

@Deepyaman Datta The DB does support the

connections; this is from previous tests in a compute instance in Azure with

ThreadRunner

. The weird part is that it fails when running in a cluster using a

CommandComponent

as part of a Pipeline Job in AzureML under the same conditions. (the pipeline architecture and environment).

Nelson Zambrano

04/14/2024, 5:36 AM

Not sure if it's relevant, but I noticed this warning mid-run. Nodes keep executing tho WARNING There are 21 nodes that have not run. runner.py:218 You can resume the pipeline run from the nearest nodes with persisted inputs by adding the following argument to your previous command: --from-nodes "nodo_mt,check_requirements"

25 Views

Open in Slack

Previous Next