Hi all, hope you are doing fine! I am currently ha...
# questions
j
Hi all, hope you are doing fine! I am currently having an issue with a cluster running on Databricks with Kedro 19. We have updated from Kedro 18 to Kedro 19 and are currently facing some issues where the tasks in a job constantly fail and are retried. It looks like it is a memory leak or something similar because the tasks repeat and repeat until they finally succeed. Or there tasks are hanging because they are waiting for a thread to finish. This is the code we are running: from kedro.runner import SequentialRunner, ParallelRunner, ThreadRunner with make_session() as session: session.run(pipeline_name="commercial_group_pipeline", node_names=['nd_product_commercial_group_processing'], runner=ThreadRunner()) I noticed there is an error in the logs: ERROR ThrottledLogger$: Background thread had non-allowed tags, this might indicate a leak: Set(TagDefinition(dbfsPath,dbfs path. E.g., '/' or '/mnt/data',DATA_LABEL_USER_DATASET_METADATA_PATH,false,false,List(),UsageLogRedactionConfig(List()))) [35 occurrences] java.lang.IllegalThreadStateException: BackgroundThread had non-allowed tags, possible leak   at com.databricks.threading.InstrumentedScheduledBackgroundExecutor.validateExistingContext(InstrumentedScheduledBackgroundExecutor.scala:97)   at com.databricks.threading.InstrumentedThreadPoolExecutor$$anon$1.$anonfun$run$2(InstrumentedThreadPoolExecutor.scala:139)   at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)   at com.databricks.instrumentation.QueuedThreadPoolInstrumenter.trackActiveThreads(QueuedThreadPoolInstrumenter.scala:66)   at com.databricks.instrumentation.QueuedThreadPoolInstrumenter.trackActiveThreads$(QueuedThreadPoolInstrumenter.scala:63)   at com.databricks.threading.InstrumentedThreadPoolExecutor.trackActiveThreads(InstrumentedThreadPoolExecutor.scala:27)   at com.databricks.threading.InstrumentedThreadPoolExecutor$$anon$1.$anonfun$run$1(InstrumentedThreadPoolExecutor.scala:138)   at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)   at com.databricks.context.integrity.IntegrityCheckContext$ThreadLocalStorage$.withValue(IntegrityCheckContext.scala:44)   at com.databricks.threading.InstrumentedThreadPoolExecutor$$anon$1.run(InstrumentedThreadPoolExecutor.scala:138)   at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)   at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)   at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)   at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)   at java.lang.Thread.run(Thread.java:750) CLuster config: { "cluster_name": "spark_clue_new_kedro_19_upgraded_001", "spark_version": "14.3.x-cpu-ml-scala2.12", "spark_conf": { "spark.speculation.interval": "2000", "spark.databricks.delta.preview.enabled": "true", "spark.scheduler.listenerbus.eventqueue.capacity": "1000000", "spark.sql.shuffle.partitions": "auto", "spark.speculation": "true", "spark.sql.execution.arrow.enabled": "true", "spark.sql.adaptive.enabled": "true", "spark.speculation.quantile": "0.75", "spark.speculation.multiplier": "5" }, "azure_attributes": { "first_on_demand": 1, "availability": "SPOT_WITH_FALLBACK_AZURE", "spot_bid_max_price": -1 }, "node_type_id": "Standard_E20s_v4", "driver_node_type_id": "Standard_E20s_v4", "spark_env_vars": { [I removed this part] }, "autotermination_minutes": 60, "enable_elastic_disk": true, "init_scripts": [ { "workspace": { "destination": "/Shared/init-script.sh" } } ], "policy_id": "3063D0A837001E49", "enable_local_disk_encryption": false, "data_security_mode": "LEGACY_PASSTHROUGH", "runtime_engine": "STANDARD", "effective_spark_version": "14.3.x-cpu-ml-scala2.12", "autoscale": { "min_workers": 1, "max_workers": 14 }, "apply_policy_default_values": false } Thanks in advanced, maybe this is not descriptive enough, but let me know what I should share to help debug this 🙂
d
hi Joao, have you tried ParallelRunner instead of ThreadRunner?
j
Thank you for your help, there was a bug on the code. I finally fix it.
👍 1