Hi all, hope you are doing fine! I am currently having an issue with a cluster running on Databricks with Kedro 19. We have updated from Kedro 18 to Kedro 19 and are currently facing some issues where the tasks in a job constantly fail and are retried. It looks like it is a memory leak or something similar because the tasks repeat and repeat until they finally succeed. Or there tasks are hanging because they are waiting for a thread to finish.
This is the code we are running:
from kedro.runner import SequentialRunner, ParallelRunner, ThreadRunner
with make_session() as session:
session.run(pipeline_name="commercial_group_pipeline", node_names=['nd_product_commercial_group_processing'], runner=ThreadRunner())
I noticed there is an error in the logs:
ERROR ThrottledLogger$: Background thread had non-allowed tags, this might indicate a leak: Set(TagDefinition(dbfsPath,dbfs path. E.g., '/' or '/mnt/data',DATA_LABEL_USER_DATASET_METADATA_PATH,false,false,List(),UsageLogRedactionConfig(List()))) [35 occurrences]
java.lang.IllegalThreadStateException: BackgroundThread had non-allowed tags, possible leak
at com.databricks.threading.InstrumentedScheduledBackgroundExecutor.validateExistingContext(InstrumentedScheduledBackgroundExecutor.scala:97)
at com.databricks.threading.InstrumentedThreadPoolExecutor$$anon$1.$anonfun$run$2(InstrumentedThreadPoolExecutor.scala:139)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.instrumentation.QueuedThreadPoolInstrumenter.trackActiveThreads(QueuedThreadPoolInstrumenter.scala:66)
at com.databricks.instrumentation.QueuedThreadPoolInstrumenter.trackActiveThreads$(QueuedThreadPoolInstrumenter.scala:63)
at com.databricks.threading.InstrumentedThreadPoolExecutor.trackActiveThreads(InstrumentedThreadPoolExecutor.scala:27)
at com.databricks.threading.InstrumentedThreadPoolExecutor$$anon$1.$anonfun$run$1(InstrumentedThreadPoolExecutor.scala:138)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.context.integrity.IntegrityCheckContext$ThreadLocalStorage$.withValue(IntegrityCheckContext.scala:44)
at com.databricks.threading.InstrumentedThreadPoolExecutor$$anon$1.run(InstrumentedThreadPoolExecutor.scala:138)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
CLuster config:
{
"cluster_name": "spark_clue_new_kedro_19_upgraded_001",
"spark_version": "14.3.x-cpu-ml-scala2.12",
"spark_conf": {
"spark.speculation.interval": "2000",
"spark.databricks.delta.preview.enabled": "true",
"spark.scheduler.listenerbus.eventqueue.capacity": "1000000",
"spark.sql.shuffle.partitions": "auto",
"spark.speculation": "true",
"spark.sql.execution.arrow.enabled": "true",
"spark.sql.adaptive.enabled": "true",
"spark.speculation.quantile": "0.75",
"spark.speculation.multiplier": "5"
},
"azure_attributes": {
"first_on_demand": 1,
"availability": "SPOT_WITH_FALLBACK_AZURE",
"spot_bid_max_price": -1
},
"node_type_id": "Standard_E20s_v4",
"driver_node_type_id": "Standard_E20s_v4",
"spark_env_vars": {
[I removed this part]
},
"autotermination_minutes": 60,
"enable_elastic_disk": true,
"init_scripts": [
{
"workspace": {
"destination": "/Shared/init-script.sh"
}
}
],
"policy_id": "3063D0A837001E49",
"enable_local_disk_encryption": false,
"data_security_mode": "LEGACY_PASSTHROUGH",
"runtime_engine": "STANDARD",
"effective_spark_version": "14.3.x-cpu-ml-scala2.12",
"autoscale": {
"min_workers": 1,
"max_workers": 14
},
"apply_policy_default_values": false
}
Thanks in advanced, maybe this is not descriptive enough, but let me know what I should share to help debug this 🙂