Merel
03/27/2025, 8:31 AMkedro-databricks
works and I was wondering whether it makes sense to use any of the other runners (ThreadRunner
or ParallelRunner
)? As far as I understand for every node we use these run parameters --nodes name, --conf-source self.remote_conf_dir, --env self.env
. Would it make sense to allow for adding runner type too? Or if you want parallel running you should use the databricks cluster setup for that? I'm not very familiar with all the run options in Databricks, so trying to figure out where to use Kedro features and where Databricks. (cc: @Rashida Kanchwala)Merel
03/27/2025, 9:02 AMThreadRunner
or ParallelRunner
?Deepyaman Datta
03/27/2025, 1:45 PMkedro-databricks
, but based on experience working with Spark, I would expect you can't use ParallelRunner
.Merel
03/27/2025, 2:00 PMJens Peder Meldgaard
03/28/2025, 2:34 PMkedro-databricks
is rather to generate the DAGs of kedro pipelines as a Databricks Workflows.
Any type of parallelisation should therefore be implemented on the node-level, if used with kedro-databricks
.
If tasks can run in parallel, based on the DAG, they will run in parallel by default.Merel
03/28/2025, 4:21 PMIf tasks can run in parallel, based on the DAG, they will run in parallel by default.Is that a default setting on Databricks? I haven't gone beyond the basic example with 3 nodes yet, but will do some more experimenting next week.
Jens Peder Meldgaard
03/28/2025, 4:22 PMMerel
03/28/2025, 4:23 PM