Hi <@U0769K3GSD9>, I'm learning more about how `ke...
# plugins-integrations
m
Hi @Jens Peder Meldgaard, I'm learning more about how
kedro-databricks
works and I was wondering whether it makes sense to use any of the other runners (
ThreadRunner
or
ParallelRunner
)? As far as I understand for every node we use these run parameters
--nodes name, --conf-source self.remote_conf_dir, --env self.env
. Would it make sense to allow for adding runner type too? Or if you want parallel running you should use the databricks cluster setup for that? I'm not very familiar with all the run options in Databricks, so trying to figure out where to use Kedro features and where Databricks. (cc: @Rashida Kanchwala)
I guess I partly answered my own question, since it doesn't make sene to provide the runner argument if each node is run individually per task. But you could of course do your grouping differently and run a whole namespace or pipeline in a task, would it then make sense to run that part with either the
ThreadRunner
or
ParallelRunner
?
d
Without looking into
kedro-databricks
, but based on experience working with Spark, I would expect you can't use
ParallelRunner
.
m
hmm yeah good point about spark and the ParallelRunner
j
Hey @Merel, The idea of
kedro-databricks
is rather to generate the DAGs of kedro pipelines as a Databricks Workflows. Any type of parallelisation should therefore be implemented on the node-level, if used with
kedro-databricks
. If tasks can run in parallel, based on the DAG, they will run in parallel by default.
👍 2
m
If tasks can run in parallel, based on the DAG, they will run in parallel by default.
Is that a default setting on Databricks? I haven't gone beyond the basic example with 3 nodes yet, but will do some more experimenting next week.
j
Yes. The DAG is executed 100% based on the dependencies between tasks, so it should be parallel, where possible by default
m
Ah great to know! @Rashida Kanchwala ☝️