Hi all, I am running Kedro to run as a Job in Data...
# questions
j
Hi all, I am running Kedro to run as a Job in Databricks. I am getting the error in the screenshot attached. It tries to find the configuration in databricks/driver/conf/base, but it is passed (correctly I hope) in the sysargs as displayed in the second attachment. I'm using Databricks Asset Bundles to run it (is like dbx but officially supported by Databricks). Thanks in advance for your help! FYI I have been asking related questions on this side as well: https://kedro-org.slack.com/archives/C03RKP2LW64/p1702289803851329
🧱 1
d
So DABs are very new but it says that the
filepath does not exist or not accessible
at the very top of your screenshot
so we just need to work out what it can see at that working directory
perhaps a little script that prints out the cwd or even the file tree will help diagnose what the filepath needs to be
j
This is probably where it is trying to look (see attachment). But I wonder what is happening in the background that it tries to look for the config there, instead of the location I pointed to. I have tried to follow the steps here as much as possible, so I would hope it doesn't fail: https://docs.kedro.org/en/0.18.14/deployment/databricks/databricks_deployment_workflow.html On the other hand I believe DABs work pretty similar to dbx in the background, so I would hope this is not the issue.
m
You're using older runtime (below or equal 13.0), you may need to add "experimental: python_wheel_wrapper: true" to the top of your databricks.yml file
j
Thanks @Michał Madej,I tried the "experimental: python_wheel_wrapper: true" proposed in the link but didn't work. I upgraded to
spark_version: 13.3.x-cpu-ml-scala2.12
and I get the (new) error shown attached - I made screenshots of the full trace (even more cryptic). Any ideas? Thanks for your quick help so far guys!
d
have you packaged your
spark.yaml
?
j
not really... and I'm not aware of having to do that from the instructions. I believe you mean
databricks.yaml
from the DABs? If so, I think that one needs to stay in the root of the repo.
d
value cannot be null for spark.app.name
so Kedro spark app name is usually defined in the SparkHooks
m
I don't know about spark, but DAB uploads your configuration directory to this location
/Workspace/Users/${workspace.current_user.userName}/.bundle/${bundle.target}/${bundle.name}
, try using it in
parameters: ["--conf-source", "here", ...]
j
Hi here, I just wanted to share the solution that I found in the end. The issue was happening due to the ConfigLoader trying to read the config folder from the default location close at the project root, this was happening in an mlflow hook in
hooks.py
. Because this project path is in
databricks/driver
folder in databricks jobs , it was failing there. The solution was to point the ConfigLoader to look for the config folder in the dbfs that I copied it to when I had to run the pipeline. After that it works! I hope that with the development and growth of Kedro with deployment as databricks jobs gets better and smoother!
d
That’s super helpful thanks for the update
I’d really like to think about how we could provide a better error message
j
I think I just had to look properly at the error message to see that it was jumping into the hook and failing there. I think it is just the different layers that made it a bit distracting to catch that one up. In the end this was true the whole time:
ValueError: Given configuration path either does not exist or is not a valid directory: /databricks/driver/conf/base
. Is just that seeing it fail in that directory "felt" complex, but in the trace you could see that it was going through the hooks. It is a bit of a Databricks extra complexity that didn't help.
d
and for reference what was the correct directory?
j
I used this location in dbfs to copy the config file so it could be read :
/dbfs/FileStore/wine_model_kedro/conf/
👍 1