https://kedro.org/ logo
#questions
Title
# questions
n

Nik Linnane

02/07/2024, 7:41 PM
hi team! i have packaged my kedro project using the databricks-cli to be deployed as a databricks job (my first time doing this). after running the task it failed with this error
Copy code
DatasetError: An exception occurred when parsing config for dataset 'my_snowflake_table':
No module named 'snowflake'. Please see the documentation on how to install relevant dependencies for kedro_datasets.snowflake.SnowparkTableDataset:
<https://docs.kedro.org/en/stable/kedro_project_setup/dependencies.html#install-dependencies-related-to-the-data-catalog>
the pipeline runs fine locally as i have the following installed
Copy code
kedro-datasets
snowflake-connector-python
snowflake-snowpark-python
any insight on this is appreciated!
prior to packaging the kedro project do i need to update the file paths in the catalog to reference where they're stored within dbfs?
a

Ahdra Merali

02/08/2024, 9:41 AM
It looks like the dependencies needed weren't installed correctly - did you include them in the project's requirements.txt file before packaging?
n

Nik Linnane

02/08/2024, 3:45 PM
no i didnt do that. is there a specific command i should use to generate that requirements.txt file? besides pip freeze
@Ahdra Merali that seems to have worked. although the job started it ran for 45+ mins before i canceled it (should not take nearly that long) and the output showed this
Copy code
As an open-source project, we collect usage analytics. 
We cannot see nor store information contained in a Kedro project. 
You can find out more by reading our privacy notice: 
<https://github.com/kedro-org/kedro-plugins/tree/main/kedro-telemetry#privacy-notice> 
Do you opt into usage analytics?  [y/N]:
i think it got stuck here before actually running the pipeline. how can i avoid this?
looks like someone already asked this in the questions channel, just removed kedro-telemetry from requirements so will see if that works
a

Ahdra Merali

02/08/2024, 5:27 PM
Hi Nik, you can skip the telemetry prompt by including a
.telemetry
file in your project root and include the following line:
Copy code
consent: <true/false>
👍🏼 1
As for the requirements, it might not be necessary to include everything in you pip freeze but as a rule of thumb, any datasets you use in the catalog should be added to the requirements, this is what it would look like for SnowflakeDataset:
Copy code
kedro-datasets[snowflake.SnowparkTableDataset]>=1.0
n

Nik Linnane

02/08/2024, 9:44 PM
got it running - appreciate the help 🙂
np 1
@Ahdra Merali do i need to add anything to the catalog datasets in order for them to be saved to DBFS after each run? i only use 1-2 memory datasets, everything else is a parquet or csv and i thought they'd be saved to DBFS after running but nothing is in the data folders
a

Ahdra Merali

02/09/2024, 3:23 PM
Could you show me your catalog entries? Is it that the datasets aren't being saved at all, or that they are but in an unexpected location?
This section on uploading data when deploying packaged projects as a Databricks job might also be useful
n

Nik Linnane

02/09/2024, 3:42 PM
here is an example of one. and they just arent being saved at all so im sure i just left something out
Copy code
my_df:
  type: pandas.ParquetDataset
  filepath: data/03_preprocessed/my_df.parquet
  metadata:
    kedro-viz:
      layer: preprocessed
a

Ahdra Merali

02/09/2024, 3:47 PM
You'll need to set up your catalog to save the data on DBFS - here is an example catalog from the databricks-iris starter project that shows you how to do this Remember to point your conf-source to the right directory as well
n

Nik Linnane

02/09/2024, 3:48 PM
got it - so something like this
Copy code
filepath: /dbfs/FileStore/my_product/data/03_preprocessed/my_df.parquet
👍 1