hi team i have packaged my kedro project using the databrick Kedro #questions

hi team! i have packaged my kedro project using th...

Nik Linnane

02/07/2024, 7:41 PM

hi team! i have packaged my kedro project using the databricks-cli to be deployed as a databricks job (my first time doing this). after running the task it failed with this error

Copy code

DatasetError: An exception occurred when parsing config for dataset 'my_snowflake_table':
No module named 'snowflake'. Please see the documentation on how to install relevant dependencies for kedro_datasets.snowflake.SnowparkTableDataset:
<https://docs.kedro.org/en/stable/kedro_project_setup/dependencies.html#install-dependencies-related-to-the-data-catalog>

the pipeline runs fine locally as i have the following installed

Copy code

kedro-datasets
snowflake-connector-python
snowflake-snowpark-python

any insight on this is appreciated!

Nik Linnane

02/07/2024, 10:07 PM

prior to packaging the kedro project do i need to update the file paths in the catalog to reference where they're stored within dbfs?

Ahdra Merali

02/08/2024, 9:41 AM

It looks like the dependencies needed weren't installed correctly - did you include them in the project's requirements.txt file before packaging?

Nik Linnane

02/08/2024, 3:45 PM

no i didnt do that. is there a specific command i should use to generate that requirements.txt file? besides pip freeze

Nik Linnane

02/08/2024, 5:02 PM

@Ahdra Merali that seems to have worked. although the job started it ran for 45+ mins before i canceled it (should not take nearly that long) and the output showed this

Copy code

As an open-source project, we collect usage analytics. 
We cannot see nor store information contained in a Kedro project. 
You can find out more by reading our privacy notice: 
<https://github.com/kedro-org/kedro-plugins/tree/main/kedro-telemetry#privacy-notice> 
Do you opt into usage analytics?  [y/N]:

i think it got stuck here before actually running the pipeline. how can i avoid this?

Nik Linnane

02/08/2024, 5:21 PM

looks like someone already asked this in the questions channel, just removed kedro-telemetry from requirements so will see if that works

Ahdra Merali

02/08/2024, 5:27 PM

Hi Nik, you can skip the telemetry prompt by including a

.telemetry

file in your project root and include the following line:

Copy code

consent: <true/false>

👍🏼 1

Ahdra Merali

02/08/2024, 5:37 PM

As for the requirements, it might not be necessary to include everything in you pip freeze but as a rule of thumb, any datasets you use in the catalog should be added to the requirements, this is what it would look like for SnowflakeDataset:

Copy code

kedro-datasets[snowflake.SnowparkTableDataset]>=1.0

Nik Linnane

02/08/2024, 9:44 PM

got it running - appreciate the help 🙂

np 1

Nik Linnane

02/09/2024, 3:07 PM

@Ahdra Merali do i need to add anything to the catalog datasets in order for them to be saved to DBFS after each run? i only use 1-2 memory datasets, everything else is a parquet or csv and i thought they'd be saved to DBFS after running but nothing is in the data folders

Ahdra Merali

02/09/2024, 3:23 PM

Could you show me your catalog entries? Is it that the datasets aren't being saved at all, or that they are but in an unexpected location?

Ahdra Merali

02/09/2024, 3:26 PM

This section on uploading data when deploying packaged projects as a Databricks job might also be useful

Nik Linnane

02/09/2024, 3:42 PM

here is an example of one. and they just arent being saved at all so im sure i just left something out

Copy code

my_df:
  type: pandas.ParquetDataset
  filepath: data/03_preprocessed/my_df.parquet
  metadata:
    kedro-viz:
      layer: preprocessed

Ahdra Merali

02/09/2024, 3:47 PM

You'll need to set up your catalog to save the data on DBFS - here is an example catalog from the databricks-iris starter project that shows you how to do this Remember to point your conf-source to the right directory as well

Nik Linnane

02/09/2024, 3:48 PM

got it - so something like this

Copy code

filepath: /dbfs/FileStore/my_product/data/03_preprocessed/my_df.parquet

👍 1

Open in Slack

Previous Next