Hi all I m trying to set up the version option for a SparkDa Kedro #questions

Hi all. I'm trying to set up the version option fo...

Sebastian Cardona Lozano

03/11/2023, 12:26 AM

Hi all. I'm trying to set up the version option for a SparkDataSet in the Catalog, but I got the next error when the node tries to save the dataset as .parquet file in Google Cloud Storage:

Copy code

VersionNotFoundError: Did not find any versions for SparkDataSet(file_format=parquet, 
filepath=<gs://bdb-gcp-cds-pr-ac-ba-analitica-avanzada/banca-masiva/599_profundizacion/data/05_model_input/master_model_input.pa>
rquet, load_args={'header': True, 'inferSchema': True}, save_args={}, version=Version(load=None, 
save='2023-03-10T23.44.07.085Z'))

In the catalog.yml I have this:

Copy code

master_model_input:
    type: spark.SparkDataSet
    filepath: <gs://bdb-gcp-cds-pr-ac-ba-analitica-avanzada/banca-masiva/599_profundizacion/data/05_model_input/master_model_input.parquet> #<gs://uri> de cloud storage
    file_format: parquet
    layer: model_input
    versioned: True
    load_args:
        header: True
        inferSchema: True

However, the parquet file is generated correctly in GCS (see the image attached). Thanks for your help! 🙂

Jannic Holzer

03/11/2023, 12:37 AM

Hey Sebastian, unfortunately this is a known error and is because of the way that

SparkDataSet

resolves file paths. It does not use

fsspec

in the same way that other datasets do, which leads to these difficulties. We have it on our radar to fix this, I'll make sure a fix gets prioritised.

Jannic Holzer

03/11/2023, 12:38 AM

You can track progress under this ticket: https://github.com/kedro-org/kedro-plugins/issues/117

Sebastian Cardona Lozano

03/11/2023, 1:02 AM

Thanks @Jannic Holzer!!

6 Views

Open in Slack

Previous Next