Hi all. I'm trying to set up the version option fo...
# questions
s
Hi all. I'm trying to set up the version option for a SparkDataSet in the Catalog, but I got the next error when the node tries to save the dataset as .parquet file in Google Cloud Storage:
Copy code
VersionNotFoundError: Did not find any versions for SparkDataSet(file_format=parquet, 
filepath=<gs://bdb-gcp-cds-pr-ac-ba-analitica-avanzada/banca-masiva/599_profundizacion/data/05_model_input/master_model_input.pa>
rquet, load_args={'header': True, 'inferSchema': True}, save_args={}, version=Version(load=None, 
save='2023-03-10T23.44.07.085Z'))
In the catalog.yml I have this:
Copy code
master_model_input:
    type: spark.SparkDataSet
    filepath: <gs://bdb-gcp-cds-pr-ac-ba-analitica-avanzada/banca-masiva/599_profundizacion/data/05_model_input/master_model_input.parquet> #<gs://uri> de cloud storage
    file_format: parquet
    layer: model_input
    versioned: True
    load_args:
        header: True
        inferSchema: True
However, the parquet file is generated correctly in GCS (see the image attached). Thanks for your help! 🙂
j
Hey Sebastian, unfortunately this is a known error and is because of the way that
SparkDataSet
resolves file paths. It does not use
fsspec
in the same way that other datasets do, which leads to these difficulties. We have it on our radar to fix this, I'll make sure a fix gets prioritised.
You can track progress under this ticket: https://github.com/kedro-org/kedro-plugins/issues/117
s
Thanks @Jannic Holzer!!