Hello Everyone I am seeing an issue with partitionedDataset Kedro #questions

Hello Everyone, I am seeing an issue with partiti...

Akshay

02/14/2023, 5:03 AM

Hello Everyone, I am seeing an issue with partitionedDataset not found in Kedro pipeline when running on Azure Databricks notebook. It throws error - DataSetError: No Partitions found in ''`/mnt/testmount/data/05_model_input/partitions`'' ADLS has been mounted to /mnt/testmount/ Partitions are getting created at

/mnt/testmount/data/05_model_input/partitions

Details-- I am running a Kedro pipelines on Azure Databricks notebook. There are 4 pipelines in the project. First two, Parse and Clean works fine, read the raw data from ADLS, do the transformation and write the data back to ADLS. Third pipeline 'optimize' has spark dataset as input and generates 2 outputs. PartitionedDataset and transformed Pandas Dataframe.

Copy code

Optimize.partition@spark:
  type: kedro.io.PartitionedDataSet
  dataset:<<: *spark_parquet_partitioned
  load_args:
    maxdepth: 1
    withdirs: True
  layer : Data Transformation
  path : /mnt/testmount/data/05_model_input/partitions

model_input@pandas:
  type: kedro.io.PartitionedDataSet
  dataset:<<: *pandas_parquet_partitioned
  load_args:
    maxdepth: 1
    withdirs: True
  layer : Data Transformation
  path : /mnt/testmount/data/05_model_input/model_data

Note- pipeline works fine when run in local environment. Kedro =0.18.3 Python =3.8.10 Cluster= Spark 3.2.1

Nok Lam Chan

02/14/2023, 5:46 AM

Could you also share what is defined in

spark_parquet_partitioned

? Is there data already in

/mnt/testmount/data/05_model_input/partitions

or some pipeline generate the partitions there? So mounting ADLS is working fine for you. Could you change the SparkDataSet to something non-spark? I am curious if this is a

PartitionDataSet

or spark problem.

Akshay

02/14/2023, 6:01 AM

Copy code

_spark_parquet_partitioned: &spark_parquet_partitioned
  type: spark.SparkDataSet
  file_format: parquet
  save_args:
    mode: overwrite

_pandas_parquet_partitioned : &pandas_parquet_partitioned
  type : pandas.ParquetDataSet
  save_args:
    index : False

@Nok Lam Chan Thanks for your reply. So this pipeline 'optimize' has two nodes, first node puts the partitioned data to ADLS, so the partitions are present in ADLS location. Second node of this pipeline is taking this partitioned data and convert it into pandas dataframes. This node is failing.

6 Views

Open in Slack

Previous Next