Hello Everyone, I am seeing an issue with partiti...
# questions
a
Hello Everyone, I am seeing an issue with partitionedDataset not found in Kedro pipeline when running on Azure Databricks notebook. It throws error - DataSetError: No Partitions found in ''`/mnt/testmount/data/05_model_input/partitions`'' ADLS has been mounted to /mnt/testmount/ Partitions are getting created at
/mnt/testmount/data/05_model_input/partitions
Details-- I am running a Kedro pipelines on Azure Databricks notebook. There are 4 pipelines in the project. First two, Parse and Clean works fine, read the raw data from ADLS, do the transformation and write the data back to ADLS. Third pipeline 'optimize' has spark dataset as input and generates 2 outputs. PartitionedDataset and transformed Pandas Dataframe.
Copy code
Optimize.partition@spark:
  type: kedro.io.PartitionedDataSet
  dataset:<<: *spark_parquet_partitioned
  load_args:
    maxdepth: 1
    withdirs: True
  layer : Data Transformation
  path : /mnt/testmount/data/05_model_input/partitions

model_input@pandas:
  type: kedro.io.PartitionedDataSet
  dataset:<<: *pandas_parquet_partitioned
  load_args:
    maxdepth: 1
    withdirs: True
  layer : Data Transformation
  path : /mnt/testmount/data/05_model_input/model_data
Note- pipeline works fine when run in local environment. Kedro =0.18.3 Python =3.8.10 Cluster= Spark 3.2.1
n
Could you also share what is defined in
spark_parquet_partitioned
? Is there data already in
/mnt/testmount/data/05_model_input/partitions
or some pipeline generate the partitions there? So mounting ADLS is working fine for you. Could you change the SparkDataSet to something non-spark? I am curious if this is a
PartitionDataSet
or spark problem.
a
Copy code
_spark_parquet_partitioned: &spark_parquet_partitioned
  type: spark.SparkDataSet
  file_format: parquet
  save_args:
    mode: overwrite

_pandas_parquet_partitioned : &pandas_parquet_partitioned
  type : pandas.ParquetDataSet
  save_args:
    index : False
@Nok Lam Chan Thanks for your reply. So this pipeline 'optimize' has two nodes, first node puts the partitioned data to ADLS, so the partitions are present in ADLS location. Second node of this pipeline is taking this partitioned data and convert it into pandas dataframes. This node is failing.