Akshay
02/14/2023, 5:03 AM/mnt/testmount/data/05_model_input/partitions
Details--
I am running a Kedro pipelines on Azure Databricks notebook. There are 4 pipelines in the project. First two, Parse and Clean works fine, read the raw data from ADLS, do the transformation and write the data back to ADLS.
Third pipeline 'optimize' has spark dataset as input and generates 2 outputs. PartitionedDataset and transformed Pandas Dataframe.
Optimize.partition@spark:
type: kedro.io.PartitionedDataSet
dataset:<<: *spark_parquet_partitioned
load_args:
maxdepth: 1
withdirs: True
layer : Data Transformation
path : /mnt/testmount/data/05_model_input/partitions
model_input@pandas:
type: kedro.io.PartitionedDataSet
dataset:<<: *pandas_parquet_partitioned
load_args:
maxdepth: 1
withdirs: True
layer : Data Transformation
path : /mnt/testmount/data/05_model_input/model_data
Note- pipeline works fine when run in local environment.
Kedro =0.18.3
Python =3.8.10
Cluster= Spark 3.2.1Nok Lam Chan
02/14/2023, 5:46 AMspark_parquet_partitioned
? Is there data already in /mnt/testmount/data/05_model_input/partitions
or some pipeline generate the partitions there?
So mounting ADLS is working fine for you. Could you change the SparkDataSet to something non-spark? I am curious if this is a PartitionDataSet
or spark problem.Akshay
02/14/2023, 6:01 AM_spark_parquet_partitioned: &spark_parquet_partitioned
type: spark.SparkDataSet
file_format: parquet
save_args:
mode: overwrite
_pandas_parquet_partitioned : &pandas_parquet_partitioned
type : pandas.ParquetDataSet
save_args:
index : False
@Nok Lam Chan Thanks for your reply.
So this pipeline 'optimize' has two nodes, first node puts the partitioned data to ADLS, so the partitions are present in ADLS location. Second node of this pipeline is taking this partitioned data and convert it into pandas dataframes. This node is failing.