Akshay02/14/2023, 5:03 AM
Details-- I am running a Kedro pipelines on Azure Databricks notebook. There are 4 pipelines in the project. First two, Parse and Clean works fine, read the raw data from ADLS, do the transformation and write the data back to ADLS. Third pipeline 'optimize' has spark dataset as input and generates 2 outputs. PartitionedDataset and transformed Pandas Dataframe.
Note- pipeline works fine when run in local environment. Kedro =0.18.3 Python =3.8.10 Cluster= Spark 3.2.1
Optimize.partition@spark: type: kedro.io.PartitionedDataSet dataset:<<: *spark_parquet_partitioned load_args: maxdepth: 1 withdirs: True layer : Data Transformation path : /mnt/testmount/data/05_model_input/partitions model_input@pandas: type: kedro.io.PartitionedDataSet dataset:<<: *pandas_parquet_partitioned load_args: maxdepth: 1 withdirs: True layer : Data Transformation path : /mnt/testmount/data/05_model_input/model_data
Nok Lam Chan02/14/2023, 5:46 AM
? Is there data already in
or some pipeline generate the partitions there? So mounting ADLS is working fine for you. Could you change the SparkDataSet to something non-spark? I am curious if this is a
or spark problem.
Akshay02/14/2023, 6:01 AM
@Nok Lam Chan Thanks for your reply. So this pipeline 'optimize' has two nodes, first node puts the partitioned data to ADLS, so the partitions are present in ADLS location. Second node of this pipeline is taking this partitioned data and convert it into pandas dataframes. This node is failing.
_spark_parquet_partitioned: &spark_parquet_partitioned type: spark.SparkDataSet file_format: parquet save_args: mode: overwrite _pandas_parquet_partitioned : &pandas_parquet_partitioned type : pandas.ParquetDataSet save_args: index : False