Kasper Janehag
09/15/2022, 9:42 AM0.17.7
). Hi! I have some problems with running Kedro on a with a self-hosted Hadoop cluster. As part of a pipeline, I have a transcoded registered dataset table@pandas
and a table@spark
, with the following settings.
...table@pandas:
type: "${datasets.parquet}"
filepath: "${base_path_spark}/…/master_table"
..._table@spark:
<<: *pq
filepath: "${base_path_spark}/…/master_table"
The base_path_spark
is a HDFS location. These are then used in a pipeline in the following matter.
spark_to_pandas = pipeline(
pipe=Pipeline(
[
node(
func=spark_utils.to_pandas,
…
outputs=f"..._table@spark",
)
]
)
)
data_cleaning = pipeline(
pipe=Pipeline(
[
node(
func=enforce_schema_using_dict,
inputs={
"data": f"..._table@pandas",
},
…
)
]
)
)
The data_cleaning
node is suppose to pick up the output from the spark_to_pandas
node, using the transcoded dataset. However, an DataSetError
is raised with the following message
Exception has occurred: DataSetError
[Errno 2] No such file or directory: 'hadoop': 'hadoop'
Failed to instantiate Dataset 'telco_churn.master_table@pandas' of type 'kedro.extras.datasets.pandas.parquet_dataset.ParquetDataSet'.
If we remove the transcoding in the DataCatalog and register the datasets as individual registries the error disappears.
Anyone know how to proceed from this kind of error? Could it be related to client specific Hadoop environment? How can we proceed with trouble shooting?