(`0.17.7`). Hi! I have some problems with running ...
# questions
k
(
0.17.7
). Hi! I have some problems with running Kedro on a with a self-hosted Hadoop cluster. As part of a pipeline, I have a transcoded registered dataset
table@pandas
and a
table@spark
, with the following settings.
Copy code
...table@pandas:
  type: "${datasets.parquet}"
  filepath: "${base_path_spark}/…/master_table"
 
..._table@spark:
  <<: *pq
  filepath: "${base_path_spark}/…/master_table"
The
base_path_spark
is a HDFS location. These are then used in a pipeline in the following matter.
Copy code
spark_to_pandas = pipeline(
        pipe=Pipeline(
           [
                node(
                    func=spark_utils.to_pandas,
                    …
                    outputs=f"..._table@spark",
               )
           ]
       )
   )
 
    data_cleaning = pipeline(
        pipe=Pipeline(
           [
                node(
                    func=enforce_schema_using_dict,
                    inputs={
                        "data": f"..._table@pandas",
                   },
     …
               )
           ]
       )
   )
The
data_cleaning
node is suppose to pick up the output from the
spark_to_pandas
node, using the transcoded dataset. However, an
DataSetError
is raised with the following message
Copy code
Exception has occurred: DataSetError
[Errno 2] No such file or directory: 'hadoop': 'hadoop'
Failed to instantiate Dataset 'telco_churn.master_table@pandas' of type 'kedro.extras.datasets.pandas.parquet_dataset.ParquetDataSet'.
If we remove the transcoding in the DataCatalog and register the datasets as individual registries the error disappears. Anyone know how to proceed from this kind of error? Could it be related to client specific Hadoop environment? How can we proceed with trouble shooting?