`0 17 7` Hi I have some problems with running Kedro on a wi Kedro #questions

(`0.17.7`). Hi! I have some problems with running ...

Kasper Janehag

09/15/2022, 9:42 AM

(

0.17.7

). Hi! I have some problems with running Kedro on a with a self-hosted Hadoop cluster. As part of a pipeline, I have a transcoded registered dataset

table@pandas

and a

table@spark

, with the following settings.

Copy code

...table@pandas:
  type: "${datasets.parquet}"
  filepath: "${base_path_spark}/…/master_table"
 
..._table@spark:
  <<: *pq
  filepath: "${base_path_spark}/…/master_table"

The

base_path_spark

is a HDFS location. These are then used in a pipeline in the following matter.

Copy code

spark_to_pandas = pipeline(
        pipe=Pipeline(
           [
                node(
                    func=spark_utils.to_pandas,
                    …
                    outputs=f"..._table@spark",
               )
           ]
       )
   )
 
    data_cleaning = pipeline(
        pipe=Pipeline(
           [
                node(
                    func=enforce_schema_using_dict,
                    inputs={
                        "data": f"..._table@pandas",
                   },
     …
               )
           ]
       )
   )

The

data_cleaning

node is suppose to pick up the output from the

spark_to_pandas

node, using the transcoded dataset. However, an

DataSetError

is raised with the following message

Copy code

Exception has occurred: DataSetError
[Errno 2] No such file or directory: 'hadoop': 'hadoop'
Failed to instantiate Dataset 'telco_churn.master_table@pandas' of type 'kedro.extras.datasets.pandas.parquet_dataset.ParquetDataSet'.

If we remove the transcoding in the DataCatalog and register the datasets as individual registries the error disappears. Anyone know how to proceed from this kind of error? Could it be related to client specific Hadoop environment? How can we proceed with trouble shooting?

Open in Slack

Previous Next