Hi team a question might be related to how to create a custo Kedro #questions

Hi team, a question might be related to how to cre...

Danhua Yan

10/25/2022, 2:00 PM

Hi team, a question might be related to how to create a custom datasets with certain read/write behavior. Details below: I try to use kedro to read a

delta

datasets created by databricks in pandas. The current configs look like below:

Copy code

_pandas_parquet: &_pandas_parquet
  type: pandas.ParquetDataSet

_spark_parquet: &_delta_parquet
  type: spark.SparkDataSet
  file_format: delta

What I want to achieve:

Copy code

node1:
  outputs: dataset@spark

node2:
  inputs: dataset@pandas

Unfortunately

pandas

doesn’t support reading

delta

as is. I found below workaround that requires additional steps. https://mungingdata.com/pandas/read-delta-lake-dataframe/ How should I create a dataset that can do something like this internally when being loaded?

Copy code

from deltalake import DeltaTable
dt = DeltaTable("resources/delta/1")
df = dt.to_pandas()

Tried looking into this https://kedro.readthedocs.io/en/stable/tools_integration/pyspark.html#spark-and-delta-lake-interaction but nothing mentioned about using pandas to interact with

delta

. Thank you!

Nok Lam Chan

10/25/2022, 2:15 PM

Copy code

my_dataframe@spark:
  type: spark.SparkDataSet
  filepath: data/02_intermediate/data.parquet
  file_format: parquet

my_dataframe@pandas_delta:
  type: custom.PandasDeltaDataSet
  filepath: data/02_intermediate/data.parquet

In this case, if