https://kedro.org/ logo
#questions
Title
# questions
d

Danhua Yan

10/25/2022, 2:00 PM
Hi team, a question might be related to how to create a custom datasets with certain read/write behavior. Details below: I try to use kedro to read a
delta
datasets created by databricks in pandas. The current configs look like below:
Copy code
_pandas_parquet: &_pandas_parquet
  type: pandas.ParquetDataSet

_spark_parquet: &_delta_parquet
  type: spark.SparkDataSet
  file_format: delta
What I want to achieve:
Copy code
node1:
  outputs: dataset@spark

node2:
  inputs: dataset@pandas
Unfortunately
pandas
doesn’t support reading
delta
as is. I found below workaround that requires additional steps. https://mungingdata.com/pandas/read-delta-lake-dataframe/ How should I create a dataset that can do something like this internally when being loaded?
Copy code
from deltalake import DeltaTable
dt = DeltaTable("resources/delta/1")
df = dt.to_pandas()
Tried looking into this https://kedro.readthedocs.io/en/stable/tools_integration/pyspark.html#spark-and-delta-lake-interaction but nothing mentioned about using pandas to interact with
delta
. Thank you!
n

Nok Lam Chan

10/25/2022, 2:15 PM
Copy code
my_dataframe@spark:
  type: spark.SparkDataSet
  filepath: data/02_intermediate/data.parquet
  file_format: parquet

my_dataframe@pandas_delta:
  type: custom.PandasDeltaDataSet
  filepath: data/02_intermediate/data.parquet
In this case, if
pandas
doesn’t support this, you will need to make a
CustomDataSet
and implement the
_load
method, which is quite similar to the 3 lines of code you posted.
d

Danhua Yan

10/25/2022, 2:23 PM
Thank you @Nok Lam Chan! Will give it a try.
K 1
@Shubham Agrawal
4 Views