Danhua Yan
10/25/2022, 2:00 PMdelta
datasets created by databricks in pandas. The current configs look like below:
_pandas_parquet: &_pandas_parquet
type: pandas.ParquetDataSet
_spark_parquet: &_delta_parquet
type: spark.SparkDataSet
file_format: delta
What I want to achieve:
node1:
outputs: dataset@spark
node2:
inputs: dataset@pandas
Unfortunately pandas
doesn’t support reading delta
as is. I found below workaround that requires additional steps. https://mungingdata.com/pandas/read-delta-lake-dataframe/
How should I create a dataset that can do something like this internally when being loaded?
from deltalake import DeltaTable
dt = DeltaTable("resources/delta/1")
df = dt.to_pandas()
Tried looking into this https://kedro.readthedocs.io/en/stable/tools_integration/pyspark.html#spark-and-delta-lake-interaction but nothing mentioned about using pandas to interact with delta
. Thank you!Nok Lam Chan
10/25/2022, 2:15 PMmy_dataframe@spark:
type: spark.SparkDataSet
filepath: data/02_intermediate/data.parquet
file_format: parquet
my_dataframe@pandas_delta:
type: custom.PandasDeltaDataSet
filepath: data/02_intermediate/data.parquet
In this case, if pandas
doesn’t support this, you will need to make a CustomDataSet
and implement the _load
method, which is quite similar to the 3 lines of code you posted.Danhua Yan
10/25/2022, 2:23 PM