Filip Wójcik
02/14/2023, 9:40 AMpandas.CSVDataSet with save_args: mode: "a" and PartitionedDataSet- but every time a dataset is overridden.
I cannot find any such case in the docs. Should I create my implementation, deriving from the AbstractDataSet?
I've heard from many fellow DS-Kedro Users that a similar use case happens from time to time, so probably I'm not alone.
Thanks in advance, and best regards, Kedro is an absolute blast!marrrcin
02/14/2023, 9:51 AMFilip Wójcik
02/14/2023, 9:56 AMmarrrcin
02/14/2023, 10:01 AMwith self._fs.open(save_path, mode="wb") as fs_file:
fs_file.write(buf.getvalue())
(Most likely, because GCS/S3/Azure do not support append operations)
You can use combination of PartitionedDataSet with CSVDataSet like this:
node(
func=lambda: {
dt.datetime.utcnow().strftime("%Y-%m-%d %H:%M:%S"): pd.DataFrame(
{"data": [1, 2, 3]}
)
},
inputs=None,
outputs="records",
name="scrape",
)
In the catalog:
records:
type: PartitionedDataSet
path: data/records
filename_suffix: .csv
dataset:
type: pandas.CSVDataSet
You work with PartitionedDataSet by returning dictionary of partitions from your node. Keys are partition names, values are the data to be saved. In the example above, every time you run the pipeline it will return “new” partition, making the PartitionedDataSet save it in a separate file.Filip Wójcik
02/14/2023, 10:02 AMmarrrcin
02/14/2023, 10:05 AMnode(
func=lambda: pd.DataFrame({"data": [1, 2, 3]}),
inputs=None,
outputs="records",
name="scrape",
)
catalog:
records:
type: pandas.CSVDataSet
filepath: data/records/records.csv
versioned: true
Which will create a folder under data/records/records.csv like this:Filip Wójcik
02/14/2023, 2:07 PM