Filip Wójcik
02/14/2023, 9:40 AMpandas.CSVDataSet
with save_args: mode: "a"
and PartitionedDataSet
- but every time a dataset is overridden.
I cannot find any such case in the docs. Should I create my implementation, deriving from the AbstractDataSet?
I've heard from many fellow DS-Kedro Users that a similar use case happens from time to time, so probably I'm not alone.
Thanks in advance, and best regards, Kedro is an absolute blast!marrrcin
02/14/2023, 9:51 AMFilip Wójcik
02/14/2023, 9:56 AMmarrrcin
02/14/2023, 10:01 AMwith self._fs.open(save_path, mode="wb") as fs_file:
fs_file.write(buf.getvalue())
(Most likely, because GCS/S3/Azure do not support append operations)
You can use combination of PartitionedDataSet
with CSVDataSet
like this:
node(
func=lambda: {
dt.datetime.utcnow().strftime("%Y-%m-%d %H:%M:%S"): pd.DataFrame(
{"data": [1, 2, 3]}
)
},
inputs=None,
outputs="records",
name="scrape",
)
In the catalog:
records:
type: PartitionedDataSet
path: data/records
filename_suffix: .csv
dataset:
type: pandas.CSVDataSet
You work with PartitionedDataSet
by returning dictionary of partitions from your node. Keys are partition names, values are the data to be saved. In the example above, every time you run the pipeline it will return “new” partition, making the PartitionedDataSet
save it in a separate file.Filip Wójcik
02/14/2023, 10:02 AMmarrrcin
02/14/2023, 10:05 AMnode(
func=lambda: pd.DataFrame({"data": [1, 2, 3]}),
inputs=None,
outputs="records",
name="scrape",
)
catalog:
records:
type: pandas.CSVDataSet
filepath: data/records/records.csv
versioned: true
Which will create a folder under data/records/records.csv
like this:Filip Wójcik
02/14/2023, 2:07 PM