https://kedro.org/ logo
#questions
Title
# questions
f

Filip Wójcik

02/14/2023, 9:40 AM
Hello all! I'm wrapping my head around the following problem/use case. So far, no luck. Imagine you have a data pipeline where you run, e.g., a web scraper every day, so it saves some amount of data (a couple of hundred records, so no big data case) every day. Can we configure a dataset so that we can append it to it? I was trying with
pandas.CSVDataSet
with
save_args: mode: "a"
and
PartitionedDataSet
- but every time a dataset is overridden. I cannot find any such case in the docs. Should I create my implementation, deriving from the AbstractDataSet? I've heard from many fellow DS-Kedro Users that a similar use case happens from time to time, so probably I'm not alone. Thanks in advance, and best regards, Kedro is an absolute blast!
m

marrrcin

02/14/2023, 9:51 AM
Do you want to save those records in a separate file every day?
f

Filip Wójcik

02/14/2023, 9:56 AM
Hi, this is one of the options, possibly achievable via "versioning" (although versioning is probably intended for other purposes). However, any other options are possible: appending or saving to a separate file with a unique name is fine 🙂 Whichever is doable with fewer hacks 😄
m

marrrcin

02/14/2023, 10:01 AM
Append mode will not work, because Kedro uses fsspec underneath the hood, which uses
Copy code
with self._fs.open(save_path, mode="wb") as fs_file:
            fs_file.write(buf.getvalue())
(Most likely, because GCS/S3/Azure do not support append operations) You can use combination of
PartitionedDataSet
with
CSVDataSet
like this:
Copy code
node(
                func=lambda: {
                    dt.datetime.utcnow().strftime("%Y-%m-%d %H:%M:%S"): pd.DataFrame(
                        {"data": [1, 2, 3]}
                    )
                },
                inputs=None,
                outputs="records",
                name="scrape",
            )
In the catalog:
Copy code
records:
  type: PartitionedDataSet
  path: data/records
  filename_suffix: .csv
  dataset:
    type: pandas.CSVDataSet
You work with
PartitionedDataSet
by returning dictionary of partitions from your node. Keys are partition names, values are the data to be saved. In the example above, every time you run the pipeline it will return “new” partition, making the
PartitionedDataSet
save it in a separate file.
👍 1
f

Filip Wójcik

02/14/2023, 10:02 AM
Perfectly makes sense, thank you! I guess this approach resolves the problem.
👍 1
m

marrrcin

02/14/2023, 10:05 AM
If you want “less” control you could as well use:
Copy code
node(
                func=lambda: pd.DataFrame({"data": [1, 2, 3]}),
                inputs=None,
                outputs="records",
                name="scrape",
            )
catalog:
Copy code
records:
  type: pandas.CSVDataSet
  filepath: data/records/records.csv
  versioned: true
Which will create a folder under
data/records/records.csv
like this:
f

Filip Wójcik

02/14/2023, 2:07 PM
Thanks a lot. Indeed, I have spotted that versioning can produce such results, although from what I've understood after reading documentation versioning is intended for other purposes: to keep the latest version of the file. Anyway - still works 🙂 Thanks
4 Views