Hello all I m wrapping my head around the following problem Kedro #questions

Hello all! I'm wrapping my head around the follow...

Filip Wójcik

02/14/2023, 9:40 AM

Hello all! I'm wrapping my head around the following problem/use case. So far, no luck. Imagine you have a data pipeline where you run, e.g., a web scraper every day, so it saves some amount of data (a couple of hundred records, so no big data case) every day. Can we configure a dataset so that we can append it to it? I was trying with pandas.CSVDataSet
with save_args: mode: "a"
and PartitionedDataSet
- but every time a dataset is overridden. I cannot find any such case in the docs. Should I create my implementation, deriving from the AbstractDataSet? I've heard from many fellow DS-Kedro Users that a similar use case happens from time to time, so probably I'm not alone. Thanks in advance, and best regards, Kedro is an absolute blast!

marrrcin

02/14/2023, 9:51 AM

Do you want to save those records in a separate file every day?

Filip Wójcik

02/14/2023, 9:56 AM

Hi, this is one of the options, possibly achievable via "versioning" (although versioning is probably intended for other purposes). However, any other options are possible: appending or saving to a separate file with a unique name is fine 🙂 Whichever is doable with fewer hacks 😄

marrrcin

02/14/2023, 10:01 AM

Append mode will not work, because Kedro uses fsspec underneath the hood, which uses

Copy code

with self._fs.open(save_path, mode="wb") as fs_file:
            fs_file.write(buf.getvalue())

(Most likely, because GCS/S3/Azure do not support append operations) You can use combination of

PartitionedDataSet

with

CSVDataSet

like this:

Copy code

node(
                func=lambda: {
                    dt.datetime.utcnow().strftime("%Y-%m-%d %H:%M:%S"): pd.DataFrame(
                        {"data": [1, 2, 3]}
                    )
                },
                inputs=None,
                outputs="records",
                name="scrape",
            )

In the catalog:

Copy code

records:
  type: PartitionedDataSet
  path: data/records
  filename_suffix: .csv
  dataset:
    type: pandas.CSVDataSet

You work with

PartitionedDataSet

by returning dictionary of partitions from your node. Keys are partition names, values are the data to be saved. In the example above, every time you run the pipeline it will return “new” partition, making the

PartitionedDataSet

save it in a separate file.

👍 1

Filip Wójcik

02/14/2023, 10:02 AM

Perfectly makes sense, thank you! I guess this approach resolves the problem.

👍 1

marrrcin

02/14/2023, 10:05 AM

If you want “less” control you could as well use:

Copy code

node(
                func=lambda: pd.DataFrame({"data": [1, 2, 3]}),
                inputs=None,
                outputs="records",
                name="scrape",
            )

catalog:

Copy code

records:
  type: pandas.CSVDataSet
  filepath: data/records/records.csv
  versioned: true

Which will create a folder under

data/records/records.csv

like this:

Filip Wójcik

02/14/2023, 2:07 PM

Thanks a lot. Indeed, I have spotted that versioning can produce such results, although from what I've understood after reading documentation versioning is intended for other purposes: to keep the latest version of the file. Anyway - still works 🙂 Thanks

5 Views

Open in Slack

Previous Next