hi folks I created a custom dataset to see if I could unders Kedro #questions

hi folks, I created a custom dataset to see if I c...

Juan Luis

03/30/2023, 8:40 AM

hi folks, I created a custom dataset to see if I could understand the documentation and how it works, but I feel I'm doing some unconventional things and I'd need some advice:

Juan Luis

03/30/2023, 8:40 AM

I created a

KaggleDataSet

with a

_load

method that basically returns a

KaggleBundle

containing the list of files I will download, and a

_save

method that performs the download. if I load and save it from Python using the

catalog

, it works beautifully:

Copy code

In [2]: catalog.list()
Out[2]: ['spaceship_titanic_kaggle', 'head_titanic_train', 'parameters']

In [3]: bundle = catalog.load("spaceship_titanic_kaggle")
[03/30/23 10:21:01] INFO     Loading data from 'spaceship_titanic_kaggle' (KaggleDataSet)...                                                      data_catalog.py:343
[03/30/23 10:21:02] WARNING  /Users/juan_cano/.micromamba/envs/kaggle310-dev/lib/python3.10/site-packages/kaggle/rest.py:62: DeprecationWarning:      warnings.py:109
                             HTTPResponse.getheaders() is deprecated and will be removed in urllib3 v2.1.0. Instead access HTTPResponse.headers                      
                             directly.                                                                                                                               
                               return self.urllib3_response.getheaders()                                                                                             
                                                                                                                                                                     

In [4]: bundle
Out[4]: KaggleBundle(dataset_or_competition='spaceship-titanic', members=['sample_submission.csv', 'train.csv', 'test.csv'], is_competition=True, single_file=False)

In [5]: catalog.save("spaceship_titanic_kaggle", data=bundle)
[03/30/23 10:21:51] INFO     Saving data to 'spaceship_titanic_kaggle' (KaggleDataSet)...                                                         data_catalog.py:382
[03/30/23 10:21:52] WARNING  /Users/juan_cano/.micromamba/envs/kaggle310-dev/lib/python3.10/site-packages/kaggle/api_client.py:181:                   warnings.py:109
                             DeprecationWarning: HTTPResponse.getheaders() is deprecated and will be removed in urllib3 v2.1.0. Instead access                       
                             HTTPResponse.headers directly.                                                                                                          
                               response_data.getheaders())                                                                                                           
                                                                                                                                                                     
Downloading spaceship-titanic.zip to data/01_raw
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 299k/299k [00:00<00:00, 988kB/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 299k/299k [00:00<00:00, 986kB/s]

In [6]: !ls data/01_raw
sample_submission.csv  test.csv               train.csv

but then, when I want to use it in a Kedro pipeline, because a node cannot have the same dataset as input and output, the only way I could make it work is by duplicating a

KaggleDataSet

in the catalog and using a dummy node that just passes the bundle without doing anything:

Copy code

# catalog.yaml
spaceship_titanic_kaggle_orig:
  type: kedro_kaggle_dataset.KaggleDataSet
  dataset: spaceship-titanic
  directory: data/01_raw/
  is_competition: True

spaceship_titanic_kaggle:
  type: kedro_kaggle_dataset.KaggleDataSet
  dataset: spaceship-titanic
  directory: data/01_raw/
  is_competition: True

# nodes.py
from kedro_kaggle_dataset.kaggle_dataset import KaggleBundle

def download_titanic(data: KaggleBundle) -> KaggleBundle:
    return data


# pipeline.py
def create_pipeline(**kwargs) -> Pipeline:
    return pipeline([
        node(
            func=download_titanic,
            inputs="spaceship_titanic_kaggle_orig",
            outputs="spaceship_titanic_kaggle",
            name="download_titanic_node",
        ),
        ...

Juan Luis

03/30/2023, 8:40 AM

this is the code of the dataset in case you want to try it https://github.com/astrojuanlu/kedro-kaggle-dataset

Juan Luis

03/30/2023, 8:41 AM

is there a better way of doing this?

datajoely

03/30/2023, 8:56 AM

so i think the trick here is to be explicit with

load_args

and

save_args

but let me DM you and think through this properly

datajoely

03/30/2023, 8:57 AM

in truth this is a higher level abstraction than we normally do in the standard bundle

datajoely

03/30/2023, 8:57 AM

so not all of the concepts map neatly

Antony Milne

03/30/2023, 9:53 AM

Yeah, this is a very interesting idea but I think might be tricky to do within the kedro catalog paradigm as it stands. I think the simplest way would be to have a dataset a bit like

APIDataSet

that just exposes

_load

(no

_save

). This would take as some sort of argument (maybe in

load_args

, maybe top level) the specification of which files you want from Kaggle.

_load

then opens these up in memory (or saves them somewhere temporary if you have to specify a filepath according to the kaggle API) as pandas dataframes. Then it’s the responsibility of the node that consumes those datasets to manipulate them as required and output to a new, persisted dataset with its own catalog entry (e.g.

pandas.CSVDataSet

💡 1

Antony Milne

03/30/2023, 9:54 AM

Curious if @Nok Lam Chan has any ideas here

datajoely

03/30/2023, 9:56 AM

I’ll forward my invite to you two

✔️ 1

Nok Lam Chan

03/30/2023, 12:32 PM

I’ll need some time to think about it more, APIDataSet approach seems fine to me. A bit off topic, I think we need some kind of

PyTorchDataLoaderDataSet

or something similar. The

_save

_load

interface works quite well for pandas, but sometimes it’s awkward for these DL library will take path as argument.

datajoely

03/30/2023, 12:38 PM

yeah seen lots of people talk about that

datajoely

03/30/2023, 12:38 PM

I’ve considered a special type of node that processes each partition 1:1

datajoely

03/30/2023, 12:38 PM

cos image processing today is a bit weird in one node

Juan Luis

03/30/2023, 2:37 PM

or maybe a dataset that needs no inputs, because what I'm doing here is essentially downloading a thing from the internet and putting it in

01_raw

Juan Luis

03/30/2023, 2:38 PM

but the "dummy node" problem would still be there

Juan Luis

03/30/2023, 2:38 PM

(I also think the "custom datasets" page needs better docs but I'll tackle that when I fully understand them 😅 )

Nok Lam Chan

03/30/2023, 2:39 PM

I can’t run the code for some reason - keep getting 403 error, running on CLI is fine tho

Juan Luis

03/30/2023, 2:40 PM

authentication problems? I was using environment variables for that

Nok Lam Chan

03/30/2023, 2:42 PM

Downloading seems to be a

_load

instead of

_save

to me -

save

is something that will be called for function output

Juan Luis

03/30/2023, 2:51 PM

yeah I was struggling with that. the problem of that particular case is that the Kaggle download could be anything, so I don't know up front how should I handle that

👍🏼 1

Nok Lam Chan

03/30/2023, 2:51 PM

I don’t have much better idea than the one Antony has, it feels repeating though to opening up these file in a node and converting them as a dataframe

Juan Luis

03/30/2023, 2:53 PM

that works as long as you assume that everything are dataframes 🤔 but maybe we could return a dictionary of

io.BytesIO

objects

Juan Luis

03/30/2023, 2:53 PM

and then downstream nodes decide what to do

Juan Luis

03/30/2023, 3:36 PM

by the way this approach worked 😄 and now I do understand what

_load

and

_save

actually do 🔥

K 1

Juan Luis

03/30/2023, 4:52 PM

Copy code

def _load(self) -> KaggleBundle:
    ...
    members = {}
    for member_filename in members_list:
        with open(self._directory / member_filename, "rb") as fh:
            members[member_filename] = BytesIO(fh.read())

    return KaggleBundle(..., members)

def _save(self, KaggleBundle): -> None:
    raise NotImplementedError("Cannot save back to Kaggle")

and now I only need 1

KaggleDataSet

in the catalog, 1 destination

pandas.CSVDataSet

, and a function doing like this:

Copy code

def head_titanic(bundle: KaggleBundle) -> pd.DataFrame:
    df_test = pd.read_csv(bundle["test.csv"])
    return ...

✨ happy to keep the meeting on Monday if you want folks, but I'm quite happy with this

Juan Luis

03/31/2023, 6:01 PM

notes for Monday: what to do if I don't want to download this data every time I run a pipeline

datajoely

04/01/2023, 8:21 AM

I made an expiring cahced dataset a long time ago

datajoely

04/01/2023, 8:21 AM

I should contribute it

datajoely

04/01/2023, 8:21 AM

it essentially pickled something with the date it was last downloaded and then only triggered a new request after some threshold passed

💡 1

Juan Luis

04/03/2023, 3:05 PM

waiting folks!

Juan Luis

04/03/2023, 3:06 PM

@Nok Lam Chan @datajoely

datajoely

04/03/2023, 3:06 PM

sorry

Nok Lam Chan

04/03/2023, 3:10 PM

Sorry just finish an adhox meeting

datajoely

04/03/2023, 3:13 PM

Untitled.py

datajoely

04/03/2023, 3:13 PM

This my super old ExpiringHTTPDataSet

datajoely

04/03/2023, 3:14 PM

Untitled.yaml

👀 1

datajoely

04/03/2023, 3:14 PM

and it could be used like this

Juan Luis

04/03/2023, 3:59 PM

Copy code

birdclef_kaggle:
  type: BagOfFilesDataSet  # .zip/heterogeneous/generic/whatever
  path: <kaggle://birdclef-2023/>
  credentials: kaggle_credentials
  patterns: "**/*.ogg"

# birdclef_kaggle ----> Bundle ----> audio_files

audio_files:
  type: PartitionedDataSet
  path: data/02_intermediate/audio
  dataset: audio.OGGDataSet

👍🏼 1

Nok Lam Chan

04/03/2023, 4:03 PM

Totally unrelated to the KaggleDataSet - I like the

patterns

attribute, it will be a nice addition to the PartitionedDataSet. Currently it’s possible to customize the path but it’s tricky to define the structures for your partitions. Thank you for the meeting, I don’t have any answer but only more questions now😂

☝🏼 1

Juan Luis

04/03/2023, 4:04 PM

the best kind of meeting! 😂

Juan Luis

04/03/2023, 4:04 PM

last thing we discussed @datajoely not sure if you were still there is that datasets seem to involve formats, whereas protocols feel more appropriate for transports or locations

👍 1

Nok Lam Chan

04/03/2023, 4:07 PM

Also, I like the wrapper DataSet concept - which I haven’t seen much people doing it? Essential it’s a MixIn approach for datasets. (Not sure if the format is the best though, since it’s weird to have more than 2 MixIn as your catalog start to look like arbitary nested) I used to have a wrapper

CacheArtifactDataSet

which does all the metadata, upload/download and cache MD5 checking. i.e. if file doesn’t exist it will download it from a remote storage and store in your local

path

Juan Luis

04/15/2023, 9:44 AM

FWIW I've been trying to implement a Kaggle filesystem - still unsure how it will allow me to download the whole

.zip

bundle easily (looks like downloading individual files will be easy), but at least

ls

info

and

size

seem to work nicely https://github.com/astrojuanlu/kedro-kaggle-dataset/pull/3

6 Views

Open in Slack

Previous Next