hi folks, I created a custom dataset to see if I c...
# questions
j
hi folks, I created a custom dataset to see if I could understand the documentation and how it works, but I feel I'm doing some unconventional things and I'd need some advice:
I created a
KaggleDataSet
with a
_load
method that basically returns a
KaggleBundle
containing the list of files I will download, and a
_save
method that performs the download. if I load and save it from Python using the
catalog
, it works beautifully:
Copy code
In [2]: catalog.list()
Out[2]: ['spaceship_titanic_kaggle', 'head_titanic_train', 'parameters']

In [3]: bundle = catalog.load("spaceship_titanic_kaggle")
[03/30/23 10:21:01] INFO     Loading data from 'spaceship_titanic_kaggle' (KaggleDataSet)...                                                      data_catalog.py:343
[03/30/23 10:21:02] WARNING  /Users/juan_cano/.micromamba/envs/kaggle310-dev/lib/python3.10/site-packages/kaggle/rest.py:62: DeprecationWarning:      warnings.py:109
                             HTTPResponse.getheaders() is deprecated and will be removed in urllib3 v2.1.0. Instead access HTTPResponse.headers                      
                             directly.                                                                                                                               
                               return self.urllib3_response.getheaders()                                                                                             
                                                                                                                                                                     

In [4]: bundle
Out[4]: KaggleBundle(dataset_or_competition='spaceship-titanic', members=['sample_submission.csv', 'train.csv', 'test.csv'], is_competition=True, single_file=False)

In [5]: catalog.save("spaceship_titanic_kaggle", data=bundle)
[03/30/23 10:21:51] INFO     Saving data to 'spaceship_titanic_kaggle' (KaggleDataSet)...                                                         data_catalog.py:382
[03/30/23 10:21:52] WARNING  /Users/juan_cano/.micromamba/envs/kaggle310-dev/lib/python3.10/site-packages/kaggle/api_client.py:181:                   warnings.py:109
                             DeprecationWarning: HTTPResponse.getheaders() is deprecated and will be removed in urllib3 v2.1.0. Instead access                       
                             HTTPResponse.headers directly.                                                                                                          
                               response_data.getheaders())                                                                                                           
                                                                                                                                                                     
Downloading spaceship-titanic.zip to data/01_raw
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 299k/299k [00:00<00:00, 988kB/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 299k/299k [00:00<00:00, 986kB/s]

In [6]: !ls data/01_raw
sample_submission.csv  test.csv               train.csv
but then, when I want to use it in a Kedro pipeline, because a node cannot have the same dataset as input and output, the only way I could make it work is by duplicating a
KaggleDataSet
in the catalog and using a dummy node that just passes the bundle without doing anything:
Copy code
# catalog.yaml
spaceship_titanic_kaggle_orig:
  type: kedro_kaggle_dataset.KaggleDataSet
  dataset: spaceship-titanic
  directory: data/01_raw/
  is_competition: True

spaceship_titanic_kaggle:
  type: kedro_kaggle_dataset.KaggleDataSet
  dataset: spaceship-titanic
  directory: data/01_raw/
  is_competition: True

# nodes.py
from kedro_kaggle_dataset.kaggle_dataset import KaggleBundle

def download_titanic(data: KaggleBundle) -> KaggleBundle:
    return data


# pipeline.py
def create_pipeline(**kwargs) -> Pipeline:
    return pipeline([
        node(
            func=download_titanic,
            inputs="spaceship_titanic_kaggle_orig",
            outputs="spaceship_titanic_kaggle",
            name="download_titanic_node",
        ),
        ...
this is the code of the dataset in case you want to try it https://github.com/astrojuanlu/kedro-kaggle-dataset
is there a better way of doing this?
d
so i think the trick here is to be explicit with
load_args
and
save_args
but let me DM you and think through this properly
in truth this is a higher level abstraction than we normally do in the standard bundle
so not all of the concepts map neatly
a
Yeah, this is a very interesting idea but I think might be tricky to do within the kedro catalog paradigm as it stands. I think the simplest way would be to have a dataset a bit like
APIDataSet
that just exposes
_load
(no
_save
). This would take as some sort of argument (maybe in
load_args
, maybe top level) the specification of which files you want from Kaggle.
_load
then opens these up in memory (or saves them somewhere temporary if you have to specify a filepath according to the kaggle API) as pandas dataframes. Then it’s the responsibility of the node that consumes those datasets to manipulate them as required and output to a new, persisted dataset with its own catalog entry (e.g.
pandas.CSVDataSet
).
💡 1
Curious if @Nok Lam Chan has any ideas here
d
I’ll forward my invite to you two
✔️ 1
n
I’ll need some time to think about it more, APIDataSet approach seems fine to me. A bit off topic, I think we need some kind of
PyTorchDataLoaderDataSet
or something similar. The
_save
,
_load
interface works quite well for pandas, but sometimes it’s awkward for these DL library will take path as argument.
d
yeah seen lots of people talk about that
I’ve considered a special type of node that processes each partition 1:1
cos image processing today is a bit weird in one node
j
or maybe a dataset that needs no inputs, because what I'm doing here is essentially downloading a thing from the internet and putting it in
01_raw
but the "dummy node" problem would still be there
(I also think the "custom datasets" page needs better docs but I'll tackle that when I fully understand them 😅 )
n
I can’t run the code for some reason - keep getting 403 error, running on CLI is fine tho
j
authentication problems? I was using environment variables for that
n
Downloading seems to be a
_load
instead of
_save
to me -
save
is something that will be called for function output
j
yeah I was struggling with that. the problem of that particular case is that the Kaggle download could be anything, so I don't know up front how should I handle that
👍🏼 1
n
I don’t have much better idea than the one Antony has, it feels repeating though to opening up these file in a node and converting them as a dataframe
j
that works as long as you assume that everything are dataframes 🤔 but maybe we could return a dictionary of
io.BytesIO
objects
and then downstream nodes decide what to do
by the way this approach worked 😄 and now I do understand what
_load
and
_save
actually do 🔥
K 1
Copy code
def _load(self) -> KaggleBundle:
    ...
    members = {}
    for member_filename in members_list:
        with open(self._directory / member_filename, "rb") as fh:
            members[member_filename] = BytesIO(fh.read())

    return KaggleBundle(..., members)

def _save(self, KaggleBundle): -> None:
    raise NotImplementedError("Cannot save back to Kaggle")
and now I only need 1
KaggleDataSet
in the catalog, 1 destination
pandas.CSVDataSet
, and a function doing like this:
Copy code
def head_titanic(bundle: KaggleBundle) -> pd.DataFrame:
    df_test = pd.read_csv(bundle["test.csv"])
    return ...
happy to keep the meeting on Monday if you want folks, but I'm quite happy with this
notes for Monday: what to do if I don't want to download this data every time I run a pipeline
d
I made an expiring cahced dataset a long time ago
I should contribute it
it essentially pickled something with the date it was last downloaded and then only triggered a new request after some threshold passed
💡 1
j
waiting folks!
@Nok Lam Chan @datajoely
d
sorry
n
Sorry just finish an adhox meeting
d
Untitled.py
This my super old ExpiringHTTPDataSet
Untitled.yaml
👀 1
and it could be used like this
j
Copy code
birdclef_kaggle:
  type: BagOfFilesDataSet  # .zip/heterogeneous/generic/whatever
  path: <kaggle://birdclef-2023/>
  credentials: kaggle_credentials
  patterns: "**/*.ogg"

# birdclef_kaggle ----> Bundle ----> audio_files

audio_files:
  type: PartitionedDataSet
  path: data/02_intermediate/audio
  dataset: audio.OGGDataSet
👍🏼 1
n
Totally unrelated to the KaggleDataSet - I like the
patterns
attribute, it will be a nice addition to the PartitionedDataSet. Currently it’s possible to customize the path but it’s tricky to define the structures for your partitions. Thank you for the meeting, I don’t have any answer but only more questions now😂
☝🏼 1
j
the best kind of meeting! 😂
last thing we discussed @datajoely not sure if you were still there is that datasets seem to involve formats, whereas protocols feel more appropriate for transports or locations
👍 1
n
Also, I like the wrapper DataSet concept - which I haven’t seen much people doing it? Essential it’s a MixIn approach for datasets. (Not sure if the format is the best though, since it’s weird to have more than 2 MixIn as your catalog start to look like arbitary nested) I used to have a wrapper
CacheArtifactDataSet
which does all the metadata, upload/download and cache MD5 checking. i.e. if file doesn’t exist it will download it from a remote storage and store in your local
path
j
FWIW I've been trying to implement a Kaggle filesystem - still unsure how it will allow me to download the whole
.zip
bundle easily (looks like downloading individual files will be easy), but at least
ls
,
info
and
size
seem to work nicely https://github.com/astrojuanlu/kedro-kaggle-dataset/pull/3