Juan Luis
03/30/2023, 8:40 AMKaggleDataSet
with a _load
method that basically returns a KaggleBundle
containing the list of files I will download, and a _save
method that performs the download. if I load and save it from Python using the catalog
, it works beautifully:
In [2]: catalog.list()
Out[2]: ['spaceship_titanic_kaggle', 'head_titanic_train', 'parameters']
In [3]: bundle = catalog.load("spaceship_titanic_kaggle")
[03/30/23 10:21:01] INFO Loading data from 'spaceship_titanic_kaggle' (KaggleDataSet)... data_catalog.py:343
[03/30/23 10:21:02] WARNING /Users/juan_cano/.micromamba/envs/kaggle310-dev/lib/python3.10/site-packages/kaggle/rest.py:62: DeprecationWarning: warnings.py:109
HTTPResponse.getheaders() is deprecated and will be removed in urllib3 v2.1.0. Instead access HTTPResponse.headers
directly.
return self.urllib3_response.getheaders()
In [4]: bundle
Out[4]: KaggleBundle(dataset_or_competition='spaceship-titanic', members=['sample_submission.csv', 'train.csv', 'test.csv'], is_competition=True, single_file=False)
In [5]: catalog.save("spaceship_titanic_kaggle", data=bundle)
[03/30/23 10:21:51] INFO Saving data to 'spaceship_titanic_kaggle' (KaggleDataSet)... data_catalog.py:382
[03/30/23 10:21:52] WARNING /Users/juan_cano/.micromamba/envs/kaggle310-dev/lib/python3.10/site-packages/kaggle/api_client.py:181: warnings.py:109
DeprecationWarning: HTTPResponse.getheaders() is deprecated and will be removed in urllib3 v2.1.0. Instead access
HTTPResponse.headers directly.
response_data.getheaders())
Downloading spaceship-titanic.zip to data/01_raw
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 299k/299k [00:00<00:00, 988kB/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 299k/299k [00:00<00:00, 986kB/s]
In [6]: !ls data/01_raw
sample_submission.csv test.csv train.csv
but then, when I want to use it in a Kedro pipeline, because a node cannot have the same dataset as input and output, the only way I could make it work is by duplicating a KaggleDataSet
in the catalog and using a dummy node that just passes the bundle without doing anything:
# catalog.yaml
spaceship_titanic_kaggle_orig:
type: kedro_kaggle_dataset.KaggleDataSet
dataset: spaceship-titanic
directory: data/01_raw/
is_competition: True
spaceship_titanic_kaggle:
type: kedro_kaggle_dataset.KaggleDataSet
dataset: spaceship-titanic
directory: data/01_raw/
is_competition: True
# nodes.py
from kedro_kaggle_dataset.kaggle_dataset import KaggleBundle
def download_titanic(data: KaggleBundle) -> KaggleBundle:
return data
# pipeline.py
def create_pipeline(**kwargs) -> Pipeline:
return pipeline([
node(
func=download_titanic,
inputs="spaceship_titanic_kaggle_orig",
outputs="spaceship_titanic_kaggle",
name="download_titanic_node",
),
...
datajoely
03/30/2023, 8:56 AMload_args
and save_args
but let me DM you and think through this properlyAntony Milne
03/30/2023, 9:53 AMAPIDataSet
that just exposes _load
(no _save
). This would take as some sort of argument (maybe in load_args
, maybe top level) the specification of which files you want from Kaggle. _load
then opens these up in memory (or saves them somewhere temporary if you have to specify a filepath according to the kaggle API) as pandas dataframes. Then it’s the responsibility of the node that consumes those datasets to manipulate them as required and output to a new, persisted dataset with its own catalog entry (e.g. pandas.CSVDataSet
).datajoely
03/30/2023, 9:56 AMNok Lam Chan
03/30/2023, 12:32 PMPyTorchDataLoaderDataSet
or something similar. The _save
, _load
interface works quite well for pandas, but sometimes it’s awkward for these DL library will take path as argument.datajoely
03/30/2023, 12:38 PMJuan Luis
03/30/2023, 2:37 PM01_raw
Nok Lam Chan
03/30/2023, 2:39 PMJuan Luis
03/30/2023, 2:40 PMNok Lam Chan
03/30/2023, 2:42 PM_load
instead of _save
to me - save
is something that will be called for function outputJuan Luis
03/30/2023, 2:51 PMNok Lam Chan
03/30/2023, 2:51 PMJuan Luis
03/30/2023, 2:53 PMio.BytesIO
objects_load
and _save
actually do 🔥def _load(self) -> KaggleBundle:
...
members = {}
for member_filename in members_list:
with open(self._directory / member_filename, "rb") as fh:
members[member_filename] = BytesIO(fh.read())
return KaggleBundle(..., members)
def _save(self, KaggleBundle): -> None:
raise NotImplementedError("Cannot save back to Kaggle")
and now I only need 1 KaggleDataSet
in the catalog, 1 destination pandas.CSVDataSet
, and a function doing like this:
def head_titanic(bundle: KaggleBundle) -> pd.DataFrame:
df_test = pd.read_csv(bundle["test.csv"])
return ...
✨ happy to keep the meeting on Monday if you want folks, but I'm quite happy with thisdatajoely
04/01/2023, 8:21 AMJuan Luis
04/03/2023, 3:05 PMdatajoely
04/03/2023, 3:06 PMNok Lam Chan
04/03/2023, 3:10 PMdatajoely
04/03/2023, 3:13 PMJuan Luis
04/03/2023, 3:59 PMbirdclef_kaggle:
type: BagOfFilesDataSet # .zip/heterogeneous/generic/whatever
path: <kaggle://birdclef-2023/>
credentials: kaggle_credentials
patterns: "**/*.ogg"
# birdclef_kaggle ----> Bundle ----> audio_files
audio_files:
type: PartitionedDataSet
path: data/02_intermediate/audio
dataset: audio.OGGDataSet
Nok Lam Chan
04/03/2023, 4:03 PMpatterns
attribute, it will be a nice addition to the PartitionedDataSet.
Currently it’s possible to customize the path but it’s tricky to define the structures for your partitions.
Thank you for the meeting, I don’t have any answer but only more questions now😂Juan Luis
04/03/2023, 4:04 PMNok Lam Chan
04/03/2023, 4:07 PMCacheArtifactDataSet
which does all the metadata, upload/download and cache MD5 checking. i.e. if file doesn’t exist it will download it from a remote storage and store in your local path
Juan Luis
04/15/2023, 9:44 AM.zip
bundle easily (looks like downloading individual files will be easy), but at least ls
, info
and size
seem to work nicely https://github.com/astrojuanlu/kedro-kaggle-dataset/pull/3