I'm migrating to Kedro for a bunch of my projects!...
# questions
g
I'm migrating to Kedro for a bunch of my projects! For one of them I have been using wget in shell scripts to grab the data. Here is an example, which grabs the Canada national road network along with some flags to adjust how the data is grabbed:
wget --random-wait -r -l 2 -p -np -nH -e robots=off -P ../../../data/external/ -U mozilla <https://ftp.maps.canada.ca/pub/nrcan_rncan/vector/geobase_nrn_rrn/>
There are a couple of aspects that I am hoping can be improved about this approach. The first is that it requires wget which is no problem for my Linux system but I think requires additional config to get it available in a the namespace on windows. The second aspect is I'm not sure how to migrate something like this to Kedro. I see that there are a variety of classes available to put in the data catalog, which is great. But I am not sure if any of them can provide something functionally equivalent to this call. I see there is:
kedro_datasets.api.APIDataSet
but I don't know if it supports what I need. If not I am guessing I could learn how the class structure works for kedro data sets and develop my own data source. Any suggestions for this noob?
🔥 3
d
IMO Kedro is not a data download tool. You can always find some dataset or write a custom dataset that could do this, but... If you manually run your pipeline, do you want to unnecessarily go fetch the same data again? Or do you want to have a separate process that fetches the data (can be automated pulls, if in production), and then have the pipeline operate upon downloaded data. In my view, the latter almost always makes more sense. Perhaps not a great example, but e.g. in a project I started (and abandoned) a few years back, write a script (probably not as different from what you've been doing) for the data download, and then the Kedro catalog entries are configured to pull from that landing space. https://github.com/deepyaman/inauditus/blob/develop/refresh-data
❤️ 1
j
thanks for sharing this @Galen Seilis! this is an interesting example of non-conventional data access. as @Deepyaman Datta says, it's not something Kedro excels at - although if you absolutely want to have this process be taken care of by Kedro (which I completely sympathise with), you could potentially create your own custom dataset for it. it's what I did with my first attempt at a
KaggleDataset
https://github.com/astrojuanlu/kedro-kaggle-dataset/tree/kaggle-fs
APIDataSet
will probably not offer the degree of flexibility that you need. I'm adding this example to our long issue https://github.com/kedro-org/kedro/issues/1936, feel free to chime in there
❤️ 1
I'm thinking that it would be cool to have a generic dataset that takes the subprocess you want to execute and the folder you expect your data to be in, and then the node receives a list of files.
💯 1
g
Thank you both for sharing your perspectives. One of the points which I think is important here is the data set is quite large. I don't want to refresh it every time as that takes a substantial amount of time. I think for now I will opt to grab the data using a separate script which runs the wget calls. From there the Kedro project can have those files listed in the data catalog. It adds a single step to the setup of the overall product, but that will be fine in my case.
👍🏼 1
👍 1
1
d
The
IncrememntalDataSet
may be helpful here too
💯 1