lets say i have the following: - source: a csv res...
# questions
s
lets say i have the following: - source: a csv rest api with time series data and a 'duration to pull' parameter - task: weekly preparation of a dataset (historic and recent data) to be used by a BI tool for vizualization what would be the kedroic way to implement this? my guess: define a 'first run/update run' parameter in the conf/parameters.yml. if first run, pull all the data there is (duration to pull in last weeks = nan) and save as partitioned dataset into 01_raw (yearweek as partition key). if update run, determine amount of weeks to pull by checking whats already downloaded (difference between begin(='most recent' yearweek foldername in the partitioned dataset) and end(=current yearweek)) and save in same partitioned dataset (in fact i guess it would happen inside the same node as 'first run', the only difference being the computed 'duration to pull' parameter). in another node, the report dataset would be prepared (concat all data, save as multi sheet xlsx) and saved into 08_reporting. any advice is appreciated!
b
I think the kedronic way would be to define the duration during each run. What I would do is either 1) if executing from some type of scheduler look up the last available date and pass the duration on as a parameter when calling "run" (you could also do this manually but risk someone messing it up) 2) think about what costs would be incurred from pulling the entire history each run and see how that compares to the cost of someone messing it up by pulling the incorrect amount of history (in other words just pull the whole history each time) If you really wanted to look up the last run then you would probably have to use hooks to do so
👍 1
s
This would be designed as cronjob running weekly. Always pulling the whole history is not advised as it is not stable throughout time (data retention, aggregation, deletion). i can trust only the most recent data at any point in time.
b
Could you pull the whole history, save the data to an Incremental dataset and have a node that loads the incremental dataset and chops off the older data? You could also look into using the Code API
It's worth pointing out too, the reason this is complicated is because Kedro is trying to push you towards writing pure data pipelines which means it isn't dependent on the state when it runs. If you build it out the way you're describing each run will be dependant on the state that the stored CSV is in, so if one run fails then it'll be really difficult to recreate that state to debug later
s
this is interesting. i would not have thought that my case is that 'special'. i can see your point of problems in case of crashed runs. from my perspective if something crashes -- and i detect it through the logs -- i would manually delete all data belonging to the crashed run and rerun the whole thing, right? just to be clear, the report dataset preparation could be done at any run, but pulling the raw data of the partitioned dataset meaning gigabytes of data through the web at every run although i only need some MB seems a bit odd to me?
b
I don't disagree, and I think there used to be language around that in the documentation for the code API, but I do think that's probably your best option
s
thanks i will look into it!
NB: would the logical problem disappear if i would just always pull a defined duration (e.g. 1 week) after the big initial historical pull, meaning ditching the idea of the script to define the duration itself?
b
Yes, that would also work
s
ditching it is ;)
😁 1