hi team, just a quick question. let's say I have o...
# questions
d
hi team, just a quick question. let's say I have output O1 from node1 with the associated catalog configured so the content of O1 will be saved to CSV. node2 will use O1 as input. The current behaviour is that node2 will reload the data from O1 file instead of from memory (this is expected I assume, due to the catalog configuration). Is there any way I could still have O1 saved as CSV (easier for business people to check data quality) while having O1 loaded to node2 through memory (faster and no need to deal with csv save/load tricks), Thanks
e
You could define two output datasets from node1. One of the two datasets would be saved as a CSV (i.e. there is a catalog entry for CSV), while the other is simply referenced by node2 (i.e. no catalog entry, and therefore it's an "in-memory" dataset). Let me know if that makes sense.
👍 1
d
Yes!! I haven't maintained it due to (perceived) lack of interest, but https://github.com/deepyaman/kedro-accelerator solves exactly this problem. 🙂
👍 1
(I'm guessing it would require minimal changes to work with 0.18.x, but nobody's asked; if this solves your problem and you'd use it, I'm happy to try and find some time to update it, or of course happy to accept PRs)
d
Cool, thanks all, I will start with option 1 (logically, it should work without issue) and move to a more programmatic solution (accelerator) if get extra time.
Good to know the accelerator was built so I know I was not asking for a meaningless use case 🙂
m
Hello, can this solution (https://kedro.readthedocs.io/en/stable/data/data_catalog.html#transcode-datasets) be interresting? With df@csv with pandas.CSVDataSet and df@memory with MemoryDataset?
d
Thanks Massinissa, good point. I think this is similar to what johnson mentioned above, fundamentally you need to have two outputs