https://kedro.org/ logo
#questions
Title
# questions
b

Ben Phillips

02/25/2024, 6:00 PM
Hey! I have a use case where I have node X that creates many CSV files in subdirectories (each subdirectory representing a class; FYI this is a time series classification problem). Each of these CSV files could be massive, too big to fit into memory, so they'll need to be created one at a time (data will be written in chunks to each file) within node X, i.e., storing them into memory then outputting them into a kedro dataset all at once is not feasible. Node Y will then act on these CSV files after node X is done, and will further refine the files to output some multi dimensional numpy arrays. I am wondering what the X node will output and the Y node will input for this use case? I am really looking for some way to enforce some structure/interface between node X and Y. Do partitioned datasets make sense here? Any advice greatly appreciated 🙂
d

datajoely

02/26/2024, 1:58 AM
Have you used PartitionedDataset for this?
n

Nok Lam Chan

02/26/2024, 10:30 AM
Is it possible to avoid CSV files to start with? massive CSV files are generally not a good idea
👍 1
b

Ben Phillips

02/26/2024, 1:49 PM
@datajoely , no not yet, I may have a play around soon @Nok Lam Chan the CSV files are converted from pcap files, where each pcap file represents a TCP session (or "flow"). Some of these sessions are huge in the source datasets and unfortunately can't be helped. I will sample from these CSV files later in the process, so that the resulting data samples I'll be using when training my models will only use a fraction of the large files (in reference to my original problem statement, this will happen in the Y node). Reasons for not wanting to move from CSV files are: 1. this is what the current code is doing, and I would prefer to stick more closely with that for now 2. some encrypted network traffic source datasets come preprocessed in CSV files (i.e. no pcap files are provided) where each CSV file represents a session
👍🏼 1
n

Nok Lam Chan

02/26/2024, 1:52 PM
In this case, I think you need to explore PartitionedDataset with lazing loading/saving. You can search from the docs for examples.
👍 1
b

Ben Phillips

02/26/2024, 1:56 PM
thanks, I'll take a look 🙂