Hello kedro team, I'm working on a data pipeline ...
# questions
a
Hello kedro team, I'm working on a data pipeline and i got a problem with lazy partitioned dataset. I have : • Partitioned dataset in input • Partitioned dataset in output • Processing fonction My problem is : my processing function load a huge NN and i want to avoid loading it for every file when i do lazy saving. Someone know a clean solution ? I was thinking of Hooks but that would put part of the code logic in the Hook, not in the node
m
What have you tried so far? Loading NN as a separate dataset once and then passing it to the same node to which you pass the partitioned dataset should do the job.
a
Thanks for the answer, it was my first implementation but i want to use lazy saving to avoid loading all the data in memory. So i return a dict with Callable on the loading & processing function. The processing is done after the node calculation by itering on the dict. So my NN is not load anymore...
m
I don’t quite understand. PartitionedDataset and NN are two separate things here. You can use lazy loading with dict of callables, it’s actually a good aproach here to avoid loading everything into memory.
a
Copy code
def calc_func(audio):
    #DO SOME PROCESSING
    return audio

def generate_bnf(partitioned_wav: Dict[str, Callable[[], Any]], neural_network, parameters: Dict[str, Any]):
    """

    Args:
        data: Partitionned dataset from wav files
        parameters: Parameters defined in parameters.yml.
    Returns:
        Bottleneck features
    """
    def wrapper():
        return calc_func(partition_load_func(), neural_network)

    bnf_features = {}

    for partition_key, partition_load_func in sorted(partitioned_wav.items()):
        bnf_features[partition_key] = lambda: wrapper()

    return bnf_features
I don't know if the function wrapper can access to neural_network param when the bnf_feature is resolve
m
Yes, it can, but note that your implementation is incorrect and you might get an issue with closures.
Copy code
def get_wrapper(partition_load_func, nn):
        def wrapper():
            return calc_func(partition_load_func(), nn)
        return wrapper

    bnf_features = {}

    for partition_key, partition_load_func in sorted(partitioned_wav.items()):
        bnf_features[partition_key] = get_wrapper(partition_load_func, neural_network)

    return bnf_features
Sth like this should work
a
Thank you ! You save my day !
😎 1
🙂 1
Just a last question, if i want to process data by batch and load 10 files in one time, is it possible ?
m
Sure, you can write a logic that will create batches from the dict.
a
Thanks for you'r answer ! How i'm supposed to implement it ?
Is it possible to define a list in the dict ? Or I have to write a custom hook "after_node_execution" ?