hello there, I hope that this finds you well. Pot...
# questions
c
hello there, I hope that this finds you well. Potential bug with partititioned dataset lazy saving I'm working with partitioned dataset, specifically using lazy saving. I have a list of items (dict). Each item has both a refname and a bbox. In order to implement lazy saving, following the doc, my node returns the following:
Copy code
return {item["refname"]: lambda: get_image(item["bbox"], parameters) for item in items}
However, when I do that, the refname and the bbox are messing up -> images (bbox) are saved under the wrong refnames (the refname of another image-bbox). Quick fix If I don't implement lazy loading, everything works as expected (expected image-bbox under the related refname)
Copy code
return {item["refname"]: get_image(item["bbox"], parameters) for item in items}
That said, I still need lazy loading. Set up kedro version: 0.18.3 OS: mac Questions Can you confirm that my implementation should e correct? If yes, do we have any experience with this bug? Is there any known fix? Should I raise an issue?
👀 1
cc @Roberto P. Palomares
n
Note: waiting for a simplified example
y
I think I had the same very strange issue a while ago. I don't remember the inner details of python's internals, but if I remember well this is due to
lambda
not being properly redefined. I'll try to reproduce tomorrow and get back to you.
thankyou 1
z
Hi, you potentially have to complete your lambda with a parameter list (for it to have the actual value of what you pass to get_image.) Not tested but can an example like this help you?
Copy code
return {item["refname"]: lambda bbox=item["bbox"]: get_image(bbox, parameters) for item in items}
The template i use usually is the following (with an explicit parameter list.
Copy code
return {
        partition_key: (
            lambda partition_load_func=partition_load_func, partition_key=partition_key: _my_function(
                partition_load_func(),
                partition_key,
                parameters,
            )
        )
        for partition_key, partition_load_func in loaded.items()
    }
y
Hi @Cyril Verluise, sorry for the delay! I think I remember what happened. There is a scoping conflict with lambda which don't evaluate parameter properly in a loop. I think the fix went like this (I know it looks stupid, but I am pretty sure it was my fix back then) :
def _create_lambda(bbox, parameters):
return lambda: get_image(bbox, parameters)
and then :
Copy code
return {item["refname"]: _create_lambda(bbox, parameters) for item in items}
I haven't found the blog post about python scope for variables resolution which was an interesting read, but I did not find very carefully, if someone finds the reference please tell me!
By the way @Nok Lam Chan I'll try to create a reproducible example if Cyril confirms this is the right solution because this is very hard to debug and may be worth documenting.
thankyou 1
c
thanks a lot! let me try that tomorrow
Great to have such a wonderful community. Let's check that and create a proper issue/request when it's done!
This approach works fine!
Copy code
return {
        partition_key: (
            lambda partition_load_func=partition_load_func, partition_key=partition_key: _my_function(
                partition_load_func(),
                partition_key,
                parameters,
            )
        )
        for partition_key, partition_load_func in loaded.items()
    }
Awesome!
Same for Yolan proposal!
btw, aren't the two approaches equivalent?
👍 1
@Nok Lam Chan, what's the next step? raising an issue directly in GH summarizing the above and asking for doc clarification? Lmk Happy to do it!
đŸ‘đŸŒ 1
y
Yes, actually both approches force the resolution by evaluating variables in a higher scope (either with default args or a "proper" function), this is quite equivalent
👍 1
c
Issue raised here: https://github.com/kedro-org/kedro/issues/3052#issue-1904569322 Feel free to subscribe to receive related news and/or to comment! Thanks a lot for your help!
thankyou 1
n
thank you both
So I finally have time to play around with it. From my understanding, this is not a Kedro problem. It’s how Lambda variable scope work. See this example
Copy code
In [7]: iterable = [lambda: print(x) for x in range(4)]
   ...: 
   ...: for i in iterable:
   ...:     i()
   ...: 
   ...: print("Assign the variable to lambda scope")
   ...: 
   ...: iterable = [lambda x=x : print(x) for x in range(4)]
   ...: 
   ...: for i in iterable:
   ...:     i()
   ...: 
   ...: 
3
3
3
3
Assign the variable to lambda scope
0
1
2
3
Copy code
This StackOverFlow thread explains better: <https://stackoverflow.com/questions/938429/scope-of-lambda-functions-and-their-parameters>
y
Yes exactly , this is a python problem, not a kedro one. That said I think this is a common error with PartitionedDataSet (and a silent one), so it may be useful to warn about it
👍 1
n
That’s fair, I think we can add a
Note
section to warn about this. I just want to confirm this is not a bug that Kedro introduced. Actually should there be any lint tool that can pick this up? My guess is this should exists already.