I believe I have found a bug when running the same...
# questions
a
I believe I have found a bug when running the same pipieline with different parameters. For instance, I have the following pipeline : function X-> versionned dataset -> function Y if I start this pipeline twice, if the 2nd pipeline's X node finishes earlier, I don't get the expected dataset
đź‘€ 1
A bit more context so as to not spam this channel
I have the following timeline : 1. Start pipeline execution A, function X 2. Start pipeline execution B, function X 3. Pipeline execution B, function X finishes 4. Pipeline execution B, function Y starts with dataset generated at step 3(as expected) 5. Pipeline execution A, function X finishes 6. PIpeline execution A, function Y starts with dataset generated at step *3*(not 5, unexpected)
I think the issue comes from the fact that there doesn't seem to have "version ID" for the whole run, only for the node, and it is assumed that when you run a node, whatever latest version of the dataset was created is the proper one
if I force a non-existing version with "--load-versions", I get the expected behaviour
The easiest way to reproduce is to create a very small function X that generates a small random csv, debug/break the function, then start the exact same pipeline in the command line
once you debug function Y, you'll notice that X's output doesn't match Y's input
n
Any chance you can have a repository to demonstrate this? This is quite complicated I don’t think I am following
a
I don't, but I could probably do something quick tomorrow
i
Let me see if I understood:
Copy code
pipeline = node_a -> versioned_dataset -> node_b

pipeline_1 => node_a_1 (slow) -> versioned_dataset_1 !!! versioned_dataset_2  -> node_b_1
pipeline_2 => node_a_2 (fast) -> versioned_dataset_2 -> node_b_2
pipeline_2 starts slightly after pipeline_1, but node_a_2 finishes before node_a_1 which leads to node_b_1 getting versioned_dataset_2 as an input?
a
yeah, pretty much it
i
that’s surprising, as far as i know if you don’t supply a version it should be pulling from
latest/
which in both cases should be the correct version, as
node_b_2
would start as soon as
versioned_dataset_2
was generated by
node_a_2
, and
node_b_1
would start right after
node_a_1
generates
versioned_dataset_1
. if instead of forcing the pause through debugging you make
node_a
sleep
according to a certain parameter, and make that parameter bigger in one of the two runs, does the behavior still occur?
a
Maybe this will help
image.png
Start this pipeline, have a parameter to make it run say 1000 epochs
start this pipeline again right after, have a parameter to make it run 1 epoch
The debugging was just an easy way to sleep() or to slow down an execution
and that is the issue actually
the dataset version ID(the timestamp) is determined when a node is starting, not when it is saving
Since node_b_1 has started after node_a_1, node_b_1 is the most up to date (the latest), and it is that version that will be used in node_b_1
In the ideal world, the Version ID (the timestamp) would be established at the beginning of the pipeline run, not at the beginning of the execution of the node
The 2nd best (but still would create issue) is to have the Version ID (the timestamp) established at the moment of saving the dataset (but it would still create a timing issue if you have 2 run saving almost at the same time)
And currently, the version ID(the timestamp) is established when the node starts executing (which means that you can have multiple version ID if you have multiple node generating versionned dataset)
n
The ID is set at the beginning of a pipeline, this shouldn’t matter with node ordering
And it’s cache, even if you have another process created a new dataset, it wouldn’t matter.
a
then the latest is wrong for node_b_1
but I just reproduced it
reliably
n
It would be really helpful if you can create the examples and steps for me to reproduce the issue on my end.
a
give me ~30 minutes
n
Any chance you can open up a Github Issue? https://github.com/kedro-org/kedro/issues/new/choose
a
Will do
n
Thank you, this will help a lot!
a
writing a new issue
Let me know if you need more details
n
I don’t have enough time to test it today. I had a quick look at the issue, it’s quite clear and I think I have enough to reproduce the issue.
I’ll try to find sometime tomorrow
a
No problem. I fear it might require a bit of a rewrite (not at the fundamental level, but the way version are passed to nodes/run)
👍🏼 1
n
I can reproduce the issue. It’s a bit strange and I need more time to check. I can see this WARNING log from the 100s sleeping run
Copy code
WARNING  /Users/Nok_Lam_Chan/GitHub/kedro/kedro/io/core.py:632 warnings.py:109
                             : UserWarning: Save version                                          
                             '2023-06-16T11.33.37.506Z' did not match load version                
                             '2023-06-16T11.33.52.306Z' for                                       
                             CSVDataSet(filepath=/Users/Nok_Lam_Chan/dev/test/kedr                
                             o-versionned-dataset-bug/data/01_raw/versionned_datas                
                             et.csv, load_args={}, protocol=file,                                 
                             save_args={'index': False},                                          
                             version=Version(load=None,                                           
                             save='2023-06-16T11.33.37.506Z')). This is strongly                  
                             discouraged due to inconsistencies it may cause                      
                             between 'save' and 'load' operations. Please refrain                 
                             from setting exact load version for intermediate data                
                             sets where possible to avoid this warning.