I believe I have found a bug when running the same pipieline Kedro #questions

I believe I have found a bug when running the same...

Alexandre Ouellet

06/14/2023, 7:07 PM

I believe I have found a bug when running the same pipieline with different parameters. For instance, I have the following pipeline : function X-> versionned dataset -> function Y if I start this pipeline twice, if the 2nd pipeline's X node finishes earlier, I don't get the expected dataset

👀 1

Alexandre Ouellet

06/14/2023, 7:08 PM

A bit more context so as to not spam this channel

Alexandre Ouellet

06/14/2023, 7:10 PM

I have the following timeline : 1. Start pipeline execution A, function X 2. Start pipeline execution B, function X 3. Pipeline execution B, function X finishes 4. Pipeline execution B, function Y starts with dataset generated at step 3(as expected) 5. Pipeline execution A, function X finishes 6. PIpeline execution A, function Y starts with dataset generated at step *3*(not 5, unexpected)

Alexandre Ouellet

06/14/2023, 7:12 PM

I think the issue comes from the fact that there doesn't seem to have "version ID" for the whole run, only for the node, and it is assumed that when you run a node, whatever latest version of the dataset was created is the proper one

Alexandre Ouellet

06/14/2023, 7:12 PM

if I force a non-existing version with "--load-versions", I get the expected behaviour

Alexandre Ouellet

06/14/2023, 7:15 PM

The easiest way to reproduce is to create a very small function X that generates a small random csv, debug/break the function, then start the exact same pipeline in the command line

Alexandre Ouellet

06/14/2023, 7:16 PM

once you debug function Y, you'll notice that X's output doesn't match Y's input

Nok Lam Chan

06/14/2023, 9:29 PM

Any chance you can have a repository to demonstrate this? This is quite complicated I don’t think I am following

Alexandre Ouellet

06/15/2023, 2:08 AM

I don't, but I could probably do something quick tomorrow

Iñigo Hidalgo

06/15/2023, 9:25 AM

Let me see if I understood:

Copy code

pipeline = node_a -> versioned_dataset -> node_b

pipeline_1 => node_a_1 (slow) -> versioned_dataset_1 !!! versioned_dataset_2  -> node_b_1
pipeline_2 => node_a_2 (fast) -> versioned_dataset_2 -> node_b_2

pipeline_2 starts slightly after pipeline_1, but node_a_2 finishes before node_a_1 which leads to node_b_1 getting versioned_dataset_2 as an input?

Alexandre Ouellet

06/15/2023, 1:29 PM

yeah, pretty much it

Iñigo Hidalgo

06/15/2023, 1:51 PM

that’s surprising, as far as i know if you don’t supply a version it should be pulling from

latest/

which in both cases should be the correct version, as

node_b_2

would start as soon as

versioned_dataset_2

was generated by

node_a_2

, and

node_b_1

would start right after

node_a_1

generates

versioned_dataset_1

. if instead of forcing the pause through debugging you make

node_a

sleep

according to a certain parameter, and make that parameter bigger in one of the two runs, does the behavior still occur?

Alexandre Ouellet

06/15/2023, 1:53 PM

Maybe this will help

Alexandre Ouellet

06/15/2023, 1:53 PM

Alexandre Ouellet

06/15/2023, 1:53 PM

Start this pipeline, have a parameter to make it run say 1000 epochs

Alexandre Ouellet

06/15/2023, 1:54 PM

start this pipeline again right after, have a parameter to make it run 1 epoch

Alexandre Ouellet

06/15/2023, 2:00 PM

The debugging was just an easy way to sleep() or to slow down an execution

Alexandre Ouellet

06/15/2023, 2:14 PM

and that is the issue actually

Alexandre Ouellet

06/15/2023, 2:14 PM

the dataset version ID(the timestamp) is determined when a node is starting, not when it is saving

Alexandre Ouellet

06/15/2023, 2:15 PM

Since node_b_1 has started after node_a_1, node_b_1 is the most up to date (the latest), and it is that version that will be used in node_b_1

Alexandre Ouellet

06/15/2023, 2:20 PM

In the ideal world, the Version ID (the timestamp) would be established at the beginning of the pipeline run, not at the beginning of the execution of the node

Alexandre Ouellet

06/15/2023, 2:21 PM

The 2nd best (but still would create issue) is to have the Version ID (the timestamp) established at the moment of saving the dataset (but it would still create a timing issue if you have 2 run saving almost at the same time)

Alexandre Ouellet

06/15/2023, 2:22 PM

And currently, the version ID(the timestamp) is established when the node starts executing (which means that you can have multiple version ID if you have multiple node generating versionned dataset)

Nok Lam Chan

06/15/2023, 2:25 PM

The ID is set at the beginning of a pipeline, this shouldn’t matter with node ordering

Nok Lam Chan

06/15/2023, 2:26 PM

And it’s cache, even if you have another process created a new dataset, it wouldn’t matter.

Alexandre Ouellet

06/15/2023, 2:26 PM

then the latest is wrong for node_b_1

Alexandre Ouellet

06/15/2023, 2:26 PM

but I just reproduced it

Alexandre Ouellet

06/15/2023, 2:26 PM

reliably

Nok Lam Chan

06/15/2023, 2:27 PM

It would be really helpful if you can create the examples and steps for me to reproduce the issue on my end.

Alexandre Ouellet

06/15/2023, 2:27 PM

give me ~30 minutes

Nok Lam Chan

06/15/2023, 2:28 PM

Any chance you can open up a Github Issue? https://github.com/kedro-org/kedro/issues/new/choose

Alexandre Ouellet

06/15/2023, 2:28 PM

Will do

Nok Lam Chan

06/15/2023, 2:28 PM

Thank you, this will help a lot!

Alexandre Ouellet

06/15/2023, 3:40 PM

Here is the repo : https://github.com/AlexandreOuellet/kedro-versionned-dataset-bug

Alexandre Ouellet

06/15/2023, 3:40 PM

writing a new issue

Alexandre Ouellet

06/15/2023, 3:47 PM

https://github.com/kedro-org/kedro/issues/2694

Alexandre Ouellet

06/15/2023, 3:47 PM

Let me know if you need more details

Nok Lam Chan

06/15/2023, 3:54 PM

I don’t have enough time to test it today. I had a quick look at the issue, it’s quite clear and I think I have enough to reproduce the issue.

Nok Lam Chan

06/15/2023, 3:54 PM

I’ll try to find sometime tomorrow

Alexandre Ouellet

06/15/2023, 7:27 PM

No problem. I fear it might require a bit of a rewrite (not at the fundamental level, but the way version are passed to nodes/run)

👍🏼 1

Nok Lam Chan

06/16/2023, 11:41 AM

I can reproduce the issue. It’s a bit strange and I need more time to check. I can see this WARNING log from the 100s sleeping run

Copy code

WARNING  /Users/Nok_Lam_Chan/GitHub/kedro/kedro/io/core.py:632 warnings.py:109
                             : UserWarning: Save version                                          
                             '2023-06-16T11.33.37.506Z' did not match load version                
                             '2023-06-16T11.33.52.306Z' for                                       
                             CSVDataSet(filepath=/Users/Nok_Lam_Chan/dev/test/kedr                
                             o-versionned-dataset-bug/data/01_raw/versionned_datas                
                             et.csv, load_args={}, protocol=file,                                 
                             save_args={'index': False},                                          
                             version=Version(load=None,                                           
                             save='2023-06-16T11.33.37.506Z')). This is strongly                  
                             discouraged due to inconsistencies it may cause                      
                             between 'save' and 'load' operations. Please refrain                 
                             from setting exact load version for intermediate data                
                             sets where possible to avoid this warning.

5 Views

Open in Slack

Previous Next