Hi Team one of my team member claimed that he got a differen Kedro #questions

Hi Team, one of my team member claimed that he got...

Jonghyun Yun

08/07/2024, 7:22 PM

Hi Team, one of my team member claimed that he got a different final outputs depending on whether he saves all intermediate files or he uses memoryDataSet. Before digging into this to find what's going on, does anyone experience this "saving intermediate files or not affect a final output"?

Deepyaman Datta

08/07/2024, 8:29 PM

It's possible, depending on whether your dataset round trips properly. Kedro defaults try to do well, but it's not always possible. Using memory dataset is like:

Copy code

my_data = pd.DataFrame(...)
reloaded = my_data

Of course, this is "correct". Using an intermediate file is like:

Copy code

my_data = pd.DataFrame(...)
my_data.to_csv("path/to/file.csv", **save_args)
reloaded = pd.read_csv("path/to/file.csv", **load_args)

Kedro tries to set stuff for things like whether to read/write index column, etc. reasonably, but there can still be inconsistency. For example, null types may be difficult to distinguish from empty strings, unless you configure these. Formats like Parquet definitely help in this regard, but still may not be perfect.

👍 3

Jonghyun Yun

08/07/2024, 8:37 PM

Thanks @Deepyaman Datta, it definitely makes sense to me.

Yury Fedotov

08/08/2024, 4:01 AM

To avoid problems like in @Deepyaman Datta’s example, what I like to do is: try to not use file formats with inconsistent save/load for intermediary datasets. For inputs and free outputs that's fine. For intermediary I almost always use only

pickle

and

parquet

👍 1

Nok Lam Chan

08/09/2024, 9:51 PM

Yup, the assumption is save/load yielding identical result. This is usually true but if you start saving with untyped csv then you run into trouble. (Please don’t use csv as intermediate storage they are very inefficient)

👍 2

Nok Lam Chan

08/09/2024, 9:53 PM

Another potential root cause involves randomness seed. For example if you have two nodes do random sampling, you get different result if you run one node at a time compared to running both.

Open in Slack

Previous Next