Hi Team, one of my team member claimed that he got...
# questions
j
Hi Team, one of my team member claimed that he got a different final outputs depending on whether he saves all intermediate files or he uses memoryDataSet. Before digging into this to find what's going on, does anyone experience this "saving intermediate files or not affect a final output"?
d
It's possible, depending on whether your dataset round trips properly. Kedro defaults try to do well, but it's not always possible. Using memory dataset is like:
Copy code
my_data = pd.DataFrame(...)
reloaded = my_data
Of course, this is "correct". Using an intermediate file is like:
Copy code
my_data = pd.DataFrame(...)
my_data.to_csv("path/to/file.csv", **save_args)
reloaded = pd.read_csv("path/to/file.csv", **load_args)
Kedro tries to set stuff for things like whether to read/write index column, etc. reasonably, but there can still be inconsistency. For example, null types may be difficult to distinguish from empty strings, unless you configure these. Formats like Parquet definitely help in this regard, but still may not be perfect.
๐Ÿ‘ 3
j
Thanks @Deepyaman Datta, it definitely makes sense to me.
y
To avoid problems like in @Deepyaman Dattaโ€™s example, what I like to do is: try to not use file formats with inconsistent save/load for intermediary datasets. For inputs and free outputs that's fine. For intermediary I almost always use only
pickle
and
parquet
๐Ÿ‘ 1
n
Yup, the assumption is save/load yielding identical result. This is usually true but if you start saving with untyped csv then you run into trouble. (Please donโ€™t use csv as intermediate storage they are very inefficient)
๐Ÿ‘ 2
Another potential root cause involves randomness seed. For example if you have two nodes do random sampling, you get different result if you run one node at a time compared to running both.