Hey team, Is there a standard way to deal with pos...
# questions
p
Hey team, Is there a standard way to deal with possibly non-existent files in Kedro's catalog (taking a very generic use case of a node that accepts an optional argument of type dataframe)? Thank you!
m
Why do you need such functionality?
p
To handle the following behavior: either the file exists and we append its content to an existing dataframe, or it doesn't and we return the existing dataframe unaltered
@marrrcin any thoughts? meanwhile found this that indicates there's no solution, but I wonder if there are recommended workarounds
m
This should work
Copy code
def pandas_appender_pipeline():
    return pipeline(
        [
            node(
                func=lambda: pd.DataFrame(
                    {
                        "A": np.random.randint(0, 512, 5),
                        "B": np.random.randint(512, 1024, 5),
                    }
                ),
                inputs=None,
                outputs="csv_appender",
                name="generate_dataframe",
            ),
            node(
                func=lambda df: print(df),
                inputs="csv_appender",
                outputs=None,
                name="print_dataframe",
            ),
        ]
    )
Catalog:
Copy code
csv_appender:
  type: pandas.GenericDataset
  file_format: csv
  filepath: data/03_primary/csv_appender.csv
  fs_args:
    open_args_save:
      mode: a
  save_args:
    mode: a
    index: false
    header: false
p
Alright, thanks!
m
Please let me know if it works for you
p
It serves the use case of appending indeed! my question was more generic though, on how to handle the loading from the catalog only if the file exists in the defined location, otherwise return None (for instance). I think I can just use something like this in a preceding node though:
Copy code
try:
    # check if the file exists
    catalog.load('dataset_2')

except:
    # otherwise create an empty dataset and save it
    from kedro_datasets.pandas import ParquetDataset
    import pandas as pd

    df2_loc = ParquetDataset(filepath="dataset_2.parquet")
    df2_loc.save(pd.DataFrame(columns=['col1', 'col2']))
m
Out of the box, there’s no such solution AFAIK (maybe @Juan Luis / @datajoely can prove me wrong 🤞 ). You can try going in this direction: https://kedro-org.slack.com/archives/C03RKP2LW64/p1695291572735829?thread_ts=1695291188.286309&cid=C03RKP2LW64
p
ohhh of course, I can just create a subclass of a ParquetDataset and handle that behavior in _load. Thanks for pointing that out @marrrcin! Super helpful