https://kedro.org/ logo
#questions
Title
# questions
p

Pedro Sousa Silva

12/05/2023, 11:32 AM
Hey team, Is there a standard way to deal with possibly non-existent files in Kedro's catalog (taking a very generic use case of a node that accepts an optional argument of type dataframe)? Thank you!
m

marrrcin

12/05/2023, 12:03 PM
Why do you need such functionality?
p

Pedro Sousa Silva

12/05/2023, 12:08 PM
To handle the following behavior: either the file exists and we append its content to an existing dataframe, or it doesn't and we return the existing dataframe unaltered
@marrrcin any thoughts? meanwhile found this that indicates there's no solution, but I wonder if there are recommended workarounds
m

marrrcin

12/05/2023, 2:17 PM
This should work
Copy code
def pandas_appender_pipeline():
    return pipeline(
        [
            node(
                func=lambda: pd.DataFrame(
                    {
                        "A": np.random.randint(0, 512, 5),
                        "B": np.random.randint(512, 1024, 5),
                    }
                ),
                inputs=None,
                outputs="csv_appender",
                name="generate_dataframe",
            ),
            node(
                func=lambda df: print(df),
                inputs="csv_appender",
                outputs=None,
                name="print_dataframe",
            ),
        ]
    )
Catalog:
Copy code
csv_appender:
  type: pandas.GenericDataset
  file_format: csv
  filepath: data/03_primary/csv_appender.csv
  fs_args:
    open_args_save:
      mode: a
  save_args:
    mode: a
    index: false
    header: false
p

Pedro Sousa Silva

12/05/2023, 2:34 PM
Alright, thanks!
m

marrrcin

12/05/2023, 2:47 PM
Please let me know if it works for you
p

Pedro Sousa Silva

12/05/2023, 3:20 PM
It serves the use case of appending indeed! my question was more generic though, on how to handle the loading from the catalog only if the file exists in the defined location, otherwise return None (for instance). I think I can just use something like this in a preceding node though:
Copy code
try:
    # check if the file exists
    catalog.load('dataset_2')

except:
    # otherwise create an empty dataset and save it
    from kedro_datasets.pandas import ParquetDataset
    import pandas as pd

    df2_loc = ParquetDataset(filepath="dataset_2.parquet")
    df2_loc.save(pd.DataFrame(columns=['col1', 'col2']))
m

marrrcin

12/05/2023, 3:30 PM
Out of the box, there’s no such solution AFAIK (maybe @Juan Luis / @datajoely can prove me wrong 🤞 ). You can try going in this direction: https://kedro-org.slack.com/archives/C03RKP2LW64/p1695291572735829?thread_ts=1695291188.286309&cid=C03RKP2LW64
p

Pedro Sousa Silva

12/05/2023, 4:11 PM
ohhh of course, I can just create a subclass of a ParquetDataset and handle that behavior in _load. Thanks for pointing that out @marrrcin! Super helpful