Hey team Is there a standard way to deal with possibly non e Kedro #questions

Hey team, Is there a standard way to deal with pos...

Pedro Sousa Silva

12/05/2023, 11:32 AM

Hey team, Is there a standard way to deal with possibly non-existent files in Kedro's catalog (taking a very generic use case of a node that accepts an optional argument of type dataframe)? Thank you!

marrrcin

12/05/2023, 12:03 PM

Why do you need such functionality?

Pedro Sousa Silva

12/05/2023, 12:08 PM

To handle the following behavior: either the file exists and we append its content to an existing dataframe, or it doesn't and we return the existing dataframe unaltered

Pedro Sousa Silva

12/05/2023, 2:08 PM

@marrrcin any thoughts? meanwhile found this that indicates there's no solution, but I wonder if there are recommended workarounds

marrrcin

12/05/2023, 2:17 PM

This should work

Copy code

def pandas_appender_pipeline():
    return pipeline(
        [
            node(
                func=lambda: pd.DataFrame(
                    {
                        "A": np.random.randint(0, 512, 5),
                        "B": np.random.randint(512, 1024, 5),
                    }
                ),
                inputs=None,
                outputs="csv_appender",
                name="generate_dataframe",
            ),
            node(
                func=lambda df: print(df),
                inputs="csv_appender",
                outputs=None,
                name="print_dataframe",
            ),
        ]
    )

Catalog:

Copy code

csv_appender:
  type: pandas.GenericDataset
  file_format: csv
  filepath: data/03_primary/csv_appender.csv
  fs_args:
    open_args_save:
      mode: a
  save_args:
    mode: a
    index: false
    header: false

Pedro Sousa Silva

12/05/2023, 2:34 PM

Alright, thanks!

marrrcin

12/05/2023, 2:47 PM

Please let me know if it works for you

Pedro Sousa Silva

12/05/2023, 3:20 PM

It serves the use case of appending indeed! my question was more generic though, on how to handle the loading from the catalog only if the file exists in the defined location, otherwise return None (for instance). I think I can just use something like this in a preceding node though:

Copy code

try:
    # check if the file exists
    catalog.load('dataset_2')

except:
    # otherwise create an empty dataset and save it
    from kedro_datasets.pandas import ParquetDataset
    import pandas as pd

    df2_loc = ParquetDataset(filepath="dataset_2.parquet")
    df2_loc.save(pd.DataFrame(columns=['col1', 'col2']))

marrrcin

12/05/2023, 3:30 PM

Out of the box, there’s no such solution AFAIK (maybe @Juan Luis / @datajoely can prove me wrong 🤞 ). You can try going in this direction: https://kedro-org.slack.com/archives/C03RKP2LW64/p1695291572735829?thread_ts=1695291188.286309&cid=C03RKP2LW64

Pedro Sousa Silva

12/05/2023, 4:11 PM

ohhh of course, I can just create a subclass of a ParquetDataset and handle that behavior in _load. Thanks for pointing that out @marrrcin! Super helpful

2 Views

Open in Slack

Previous Next