Hi fellows, I would like to use example datasets d...
# questions
f
Hi fellows, I would like to use example datasets defined in
data/
through the catalog in some "unit tests" (does it make sense?). To do so, I thought about using the
test_run
example from the documentation (created from
kedro new
) to load the
catalog
through the context. Is it the correct way to do it? Thanks in advance.
j
Hi Flavien! Can you give an example of exactly what you want to test in your data? Maybe you can use Hooks + Great Expectations instead?
f
Let's say I want to unit test such a node/function
Copy code
def filter_out_flagged_data(
    hourly_measures: DataFrame, excluded_flags: list[str]
) -> DataFrame:
    return hourly_measures.filter(
        ~arrays_overlap(
            col("flags_array"), array([lit(flag) for flag in excluded_flags])
        )
    ).drop(col("flags_array"))
To do so, I would like to use a test dataset from a JSON file which is defined in a catalog through
Copy code
hourly_measures:
  type: spark.SparkDataSet
  filepath: data/01_raw/hourly_measures_test.json
  file_format: json
  load_args:
    header: True
    multiline: True
It would be a pytest fixture using the catalog to load the data instead of including the JSON/dict content inside the fixture itself.
j
@Nok Lam Chan how would you approach this?
n
so what I will do is having a catalog fixture (syntax may not be exactly correct I am just typing it out here)
Copy code
@pytest.fixture
def context():
   session = KedroSession.create(..., env="test")
   return session.load_context()

def catalog(context):
   return context.catalog

def test_my_dataset(catalog):
   dataset = catalog.load("some_json_dataset")
   ... # do your work here
@Jose Nuñez @Flavien
👍 1
f
Cool, that's what I had in mind. Thanks for the confirmation!
n
Meanwhile there are some discussion to create a Testing client to provide a python API to manipulate Kedro project in a easier way - very early discussion right now, not sure if the community need this.