Kedro is an open-sourced Python framework for creating maintainable and modular data science code.

Kedro

Hi all,  In my Kedro data pipeline, I need to capture the count of rows and fill rate of certain columns in a PySpark dataframe after every filter or join operation. Currently, I have implemented this metric collection using `count`, but it has significantly slowed down my pipeline. Is there a better way to capture waterflow?

Hi, this is a good question. It sounds like more a spark question in general. I am no expert of data engineering.
Is it possible to read the Metadata directly if you are saving it as a parquet or similar data format?

<https://duckdb.org/docs/data/parquet/metadata.html|https://duckdb.org/docs/data/parquet/metadata.html>

So it’s a bit difficult to do this efficiently

if you want to use Spark to do this, it will count things by materialising the dataframe in memory to count it even if you don’t need to do that as part of your query plan

so I would consider having a separate metrics pipeline OR like nok suggests using a hook or something to read the metadata on disk directly