Hi all, In my Kedro data pipeline, I need to capt...
# questions
Hi all, In my Kedro data pipeline, I need to capture the count of rows and fill rate of certain columns in a PySpark dataframe after every filter or join operation. Currently, I have implemented this metric collection using
, but it has significantly slowed down my pipeline. Is there a better way to capture waterflow?
Hi, this is a good question. It sounds like more a spark question in general. I am no expert of data engineering. Is it possible to read the Metadata directly if you are saving it as a parquet or similar data format? https://duckdb.org/docs/data/parquet/metadata.html
So it’s a bit difficult to do this efficiently
if you want to use Spark to do this, it will count things by materialising the dataframe in memory to count it even if you don’t need to do that as part of your query plan
💡 1
so I would consider having a separate metrics pipeline OR like nok suggests using a hook or something to read the metadata on disk directly