Hi all, In my Kedro data pipeline, I need to capt...
# questions
o
Hi all, In my Kedro data pipeline, I need to capture the count of rows and fill rate of certain columns in a PySpark dataframe after every filter or join operation. Currently, I have implemented this metric collection using
count
, but it has significantly slowed down my pipeline. Is there a better way to capture waterflow?
n
Hi, this is a good question. It sounds like more a spark question in general. I am no expert of data engineering. Is it possible to read the Metadata directly if you are saving it as a parquet or similar data format? https://duckdb.org/docs/data/parquet/metadata.html
d
So it’s a bit difficult to do this efficiently
if you want to use Spark to do this, it will count things by materialising the dataframe in memory to count it even if you don’t need to do that as part of your query plan
💡 1
so I would consider having a separate metrics pipeline OR like nok suggests using a hook or something to read the metadata on disk directly