Not a Kedro Question What s the most efficient way to find Kedro #questions

[Not a Kedro Question] - What’s the most efficient...

Nok Lam Chan

07/21/2023, 3:32 PM

[Not a Kedro Question] - What’s the most efficient way to find out some basic statistic of a CSV (filesize/number of rows and columns) ? Requirements: • Memory efficient (cannot load the full csv into a dataframe and do

df.describe()

• Need to work in Windows and Linux so

wc

is not an option • Need to be fast • Bonus: is it possible to generalised to Excel filetype?

👍 1

Marc Gris

07/21/2023, 3:40 PM

You may enjoy this option: https://github.com/BurntSushi/xsv 🙂

👀 2

rust 2

🦀 1

ferris 1

datajoely

07/21/2023, 3:45 PM

• You can check columns cheaply because you only need to read 1 line • You can check filesize from the filesystem cheaply • You could estimate the rows from a sample filesize, but also report the filesize

Juan Luis

07/21/2023, 3:47 PM

VisiData is quite famous too https://www.visidata.org/

🔥 2

Nok Lam Chan

07/21/2023, 3:52 PM

xsv looks good! Maybe too much overhead and makes the dependency management more complicated tho.

Nok Lam Chan

07/21/2023, 3:54 PM

Btw I think it we are adding dependencies, it shouldn’t go into

kedro-datasets

. It may just stay in

kedro-viz

and monkeypatched

Nok Lam Chan

07/21/2023, 3:55 PM

• You could estimate the rows from a sample filesize, but also report the filesize

This is nice! should be very fast to just read a few rows and divide it with the filesize. Trickier if the data are strings and heterogeneous.

👍 1

Juan Luis

07/21/2023, 4:02 PM

you can also try Polars 🐻‍❄️

Copy code

python -c 'import polars as pl; print(pl.scan_csv("companies.csv").select(pl.count()).collect())' 
shape: (1, 1)
┌───────┐
│ count │
│ ---   │
│ u32   │
╞═══════╡
│ 77096 │
└───────┘

(number of columns and file size left as an exercise to the reader)

👍 2

😄 2

Ravi Kumar Pilla

07/21/2023, 4:05 PM

scan_csv is similar to using pandas read_csv with chunksize ? I am not sure. we want to get the file metadata (local file/remote file) without loading the file in memory

Nok Lam Chan

07/21/2023, 4:07 PM

you can also try Polars 🐻‍❄️ @Juan Luis

I was waiting for your polar’s solution

😊 2

Juan Luis

07/21/2023, 4:08 PM

scan_csv

does not load the file, only creates a lazy representation. it's more similar to how

dask.dataframe

works

👍 1

Juan Luis

07/21/2023, 4:09 PM

pl.scan_csv(...).select(pl.count())

creates the query plan, and

collect()

is almost instantaneous in this case

Nok Lam Chan

07/21/2023, 4:11 PM

Any idea how it works under the hood to achieve this? not familiar with `polars`codebase, I think these are all expression and it get evaluated in a smart way.

Juan Luis

07/21/2023, 4:13 PM

I have no idea to be honest! maybe using Arrow memory mapping https://dev.to/astrojuanlu/demystifying-apache-arrow-5b0a/#arrow-awesomeness

👍🏼 1

Nok Lam Chan

07/21/2023, 4:18 PM

remind me this answer using

np.memmap

https://stackoverflow.com/questions/64744161/best-way-to-find-out-number-of-rows-in-csv-without-loading-the-full-thing

Ravi Kumar Pilla

07/21/2023, 4:20 PM

yea I saw this which is similar to what you suggested of counting lines

datajoely

07/21/2023, 4:20 PM

Whilst this is an interesting problem, I wouldn’t overindex on CSV - any real pipeline is using an adult fileformat

😂 1

Juan Luis

07/21/2023, 4:22 PM

yep, there's

scan_parquet

. no

scan_excel

though 😄

Nok Lam Chan

07/21/2023, 4:34 PM

adult fileformat

lol…

Nok Lam Chan

07/21/2023, 4:34 PM

Do you mean Excel? lolsob

datajoely

07/21/2023, 4:37 PM

I think I mean Parquet, Delta and SQL Tables these days

Matthias Roels

07/21/2023, 6:24 PM

I would say, go with Polars or Duckdb. Both will perform just fine and are extremely memory efficient. Also, they are lightweight when it comes to dependencies… Edit: oh and I forgot to mention, they are fast too 😅. Been using polars for a couple of week now and I will never switch back to pandas unless I really really really have to!

❤️ 2

🐻‍❄️ 1

Iñigo Hidalgo

07/31/2023, 4:39 PM

> adult fileformat

Do you mean Excel?

I think Excel counts as a senior citizen by now

🤣 3

datajoely

07/31/2023, 4:45 PM

Time to send it to the farm

5 Views

Open in Slack

Previous Next