https://kedro.org/ logo
#questions
Title
# questions
n

Nok Lam Chan

07/21/2023, 3:32 PM
[Not a Kedro Question] - What’s the most efficient way to find out some basic statistic of a CSV (filesize/number of rows and columns) ? Requirements: • Memory efficient (cannot load the full csv into a dataframe and do
df.describe()
• Need to work in Windows and Linux so
wc
is not an option • Need to be fast • Bonus: is it possible to generalised to Excel filetype?
👍 1
m

Marc Gris

07/21/2023, 3:40 PM
You may enjoy this option: https://github.com/BurntSushi/xsv 🙂
ferris 1
👀 2
rust 2
🦀 1
d

datajoely

07/21/2023, 3:45 PM
• You can check columns cheaply because you only need to read 1 line • You can check filesize from the filesystem cheaply • You could estimate the rows from a sample filesize, but also report the filesize
j

Juan Luis

07/21/2023, 3:47 PM
VisiData is quite famous too https://www.visidata.org/
🔥 2
n

Nok Lam Chan

07/21/2023, 3:52 PM
xsv looks good! Maybe too much overhead and makes the dependency management more complicated tho.
Btw I think it we are adding dependencies, it shouldn’t go into
kedro-datasets
. It may just stay in
kedro-viz
and monkeypatched
• You could estimate the rows from a sample filesize, but also report the filesize
This is nice! should be very fast to just read a few rows and divide it with the filesize. Trickier if the data are strings and heterogeneous.
👍 1
j

Juan Luis

07/21/2023, 4:02 PM
you can also try Polars 🐻‍❄️
Copy code
python -c 'import polars as pl; print(pl.scan_csv("companies.csv").select(pl.count()).collect())' 
shape: (1, 1)
┌───────┐
│ count │
│ ---   │
│ u32   │
╞═══════╡
│ 77096 │
└───────┘
(number of columns and file size left as an exercise to the reader)
😄 2
👍 2
r

Ravi Kumar Pilla

07/21/2023, 4:05 PM
scan_csv is similar to using pandas read_csv with chunksize ? I am not sure. we want to get the file metadata (local file/remote file) without loading the file in memory
n

Nok Lam Chan

07/21/2023, 4:07 PM
you can also try Polars 🐻‍❄️ @Juan Luis
I was waiting for your polar’s solution
😊 2
j

Juan Luis

07/21/2023, 4:08 PM
scan_csv
does not load the file, only creates a lazy representation. it's more similar to how
dask.dataframe
works
👍 1
pl.scan_csv(...).select(pl.count())
creates the query plan, and
collect()
is almost instantaneous in this case
n

Nok Lam Chan

07/21/2023, 4:11 PM
Any idea how it works under the hood to achieve this? not familiar with `polars`codebase, I think these are all expression and it get evaluated in a smart way.
j

Juan Luis

07/21/2023, 4:13 PM
I have no idea to be honest! maybe using Arrow memory mapping https://dev.to/astrojuanlu/demystifying-apache-arrow-5b0a/#arrow-awesomeness
👍🏼 1
r

Ravi Kumar Pilla

07/21/2023, 4:20 PM
yea I saw this which is similar to what you suggested of counting lines
d

datajoely

07/21/2023, 4:20 PM
Whilst this is an interesting problem, I wouldn’t overindex on CSV - any real pipeline is using an adult fileformat
😂 1
j

Juan Luis

07/21/2023, 4:22 PM
yep, there's
scan_parquet
. no
scan_excel
though 😄
n

Nok Lam Chan

07/21/2023, 4:34 PM
adult fileformat
lol…
Do you mean Excel? lolsob
d

datajoely

07/21/2023, 4:37 PM
I think I mean Parquet, Delta and SQL Tables these days
m

Matthias Roels

07/21/2023, 6:24 PM
I would say, go with Polars or Duckdb. Both will perform just fine and are extremely memory efficient. Also, they are lightweight when it comes to dependencies… Edit: oh and I forgot to mention, they are fast too 😅. Been using polars for a couple of week now and I will never switch back to pandas unless I really really really have to!
❤️ 2
🐻‍❄️ 1
i

Iñigo Hidalgo

07/31/2023, 4:39 PM
> adult fileformat
Do you mean Excel?
I think Excel counts as a senior citizen by now
🤣 3
d

datajoely

07/31/2023, 4:45 PM
Time to send it to the farm
2 Views