[Not a Kedro Question] - What’s the most efficient...
# questions
n
[Not a Kedro Question] - What’s the most efficient way to find out some basic statistic of a CSV (filesize/number of rows and columns) ? Requirements: • Memory efficient (cannot load the full csv into a dataframe and do
df.describe()
• Need to work in Windows and Linux so
wc
is not an option • Need to be fast • Bonus: is it possible to generalised to Excel filetype?
👍 1
m
You may enjoy this option: https://github.com/BurntSushi/xsv 🙂
ferris 1
👀 2
rust 2
🦀 1
d
• You can check columns cheaply because you only need to read 1 line • You can check filesize from the filesystem cheaply • You could estimate the rows from a sample filesize, but also report the filesize
j
VisiData is quite famous too https://www.visidata.org/
🔥 2
n
xsv looks good! Maybe too much overhead and makes the dependency management more complicated tho.
Btw I think it we are adding dependencies, it shouldn’t go into
kedro-datasets
. It may just stay in
kedro-viz
and monkeypatched
• You could estimate the rows from a sample filesize, but also report the filesize
This is nice! should be very fast to just read a few rows and divide it with the filesize. Trickier if the data are strings and heterogeneous.
👍 1
j
you can also try Polars 🐻‍❄️
Copy code
python -c 'import polars as pl; print(pl.scan_csv("companies.csv").select(pl.count()).collect())' 
shape: (1, 1)
┌───────┐
│ count │
│ ---   │
│ u32   │
╞═══════╡
│ 77096 │
└───────┘
(number of columns and file size left as an exercise to the reader)
😄 2
👍 2
r
scan_csv is similar to using pandas read_csv with chunksize ? I am not sure. we want to get the file metadata (local file/remote file) without loading the file in memory
n
you can also try Polars 🐻‍❄️ @Juan Luis
I was waiting for your polar’s solution
😊 2
j
scan_csv
does not load the file, only creates a lazy representation. it's more similar to how
dask.dataframe
works
👍 1
pl.scan_csv(...).select(pl.count())
creates the query plan, and
collect()
is almost instantaneous in this case
n
Any idea how it works under the hood to achieve this? not familiar with `polars`codebase, I think these are all expression and it get evaluated in a smart way.
j
I have no idea to be honest! maybe using Arrow memory mapping https://dev.to/astrojuanlu/demystifying-apache-arrow-5b0a/#arrow-awesomeness
👍🏼 1
r
yea I saw this which is similar to what you suggested of counting lines
d
Whilst this is an interesting problem, I wouldn’t overindex on CSV - any real pipeline is using an adult fileformat
😂 1
j
yep, there's
scan_parquet
. no
scan_excel
though 😄
n
adult fileformat
lol…
Do you mean Excel? lolsob
d
I think I mean Parquet, Delta and SQL Tables these days
m
I would say, go with Polars or Duckdb. Both will perform just fine and are extremely memory efficient. Also, they are lightweight when it comes to dependencies… Edit: oh and I forgot to mention, they are fast too 😅. Been using polars for a couple of week now and I will never switch back to pandas unless I really really really have to!
❤️ 2
🐻‍❄️ 1
i
> adult fileformat
Do you mean Excel?
I think Excel counts as a senior citizen by now
🤣 3
d
Time to send it to the farm