Nok Lam Chan
07/21/2023, 3:32 PMdf.describe()
• Need to work in Windows and Linux so wc
is not an option
• Need to be fast
• Bonus: is it possible to generalised to Excel filetype?Marc Gris
07/21/2023, 3:40 PMdatajoely
07/21/2023, 3:45 PMJuan Luis
07/21/2023, 3:47 PMNok Lam Chan
07/21/2023, 3:52 PMkedro-datasets
. It may just stay in kedro-viz
and monkeypatched• You could estimate the rows from a sample filesize, but also report the filesize
This is nice! should be very fast to just read a few rows and divide it with the filesize. Trickier if the data are strings and heterogeneous.
Juan Luis
07/21/2023, 4:02 PMpython -c 'import polars as pl; print(pl.scan_csv("companies.csv").select(pl.count()).collect())'
shape: (1, 1)
┌───────┐
│ count │
│ --- │
│ u32 │
╞═══════╡
│ 77096 │
└───────┘
(number of columns and file size left as an exercise to the reader)Ravi Kumar Pilla
07/21/2023, 4:05 PMNok Lam Chan
07/21/2023, 4:07 PMyou can also try Polars 🐻❄️ @Juan LuisI was waiting for your polar’s solution
Juan Luis
07/21/2023, 4:08 PMscan_csv
does not load the file, only creates a lazy representation. it's more similar to how dask.dataframe
workspl.scan_csv(...).select(pl.count())
creates the query plan, and collect()
is almost instantaneous in this caseNok Lam Chan
07/21/2023, 4:11 PMJuan Luis
07/21/2023, 4:13 PMNok Lam Chan
07/21/2023, 4:18 PMnp.memmap
https://stackoverflow.com/questions/64744161/best-way-to-find-out-number-of-rows-in-csv-without-loading-the-full-thingRavi Kumar Pilla
07/21/2023, 4:20 PMdatajoely
07/21/2023, 4:20 PMJuan Luis
07/21/2023, 4:22 PMscan_parquet
. no scan_excel
though 😄Nok Lam Chan
07/21/2023, 4:34 PMadult fileformatlol…
datajoely
07/21/2023, 4:37 PMMatthias Roels
07/21/2023, 6:24 PMIñigo Hidalgo
07/31/2023, 4:39 PM> adult fileformat
Do you mean Excel?I think Excel counts as a senior citizen by now
datajoely
07/31/2023, 4:45 PM