Hi, Currently kedro_datasets.geopandas only handle...
# plugins-integrations
j
Hi, Currently kedro_datasets.geopandas only handles geojson. But geopandas has methods for many other very useful data types as well, e.g. [GeoPackage](GeoPackage), Spatial Databases, Apache Parquet and Feather file formats), and more... I'd like to start by working with [GeoPackage](GeoPackage) 's. What would be the easiest way to get started? Should I start here: https://docs.kedro.org/en/stable/data/how_to_create_a_custom_dataset.html?
j
hi @Joost Gevaert ! indeed, you’d need to create a custom dataset. or extend the geopandas one
👍 1
(btw, hope it’s compatible with GeoPandas 1.0! let us know otherwise)
👍 1
j
Yeah, GeoPackage is compatible with geopandas 1.0. GeoPackage is an awesome data format for GIS data: https://en.wikipedia.org/wiki/GeoPackage GeoPackage (GPKG) is an open, non-proprietary, platform-independent and standards-based data format for (3D) Vector and Raster GIS data
j
oh sorry I meant
kedro_datasets
being compatible with GeoPandas 1.0 😄
😄 1
any opinions on GeoPackage vs GeoArrow btw?
(or shapefiles 😬 )
j
Well, that http://switchfromshapefile.org/ website says all there is to say about shapefiles, right? haha I really don't like them
The output of my data pipelines often goes to people who use those results in ArcGIS Pro, so I'm trying to make sure it's as easy for them to work with my results as possible. Getting .geojson files into ArcGIS Pro is a little more complicated than getting .gpkg's in there. GeoArrow, I don't really know. So far I've not had the crazy big amounts of data yet for which it's necessary to use GeoArrow. To be honest, in general I'm not very familiar with all those columnar memory formats. What's the difference between arrow, feather, parquet? How would I get those results into ArcGIS Pro? Let me ask ChatGPT 🙂
Ah, .feather = .arrow? So I guess that gpd.GeoDataFrame.to_feather would work, but then I wouldn't (yet) know how I'd get that .feather file into an ArcGIS Pro project.
ChatGPT tells me that .feather files are easier to work with than .parquet files, and that .feather files are quicker, but less compatible. Would you agree?
j
what's the difference between arrow, feather, parquet?
let me know if this helps! https://dev.to/astrojuanlu/demystifying-apache-arrow-5b0a
❤️ 1
The Feather format was created alongside Arrow, and nowadays it provides decent compression (although Parquet files are usually smaller) and excellent read and write speeds (even better than Parquet). On the other hand, the Parquet format has much wider adoption and is more interoperable. If you are not sure which one is best and you're not concerned about squeezing the speed as much as possible, you can safely pick Parquet.
❤️ 1
(glad I could copy paste that from my past self 😄 )
😄 1
I think I actually meant GeoParquet originally, GeoArrow is probably too low level https://geoparquet.org/
j
The Demystifying Apache Arrow article definitely helped! Thanks
Regarding GeoParquet, it might come in handy once my data starts becoming too big to handle, but for now it's all doable with geopandas and GeoPackage. Hopefully I'll have a chance to also play with GeoParquet at some point in the near future, and will definitely let you know about my experience in case I do :)
💪🏼 1
y
Definitely something I am interested to, working with a bunch of spatial data these days. I think these are good candidates for "experimental" datasets given our recent policy change ;)
👍 1
j
anecdote: today I was reading a 300 MB GeoParquet with GeoPandas and it took 3 minutes to load on my computer (~800k rows). I saved it to GeoParquet and now it takes 30 MB, loads in 3 seconds, and contains exactly the same information 🤯
😯 1