Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storing data in a cross-language form #71

Open
martinfleis opened this issue Sep 19, 2024 · 19 comments
Open

Storing data in a cross-language form #71

martinfleis opened this issue Sep 19, 2024 · 19 comments

Comments

@martinfleis
Copy link

Hi,

would you be keen on storing the data in some open formats alongside rda so we could link to it from Python? We have a geodatasets package that holds metadata and some tooling to cache the data locally so if you including the data here as GeoJSON, CSV, GPKG or whatever is needed we could include them in geodatasets allowing easier access to the same data from R and Python, avoiding the need of running R first to save the data Python can read.

@Robinlovelace
Copy link
Collaborator

Would be really useful to have cross-language datasets. Maybe a spDatapy or spDatax repo could be worthwhile, to avoid issues with CRAN..

@Nowosad
Copy link
Owner

Nowosad commented Sep 19, 2024

@martinfleis what do you have in mind? Do you want to store the files in some python package? Many of the datasets from spData are available in inst/shapes -- https://github.com/Nowosad/spData/tree/master/inst/shapes (although we plan to remove shapefiles soon from there -- #62). Do you need any other dataset from spData as a file?

@martinfleis
Copy link
Author

Many of the datasets from spData are available in inst/shapes

Missed that! That is what I was looking for. If these links are considered stable, I would just include them in geodatasets for easy access from Python.

@Nowosad
Copy link
Owner

Nowosad commented Sep 19, 2024

Yes, they are v. stable. (Except the .shp files, which will be removed in ~two months)

@Nowosad
Copy link
Owner

Nowosad commented Sep 19, 2024

@martinfleis
Copy link
Author

I have exposed those datasets that live in inst/shapes in geodatasets in geopandas/geodatasets#27. It is far from the complete list but I believe that the rest is not available as files but generated in some form?

@Nowosad
Copy link
Owner

Nowosad commented Sep 20, 2024

The rest of them are .rda object -- do you want all of the datasets from the README available (except the one we discussed yesterday)? If so, I could just create another GH repo for that.

@martinfleis
Copy link
Author

It would be nice for independence of R and Python examples depending on the same data. The tiny snippet @Robinlovelace used during SDSL required R running prior to Python to load the file and dump it to the disk before it could be read by geopandas. Having it available directly would allow more freedom in what runs first and in what runs at all.

@Robinlovelace
Copy link
Collaborator

+1 to increasing modularity and x-language compat (without having to depend on either for shared examples).

@Nowosad
Copy link
Owner

Nowosad commented Sep 27, 2024

@martinfleis I took a look at the data available in R files -- they consist of spatial vector data, some raster data, a few tables, and also some graph data. Do you have any suggestions on the data formats you would prefer for each of the data types (e.g., vector -- gpkg, raster -- geotiff, etc)?

@martinfleis
Copy link
Author

As long as GDAL can read it I don't really care.

@Nowosad
Copy link
Owner

Nowosad commented Oct 2, 2024

@martinfleis take a look at https://github.com/Nowosad/spData_files and let me know what you think.

@rsbivand Roger, do you maybe also have any suggestions on how to store and share these datasets?

@martinfleis
Copy link
Author

I think it will work for me if we don't touch it (so the shasum won't change).

@rsbivand
Copy link
Contributor

rsbivand commented Oct 3, 2024

I agree that using file formats that have active GDAL drivers is sensible; for larger data sets maybe SOzip from when GDAL has provided that? Otherwise for vector zipped files may reduce bandwidth: r-spatial/sf#2433 and #62 (comment).

@Nowosad
Copy link
Owner

Nowosad commented Oct 4, 2024

Thanks, Roger.

@martinfleis -- what do you think? Should I compress the gpkg files or keep them as they are?

@martinfleis
Copy link
Author

@Nowosad given the largest file in that repo is 905KB, I don't think compression is worth it.

@rsbivand
Copy link
Contributor

rsbivand commented Oct 4, 2024

Maybe also https://github.com/Nowosad/spDataLarge/tree/master/inst and the RDA files in https://github.com/Nowosad/spDataLarge/tree/master/data?

@Nowosad
Copy link
Owner

Nowosad commented Oct 9, 2024

Good idea, @rsbivand -- I just added some additional files from spData and all of the files from spDataLarge to https://github.com/Nowosad/spData_files

Comments/suggestions are welcomed

@martinfleis
Copy link
Author

This all looks fine with me. I'll wait a bit if there are any further comments and then expose these in geodatasets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants