Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should the vectorized geo_to_h3 stay as unstable? #205

Open
thomasaarholt opened this issue Nov 9, 2021 · 7 comments
Open

Should the vectorized geo_to_h3 stay as unstable? #205

thomasaarholt opened this issue Nov 9, 2021 · 7 comments

Comments

@thomasaarholt
Copy link

#147 introduced vectorized support for geo_to_h3, which sped things up by about 50% over a python for loop. It's currently under unstable, which raises a warning. It seems that geo_to_h3 gives accurate results, and that its API won't retroactively break in the future. @ajfriend do you agree?

Can I suggest that the functionality gets moved to api instead of unstable? To me, the warning from unstable implied that it may not give correct results, which does not seem to be the case.

@ajfriend
Copy link
Contributor

ajfriend commented Nov 9, 2021

Maybe we should update the warning, as unstable was meant to imply that the API will probably change in the future. (Happy to accept a PR on this, btw!)

Also, we're planning on a 4.0 release when the core library also moves to 4.0, so many of the function names will change anyway and break backwards compatibility. For 4.0, I would like to settle on how exactly we'd like to implement the vectorized functions. For example, it'd be nice if we could have them work like numpy functions and dispatch correctly to both vector and scalar arguments. (Also happy to accept PRs to play around in the unstable namespace to try out different approaches for this.)

@kylebarron
Copy link
Contributor

For 4.0, I would like to settle on how exactly we'd like to implement the vectorized functions.

I think 4.0 is coming up pretty soon 😄 . I'd love to get more of the H3 API to have a vectorized counterpart. My current thoughts on this are:

  • Add more .pxd files to prevent duplicating Cython wrapper code for vectorized functions. The existing approach for the few vectorized functions calls into h3lib directly.

    c = deg2coord(lat[i], lng[i])
    out[i] = h3lib.geoToH3(&c, res)
    This isn't horrible for simple functions like geo_to_h3, but it's not ideal for other functions that require more Cython wrapper code around h3lib. Adding .pxd definition files (that define headers for our "exported" cpdef functions) for our .pyx implementation files will allow us to import those .pyx files from other Cython files. (This would also be relevant for Example using cimport h3 in external code for speed #152).

    So instead of calling h3lib directly, we'd call our own internal wrapper functions:

     from .geo import geo_to_h3
     
     ...
     with nogil:
         for i in range(len(lat)):
             out[i] = geo_to_h3(lat[i], lng[i])
  • Manually defining vectorized functions is unavoidable. Unlike the API functions file in Python, the vectorized functions must be defined in Cython. I assume function reuse won't be possible and these functions will just have to be copied to the vectorized API.

  • Focus on supporting scalar and 1-D numpy data. Pure numpy ufuncs can be used on data of any dimension, but I think the extra complexity of trying to support more than 1D input might not be worth it. I.e. h3_to_geo with 2D input would require 3D output.

  • Add vectorized functions where easiest, first. Functions with uniform input and output dimension are easiest to support. A vectorized h3_to_geo_boundary would be harder because I assume the number of tuples returned per h3 cell is not constant.

  • Don't shy away from depending on Numpy headers if necessary. If the vectorized Cython code would be a lot simpler to implement by importing Numpy Cython code, we should embrace that. (I'm not currently sure whether depending on Numpy would make things easier). There's surely a way to modify CMakeLists.txt to compile the vectorized code only if Numpy headers are available, no? And Numpy will easily be available on CI

@dfellis
Copy link
Collaborator

dfellis commented Apr 11, 2022

* Don't shy away from depending on Numpy headers if necessary. If the vectorized Cython code would be a lot simpler to implement by importing Numpy Cython code, we should embrace that. (I'm not currently sure whether depending on Numpy would make things easier). There's surely a way to modify `CMakeLists.txt` to compile the vectorized code only if Numpy headers are available, no? And Numpy will easily be available on CI

This last point has the potential issue of requiring anyone installing h3-py the "normal" way to have numpy installed (unless there's an "optional dependency" concept in Python, too?) It feels like it might break things for users in memory-constrained environments or that have some requirement to continue using an older numpy, etc.

Not sure how big of a problem that is, but one way to avoid it would be to have this repo publish two packages, one that's "pure" h3-py and another that includes the vectorization/numpy/etc stuff?

@kylebarron
Copy link
Contributor

This last point has the potential issue of requiring anyone installing h3-py the "normal" way to have numpy installed (unless there's an "optional dependency" concept in Python, too?)

I don't think so. Not sure if by "normal" you mean via wheels or from source.

  • From wheels: I'm pretty sure that a wheel compiled against Numpy headers would not require Numpy at runtime if that file isn't imported. It would be essentially the same as our current numpy_int API where Numpy is an optional dependency, and so Numpy is only required if you import h3.api.numpy_int.
  • From source: I'm less familiar with scikit-build, but in a "normal" Cython setup it would be possible to only compile the Numpy Cython files if Numpy headers can be found at compile time. I.e. in my previous experience, in the setup.py file, you would import numpy and find the path to numpy's headers, then pass a list of Cython modules to ext_modules. It wouldn't be hard to simply remove a file from what's passed to ext_modules if numpy couldn't be imported. I assume scikit-build has similar functionality.

@kylebarron
Copy link
Contributor

  • I'm less familiar with scikit-build

From a cursory glance at the scikit-build docs it looks like it would be simple to update this CMakeLists.txt to conditionally add a Cython module only if Numpy was found at build time:

find_package(NumPy MODULE)

...

if (NumPy_FOUND)
    add_cython_file(unstable_vect)
endif()

@dfellis
Copy link
Collaborator

dfellis commented Apr 11, 2022

Got it, though Python being even less strict on optional dependencies than Node is surprising.

@rfc-hv
Copy link

rfc-hv commented Nov 29, 2022

it looks like the v4.0.0b2 pre release has removed unstable.vect.geo_to_h3. the scalar version of the function has been renamed from geo_to_h3 to latlng_to_cell. unclear if there are vectorised versions anywhere, or if this is outside v4.0 scope

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants