jellyfish 1.0 & The Transition to Rust

rust
jellyfish
python
Published

June 27, 2023

Back in 2010, my colleague Michael created a Python library with C implementations of a bunch of common string distance algorithms. I remember wanting to call it strfry at one point but jellyfish won in the end. A year or so later I eventually took over maintenance of the library and didn’t give it much thought for a while & certainly didn’t expect it to become the most widely-used library I currently maintain.

(It actually took a while to realize how popular it had become since I don’t make a habit of checking download stats or anything similar. One sign was when I inherited course materials for one of the courses I’m now teaching, and found that they were using jellyfish in the course materials.)

As more and more people use Python for record linkage type tasks, the library has found lots of use. It helps that it has C implementations with Python fallbacks for each of these algorithms, helping immensely with speed but making maintenance & packaging more of a pain.

Over the years I’d written Go and Rust ports of the library. It was mainly an exercise to help me get familiar with those languages. I’d also had in the back of my mind that it’d be nice to stop maintaining the C code at some point, and maybe one of those would provide a path forward someday.

Enter PyO3 and maturin

I’d been aware of PyO3 for a while, and a few months ago took the time to read up on it enough to realize just how easy wrapping the existing Rust implementation of jellyfish would be.

I already knew it passed all the same tests, since in doing this work I’d parametrized the tests to run against all four implementations of the algorithms.

It was no more than a day or two’s worth of work to figure out the maturin build system, thanks to some helpful people on their forum for helping me understand best practices.

If you’re curious how it works, this file creates the Python bindings for the Rust functions. This, and following maturin’s instructions for building packages is all it took.

Performance

I knew that I’d likely be trading some speed for safety. I wanted to understand the tradeoff though, so I added benchmarks to jellyfish (you can check them out here: https://github.com/jamesturk/jellyfish/blob/main/benchmarks/compare.ipynb).

After a bit of optimization, the Rust versions of the algorithms mostly take between 0.8x-2.1x as much time as the C versions. Damerau-Levenshtein is the outlier at 2.7x, but is still ~25x faster than the Python version. Given that we’re talking about microseconds in most cases, I feel the change is well worth it.

The switch also gave me the opportunity to fix a few inconsistencies between the C and Python versions, particularly in how they handled (or failed to handle) Unicode. This proper Unicode handling is a big win, making the performance difference feel even more justified.

Rust extensions are easier to maintain and distribute than C extensions, many volunteers had helped improve the jellyfish build process over the years to provide binaries on PyPI, but it was still a pain to maintain. maturin is a joy to use by comparison. If you’ve maintained a C extension, or are interested in extending Python with Rust, I highly recommend it.

A neat side effect of the way the package works is that the Rust version of the library and the Python version of the library can share a codebase.

Version 1.0!

I put out Version 0.11 with the Rust backend a couple of months ago and the sky hasn’t fallen. I decided to wait a bit longer in case somehow this was a disaster, but it’s been smooth sailing. It also feels like it’ll be much easier to maintain going forward, which is a big win.

And with those nagging Unicode inconsistencies out of the way, it felt like this 13 year old library can finally hit 1.0. Last week I pushed out jellyfish 1.0.0. I’m glad that it has been useful to so many people over the years, and know this new update will help ensure consistency & stability going forward.

I’d also like take the opportunity to say thanks to everyone that’s ever contributed to jellyfish or written about it. I’m glad it continues to be useful to so many people.