back in the saddle again

A tale of two API — Rust and Kotlin with MongoDB Async

Ilan Toren

--

I’ve written in the past about both Rust and Kotlin, but due to a change in the API for mongodb.rs (going from 1.2 to 2.1) I needed to review the code I had written since the API used in the older version was no longer valid. Along the way, I decided to devote a bit of time to a particular use-case that while common has some peculiarities to it due to the data source. I’ve been looking a the OpenFoodFacts dataset, which is both rich and very large (at least in terms of a local MongoDB instance). So I set a simple goal, look through the dataset and extract those records that originated from the USDA data as evidenced by the USDA in the creator field and an URL back to the USDA.

The query is simply:
{$match: { $and: [ { creator: { $regex: ‘usda’, $options: ‘i’ } }, { ‘sources.url’: { $exists: 1 } } ] }

To make this query work since the collection is 2.1M records fields need to be in an index. Creating an index is fairly simple, especially from Compass (yes, I do recommend Compass)

Index creation window in Compass

Compass also makes it easier to visualize how an index can improve query performance. You can get the same type of output within the mongosh :

Normally you would only need the “winning plan”, but sometimes you can get critical information from the rejected plan, especially if the created indices are not quite right. In this case, the explain path shows that the index on sources.url is the critical one for the query and that the index on ‘creator’ is unnecessary.

Other optimizations

Buiding an aggregation on a large dataset with Compass with a local large data set can be frustrating. There are memory and time restraints that a local installation lack that wouldn’t be a factor working from Atlas perhaps. On the other side of the coin, setting up a large collection cloud-based is an investment in resources, something that doesn’t have to be for development. The answer, in this case, was to take a small slice of the data for query development. The final query is a set of stages, selecting records using the index, a projection to remove unwanted fields and to lower the memory requirements, $unwind to flatten an array, and a final filter.

My goals with this query is to extract the records where there is a link back to the USDA for further analysis. There are of course many possibilities for this goal, but I decided to tack onto the task an evaluation of the async drivers for kotlin and rust.

Query done, what next?

A Rust example:

First, the rust API has changed in two ways:

Instead of specifying a data object to read the objects into in the form of
let collection: MongoCollection<Usda> = db.collection_with_type(“xxx”);
It is:
let collection: MongoCollection<Usda> = db.collection::<Usda>(“xxx”);

This is a small syntax change, but in addition you can no longer use an aggregation only a find, which is a more troubling change which will hopefully be fixed soon. Below is an example:

The creation of the connection isn’t shown, but I’ll provide a link to the entire project at the bottom. Here are the salient points:
I created the collection with Usda records from the entire OFF data dump — (8% of the 2M + in the entire collection). The Usda object is built from a struct and uses serde behind the scene to serialize/deserialize from the collection. Since the Usda struct doesn’t know how to display itself I added the Display trait — impl Display for Usda {}. This is the rust way and is very similar to the Kotlin approach of using a data class. Another point of interest is the code section:

The type Result<T> lets Rust use the MongoDB:error::Error for printing since as an external struct you can’t merely tack on an impl Display. Ignoring an error from a Result<T,E> by an unwrap can be a gotcha since you can’t know that an error occurred except during runtime — panic.

Creating a new collection from a subset of another collection looks as follows:

Walking through the code the following can be seen:
The function create_usda_set takes the already established connection as a parameter
* The query consists of a filter and find_options. The find_options is optional and could be replaced with a None, but since the projection and batch_size improve performance it is to build the find_options as shown.
* Note that persist takes the next document from the cursor and since it is an *async fn it needs the await so that it will return it’s future.
* Since the URL is within the array sources I first retrieve the array then first
* URL element (sometimes there are two and the second one seems to always be null)
* unwrap_or_default is a safe way of getting a value from a Document, but where the query itself guarantees a valid result (e.g. sources) and unwrap() is sufficient.

Finally, I wrapped the document insert with tokio::spawn. This is analogous to using a coroutine in Kotlin.

Kotlin — Born to perform

(a bit of hype never hurts)

The choice of Kotlin + the reactive MongoDB driver is pretty natural. The goal of the driver is to provide an asynchronous flow, or without the coroutines a Producer. As part of coroutines that output can be consumed as a Flow

The set-up is simple:

Note that the fun run() has runBlocking that sets off the function as the main thread. The launch{} calls the functions within it as asynchronous functions as in getOff:

The pattern with Mongodb apps is to put as much of the code into the query. There is a cost for having more stages in an aggregaate query, but it is far offset by many other factors: maintainabiliy, and performance

Look at the entire project at kotlin-github and mongo-rust-async

--

--