Implementing Scalable Search on a NoSQL Backend

Search is really easy to implement for a quick and dirty prototype, but really hard to implement in a scalable way for production usage. The patterns that work most conveniently for prototyping are often the direct cause of scalability problems later in an application’s life cycle.

Simplistic search algorithms simply scan through all the documents and execute the query on each one. If it sounds like this can take a long time, that’s because it does. The key to making searches run efficiently is to minimize the number of documents that have to be examined when executing each query by using an index. To do that you need to keep in mind what kinds of queries you want to support when designing how to organize your data. The more structured and limited these queries are, the easier this will be.

Dont Worry We Got This Cats
One of the big advantages of using a service like Parse is that you don’t have to worry about managing your own database and maintaining indexes. We’ve built an abstraction for you so that we manage all of that complexity for you, and you can focus on your application’s unique features.

To organize your data model to support efficient searching, you’ll need to know a bit about how our systems are operating behind the abstraction. You’ll need to build your data model in a way that it’s easy for us to build an index for the data you want to be searchable. For example, string matching queries that don’t match an exact prefix of the string won’t be able to use an index. This makes these types of queries very likely to fail due to timeout errors as your app grows.

We’ve been recently adding features that make it easier to search your data efficiently, and we want to continue making that easier for our users. Let’s look at an example: Say your app has users making posts, and you want to be able to search those posts for hashtags or particular keywords. You’ll want to pre-process your posts and save the list of hashtags and words into array fields. You can do this processing either in your app before saving the posts, or you can just add a Cloud Code hook to do it on the fly, leaving your app code unchanged.

Here’s an example Cloud Code hook to do this for posts:

var _ = require("underscore");
Parse.Cloud.beforeSave("Post", function(request, response) {
    var post = request.object;

    var toLowerCase = function(w) { return w.toLowerCase(); };

    var words = post.get("text").split(/\b/);
    words = _.map(words, toLowerCase);
    var stopWords = ["the", "in", "and"]
    words = _.filter(words, function(w) { return w.match(/^\w+$/) && ! _.contains(stopWords, w); });

    var hashtags = post.get("text").match(/#.+?\b/g);
    hashtags = _.map(hashtags, toLowerCase);

    post.set("words", words);
    post.set("hashtags", hashtags);
    response.success();
});

This saves your words and hashtags in array fields, which MongoDB will store with a multi-key index. There are some important things to notice about this. First of all it’s converting all words to lower case so that we can look them up with lower case queries, and get case insensitive matching. Secondly, it’s filtering out common words like ‘the’, ‘in’, and ‘and’ which will occur in a lot of posts, to additionally reduce useless scanning of the index when executing the queries. Long story short, that means you can efficiently look them up using All queries.

For example, in iOS or OS X:

PFQuery *query = [PFQuery queryWithClassName:@"Post"]
[query whereKey:@"hashtags" containsAllObjectsInArray:@[@"#parse", @"#ftw"]];
NSArray *parseFTWPosts = [query findObjects];

Or using our REST API:

curl -v -X GET  \
    -H "X-Parse-Application-Id: ${APPLICATION_ID}" \
    -H "X-Parse-REST-API-Key: ${REST_API_KEY}" \
    -G \
    --data-urlencode 'where={"hashtags":{"$all":["#parse", "#ftw"]}}' \
    "https://api.parse.com/1/classes/Post"

Give it a go. Implementing search using these patterns will help make your apps run faster.

Brad Kittenbrink
March 19, 2013
blog comments powered by Disqus

Comments are closed.

Archives

Categories

RSS Feed Follow us Like us