Riak Search

May 21, 2010 at 11:30 AM | categories: Search, Riak

This post is going to start by explaining how in-the-trenches experience with key/value stores, like Riak, led to the creation of Riak Search. Then it will tell you why you care, what you'll get out of Riak Search, and why it's worth waiting for.

A bit of history

Few people know that Basho used to develop applications for deployment on Salesforce.com. We had big goals, and were thinking big to fill them, and part of that was choosing a data storage system that would give us what we needed not only to succeed and grow, but to survive -- a confluence of pragmatism and ideal that embodied a bulletproof operations story, a path upward -- resilience, reliability, and scalability, through the use of proven science.

So, that's what we did: we developed and used what has grown to be, and what you know today, as Riak.

Idealism can't get you everywhere, though. While we answered hard questions with link-walking and map/reduce, there was still the desire in the back of all of our heads: sometimes you just want to ask, “What emails were sent on May 21 that included the word ‘strategy’?” without having to figure out how to walk links from an organizational chart to mailboxes to mails, and then filter over the data there. It was a pragmatic desire: we just wanted a quick answer in order to decide whether or not to spend more time chasing a path. “Less yak-shaving, please.”

The Operations Story

Then we stopped making Salesforce.com apps, and started selling Riak. We quickly found the same set of desires. Operationally, Riak is a huge win. Pragmatically, something that does indexing and search in a similar operational manner is even bigger. Thus, Riak Search was born.

The operational story is, in a nutshell, this: when you add another node to your cluster, you add capacity and compute power. That's it, you just add another box and “it just works.” Purposefully or not, eventually a node leaves the cluster, hardware fails, whatever: Riak deals with it. If the node comes back, it's absorbed like it never left.

We insisted on these qualities for Riak, and have continued that insistence in Riak Search. We did it with all the familiar bits: consistent hashing, hinted handoff, replication, etc.

Why Riak Search?

Now, we'll be the first to tell you that with Riak you can get pretty far using link-walking and map/reduce, with the understanding that you know what you are going to want ahead of time, and/or are willing to wait for it.

Riak Search answers questions that pop into your head; “find me all the blue dresses that are between $20 and $30 dollars,” “find me the document Bob referred to last week at the TPS procedures meeting,” “how can I delete all these emails from my aunt that have those stupid attachments,?” “find me that comic strip with Bob,” etc.

It's about making members of the sea of data in your key-value store findable. At a higher level, it's about agility. The ability to answer questions you have about your business and your customers without having to consult a developer or dig through reference manuals and without your application developers having to reinvent the wheel with a very real possibility of doing it just right enough to assure you nothing will go wrong. It's about a common indexing language.

Okay, now you know — thanks for bearing with us — let's get to the technical bits.

Riak Search …

The system we have built …

  1. is an easily-scalable, fault-tolerant search and indexing system, adhering to the operational story you just read
  2. supports full-text indexing and search
  3. allows querying via the Lucene query syntax
  4. has Solr-compatible /select and /update web-services
  5. supports date and numeric indexing
  6. supports faceting
  7. automatically distributes indexes
  8. has an intermediate query language and integrated query planner
  9. supports scoring
  10. has integrated tokenizing, filtering and analysis (yes, you can use StandardAnalyzer!)

… and much more. Sounds pretty great, right?

If you want to know more about the internals and technical nitty gritty, check out the Riak Search presentation one of our own, Riak Search engineer John Muellerleile, gave at the San Francisco Erlang Factory this year.

So, why don't you have it yet? The easy part.

There are still some stubs and hard-coded things in what we have. For instance, the full-text analyzer in use is just whitespace, case-normalization, and stop-word filtering. We intend to fully support the ability to specify other Lucene analyzers, including custom modules, but the code isn't there yet.

There is also very little documentation. Without a little bit of handholding, even the brightest and most ambitious user could be forgiven for staring blankly, lost for even the first question to ask. We're spreading the knowledge among our own team right now; that process will generate the artifacts needed for the next round of users to step in.

There are also many fiddly, finicky bits. These are largely relics of early iterations. Rather than having the interwebs be flooded with, “How do you stop this thing?” (as it was with Riak), we're going to make things friendlier.

So, why don't you have it yet? The not-so-easy part.

You've probably asked yourself, “What of integration of Riak and Riak Search?” We have many notes from discussions about how it could or should be done, as well as code showing how it can be done. But, we're not completely satisfied with any of our implementations so far.

There are certainly no shortage of designs and ideas on how this could or should work, so we're going to make a final pass at refining all of our ideas, given our current Riak Search system to play with, so that we can provide a solid, extensible system, instead of one that with many rough edges that would almost certainly be replaced immediately.

Furthering this sentiment is that we think that our existing map/reduce framework and the functionality and features provided by Riak Search are a true power combo when used together intelligently, than simply as alternatives, or at worse, at odds. As a result, we're defining exactly how Riak Search indexing and querying should be threaded into Riak map/reduce processing to bring you a combination that is undoubtedly more than the sum of its parts.

We could tease you with specifics, like generating the set of bucket/key inputs to a map phase by performing a Riak Search query, or parameterizing Search phases with map results; though, for now, amidst protest both internally -- we're chomping at the bit to get this out into the world and into your hands -- and externally, as our favorite people continually request this exact set of technology and features, we're going to implement the few extra details from our refined notes before forcing it on you all.

Hold on just a little longer. :)

-the Riak Search Team



Great Customers Make Great Products

December 15, 2009 at 09:15 PM | categories: Startups, Riak, Search

At Basho, we're a very lucky bunch of people.

The first paying customer for the Enterprise edition of Riak was Mochi Media. They first rolled it out on a low-profile system. I won't lie to you by saying that everything was perfect; any interesting piece of software will show its warts the first time it is placed into an unforeseen environment. The people at Mochi were open, communicative, and interested in learning more about how Riak worked. As they became comfortable with its behavior operationally and under load, they put it into bigger and more business-critical applications. Those first experiences with Mochi helped us to reprioritize our early to-do list to best serve others like them, and we also gained a valuable advisor.

We've signed a few great customers since then, and the latest such is Collecta. Long before they started using our product inside their search service we were having great conversations with them about their use of Webmachine and other topics. Jack already told the story about the genesis of Riak Search better than I can, so you should read his account of it. Both the ideas and the very concrete problem set posed by Collecta gave John exactly what was needed to focus the project and deliver something really cool.

An important part of building a business around open source software is that it isn't just customers that care about the quality of your work. In addition to the valuable experiences we have had with customers, direct open source contributions have made a big difference as well. Since shortly after we first released Riak under the Apache 2 license, we have received valuable improvements in code, documentation, and ideas from people around the world. A few of those have become customers since then, but we are also deeply grateful to all of the others. We hope for Riak to thrive in various open source contexts, and it is because of those people that we are hopeful.

We're working this week with a potential new customer that wants to store and process terabytes of data in a single Riak cluster, and in a very visible public-facing system. Just like with Mochi and Collecta, the fact that we are talking to some very smart, capable people makes our job a pleasure even in the face of interesting new challenges -- perhaps especially then.

We really are lucky to work with such excellent people.

Thank you to the customers of Basho and everyone else who has helped us so far on the road of continuous improvement.

Justin



Collecta Chooses Riak Search

December 15, 2009 at 08:15 AM | categories: Search, Riak

Another big announcement for the team here at Basho: Collecta , which makes a truly cool real-time streaming search engine, has chosen to use Riak Search. They are longtime Webmachine users and when they learned about Riak, they partnered with us to define Riak Search and validate the prototype.

Look for a blog post later in the day from Justin Sheehy on what it was like to work with Collecta.(Hint: it was awesome!)

Here is a link to the press release: http://basho.com/news.html

Stay tuned.

Mark


Basho Podcast Number 1 - Justin Sheehy and Tony Falco on Scaling out with Riak and Riak Search.

December 11, 2009 at 09:10 AM | categories: , Search, Riak, Map/Reduce, Database, Scaling, Podcast, Distributed Systems

Just out: Basho’s first podcast discussing Riak. Justin Sheehy and Tony Falco revisit the definition of scalability Justin first discussed at NoSQL East 2009, discuss EC2, Riak, and Riak’s map/reduce and soon-to-be-released distributed search and indexing. As a special bonus, at 3:24 in the podcast, listen for the sound of Kevin Smith’s SMS accepting the job at Basho. The mic did not pick up Justin’s grimace. Of course, he didn’t miss a beat. "I just did, Bob...."

Enjoy,

Mark Phillips



Right click here to download the Podcast