inaka

Latest blog entries

/
CredoCI

Running credo checks for elixir code on your github pull requests

Jun 16 2016 : Alejandro Mataloni

/
Thoughts on rebar3

Thoughts on rebar3

Jun 08 2016 : Hernán Rivas Acosta

/
7 Heuristics for Development

What we've learned from Hernán Wilkinson at our Tech Day

May 31 2016 : Brujo Benavides

/
Introducing Dayron - The Elixir REST client you've always wanted

Meet our library to interact with RESTful APIs in your Elixir Applications.

May 24 2016 : Flavio Granero

/
Inaka's First Tech Day

Inaka's First Tech Day

May 20 2016 : Brujo Benavides

/
PictureViewMaster

The best image projector ever. Period.

May 18 2016 : The Gera

/
Meet Jayme

Your best friend when it comes to abstract server interconnections in Swift

May 09 2016 : Pablo Villar

/
What's in a Hackathon?

The benefits of Hackathons, and a summary of ours

May 01 2016 : Stephanie Goldner

/
How do we host our internal tools ...

Inaka internal hosting

Apr 14 2016 : Ignacio Mendizabal

/
Last Year Meetups at Inaka

Last Year Meetups at Inaka

Mar 01 2016 : David Cao

/
5 Critical Lessons of App Building

While most may think that the meat of building an app is in the development, we’ve found that a large part of a successful project is the work done up front.

Feb 29 2016 : Stephanie Goldner

/
Presenting Lewis, our own Android Lint extension

Rock your Android

Feb 15 2016 : Fernando Ramirez

/
KillerTask, the solution to AsyncTask implementation

Meet our new Android library written in Kotlin

Jan 25 2016 : Fernando Ramirez

/
Android development with Kotlin

How to set up and use Kotlin language in Android apps

Jan 15 2016 : Fernando Ramirez

/
Canillita (Your First Erlang Server) - V2

Learn Erlang by example creating a simple RESTful server

Jan 04 2016 : Harenson Henao

/
Dealing with Optionals in ViewControllers

The right way to declare your variables

Dec 18 2015 : Pablo Villar

/
Testing Distributed Apps with Common Test

Testing Distributed Apps with Common Test

Dec 11 2015 : Carlos Andres Bolaños

/
Erlang Serpents

We've built a serpents game in Erlang and we played with it

Nov 13 2015 : Brujo Benavides

/
Erlang Meta Testing Revisited

Meta-Testing in Erlang. Now it's much easier.

Nov 13 2015 : Brujo Benavides

/
Here’s to 5 years

It's our 5th birthday, let's celebrate!

Sep 01 2015 : The Inako

/
See all Inaka's blog posts >>

/
My Year of Riak

Chad DePue wrote this on August 25, 2011 under databases, dev, nosql, riak .

Startups often ask my opinion on databases for their new application. In the past year we've launched a big CouchDB-based application, and we've helped build stylesclub.com, a Riak-based facebook app (but not yet launched). We've also helped launch ming.ly, an Amazon SimpleDB based application. Each of these has been an opportunity to further develop my philosophy about what makes a good database for a website or mobile app, and when to choose each one. I have a separate series coming on database tradeoffs but since there isn't that much information on Riak out in the wild yet, I will start by profiling my thoughts on the database. These are my opinions and thoughts after a bit more than a year of using it at Inaka.

Riak

Riak is a Key/Value store database where an afternoon reading the behind-the-scenes architecture of Amazon S3 is a helpful primer on how the database works. Basho's website (Basho is the company that built Riak) is remarkably obtuse about explaining what's so great about it. However, if you've ever had to deal with the hassles of scaling MySQL or other data stores across multiple servers, you'll want to familiarize yourself with Riak. Those who know they need Riak have suffered the pain of scaling other databases, and so they are a self-selecting group. Riak hasn't exactly gone mainstream yet, but it's the database all the cool kids are talking about so it's good to know at least what makes it special.

Storing data

  • Data is stored in buckets of keys, just like Amazon S3
  • Keys hold values and values can be any type of data.
  • Keys have a content-type which is set when the data is PUT or POSTed into Riak.
  • JSON is the most common type of data stored as it's easy to query with Javascript, but there's nothing magical about JSON with Riak - it's all data!

Servers crash, hard drives fail

Servers will fail; recovering from failure should be automatic; data should not be lost during a failure. Riak does this by making copies of the data across different nodes. The process of making replicas is automatic when items are stored in Riak. When a node goes down, the cluster of nodes detects and rebalances the data in the cluster across the remaining nodes. The brilliance of Riak is all the hassle of recovering from failure and of adding new nodes as you need more storage, is absolutely painless. It Just Works. There are tools for adding and removing nodes and they couldn't be simpler to use.

The brilliance of Riak is all the hassle of recovering from failure and of adding new nodes as you grow is absolutely painless. It Just Works.

A weed-eater vs a V-8

The minimum recommended configuration is three nodes, and you can add as many as you would like. I've heard of clusters up to 60 and I'm sure at this point there are more. The idea of having one Riak "node" is possible - but it's like running a 1-cylinder four-stroke engine, which would run rough, if at all. The whole design of the cylinder and its valves is to work in concert with 3 or 5 or 7 others, each one at a different point in the power cycle, all together bringing the syncopated low rhythmic rumbling sound that indicates the power underneath the hood. Riak is like a V-8, it's really designed to run as a cluster of nodes. Riak throughput goes up with additional nodes, and we have anecdotal evidence that response time is faster to a point as you add nodes as well.

Accessing data and client libraries

Riak provides two protocols for accessing data - HTTP and Protocol Buffers - a Google-created format for structuring data which is more efficient than HTTP.

Because each node communicates with all of the others, you can ask any one node for data. It's trivial to place a HTTP load balancer in front of the nodes to automate the hassle of making round robin requests from your clients - some clients don't yet support round-robin requests to multiple nodes - but it adds (depending on your load balancer) a potential single point of failure.

Clients in Ruby, Python, Javascript, Erlang, (and others) are available. If you store your data as JSON-encoded documents, it's easy to interact with Riak from different languages.

Querying data

When you want to get data out, you need to query for it. If you know the key, it's easy - just make an HTTP request for the data and you're done. But what about queries - aggregate data or a selection of nodes? There is currently only one way to get data out of Riak and that's to use Map/Reduce.

The easiest way to explain Map/Reduce is to say that it's like writing a simple piece of code to query a database, then running that query on all the rows of data, on all the servers where that data lives, and then collating the results. Think of Google's giant search index. It wouldn't be possible to build that index by bringing all the data of the billions of web pages to the server that builds it - the actual work of building the index must be completed close to where the data actually exists.

The most classic example of "Map/Reduce" is all over the web, including Basho's own demo - is the word count. See here (CouchDB), and Ilya Grigorik has a good one here.

So what do people actually use Riak for?

See Who is Using Riak for specific companies, but here's my short list:

  • For storing web session data that could grow indefinitely. Shopping carts that are always available.
  • Storing log data that could grow very large.
  • Write-heavy projects.
  • Documents where the schema between documents could be different.
  • When you absolutely can't have a database with a single point of failure.
  • Example: streaming video data to disk for later processing.
  • Example: storing streams of sensor data.

When do I want to use something other than Riak?

  • If you will be performing SQL-style set operations or your data is relational.
  • If you have budget constraints, because of disk storage requirements, the amount of data stored to ensure redundancy across the nodes will be high.
  • If you don't like running your own servers, as there are not any hosted-Riak services that I would recommend (currently).
  • If absolute latency for response times of individual requests is a priority.
  • If you need to guarantee any read of a key will see the same value immediately, as nodes can take a while to guarantee writes. There are no transactions in Riak.

If I use Riak, I need to be comfortable with...

  • Complex Queries in Javascript that are significantly more difficult to write and debug than a SQL equivalent.
  • A potential tradeoff -- possible increased development complexity for massively decreased deployment complexity.
  • No ability to list keys and therefore no equivalent to "select * from customers". (You can request keys from a bucket but it's - currently an expensive operation that can block all other activies on the nodes; meaning, don't do it.)

How do I deal with things that need to be atomic like queues and counters?

For every application Inaka has built to-date, we use Riak with Redis. For caching data, counters, quick set operations, and anything we would use Memcache for, we use Redis. For all the actual permanent data storage, we use Riak. This often creates a single point of failure at the database level, but we're almost always dealing with other single points of failure, and you can use read-slaves with Redis to eliminate this to some extent. Particularly if you're not using Redis for permanent storage, you can go a long way with two Redis servers and a Riak cluster. They're a great compliment to each other.

Roadmap Features

Basho has demonstrated secondary indices which allow for querying across the database without having to write a Map/Reduce. I believe this will be a significant improvement to the product, though I'm not super convinced they have the right format for the query language yet - it feels a bit clumsy with type definitions in the HTTP query syntax.

Additionally, Basho has promised, down the road, a SQL-like syntax which would make interacting with the database much more powerful for the average developer. The roadmap looks bright and Basho is very responsive to community feature requests.

I talked with Mark Phillips with Basho, and he gave me the short list of 1.0 features coming this fall:

  • Secondary Indexing - as referenced above.
  • Lager - More traditional, unix-friendly logging
  • LevelDB Integration - A google-created backend that allows for different performance characteristics than the default backend, called BitCask. One thing I intentionally didn't discuss is the pluggable datastores in Riak, as it's not that important for understanding the basics of the database, but since a new one is a roadmap item I'll just say that it's a great feature - you can use the default or the MySQL backend, or any of a number of stores, even Redis. LevelDB seems to have some important characteristics such as built-in compression, instant snapshotting, and more.
  • Riak Pipe - Ability to setup phases of map/reduce jobs in a 'pipeline'. It's in beta and I've played around with it. Easiest way to explain it is that the Basho guys are thinking through how to make the complexity of Map/Reduce easier and more powerful and this is the first step.
  • Search Integration - Search was a separate install with a Java dependency. That has been removed and search is a 'first class feature' of Riak now.

Final Thoughts

  • I only recommend Riak to clients that really understand their needs and can confidently walk through the list above. I have mixed feelings about recommending Riak today, because most people don't have the context to make the right decision. It's easy to pick Riak or another NoSQL store because you're worried about scale. But, it's way, way more important to worry about those first 100 customers than it is to worry about your first million. Riak is catnip to sufferers of "Premature Database Optimazation Syndrome" because it works - it does actually allow near linear scale, but that's not usually the problem.
  • Map/Reduce databases are a brilliantly powerful way of turning database queries on their head, but currently Riak's can be overloaded with complex javascript. With a big Map/Reduce query with a lot of keys, you either need to get the keys into the engine, which means POSTing them (and thereby sending them over the wire) or performing some sort of "bucket filter" which will cause a "list-keys" operation (see above) and will not scale well in production. This has the effect of narrowing the window of acceptable Map/Reduce scenarios, and forcing painful workarounds such as batching operations outside of Riak using job systems like Resque.
  • Hosting can be expensive. We've seen poor performance running Amazon small nodes, so we generally recommend running on AWS large boxes. Running three m1.large AWS boxes will cost around $730 USD/month, which means starting out with Riak is a more expensive proposition than with some other databases in a cloud environment, and it is often worth considering dedicated hosting.
  • There are some commercial features available as well that I didn't mention. Probably the most important is site-to-site replication. This is not available in the open source version but I'm assuming is the major draw for the big enterprise paying customers so far.
  • Net/net: I REALLY like Riak when I'm lying in bed at night not worried about a few node failures.
  • Overall, if you know you need Riak, it's a joy to use; easy to scale, and it's a powerful tool in your database arsenal.