inaka

Latest blog entries

/
SpellingCI: No more spelling mistakes in your markdown flies!

Feb 14 2017 : Felipe Ripoll

/
Fast reverse geocoding with offline-geocoder

Do you need a blazing fast reverse geocoder? Enter offline-geocoder!

Jan 18 2017 : Roberto Romero

/
Using Jayme to connect to the new MongooseIM REST services

MongooseIM has RESTful services!! Here I show how you can use them in an iOS application.

Dec 13 2016 : Sergio Abraham

/
20 Questions, or Maybe a Few More

20 Questions, or Maybe a Few More

Nov 16 2016 : Stephanie Goldner

/
The Power of Meeting People

Because conferences and meetups are not just about the technical stuff.

Nov 01 2016 : Pablo Villar

/
Finding the right partner for your app build

Sharing some light on how it is to partner with us.

Oct 27 2016 : Inaka

/
Just Play my Sound

How to easily play a sound in Android

Oct 25 2016 : Giaquinta Emiliano

/
Opening our Guidelines to the World

We're publishing our work guidelines for the world to see.

Oct 13 2016 : Brujo Benavides

/
Using NIFs: the easy way

Using niffy to simplify working with NIFs on Erlang

Oct 05 2016 : Hernan Rivas Acosta

/
Function Naming In Swift 3

How to write clear function signatures, yet expressive, while following Swift 3 API design guidelines.

Sep 16 2016 : Pablo Villar

/
Jenkins automated tests for Rails

How to automatically trigger rails tests with a Jenkins job

Sep 14 2016 : Demian Sciessere

/
Erlang REST Server Stack

A description of our usual stack for building REST servers in Erlang

Sep 06 2016 : Brujo Benavides

/
Replacing JSON when talking to Erlang

Using Erlang's External Term Format

Aug 17 2016 : Hernan Rivas Acosta

/
Gadget + Lewis = Android Lint CI

Integrating our Android linter with Github's pull requests

Aug 04 2016 : Fernando Ramirez and Euen Lopez

/
Passwordless login with phoenix

Introducing how to implement passwordless login with phoenix framework

Jul 27 2016 : Thiago Borges

/
Beam Olympics

Our newest game to test your Beam Skills

Jul 14 2016 : Brujo Benavides

/
Otec

Three Open Source Projects, one App

Jun 28 2016 : Andrés Gerace

/
CredoCI

Running credo checks for elixir code on your github pull requests

Jun 16 2016 : Alejandro Mataloni

/
Thoughts on rebar3

Thoughts on rebar3

Jun 08 2016 : Hernán Rivas Acosta

/
7 Heuristics for Development

What we've learned from Hernán Wilkinson at our Tech Day

May 31 2016 : Brujo Benavides

/
See all Inaka's blog posts >>

/
Sorting by popularity like Reddit with Ruby, PostgreSQL and Elastic, part 1

A photo of Flavio Granero wrote this on March 25, 2015 under dev, elastic, postgres, ruby .

As example, let's consider a simple application where users can post pictures, and these pictures can receive both positive votes (upvotes) and negative votes (downvotes). You can think of it as a smaller version of Imgur, but keeping in mind that it still needs to deal with a large number of new posts in short time periods. One of the main features of this application would be a main page displaying the pictures ranking ordered by popularity.

Simple sorting by Score

An easy form to accomplish this would be calculating the score of each post by the upvotes and downvotes difference, and order the items with this value.

But as we know, that simplist kind of score based only on positive and negative votes presents distortions in the ranking, because downvotes have a lot of weight in the classification.

Back to our example, even if we apply the suggested Wilson Score formula to obtain a more reliable sort, we won't solve the issue completely.

That happens because our application has a very well known problem handled by social aggregators as Hacker News, Reddit ,Imgur, Stumbleupon, Delicious among others receiving a lot of posts every day. Applying the Wilson Formula in our example would create the following situation: old but well voted postings would always be in the top positions, taking a long time for new images to bypass their score. Besides that, with more and more images been published, it would make it even more difficult for users to find the recent posts, because all the initial positions would be filled by old images.

To have a constantly updated list of popular images is important to keep users interested in our content, returning to check what are the new "hot" items.

Introducing the Hot Score

How can we keep a list of best-rated items and at the same time allow new posts to rank better? The solution relies on considering a value other than score when sorting: the posting age. What we want is to give more weight for votes in recent posts, and then keep reducing this weight while the posting is getting older. For this, we need an algorithm derived from an Exponential Decay formula, where a constant makes the score fall over time. With different constant values we can find variations, as shown in this score versus time chart:

Exponential Decay Chart Source: wikipedia

An Amir Salihefendic article titled How Reddit Ranking works shows through python code and mathematical formulas how to build a Hot Score algorithm. This technology is not unique to Reddit, as mentioned above all the modern social aggregators need to rank high volumes of new items using some hot ranking variation, so they adapt it to their own needs, considering other factors when calculating the score. Our goal with this text is not to show these variations, but to show how we can build an upvotes, dowvotes and age based ranking for a Ruby (on Rails) application. In the chart below we see how is score expected to change over time:

reddit_score_time Source: amix.dk/blog

Hot Score with PostgreSQL

We recently found a [clux user repository on github] (https://github.com/clux/decay) with some famous ranking algorithms implemented in javascript, including the "redditHot", that we're trying to reproduce here. As an option, we could just convert this function to Ruby, but better than that, Reddit offers its own ranking functions as SQL functions in their repository on Github. We're choosing the database function to use the performance gain we get from using PostgreSQL.

To introduce this function to our database, in our Rails application we'll create the migration:

class AddHotScoreFunction < ActiveRecord::Migration
  def up
    execute <<-SQL
      create or replace function
        hot_score(ups integer, downs integer, date timestamp with time zone)
        returns numeric as $$
        select round(cast(log(greatest(abs($1 - $2), 1)) * sign($1 - $2) +
          (date_part('epoch', $3) - 1134028003) / 45000.0 as numeric), 7)
      $$ language sql immutable;
    SQL
  end

  def down
    execute "DROP FUNCTION IF EXISTS hot_score(integer, integer, timestamp);"
  end
end

Then we have a function that applies a logarithmic scale to the score (upvotes - downvotes) summing a value proportional to age of the item, which is calculated from the creation date in the last parameter. A more detailed description of this mathematical formula is seen in "Reddit, Stumbleupon, Del.icio.us and Hacker News Algorithms Exposed!" published on the Moz blog.

As an advantage of having a SQL function, we are able to calculate a Post Hot Score with database "select" query. In a rails model, it's possible to write an [ActiveRecord scope] (http://guides.rubyonrails.org/active_record_querying.html#scopes):

# app/models/post.rb
class Post < ActiveRecord::Base
  belongs_to :user
  #...
  scope :ranking, -> { select("id, image_url, user_id, created_at,
   hot_score(up_score, down_score, created_at) as hot_score").
   order_by("hot_score desc") }
  #...
end

# getting the first 10 items in the ranking
Post.ranking.limit(10)

Of course this is a quick solution, but it is also what we call a "not scalable" solution. Once the number of records in "posts" table begins to grow, we will suffer a significant performance drop. This is expected because the database needs to calculate the hot_score() for every item before returning a sorted list. Let's take a look at the query analysis using EXPLAIN and ANALYSE commands:

EXPLAIN ANALYSE SELECT  id, hot_score(up_score, down_score, created_at) 
  as hot_score FROM "posts"   ORDER BY hot_score desc LIMIT 10;

                        QUERY PLAN
----------------------------------------------------------------------
Limit  (cost=2.37..2.38 rows=5 width=32) (actual time=0.874..0.876 rows=5 loops=1)
  ->  Sort  (cost=2.37..2.38 rows=5 width=32) (actual time=0.872..0.872 rows=5 loops=1)
     Sort Key: (hot_score(up_score, down_score, (created_at)::timestamp with time zone))
     Sort Method: quicksort  Memory: 25kB
     ->  Seq Scan on cards  (cost=0.00..2.31 rows=5 width=32) (actual time=0.722..0.820 rows=5 loops=1)
Total runtime: 0.933 ms

As our test database is still small (only 6 records), we have a satisfactory response time. But we expect that Seq Scan will become increasingly expensive as the table grows. So in the next part of this blog post we show how to create a cache to hot_score() and also solve the ranking pagination issue introducing Elastic (past known as ElasticSearch) to the app.

Conclusion

Applying ranking algorithms to create lists sorted by popularity is crucial for websites working as social aggregators. This text has presented an example of how to build this sort of ranking using a SQL function published as "open source" by Reddit. In the next post you'll learn how to make this solution scalable and how to use Elastic features to paginate the ranked items.

A photo of

Flavio Granero

Full Stack Developer