inaka

Latest blog entries

/
Galgo Android Library

An android library to display logs as on screen overlays

Nov 20 2014 : Henrique Boregio

/
ErloungeBA @ Inaka

ErloungeBA meeting @ Inaka Offices.

Nov 14 2014 : Inaka

/
Shotgun: HTTP client for Server-sent Events

Show usage of Shotgun and how consuming SSE can be simple.

Oct 20 2014 : Juan Facorro

/
Metaprogramming in Erlang: Writing a partial application function

The joys of metaprogramming and Erlang's abstract format

Oct 14 2014 : Hernán Rivas Acosta

/
Implementing an Android REST Client using Retrofit

Quickly create REST Clients from a simple java interface

Oct 10 2014 : Henrique Boregio

/
Worker Pool (for Erlang)

Introducing one of our open-source tools: Worker Pool

Sep 25 2014 : Brujo Benavides

/
The Fork Workflow in iOS

A clear way to apply modifications to your project dependencies

Sep 19 2014 : Pablo Villar

/
Launching Android Activities in a Separate Task

Launching Android Activities in a Separate Task

Sep 09 2014 : Henrique Boregio

/
Getting the right colors in your iOS app

How to keep consistence when picking and applying colors

Sep 05 2014 : Pablo Villar

/
The King of Code Style

Introducing our erlang style guide and style-checking tool, Elvis

Sep 05 2014 : Iñaki Garay

/
Proud to announce our new home with Erlang Solutions

Inaka is proud to announce our new home with Erlang Solutions!

Aug 05 2014 : Chad DePue

/
IKJayma

IKJayma: A simple iOS Networking Library

Jul 21 2014 : Tom Ryan

/
Become an Erlang Cowboy and tame the Wild Wild Web - Part I

Erlang: From zero to coding a commenting system

Jun 23 2014 : Federico Carrone

/
Implementing a simple Rest Client in Android

How to create a simple Rest Client in Android

May 19 2014 : Henrique Boregio

/
Assisted Workflow: a CLI tool to speed up simple git workflows

Introducing the assisted_workflow gem, a cli tool with useful commands to integrate a simple git workflow with the story tracker and github pull requests

Mar 25 2014 : Flavio Granero

/
Cleaning Up Your GitHub Tree

How to clear all those stray branches

Feb 21 2014 : Pablo Villar

/
Friday Talks at Inaka

Lunch together, talk together

Dec 20 2013 : Inaka Blog

/
RubyConf Argentina 2013

Inaka represents at RubyConf 2013

Dec 18 2013 : Inaka Blog

/
Bounce Rate Bare-Bones Basics

An overview of bounce rate in broad strokes

Dec 05 2013 : Inaka Blog

/
Paintball: Inaka’s End-of-the-Year Party

Welts and bruises bring Inaka together

Nov 22 2013 : Inaka Blog

/
Inaka Product Review: Connection Minder

Making networking personal again

Nov 19 2013 : Inaka Blog

/
Canillita - Your First Erlang Server

Learn Erlang by example creating a simple RESTful server

Nov 06 2013 : Fernando "Brujo" Benavides

/
Landing Page Basics: What, How, and Why

The importance of a well-built landing page

Oct 29 2013 : Inaka Blog

/
Navigating Open Source Licensing

A comparison of common open source licenses

Oct 17 2013 : Inaka Blog

/
Git: Not Just for Devs

Sharing the Git love

Oct 07 2013 : Inaka Blog

/
Inaka Product Review: Go Dish

Go Dish brings good deals on good food

Oct 01 2013 : Inaka Blog

/
Reconsidering the Big Launch

Why big launches often disappoint, and what to do instead

Sep 23 2013 : Inaka Blog

/
Inaka Product Review: Whisper

Share secrets and meet new people with Whisper

Sep 16 2013 : Inaka Blog

/
Inaka Product Review: Ombu

Ombu combines the best of Bump and Scan to make sharing easy

Sep 11 2013 : Inaka Blog

/
Digitized Halloween Costumes

Morphsuits and Digital Dudz at your fingertips

Sep 09 2013 : Inaka Blog

/
7 Tactics to Build an App Without a Technical Cofounder: Part 3

Focusing on user experience through design

Sep 06 2013 : Inaka Blog

/
From Erlang to Java and Back Again: Part 1

My experience creating a Java/Erlang OTP application

Sep 05 2013 : Fernando "Brujo" Benavides

/
7 Tactics to Build an App Without a Technical Cofounder: Part 2

A realistic look at costs and business relations

Aug 30 2013 : Inaka Blog

/
7 Tactics to Build an App Without a Technical Cofounder: Part 1

Understanding the tools and processes of app development

Aug 28 2013 : Inaka Blog

/
Second-Screen App Round-Up

A round-up of network agnostic, network-based, and show-based apps

Aug 27 2013 : Inaka Blog

/
iOS Auto Layout

A review of Apple's Auto Layout technology

Aug 23 2013 : German Azcona

/
Core Data One-Way Relationships

Common design patterns in iOS applications

Mar 06 2013 : Tom Ryan

/
Everyday Erlang: Quick and effective caching using ETS

Using ETS for effective caching in Erlang

Mar 05 2013 : Marcelo Gornstein

/
Don't Under-Think It: SQL vs NoSQL

The effect of database choice on 'technical debt

Feb 26 2013 : Chad DePue

/
Erlang Event-Driven Applications

A thorough how-to on using events

Jan 21 2013 : Marcelo Gornstein

/
Don't Under-Think It: Making Critical Decisions When Building an iOS Application

How a few up-front decisions can make or break an app

Dec 06 2012 : Chad DePue

/
Some Erlang Magic for Beginners

Erlang tricks for beginners

Dec 03 2012 : Fernando "Brujo" Benavides

/
Inaka:Pong - DIY Sport

How to play Inaka:Pong, a new sport

Dec 03 2012 : Fernando "Brujo" Benavides

/
Every-day Erlang: Handling Crashes in Erlang

Handling crashes when calling gen_server:start link outside a supervisor

Nov 29 2012 : Marcelo Gornstein

/
Inaka Friday lunches

Team building at Inaka

Nov 02 2012 : Chad DePue

/
Inaka is a proud sponsor of Erlang DC

The largest Erlang event on the East Coast

Oct 23 2012 : Jenny Taylor

/
Inaka proud to be a sponsor of RubyConf Argentina

The largest Ruby event in South America

Oct 23 2012 : Jenny Taylor

/
Inaka client Ming.ly featured on LifeHacker

Big press for the Heroku-powered Rails-based Gmail plugin

Feb 28 2012 : Chad DePue

/
Scaling Erlang

Scale testing a sample Erlang/OTP application

Oct 07 2011 : Fernando "Brujo" Benavides

/
Memory Management Changes in iOS 5

A review of Apple's new ARC technology

Sep 05 2011 : German Azcona

/
My Year of Riak

Thoughts on using Basho's Riak database in production.

Aug 25 2011 : Chad DePue

/
Every-day Erlang: Handling Crashes in Erlang

Marcelo Gornstein, Nov 29 2012

Introduction

Hi :)

This post is about a nifty trick we use when we need to start a gen_server process with a start_link call, while simultaneously handling any errors gracefully (i.e: avoid propagating a crash to the supervisors). Here's the exact situation:

  • You need to start_link a gen_server from your own not-supervisor-process.

  • The gen_server in question does not offer an alternative start function in its API that you can use to start and then link your process to, so you can only use start_link.

  • For your own requirements, it doesn't matter if the gen_server does not start (i.e: it crashes while starting). You don't want to propagate the crash, but keep trying to start the server every couple of seconds.

Feel free to skip the sections you don't need (or skip the post completely and jump right into the example source code).

The Problem

As you may already know, when a process terminates, the Erlang virtual machine will propagate an exit signal to all linked processes. By default, if the process terminated abnormally (i.e. with a reason different than 'normal'), these linked processes will terminate as well. There's an in-depth explanation about this procedure in the Appendix B.

This is one of the cornerstones of a "supervisor tree", and what start_link is all about. It is great, and it works like a charm.

But sometimes, you are in a kind of strange situation where you need to cheat a little bit, maybe due to strange requirements or APIs. In this case, we need to call start_link, but we don't want to propagate the crash up in the supervisor hierarchy. We want to catch the error (or get the proper error result) instead.

For instance, let's say we have the following architecture:

top_sup -> worker_sup -> worker -> failing_server

So when failing_server fails to start, worker will receive an exit signal and die (since it started failing_server with a start_link call) because it's linked, and this would eventually bring down the whole supervisor tree.

The Solution

What Doesn't Work

A call to start_link implies that the calling process (the one that's actually calling gen_server:start) is linked to the newly created process, and this means (quoting process section of the Erlang reference manual):

When a process terminates, it will terminate with an exit reason as
explained in Process Termination above. This exit reason is emitted in an
exit signal to all linked processes.

And that's why things like these won't do the trick:

try
   {ok, Pid} = gen_server:start_link(...)
    catch
      _:Error ->
      ...
   end

case won't work either:

case gen_server:start_link(...) of
   {ok, Pid} ->
      ...
   _ ->
      ...
   end

Since start_link is a sync procedure, the exit signal will arrive to your process before returning a value. You have to use erlang:process_flag/2 to catch it.

What Does Work

The solution is a mix of things:

  • Set worker as a transient child of your supervisor (worker_sup). In this way, crashes will not make worker_sup try to restart worker, so you don't hit any restart limits. Transient children are only restarted when they terminate with an exit reason other than normal, shutdown or {shutdown,Term}.

  • Have worker control the trap_exit process flag before trying to start the other gen_server (failing_server in the source code example). So when the offending gen_server fails on init, your worker can handle that by receiveing a message without propagating the exit signal.

  • When worker detects that failing_server can't be started, return ignore from init/1. This will tell the supervisor to not restart the child, and also keep the child definition.

  • Before returning from worker:init/1 use timer:apply_after/4 to call a function in your worker_sup module (say worker_sup:restart_child/0) to retry the operation from scratch. This can be done an infinite number of times.

The ignore atom is described in the supervisor manual page:

The start function can also return ignore if the child process for some
reason cannot be started, in which case the child specification will be kept
by the supervisor (unless it is a temporary child) but the non-existing
child process will be ignored.

Making the supervisor keep the child definition is useful, so we can later call supervisor:restart_child/2:

Tells the supervisor SupRef to restart a child process corresponding to the
child specification identified by Id. The child specification must exist and
the corresponding child process must not be running.

Source Code

The source code for the complete solution is provided, with the following files:

  • worker.erl: The worker gen_server process that uses start_link to spawn a gen_server, in this case, failing_server.

  • failing_server.erl: A gen_server process started by the worker that will fail on init/1 with {stop, Reason}.

  • worker_sup.erl: The worker supervisor, that uses restart_child to restart the worker process on a crash.

  • top_sup.erl: Not that interesting, the top level supervisor, just for the sake of completeness.

Code Overview

The failing_server

Here's the failing_server, which will always fail with a bad_hair_day error. The irrelevant parts have been stripped:

-module(failing_server).
-behaviour(gen_server).
...
start_link() ->
   gen_server:start_link({local, ?MODULE}, ?MODULE, [], []).

init([]) ->
   {stop, bad_hair_day}.
   ...

The worker

Here's the worker. Note how we handle the trap_exit flag before calling failing_server:start_link/0, and stop handling it after it returns:

-module(worker).
-behaviour(gen_server).
...
init([]) ->
  process_flag(trap_exit, true),
  Result = failing_server:start_link(),
  %% This receive here will consume the message in the process queue when
  %% failing_server crashes on init.
  receive after 25 -> ok end,
  process_flag(trap_exit, false),
  case Result of
    {ok, _Pid} ->
      io:format("Success~n"),
      {ok, []};
    Error ->
      %% On error, gracefully return an ignore to the supervisor and
      %% schedule a call to restart this worker again some time in the
      %% future.
      io:format("Error: ~p~n", [Error]),
      {ok, _} = timer:apply_after(5000, worker_sup, restart_child, []),
      ignore
  end.
    ...

The Worker Supervisor

-module(worker_sup).
-behaviour(supervisor).
...
start_link() ->
  supervisor:start_link({local, ?MODULE}, ?MODULE, []).

restart_child() ->
  supervisor:restart_child(?MODULE, worker).

init([]) ->
  {ok, {{ one_for_one, 5, 10}, [
    {worker, {worker, start_link, []}, transient, 2000, worker, [worker]}
  ]}}.

Trial Run

If worker would only do a failing_server:start_link(), note what would happen (i.e: how the shell pid changes):

Eshell V5.9.2  (abort with ^G)
1> self().
<0.40.0>
2> top_sup:start_link().

=CRASH REPORT==== 24-Nov-2012::16:09:45 ===
  crasher:
    initial call: failing_server:init/1
    pid: <0.46.0>
    registered_name: []
    exception exit: bad_hair_day
      in function  gen_server:init_it/6 (gen_server.erl, line 320)
    ancestors: [worker,worker_sup,top_sup,<0.40.0>]
    ...
** exception exit: shutdown
3>
=SUPERVISOR REPORT==== 24-Nov-2012::16:09:45 ===
     Supervisor: {local,worker_sup}
     Context:    start_error
     Reason:     bad_hair_day
     ...

=SUPERVISOR REPORT==== 24-Nov-2012::16:09:45 ===
     Supervisor: {local,top_sup}
     Context:    start_error
     Reason:     shutdown
     ...

3> self().
<0.47.0>

As you can see, the process that started the top_sup (the shell) got linked to the new process(es). When one crashed (failing_server), it took the shell with it, which was restarted, and that's why our shell now has a different pid().

Now, with the real code included in worker, after handling the trap_exit flag and retrying after a couple of seconds:

Eshell V5.9.2  (abort with ^G)
1> self().
<0.40.0>
2> top_sup:start_link().

=CRASH REPORT==== 24-Nov-2012::16:20:52 ===
  crasher:
    initial call: failing_server:init/1
    pid: <0.46.0>
    registered_name: []
    exception exit: bad_hair_day
      in function  gen_server:init_it/6 (gen_server.erl, line 320)
    ancestors: [worker,worker_sup,top_sup,<0.40.0>]
    ...
Error: {error,bad_hair_day}

=PROGRESS REPORT==== 24-Nov-2012::16:20:52 ===
          supervisor: {local,kernel_safe_sup}
             started: [{pid,<0.47.0>},
                       {name,timer_server},
                       {mfargs,{timer,start_link,[]}},
                       {restart_type,permanent},
                       {shutdown,1000},
                       {child_type,worker}]

=PROGRESS REPORT==== 24-Nov-2012::16:20:52 ===
          supervisor: {local,top_sup}
             started: [{pid,<0.44.0>},
                       {name,worker_sup},
                       {mfargs,{worker_sup,start_link,[]}},
                       {restart_type,permanent},
                       {shutdown,infinity},
                       {child_type,supervisor}]
{ok,<0.43.0>}
3>
=CRASH REPORT==== 24-Nov-2012::16:20:57 ===
  crasher:
    initial call: failing_server:init/1
    pid: <0.51.0>
    registered_name: []
    exception exit: bad_hair_day
      in function  gen_server:init_it/6 (gen_server.erl, line 320)
    ancestors: [worker,worker_sup,top_sup,<0.40.0>]
    ...
Error: {error,bad_hair_day}

3> self().
<0.40.0>
4>
=CRASH REPORT==== 24-Nov-2012::16:21:02 ===
  crasher:
    initial call: failing_server:init/1
    pid: <0.55.0>
    registered_name: []
    exception exit: bad_hair_day
      in function  gen_server:init_it/6 (gen_server.erl, line 320)
    ancestors: [worker,worker_sup,top_sup,<0.40.0>]
    ...
Error: {error,bad_hair_day}
4> self().
<0.40.0>

The shell has the same pid(), the supervisors are running, and the system is still trying to start the failing_server process from time to time.

Thanks

This article would have not been possible without the help (and patience) of my pal and personal Erlang guru Fernando "El brujo" Benavides. Thanks, man! :)

Author

Marcelo Gornstein marcelo@inakanetworks.com

Github: marcelog

Homepage: http://marcelog.github.com

Appendix A: Source Code

failing_server.erl

This one is the gen_server that won't start, will return {stop, Reason} from init/1:

-module(failing_server).
-author('marcelo@inakanetworks.com').
-license("Apache2").

-behaviour(gen_server).

-export([start_link/0]).
-export([init/1, handle_call/3, handle_cast/2, handle_info/2, code_change/3, terminate/2]).

start_link() ->
  gen_server:start_link({local, ?MODULE}, ?MODULE, [], []).

init([]) ->
  {stop, bad_hair_day}.

handle_call(_Request, _From, State) ->
  {reply, ok, State}.

handle_cast(_Request, State) ->
  {noreply, State}.

handle_info(_Info, State) ->
  {noreply, State}.

code_change(_OldVsn, State, _Extra) ->
  {ok, State}.

terminate(_Reason, _State) ->
  ok.

worker.erl

This module will be trying to start failing_server by directly invoking start_link. Will also handle any errors by capturing the trap_exit signal and reschedule a restart for itself:

-module(worker).
-author('marcelo@inakanetworks.com').
-license("Apache2").

-behaviour(gen_server).

-export([start_link/0]).
-export([
  init/1, code_change/3, terminate/2,
  handle_call/3, handle_cast/2, handle_info/2
]).

start_link() ->
  gen_server:start_link({local, ?MODULE}, ?MODULE, [], []).

init([]) ->
  process_flag(trap_exit, true),
  Result = failing_server:start_link(),
  %% This receive here will consume the message in the process queue when
  %% failing_server crashes on init.
  receive after 25 -> ok end,
  process_flag(trap_exit, false),
  case Result of
    {ok, _Pid} ->
      io:format("Success~n"),
      {ok, []};
    Error ->
      %% On error, gracefully return an ignore to the supervisor and
      %% schedule a call to restart this worker again some time in the
      %% future.
      io:format("Error: ~p~n", [Error]),
      {ok, _} = timer:apply_after(5000, worker_sup, restart_child, []),
      ignore
  end.

handle_call(_Request, _From, State) ->
  {reply, ok, State}.

handle_cast(_Request, State) ->
  {noreply, State}.

handle_info(_Info, State) ->
  {noreply, State}.

code_change(_OldVsn, State, _Extra) ->
  {ok, State}.

terminate(_Reason, _State) ->
  ok.

worker_sup.erl

The worker supervisor. The only special thing to look at here is that the worker process is defined as transient, and the restart/0 function that ends up calling supervisor:restart_child/2:

-module(worker_sup).
-author('marcelo@inakanetworks.com').
-license("Apache2").

-behaviour(supervisor).

-export([start_link/0]).
-export([init/1]).
-export([restart_child/0]).

start_link() ->
  supervisor:start_link({local, ?MODULE}, ?MODULE, []).

restart_child() ->
  supervisor:restart_child(?MODULE, worker).

init([]) ->
  {ok, {{ one_for_one, 5, 10}, [
    {worker, {worker, start_link, []}, transient, 2000, worker, [worker]}
  ]}}.

top_sup.erl

The top level supervisor, nothing special here.

-module(top_sup).
-author('marcelo@inakanetworks.com').
-license("Apache2").

-behaviour(supervisor).

-export([start_link/0]).
-export([init/1]).

start_link() ->
  supervisor:start_link({local, ?MODULE}, ?MODULE, []).

init([]) ->
  {ok, {{ one_for_one, 5, 10}, [
    {
      worker_sup, {worker_sup, start_link, []}, permanent,
      infinity, supervisor, [worker_sup]
    }
  ]}}.

Appendix B: Start and Link Procedure in Detail

When you call gen_server:start_link, a new process will be created, and init/1 will be invoked. Let's first see how, by peeking at lib/stdlib/gen_server.erl:

start_link(Name, Mod, Args, Options) ->
    gen:start(?MODULE, link, Name, Mod, Args, Options).

Going forward, let's now look into gen:start, in (lib/stdlib/gen.erl):

start(GenMod, LinkP, Mod, Args, Options) ->
    do_spawn(GenMod, LinkP, Mod, Args, Options).
...
%%-----------------------------------------------------------------
%% Spawn the process (and link) maybe at another node.
%% If spawn without link, set parent to ourselves 'self'!!!
%%-----------------------------------------------------------------
do_spawn(GenMod, link, Mod, Args, Options) ->
    Time = timeout(Options),
    proc_lib:start_link(?MODULE, init_it,
      [GenMod, self(), self(), Mod, Args, Options],
      Time,
      spawn_opts(Options));

Note the call to proc_lib:start_link. Note how it will (for sure) spawn a new process, link self() to it, and call the gen:init_it function:

init_it(GenMod, Starter, Parent, Mod, Args, Options) ->
    init_it2(GenMod, Starter, Parent, self(), Mod, Args, Options).

init_it(GenMod, Starter, Parent, Name, Mod, Args, Options) ->
    case name_register(Name) of
        true ->
            init_it2(GenMod, Starter, Parent, Name, Mod, Args, Options);
        {false, Pid} ->
            proc_lib:init_ack(Starter, {error, {already_started, Pid}})
    end.

init_it2(GenMod, Starter, Parent, Name, Mod, Args, Options) ->
    GenMod:init_it(Starter, Parent, Name, Mod, Args, Options).

So in the end (for a gen_server), gen_server:init_it is called because of the content of the GenMod variable (as we saw above, it is ?MODULE when called from a gen_server):

case catch Mod:init(Args) of
  ...
  {stop, Reason} ->
  %% For consistency, we must make sure that the
  %% registered name (if any) is unregistered before
  %% the parent process is notified about the failure.
  %% (Otherwise, the parent process could get
  %% an 'already_started' error if it immediately
  %% tried starting the process again.)
  unregister_name(Name0),
  proc_lib:init_ack(Starter, {error, Reason}),
     exit(Reason);
  ...

proc_lib:init_ack will actually notify "the parent" of the error. The parent is the one that called start_link (as we can see in lib/stdlib/proc_lib.erl)

init_ack(Parent, Return) ->
    Parent ! {ack, self(), Return},
    ok.

If init finishes with something like {stop, _}, exit is called, and this will definitely propagate a crash, because the parent is not a supervisor, it is a monitored process and the start_link operation is sync, so the exit signal will be sent before any messages.