inaka

Latest blog entries

/
Cowboy Trails

Cowboy routes on steroids!

Jul 20 2015 : Carlos Andres Bolaños

/
Erlang Meta Testing

Meta-Testing in Erlang.

Jul 17 2015 : Brujo Benavides

/
Erlang Operational Configuration Simplicity

An Erlang configuration trick.

Jul 14 2015 : Iñaki Garay

/
Get your Swagger on

Say what you mean, mean what you say.

Jun 23 2015 : Iñaki Garay

/
Android M quick review

Perfectionism over evolution

Jun 11 2015 : Emiliano Giaquinta

/
Erlang Dojo at Inaka

We held a coding dojo for Erlang at our offices.

Jun 09 2015 : Iñaki Garay

/
Server Sent Events (SSE): EventSource implementation on Swift.

An extremely simple Swift library that implements EventSource API (Server Sent Events SSE).

May 28 2015 : Andres Canal

/
Gadget

The Ultimate Code-Checking Machine

Apr 20 2015 : Brujo Benavides

/
Beautiful Code

Writing some poetry in Erlang

Apr 20 2015 : Brujo Benavides

/
(Not) failing with Node.js

Optimizing Node.js or why ORMs are not always good

Apr 10 2015 : Roberto Romero

/
Sorting by popularity like Reddit with Ruby, PostgreSQL and Elastic, part 2

Second part of a post showing how to rank items by popularity in a Ruby application following the Reddit style, this time with performance and pagination focus using Elastic.

Apr 01 2015 : Flavio Granero

/
Sorting by popularity like Reddit with Ruby, PostgreSQL and Elastic, part 1

First part of a post showing how we can sort items by popularity in a Ruby application following the Reddit style, using PostgreSQL and Elastic.

Mar 25 2015 : Flavio Granero

/
TDD, Coverage and Why Testing Exceptions

Why should I test the exceptions in my code?

Feb 24 2015 : Brujo Benavides

/
Galgo iOS Library

An iOS library to display logs as on screen overlays.

Feb 10 2015 : Andres Canal

/
Announcing xref_runner

Making full use of the Erlang toolbox

Feb 10 2015 : Iñaki Garay

/
Weird List Comprensions in Erlang

Some strange cases regarding List Comprenhensions

Jan 13 2015 : Brujo Benavides

/
How to automatically generate different production and staging builds

Use gradle to build different flavors of APKs

Dec 22 2014 : Henrique Boregio

/
Galgo Android Library

An android library to display logs as on screen overlays

Nov 20 2014 : Henrique Boregio

/
ErloungeBA @ Inaka

ErloungeBA meeting @ Inaka Offices.

Nov 14 2014 : Inaka

/
Shotgun: HTTP client for Server-sent Events

Show usage of Shotgun and how consuming SSE can be simple.

Oct 20 2014 : Juan Facorro

/
See all Inaka's blog posts >>

/
Every-day Erlang: Handling Crashes in Erlang

Marcelo Gornstein wrote this on November 29, 2012 under dev, erlang .

Introduction

Hi :)

This post is about a nifty trick we use when we need to start a gen_server process with a start_link call, while simultaneously handling any errors gracefully (i.e: avoid propagating a crash to the supervisors). Here's the exact situation:

  • You need to start_link a gen_server from your own not-supervisor-process.

  • The gen_server in question does not offer an alternative start function in its API that you can use to start and then link your process to, so you can only use start_link.

  • For your own requirements, it doesn't matter if the gen_server does not start (i.e: it crashes while starting). You don't want to propagate the crash, but keep trying to start the server every couple of seconds.

Feel free to skip the sections you don't need (or skip the post completely and jump right into the example source code).

The Problem

As you may already know, when a process terminates, the Erlang virtual machine will propagate an exit signal to all linked processes. By default, if the process terminated abnormally (i.e. with a reason different than 'normal'), these linked processes will terminate as well. There's an in-depth explanation about this procedure in the Appendix B.

This is one of the cornerstones of a "supervisor tree", and what start_link is all about. It is great, and it works like a charm.

But sometimes, you are in a kind of strange situation where you need to cheat a little bit, maybe due to strange requirements or APIs. In this case, we need to call start_link, but we don't want to propagate the crash up in the supervisor hierarchy. We want to catch the error (or get the proper error result) instead.

For instance, let's say we have the following architecture:

top_sup -> worker_sup -> worker -> failing_server

So when failing_server fails to start, worker will receive an exit signal and die (since it started failing_server with a start_link call) because it's linked, and this would eventually bring down the whole supervisor tree.

The Solution

What Doesn't Work

A call to start_link implies that the calling process (the one that's actually calling gen_server:start) is linked to the newly created process, and this means (quoting process section of the Erlang reference manual):

When a process terminates, it will terminate with an exit reason as
explained in Process Termination above. This exit reason is emitted in an
exit signal to all linked processes.

And that's why things like these won't do the trick:

try
   {ok, Pid} = gen_server:start_link(...)
    catch
      _:Error ->
      ...
   end

case won't work either:

case gen_server:start_link(...) of
   {ok, Pid} ->
      ...
   _ ->
      ...
   end

Since start_link is a sync procedure, the exit signal will arrive to your process before returning a value. You have to use erlang:process_flag/2 to catch it.

What Does Work

The solution is a mix of things:

  • Set worker as a transient child of your supervisor (worker_sup). In this way, crashes will not make worker_sup try to restart worker, so you don't hit any restart limits. Transient children are only restarted when they terminate with an exit reason other than normal, shutdown or {shutdown,Term}.

  • Have worker control the trap_exit process flag before trying to start the other gen_server (failing_server in the source code example). So when the offending gen_server fails on init, your worker can handle that by receiveing a message without propagating the exit signal.

  • When worker detects that failing_server can't be started, return ignore from init/1. This will tell the supervisor to not restart the child, and also keep the child definition.

  • Before returning from worker:init/1 use timer:apply_after/4 to call a function in your worker_sup module (say worker_sup:restart_child/0) to retry the operation from scratch. This can be done an infinite number of times.

The ignore atom is described in the supervisor manual page:

The start function can also return ignore if the child process for some
reason cannot be started, in which case the child specification will be kept
by the supervisor (unless it is a temporary child) but the non-existing
child process will be ignored.

Making the supervisor keep the child definition is useful, so we can later call supervisor:restart_child/2:

Tells the supervisor SupRef to restart a child process corresponding to the
child specification identified by Id. The child specification must exist and
the corresponding child process must not be running.

Source Code

The source code for the complete solution is provided, with the following files:

  • worker.erl: The worker gen_server process that uses start_link to spawn a gen_server, in this case, failing_server.

  • failing_server.erl: A gen_server process started by the worker that will fail on init/1 with {stop, Reason}.

  • worker_sup.erl: The worker supervisor, that uses restart_child to restart the worker process on a crash.

  • top_sup.erl: Not that interesting, the top level supervisor, just for the sake of completeness.

Code Overview

The failing_server

Here's the failing_server, which will always fail with a bad_hair_day error. The irrelevant parts have been stripped:

-module(failing_server).
-behaviour(gen_server).
...
start_link() ->
   gen_server:start_link({local, ?MODULE}, ?MODULE, [], []).

init([]) ->
   {stop, bad_hair_day}.
   ...

The worker

Here's the worker. Note how we handle the trap_exit flag before calling failing_server:start_link/0, and stop handling it after it returns:

-module(worker).
-behaviour(gen_server).
...
init([]) ->
  process_flag(trap_exit, true),
  Result = failing_server:start_link(),
  %% This receive here will consume the message in the process queue when
  %% failing_server crashes on init.
  receive after 25 -> ok end,
  process_flag(trap_exit, false),
  case Result of
    {ok, _Pid} ->
      io:format("Success~n"),
      {ok, []};
    Error ->
      %% On error, gracefully return an ignore to the supervisor and
      %% schedule a call to restart this worker again some time in the
      %% future.
      io:format("Error: ~p~n", [Error]),
      {ok, _} = timer:apply_after(5000, worker_sup, restart_child, []),
      ignore
  end.
    ...

The Worker Supervisor

-module(worker_sup).
-behaviour(supervisor).
...
start_link() ->
  supervisor:start_link({local, ?MODULE}, ?MODULE, []).

restart_child() ->
  supervisor:restart_child(?MODULE, worker).

init([]) ->
  {ok, {{ one_for_one, 5, 10}, [
    {worker, {worker, start_link, []}, transient, 2000, worker, [worker]}
  ]}}.

Trial Run

If worker would only do a failing_server:start_link(), note what would happen (i.e: how the shell pid changes):

Eshell V5.9.2  (abort with ^G)
1> self().
<0.40.0>
2> top_sup:start_link().

=CRASH REPORT==== 24-Nov-2012::16:09:45 ===
  crasher:
    initial call: failing_server:init/1
    pid: <0.46.0>
    registered_name: []
    exception exit: bad_hair_day
      in function  gen_server:init_it/6 (gen_server.erl, line 320)
    ancestors: [worker,worker_sup,top_sup,<0.40.0>]
    ...
** exception exit: shutdown
3>
=SUPERVISOR REPORT==== 24-Nov-2012::16:09:45 ===
     Supervisor: {local,worker_sup}
     Context:    start_error
     Reason:     bad_hair_day
     ...

=SUPERVISOR REPORT==== 24-Nov-2012::16:09:45 ===
     Supervisor: {local,top_sup}
     Context:    start_error
     Reason:     shutdown
     ...

3> self().
<0.47.0>

As you can see, the process that started the top_sup (the shell) got linked to the new process(es). When one crashed (failing_server), it took the shell with it, which was restarted, and that's why our shell now has a different pid().

Now, with the real code included in worker, after handling the trap_exit flag and retrying after a couple of seconds:

Eshell V5.9.2  (abort with ^G)
1> self().
<0.40.0>
2> top_sup:start_link().

=CRASH REPORT==== 24-Nov-2012::16:20:52 ===
  crasher:
    initial call: failing_server:init/1
    pid: <0.46.0>
    registered_name: []
    exception exit: bad_hair_day
      in function  gen_server:init_it/6 (gen_server.erl, line 320)
    ancestors: [worker,worker_sup,top_sup,<0.40.0>]
    ...
Error: {error,bad_hair_day}

=PROGRESS REPORT==== 24-Nov-2012::16:20:52 ===
          supervisor: {local,kernel_safe_sup}
             started: [{pid,<0.47.0>},
                       {name,timer_server},
                       {mfargs,{timer,start_link,[]}},
                       {restart_type,permanent},
                       {shutdown,1000},
                       {child_type,worker}]

=PROGRESS REPORT==== 24-Nov-2012::16:20:52 ===
          supervisor: {local,top_sup}
             started: [{pid,<0.44.0>},
                       {name,worker_sup},
                       {mfargs,{worker_sup,start_link,[]}},
                       {restart_type,permanent},
                       {shutdown,infinity},
                       {child_type,supervisor}]
{ok,<0.43.0>}
3>
=CRASH REPORT==== 24-Nov-2012::16:20:57 ===
  crasher:
    initial call: failing_server:init/1
    pid: <0.51.0>
    registered_name: []
    exception exit: bad_hair_day
      in function  gen_server:init_it/6 (gen_server.erl, line 320)
    ancestors: [worker,worker_sup,top_sup,<0.40.0>]
    ...
Error: {error,bad_hair_day}

3> self().
<0.40.0>
4>
=CRASH REPORT==== 24-Nov-2012::16:21:02 ===
  crasher:
    initial call: failing_server:init/1
    pid: <0.55.0>
    registered_name: []
    exception exit: bad_hair_day
      in function  gen_server:init_it/6 (gen_server.erl, line 320)
    ancestors: [worker,worker_sup,top_sup,<0.40.0>]
    ...
Error: {error,bad_hair_day}
4> self().
<0.40.0>

The shell has the same pid(), the supervisors are running, and the system is still trying to start the failing_server process from time to time.

Thanks

This article would have not been possible without the help (and patience) of my pal and personal Erlang guru Fernando "El brujo" Benavides. Thanks, man! :)

Author

Marcelo Gornstein marcelo@inakanetworks.com

Github: marcelog

Homepage: http://marcelog.github.com

Appendix A: Source Code

failing_server.erl

This one is the gen_server that won't start, will return {stop, Reason} from init/1:

-module(failing_server).
-author('marcelo@inakanetworks.com').
-license("Apache2").

-behaviour(gen_server).

-export([start_link/0]).
-export([init/1, handle_call/3, handle_cast/2, handle_info/2, code_change/3, terminate/2]).

start_link() ->
  gen_server:start_link({local, ?MODULE}, ?MODULE, [], []).

init([]) ->
  {stop, bad_hair_day}.

handle_call(_Request, _From, State) ->
  {reply, ok, State}.

handle_cast(_Request, State) ->
  {noreply, State}.

handle_info(_Info, State) ->
  {noreply, State}.

code_change(_OldVsn, State, _Extra) ->
  {ok, State}.

terminate(_Reason, _State) ->
  ok.

worker.erl

This module will be trying to start failing_server by directly invoking start_link. Will also handle any errors by capturing the trap_exit signal and reschedule a restart for itself:

-module(worker).
-author('marcelo@inakanetworks.com').
-license("Apache2").

-behaviour(gen_server).

-export([start_link/0]).
-export([
  init/1, code_change/3, terminate/2,
  handle_call/3, handle_cast/2, handle_info/2
]).

start_link() ->
  gen_server:start_link({local, ?MODULE}, ?MODULE, [], []).

init([]) ->
  process_flag(trap_exit, true),
  Result = failing_server:start_link(),
  %% This receive here will consume the message in the process queue when
  %% failing_server crashes on init.
  receive after 25 -> ok end,
  process_flag(trap_exit, false),
  case Result of
    {ok, _Pid} ->
      io:format("Success~n"),
      {ok, []};
    Error ->
      %% On error, gracefully return an ignore to the supervisor and
      %% schedule a call to restart this worker again some time in the
      %% future.
      io:format("Error: ~p~n", [Error]),
      {ok, _} = timer:apply_after(5000, worker_sup, restart_child, []),
      ignore
  end.

handle_call(_Request, _From, State) ->
  {reply, ok, State}.

handle_cast(_Request, State) ->
  {noreply, State}.

handle_info(_Info, State) ->
  {noreply, State}.

code_change(_OldVsn, State, _Extra) ->
  {ok, State}.

terminate(_Reason, _State) ->
  ok.

worker_sup.erl

The worker supervisor. The only special thing to look at here is that the worker process is defined as transient, and the restart/0 function that ends up calling supervisor:restart_child/2:

-module(worker_sup).
-author('marcelo@inakanetworks.com').
-license("Apache2").

-behaviour(supervisor).

-export([start_link/0]).
-export([init/1]).
-export([restart_child/0]).

start_link() ->
  supervisor:start_link({local, ?MODULE}, ?MODULE, []).

restart_child() ->
  supervisor:restart_child(?MODULE, worker).

init([]) ->
  {ok, {{ one_for_one, 5, 10}, [
    {worker, {worker, start_link, []}, transient, 2000, worker, [worker]}
  ]}}.

top_sup.erl

The top level supervisor, nothing special here.

-module(top_sup).
-author('marcelo@inakanetworks.com').
-license("Apache2").

-behaviour(supervisor).

-export([start_link/0]).
-export([init/1]).

start_link() ->
  supervisor:start_link({local, ?MODULE}, ?MODULE, []).

init([]) ->
  {ok, {{ one_for_one, 5, 10}, [
    {
      worker_sup, {worker_sup, start_link, []}, permanent,
      infinity, supervisor, [worker_sup]
    }
  ]}}.

Appendix B: Start and Link Procedure in Detail

When you call gen_server:start_link, a new process will be created, and init/1 will be invoked. Let's first see how, by peeking at lib/stdlib/gen_server.erl:

start_link(Name, Mod, Args, Options) ->
    gen:start(?MODULE, link, Name, Mod, Args, Options).

Going forward, let's now look into gen:start, in (lib/stdlib/gen.erl):

start(GenMod, LinkP, Mod, Args, Options) ->
    do_spawn(GenMod, LinkP, Mod, Args, Options).
...
%%-----------------------------------------------------------------
%% Spawn the process (and link) maybe at another node.
%% If spawn without link, set parent to ourselves 'self'!!!
%%-----------------------------------------------------------------
do_spawn(GenMod, link, Mod, Args, Options) ->
    Time = timeout(Options),
    proc_lib:start_link(?MODULE, init_it,
      [GenMod, self(), self(), Mod, Args, Options],
      Time,
      spawn_opts(Options));

Note the call to proc_lib:start_link. Note how it will (for sure) spawn a new process, link self() to it, and call the gen:init_it function:

init_it(GenMod, Starter, Parent, Mod, Args, Options) ->
    init_it2(GenMod, Starter, Parent, self(), Mod, Args, Options).

init_it(GenMod, Starter, Parent, Name, Mod, Args, Options) ->
    case name_register(Name) of
        true ->
            init_it2(GenMod, Starter, Parent, Name, Mod, Args, Options);
        {false, Pid} ->
            proc_lib:init_ack(Starter, {error, {already_started, Pid}})
    end.

init_it2(GenMod, Starter, Parent, Name, Mod, Args, Options) ->
    GenMod:init_it(Starter, Parent, Name, Mod, Args, Options).

So in the end (for a gen_server), gen_server:init_it is called because of the content of the GenMod variable (as we saw above, it is ?MODULE when called from a gen_server):

case catch Mod:init(Args) of
  ...
  {stop, Reason} ->
  %% For consistency, we must make sure that the
  %% registered name (if any) is unregistered before
  %% the parent process is notified about the failure.
  %% (Otherwise, the parent process could get
  %% an 'already_started' error if it immediately
  %% tried starting the process again.)
  unregister_name(Name0),
  proc_lib:init_ack(Starter, {error, Reason}),
     exit(Reason);
  ...

proc_lib:init_ack will actually notify "the parent" of the error. The parent is the one that called start_link (as we can see in lib/stdlib/proc_lib.erl)

init_ack(Parent, Return) ->
    Parent ! {ack, self(), Return},
    ok.

If init finishes with something like {stop, _}, exit is called, and this will definitely propagate a crash, because the parent is not a supervisor, it is a monitored process and the start_link operation is sync, so the exit signal will be sent before any messages.