inaka

Latest blog entries

/
Otec

Three Open Source Projects, one App

Jun 28 2016 : Andrés Gerace

/
CredoCI

Running credo checks for elixir code on your github pull requests

Jun 16 2016 : Alejandro Mataloni

/
Thoughts on rebar3

Thoughts on rebar3

Jun 08 2016 : Hernán Rivas Acosta

/
7 Heuristics for Development

What we've learned from Hernán Wilkinson at our Tech Day

May 31 2016 : Brujo Benavides

/
Introducing Dayron - The Elixir REST client you've always wanted

Meet our library to interact with RESTful APIs in your Elixir Applications.

May 24 2016 : Flavio Granero

/
Inaka's First Tech Day

Inaka's First Tech Day

May 20 2016 : Brujo Benavides

/
PictureViewMaster

The best image projector ever. Period.

May 18 2016 : The Gera

/
Meet Jayme

Your best friend when it comes to abstract server interconnections in Swift

May 09 2016 : Pablo Villar

/
What's in a Hackathon?

The benefits of Hackathons, and a summary of ours

May 01 2016 : Stephanie Goldner

/
How do we host our internal tools ...

Inaka internal hosting

Apr 14 2016 : Ignacio Mendizabal

/
Last Year Meetups at Inaka

Last Year Meetups at Inaka

Mar 01 2016 : David Cao

/
5 Critical Lessons of App Building

While most may think that the meat of building an app is in the development, we’ve found that a large part of a successful project is the work done up front.

Feb 29 2016 : Stephanie Goldner

/
Presenting Lewis, our own Android Lint extension

Rock your Android

Feb 15 2016 : Fernando Ramirez

/
KillerTask, the solution to AsyncTask implementation

Meet our new Android library written in Kotlin

Jan 25 2016 : Fernando Ramirez

/
Android development with Kotlin

How to set up and use Kotlin language in Android apps

Jan 15 2016 : Fernando Ramirez

/
Canillita (Your First Erlang Server) - V2

Learn Erlang by example creating a simple RESTful server

Jan 04 2016 : Harenson Henao

/
Dealing with Optionals in ViewControllers

The right way to declare your variables

Dec 18 2015 : Pablo Villar

/
Testing Distributed Apps with Common Test

Testing Distributed Apps with Common Test

Dec 11 2015 : Carlos Andres Bolaños

/
Erlang Serpents

We've built a serpents game in Erlang and we played with it

Nov 13 2015 : Brujo Benavides

/
Erlang Meta Testing Revisited

Meta-Testing in Erlang. Now it's much easier.

Nov 13 2015 : Brujo Benavides

/
See all Inaka's blog posts >>

/
Every-day Erlang: Handling Crashes in Erlang

Marcelo Gornstein wrote this on November 29, 2012 under dev, erlang .

Introduction

Hi :)

This post is about a nifty trick we use when we need to start a gen_server process with a start_link call, while simultaneously handling any errors gracefully (i.e: avoid propagating a crash to the supervisors). Here's the exact situation:

  • You need to start_link a gen_server from your own not-supervisor-process.

  • The gen_server in question does not offer an alternative start function in its API that you can use to start and then link your process to, so you can only use start_link.

  • For your own requirements, it doesn't matter if the gen_server does not start (i.e: it crashes while starting). You don't want to propagate the crash, but keep trying to start the server every couple of seconds.

Feel free to skip the sections you don't need (or skip the post completely and jump right into the example source code).

The Problem

As you may already know, when a process terminates, the Erlang virtual machine will propagate an exit signal to all linked processes. By default, if the process terminated abnormally (i.e. with a reason different than 'normal'), these linked processes will terminate as well. There's an in-depth explanation about this procedure in the Appendix B.

This is one of the cornerstones of a "supervisor tree", and what start_link is all about. It is great, and it works like a charm.

But sometimes, you are in a kind of strange situation where you need to cheat a little bit, maybe due to strange requirements or APIs. In this case, we need to call start_link, but we don't want to propagate the crash up in the supervisor hierarchy. We want to catch the error (or get the proper error result) instead.

For instance, let's say we have the following architecture:

top_sup -> worker_sup -> worker -> failing_server

So when failing_server fails to start, worker will receive an exit signal and die (since it started failing_server with a start_link call) because it's linked, and this would eventually bring down the whole supervisor tree.

The Solution

What Doesn't Work

A call to start_link implies that the calling process (the one that's actually calling gen_server:start) is linked to the newly created process, and this means (quoting process section of the Erlang reference manual):

When a process terminates, it will terminate with an exit reason as
explained in Process Termination above. This exit reason is emitted in an
exit signal to all linked processes.

And that's why things like these won't do the trick:

try
   {ok, Pid} = gen_server:start_link(...)
    catch
      _:Error ->
      ...
   end

case won't work either:

case gen_server:start_link(...) of
   {ok, Pid} ->
      ...
   _ ->
      ...
   end

Since start_link is a sync procedure, the exit signal will arrive to your process before returning a value. You have to use erlang:process_flag/2 to catch it.

What Does Work

The solution is a mix of things:

  • Set worker as a transient child of your supervisor (worker_sup). In this way, crashes will not make worker_sup try to restart worker, so you don't hit any restart limits. Transient children are only restarted when they terminate with an exit reason other than normal, shutdown or {shutdown,Term}.

  • Have worker control the trap_exit process flag before trying to start the other gen_server (failing_server in the source code example). So when the offending gen_server fails on init, your worker can handle that by receiveing a message without propagating the exit signal.

  • When worker detects that failing_server can't be started, return ignore from init/1. This will tell the supervisor to not restart the child, and also keep the child definition.

  • Before returning from worker:init/1 use timer:apply_after/4 to call a function in your worker_sup module (say worker_sup:restart_child/0) to retry the operation from scratch. This can be done an infinite number of times.

The ignore atom is described in the supervisor manual page:

The start function can also return ignore if the child process for some
reason cannot be started, in which case the child specification will be kept
by the supervisor (unless it is a temporary child) but the non-existing
child process will be ignored.

Making the supervisor keep the child definition is useful, so we can later call supervisor:restart_child/2:

Tells the supervisor SupRef to restart a child process corresponding to the
child specification identified by Id. The child specification must exist and
the corresponding child process must not be running.

Source Code

The source code for the complete solution is provided, with the following files:

  • worker.erl: The worker gen_server process that uses start_link to spawn a gen_server, in this case, failing_server.

  • failing_server.erl: A gen_server process started by the worker that will fail on init/1 with {stop, Reason}.

  • worker_sup.erl: The worker supervisor, that uses restart_child to restart the worker process on a crash.

  • top_sup.erl: Not that interesting, the top level supervisor, just for the sake of completeness.

Code Overview

The failing_server

Here's the failing_server, which will always fail with a bad_hair_day error. The irrelevant parts have been stripped:

-module(failing_server).
-behaviour(gen_server).
...
start_link() ->
   gen_server:start_link({local, ?MODULE}, ?MODULE, [], []).

init([]) ->
   {stop, bad_hair_day}.
   ...

The worker

Here's the worker. Note how we handle the trap_exit flag before calling failing_server:start_link/0, and stop handling it after it returns:

-module(worker).
-behaviour(gen_server).
...
init([]) ->
  process_flag(trap_exit, true),
  Result = failing_server:start_link(),
  %% This receive here will consume the message in the process queue when
  %% failing_server crashes on init.
  receive after 25 -> ok end,
  process_flag(trap_exit, false),
  case Result of
    {ok, _Pid} ->
      io:format("Success~n"),
      {ok, []};
    Error ->
      %% On error, gracefully return an ignore to the supervisor and
      %% schedule a call to restart this worker again some time in the
      %% future.
      io:format("Error: ~p~n", [Error]),
      {ok, _} = timer:apply_after(5000, worker_sup, restart_child, []),
      ignore
  end.
    ...

The Worker Supervisor

-module(worker_sup).
-behaviour(supervisor).
...
start_link() ->
  supervisor:start_link({local, ?MODULE}, ?MODULE, []).

restart_child() ->
  supervisor:restart_child(?MODULE, worker).

init([]) ->
  {ok, {{ one_for_one, 5, 10}, [
    {worker, {worker, start_link, []}, transient, 2000, worker, [worker]}
  ]}}.

Trial Run

If worker would only do a failing_server:start_link(), note what would happen (i.e: how the shell pid changes):

Eshell V5.9.2  (abort with ^G)
1> self().
<0.40.0>
2> top_sup:start_link().

=CRASH REPORT==== 24-Nov-2012::16:09:45 ===
  crasher:
    initial call: failing_server:init/1
    pid: <0.46.0>
    registered_name: []
    exception exit: bad_hair_day
      in function  gen_server:init_it/6 (gen_server.erl, line 320)
    ancestors: [worker,worker_sup,top_sup,<0.40.0>]
    ...
** exception exit: shutdown
3>
=SUPERVISOR REPORT==== 24-Nov-2012::16:09:45 ===
     Supervisor: {local,worker_sup}
     Context:    start_error
     Reason:     bad_hair_day
     ...

=SUPERVISOR REPORT==== 24-Nov-2012::16:09:45 ===
     Supervisor: {local,top_sup}
     Context:    start_error
     Reason:     shutdown
     ...

3> self().
<0.47.0>

As you can see, the process that started the top_sup (the shell) got linked to the new process(es). When one crashed (failing_server), it took the shell with it, which was restarted, and that's why our shell now has a different pid().

Now, with the real code included in worker, after handling the trap_exit flag and retrying after a couple of seconds:

Eshell V5.9.2  (abort with ^G)
1> self().
<0.40.0>
2> top_sup:start_link().

=CRASH REPORT==== 24-Nov-2012::16:20:52 ===
  crasher:
    initial call: failing_server:init/1
    pid: <0.46.0>
    registered_name: []
    exception exit: bad_hair_day
      in function  gen_server:init_it/6 (gen_server.erl, line 320)
    ancestors: [worker,worker_sup,top_sup,<0.40.0>]
    ...
Error: {error,bad_hair_day}

=PROGRESS REPORT==== 24-Nov-2012::16:20:52 ===
          supervisor: {local,kernel_safe_sup}
             started: [{pid,<0.47.0>},
                       {name,timer_server},
                       {mfargs,{timer,start_link,[]}},
                       {restart_type,permanent},
                       {shutdown,1000},
                       {child_type,worker}]

=PROGRESS REPORT==== 24-Nov-2012::16:20:52 ===
          supervisor: {local,top_sup}
             started: [{pid,<0.44.0>},
                       {name,worker_sup},
                       {mfargs,{worker_sup,start_link,[]}},
                       {restart_type,permanent},
                       {shutdown,infinity},
                       {child_type,supervisor}]
{ok,<0.43.0>}
3>
=CRASH REPORT==== 24-Nov-2012::16:20:57 ===
  crasher:
    initial call: failing_server:init/1
    pid: <0.51.0>
    registered_name: []
    exception exit: bad_hair_day
      in function  gen_server:init_it/6 (gen_server.erl, line 320)
    ancestors: [worker,worker_sup,top_sup,<0.40.0>]
    ...
Error: {error,bad_hair_day}

3> self().
<0.40.0>
4>
=CRASH REPORT==== 24-Nov-2012::16:21:02 ===
  crasher:
    initial call: failing_server:init/1
    pid: <0.55.0>
    registered_name: []
    exception exit: bad_hair_day
      in function  gen_server:init_it/6 (gen_server.erl, line 320)
    ancestors: [worker,worker_sup,top_sup,<0.40.0>]
    ...
Error: {error,bad_hair_day}
4> self().
<0.40.0>

The shell has the same pid(), the supervisors are running, and the system is still trying to start the failing_server process from time to time.

Thanks

This article would have not been possible without the help (and patience) of my pal and personal Erlang guru Fernando "El brujo" Benavides. Thanks, man! :)

Author

Marcelo Gornstein marcelo@inakanetworks.com

Github: marcelog

Homepage: http://marcelog.github.com

Appendix A: Source Code

failing_server.erl

This one is the gen_server that won't start, will return {stop, Reason} from init/1:

-module(failing_server).
-author('marcelo@inakanetworks.com').
-license("Apache2").

-behaviour(gen_server).

-export([start_link/0]).
-export([init/1, handle_call/3, handle_cast/2, handle_info/2, code_change/3, terminate/2]).

start_link() ->
  gen_server:start_link({local, ?MODULE}, ?MODULE, [], []).

init([]) ->
  {stop, bad_hair_day}.

handle_call(_Request, _From, State) ->
  {reply, ok, State}.

handle_cast(_Request, State) ->
  {noreply, State}.

handle_info(_Info, State) ->
  {noreply, State}.

code_change(_OldVsn, State, _Extra) ->
  {ok, State}.

terminate(_Reason, _State) ->
  ok.

worker.erl

This module will be trying to start failing_server by directly invoking start_link. Will also handle any errors by capturing the trap_exit signal and reschedule a restart for itself:

-module(worker).
-author('marcelo@inakanetworks.com').
-license("Apache2").

-behaviour(gen_server).

-export([start_link/0]).
-export([
  init/1, code_change/3, terminate/2,
  handle_call/3, handle_cast/2, handle_info/2
]).

start_link() ->
  gen_server:start_link({local, ?MODULE}, ?MODULE, [], []).

init([]) ->
  process_flag(trap_exit, true),
  Result = failing_server:start_link(),
  %% This receive here will consume the message in the process queue when
  %% failing_server crashes on init.
  receive after 25 -> ok end,
  process_flag(trap_exit, false),
  case Result of
    {ok, _Pid} ->
      io:format("Success~n"),
      {ok, []};
    Error ->
      %% On error, gracefully return an ignore to the supervisor and
      %% schedule a call to restart this worker again some time in the
      %% future.
      io:format("Error: ~p~n", [Error]),
      {ok, _} = timer:apply_after(5000, worker_sup, restart_child, []),
      ignore
  end.

handle_call(_Request, _From, State) ->
  {reply, ok, State}.

handle_cast(_Request, State) ->
  {noreply, State}.

handle_info(_Info, State) ->
  {noreply, State}.

code_change(_OldVsn, State, _Extra) ->
  {ok, State}.

terminate(_Reason, _State) ->
  ok.

worker_sup.erl

The worker supervisor. The only special thing to look at here is that the worker process is defined as transient, and the restart/0 function that ends up calling supervisor:restart_child/2:

-module(worker_sup).
-author('marcelo@inakanetworks.com').
-license("Apache2").

-behaviour(supervisor).

-export([start_link/0]).
-export([init/1]).
-export([restart_child/0]).

start_link() ->
  supervisor:start_link({local, ?MODULE}, ?MODULE, []).

restart_child() ->
  supervisor:restart_child(?MODULE, worker).

init([]) ->
  {ok, {{ one_for_one, 5, 10}, [
    {worker, {worker, start_link, []}, transient, 2000, worker, [worker]}
  ]}}.

top_sup.erl

The top level supervisor, nothing special here.

-module(top_sup).
-author('marcelo@inakanetworks.com').
-license("Apache2").

-behaviour(supervisor).

-export([start_link/0]).
-export([init/1]).

start_link() ->
  supervisor:start_link({local, ?MODULE}, ?MODULE, []).

init([]) ->
  {ok, {{ one_for_one, 5, 10}, [
    {
      worker_sup, {worker_sup, start_link, []}, permanent,
      infinity, supervisor, [worker_sup]
    }
  ]}}.

Appendix B: Start and Link Procedure in Detail

When you call gen_server:start_link, a new process will be created, and init/1 will be invoked. Let's first see how, by peeking at lib/stdlib/gen_server.erl:

start_link(Name, Mod, Args, Options) ->
    gen:start(?MODULE, link, Name, Mod, Args, Options).

Going forward, let's now look into gen:start, in (lib/stdlib/gen.erl):

start(GenMod, LinkP, Mod, Args, Options) ->
    do_spawn(GenMod, LinkP, Mod, Args, Options).
...
%%-----------------------------------------------------------------
%% Spawn the process (and link) maybe at another node.
%% If spawn without link, set parent to ourselves 'self'!!!
%%-----------------------------------------------------------------
do_spawn(GenMod, link, Mod, Args, Options) ->
    Time = timeout(Options),
    proc_lib:start_link(?MODULE, init_it,
      [GenMod, self(), self(), Mod, Args, Options],
      Time,
      spawn_opts(Options));

Note the call to proc_lib:start_link. Note how it will (for sure) spawn a new process, link self() to it, and call the gen:init_it function:

init_it(GenMod, Starter, Parent, Mod, Args, Options) ->
    init_it2(GenMod, Starter, Parent, self(), Mod, Args, Options).

init_it(GenMod, Starter, Parent, Name, Mod, Args, Options) ->
    case name_register(Name) of
        true ->
            init_it2(GenMod, Starter, Parent, Name, Mod, Args, Options);
        {false, Pid} ->
            proc_lib:init_ack(Starter, {error, {already_started, Pid}})
    end.

init_it2(GenMod, Starter, Parent, Name, Mod, Args, Options) ->
    GenMod:init_it(Starter, Parent, Name, Mod, Args, Options).

So in the end (for a gen_server), gen_server:init_it is called because of the content of the GenMod variable (as we saw above, it is ?MODULE when called from a gen_server):

case catch Mod:init(Args) of
  ...
  {stop, Reason} ->
  %% For consistency, we must make sure that the
  %% registered name (if any) is unregistered before
  %% the parent process is notified about the failure.
  %% (Otherwise, the parent process could get
  %% an 'already_started' error if it immediately
  %% tried starting the process again.)
  unregister_name(Name0),
  proc_lib:init_ack(Starter, {error, Reason}),
     exit(Reason);
  ...

proc_lib:init_ack will actually notify "the parent" of the error. The parent is the one that called start_link (as we can see in lib/stdlib/proc_lib.erl)

init_ack(Parent, Return) ->
    Parent ! {ack, self(), Return},
    ok.

If init finishes with something like {stop, _}, exit is called, and this will definitely propagate a crash, because the parent is not a supervisor, it is a monitored process and the start_link operation is sync, so the exit signal will be sent before any messages.