Elixir Concurrent Testing Architecture

Sean Lewis
11 min readDec 7, 2020

--

This article gets pretty deep into Elixir and is relatively advanced. If you’re new to this awesome programming language then you should peruse the intro guide to Elixir.

This also uses Elixir’s guide to OTP and processes as a baseline code base for testing concurrently.

This article comes with a companion Github repo so you can follow along and run the tests and view the code yourself.

About me

I’ve been in software engineering for almost a decade. I got my start as the first automated QA engineer (SDET) at Instructure so testing is very near and dear to my heart. I’m currently the very hands-on backend architect at GenesisBlock. If you’re interested in React Native, Elixir, fintech, a very diverse team, and/or cryptocurrency you should browse our career page.

Upon first glance, it seems very straightforward to run Elixir on Ecto tests in parallel. Ecto has a guide on this here. However, enabling async: true only works if the concurrent processes that are running Ecto queries are properly associated with your primary test process.

What does this mean?

If we think of Elixir processes as branches of a tree:

Source: https://stackoverflow.com/questions/32033835/proportions-tree-graph-in-r

and then imagine our application AsyncTesting.Application is running as process 1 and our test mix test test/async_testing_test.exs:20 is running as process/node 4 in that tree then processes 4, 8, and 9 will work just fine if they attempt to access the database if they are configured properly. However, if process 8 asks process 5 to look something up in the database then we’ll get this lovely error:

Ecto describes this error here.
Ecto describes the DBConnection.OwnershipError here.

Why does this happen?

This is because Ecto cannot figure out what database connection process 0.235.0 (analogous to process 5 in the graphic) should be using. Without being able to coordinate queries through Ecto.Sandbox Ecto throws this error and tells you how to fix it. If Ecto didn’t throw this error then we could easily implement testing database race conditions. And race conditions are not lovely. Ecto sandbox is not magic and you can still end up with database deadlocks if you’re not careful.

How does it happen?

On the branch async-failing-tests our implementation of KV.Bucketin lib/async_testing/bucket.excallsKV.Registry.get_value(key). This then asks our KV.Registry GenServer to make an Ecto query in lib/async_testing/registry.ex on line 52 new_bucket = AsyncTesting.Bucket.get(key).KV.Registry is started completely separate and before our testing process tree in lib/async_testing/application.ex and is analogous to process 5 in the tree graphic above.

How to prevent it?

  1. Toggle async: true to false in your tests. Not exactly desirable considering concurrency is a core tenet of Elixir! This also can have a serious impact on your test suite run time later on in your project. With async: true your test suite will usually take only as long as your longest-running test file (this can change when you start exceeding 1k tests). Setting this to false will force these synchronous tests to run linearly which can easily bloat your test times. Running the tests in the branch async-false produces a test time of 20.1 seconds. Whereas the async compatible implementation runs in just 10.9 seconds! Now imagine you had 500 tests instead of 48 and you can see just how much time we can accidentally waste on sequential testing.
  2. If you take a look at the branch: no-parallel-process-reaching and run the test mix test test/async_testing_test.exs:20you can see that our tests run just fine. This is because our implementation in lib/async_testing/bucket.ex does not ask a process outside of our tree to make an Ecto query, it makes the query itself on this line new_bucket = AsyncTesting.Bucket.get(key).

Often such an easy solution does not exist. You will run into cases where you have several GenServers that own their data/queries. You should not reach into their tables and make queries on their behalf. This generally violates separation of concerns and domain boundaries. Service A should own queries on Service A’s data. Service B should not be allowed to reach into Service A’s data model and make custom queries themselves. Good fences around responsibilities make for good software service neighbors. If you instead respect the public API that Service A has exposed you can avoid any conflicting ownership and distributed monolith headaches but you have to get clever about your testing.

Clever Testing Rules

  1. Every test file should be configured as async: true
  2. Use Mox or Hammox (preferred) to mock external or extraneous services.
  3. Use start_supervised to start unique GenServers or other required async processes in the setup block of that test.
  4. GenServers/Supervisors need to be made configurable on startup.
  5. All primary/unique keys used in tests should be made unique via randomization.

Next, we’ll go over each of these rules in detail.

1. Every test file should be configured as async: true

When you make a new test file make sure to set it as async: true. Starting every test file this way is a great way to ensure your code is performant and consistent in a parallel environment as your codebase grows. Enforce this as soon as possible and start paying off some of this tech debt.

If you say “we’ll make them concurrent later” you will be in for a lot of compounding sadness in the form of tech debt in the future. I’ve seen this first hand in a very large Elixir repo. Testing was not prioritized and two years later the test suite took 30 minutes. Very few tests were async: true and many tests would contaminate the state of the testing application because side effects were not encapsulated. We had a litany of intermittent test failures purely depending on the order the tests were run.

Do not let this become your app.

Not only is it miserable for engineers but to try to make it bearable you’ll have to spend thousands to tens of thousands of dollars a month mitigating it by splitting your test suite into chunks and running those in parallel on lots of cloud compute. When your test suite takes 11 seconds to run engineers would rather run it locally than wait for CI to tell them the results. This speeds up delivery time and efficiency. The more quickly you can get feedback to your engineers the less time wasted in context switching and waiting.

Start disciplined, stay disciplined.

2. Use Mox or Hammox (preferred) to mock external or extraneous services.

Most applications will reach out to several different APIs or services. You may integrate with Firebase and Amazon SQS. It may be tempting to somehow integrate these services into your tests but that is something to be avoided. If either service’s testing environment goes down then your test suite will behave unexpectedly or fail outright. Many of these API calls will significantly slow your tests down as well. Instead, you should use Mox to mock the calls to these services. This way you can ensure the service is being called properly without slowdowns or unexpected responses.

I’ve merged a bare-bones Mox example in the Mox library here and here’s a good tutorial on how to setup and use Mox.

It might be tempting to reach for Mock as its a bit easier to setup and understand. However, the fine print in the Mock readme tells you explicitly:

Also, note that Mock has a global effect so if you are using Mocks in multiple tests set async: false so that only one test runs at a time.

This is the largest issue with Mock. You cannot Mock the same function across different parallel tests. This violates our first rule: “Every test file should be configured as async: true.” Having replaced Mock in several apps I can confidently say Mock is considerably slower than Mox as well. I was able to speed up test runtime by 40+ percent by simply replacing Mock with Mox but keeping the test file as async: false.

As an aside, you should try to use Hammox. It is a wrapper around Mox that will test and ensure the Elixir typespecs you define are enforced in the mocks you define. For example, you have a function like this:

@spec get_user(String.t()) :: %User{}

But then you mock it with:

UserMock
|> expect(:get_user, fn "12345" -> nil end)

Hammox will throw an error when you run your test because your spec says that function must return %User{} but you returned nil in your mock. Hammox is super helpful in maintaining typespecs in Elixir and ensuring you have good mocks that simulate what the function would return in a production context.

3. Use start_supervised to start unique GenServers or other required async processes in the setup block of that test.

When you start to have a larger Elixir application, your application.ex file will probably have several different applications running as children. Testing applications that need to talk to other children in the application pose a challenge for parallel testing. Any two tests calling these other child applications at the same time can return different results depending on test order. There are several ways to solve this issue. I have provided two different solutions.

The Elixir/default solution

This solution is provided in the branch async-true.

I’m lead to believe this is the more Elixir-y solution. I’ve talked to a handful of members of the community about this issue. Combine those chats with the default tutorials (like the one used for this article) and we can see the preferred method of doing this is passing the name or PID into the function like the argument server in the aforementioned tutorial:

@doc """
Looks up the bucket pid for `name` stored in `server`.

Returns `{:ok, pid}` if the bucket exists, `:error` otherwise.
"""
def lookup(server, name) do
GenServer.call(server, {:lookup, name})
end

So all we need to do is start our GenServer in our tests with a unique name, setup Ecto allowances, and then change our tests to call our unique GenServer.

Our test setup becomes this:

setup do
registry = start_supervised!({KV.Registry, name: __MODULE__, test_pid: self()})
%{registry: registry}
end

This starts a unique GenServer of KV.Registry with the name __MODULE__ which will be the module name of our test. This ensures the GenServer name does not collide with any other test that’s running this same GenServer. If we named it “Foo” and another test also named it “Foo” then the tests could have race conditions as they’d be accessing the same GenServer. The last line passes registry in a map to each test.

In our tests, it’s now as simple as passing our custom test KV.Registry GenServer PID in as the first argument:

This solution is relatively straightforward and simple to implement. The downside lies in complexity on the caller’s side. I’m not a massive fan of this solution because passing the GenServer you want to call seems like more work than we should have to do and more information than I, as the function caller, should have to know. It also ends up being rather verbose anytime you want to make a function call to this module. This situation is compounded if you have a GenServer that calls another GenServer. Do you pass both modules in everytime you make a call to this GenServer? What if you have a GenServer that calls two other GenServers? It can become cumbersome quickly with many different GenServers. With this solution, our tests run in the expected ~10 seconds in parallel.

The Manager layer solution

This solution is provided in the branch async-true-with-manager.

My preferred solution is the introduction of a layer between caller and GenServer. The rather uninspired name I have for it is “Manager”.
The Manager is responsible for determining which GenServer you want to call in a given module. Putting this layer here means we can dynamically swap which GenServer is being called in our tests without having to tediously pass it into every call we make to our module. This method also works much better than the previous method when you encounter a module that needs to use multiple mocks of other GenServers. Otherwise, you’d be forced to mock any calls to further GenServers or you’d have to pass a map of the all the GenServers you want to use and that gets nasty very quickly.

Where you would typically have:

def lookup(server, name) do
GenServer.call(server, {:lookup, name})
end

You would instead have:

@registry_manager Application.compile_env(:mox, :registry_manager, KV.Registry.Manager)def lookup(name) do
GenServer.call(@registry_manager.get_server(), {:lookup, name})
end

Full file here

The variable @registry_manager will either be what we set it to in our configuration under the keys :mox -> :registry_manager or it will default to KV.Registry.Manager which is our preferred default implementation for production and development environments.

This makes it so you can use Mox/Hammox to change which GenServer lookup/1 calls during tests like so:

Full file here

Hammox.stub will return our test registry PID registry every time get_server/0 is called.

This ensures that each test has a unique instance of KV.Registry which prevents state contamination and thus enables parallel testing. start_supervised will also ensure our spawned KV.Registry gets shutdown when our tests are finished.

You can see that this method is less verbose in our tests and still results in the ~10-second runtime of our test suite. We don’t have to know which GenServer to call because we’ve stubbed that function to always return the test GenServer.

4. GenServers need to be configurable on startup.

We need to design our GenServers so they can take a configuration on startup. This will allow us to specifically choose the name for the GenServer. This is important because named GenServers have to be unique. If we don’t configure this name then we will not be able to start a unique instance of the GenServer in each test.

You can see how I accomplish this here:

alias Ecto.Adapters.SQL.Sandbox
...
def start_link(opts) do
GenServer.start_link(__MODULE__, {:ok, Keyword.get(opts, :test_pid, nil)}, opts)
end
@impl true
def init({:ok, parent_pid}) do
if parent_pid != nil do
:ok = Sandbox.allow(AsyncTesting.Repo, parent_pid, self())
end
names = %{}
refs = %{}
{:ok, {names, refs}}
end

This is also necessary to ensure we pass our Mox and Ecto allowances on to that new process. If you had some other Ecto or Mox allowances to configure, you’d put them in the same place as Sandbox.allow in the above example. Sandbox.allow tells Ecto that any calls this process makes to the database belong to the parent_pid‘s Ecto Sandbox. This prevents that nasty ownership error mentioned earlier.

Without this configuration, the new process will not know which Ecto sandbox to use and it won’t know which Mox/Hammox mocking instance to use. This will cause the test to fail with ownership errors because Ecto and Mox have no idea where to direct queries and mocked calls.

5. All primary keys used in tests should be unique via randomization.

This is subtle but very important. Ecto Sandbox is not magic, even though it feels that way. You can run into deadlocks and other issues in your tests if you use duplicate primary keys. The easiest way to solve this is to ensure any unique columns on your model are randomly generated in your tests.

Instead of this:

test “get user” do
user = User.create("somebody@example.com")
fetched_user = User.get(user.id)
assert user.email == fetched_user.email
end

Prefer this:

test “get user” do
email = "#{Ecto.UUID.generate()}@example.com"
user = User.create(email)
fetched_user = User.get(user.id)
assert user.email == fetched_user.email
end

The issues this can cause if done improperly is rare but it is annoying to find and fix. Ecto discusses this in more detail here. If you ensure your tests use unique values like the example then you’ll never have to worry about any Ecto race conditions that can stem from non-unique data.

Conclusion

Elixir is a great language but it is very easy to poorly architect your tests. This will cause loads of headaches, pain, and tech debt later in the lifecycle of your application. If you follow the clever testing rules and embrace the idea of thoughtful testing architecture in your Elixir application then your future self and engineering peers will thank you.

Frequent testing, excellent test coverage, and considerate test architecture has been, in my experience, a keystone in engineering productivity, product stability, and engineer happiness.

In closing, if there’s anything you feel could be more clear or isn’t working how I described please let me know in a comment and I’ll make an edit to address the issue.

--

--