Building nvidia-docker containers with nix

How to convert the output of a derivation into a self contained package is one of the most important questions when developing productive Software with nix, even for prototype applications. If the system you need to deploy to is not under your control, nix is probably not installed and will not be in the immediate future.

To avoid having to beg an admin I don’t even know personally to install nix on some remote system I have no root access to, I usually just build a docker container. Docker is, as of today, a tool most sys admins know and trust, and if it isn’t installed already it shouldn’t be a problem to convince them to do so. At least if you want to deploy server applications that is, but you can always use AppImages as a replacement if you need to deploy user applications.

Background: Deploying CUDA based apps

So when I was tasked to deploy a machine learning system to a gpu cluster running Nvidia GPUs, building a docker container was again my go-to solution. I already knew that I would need to convince the admin to install nvidia-docker (link to nixpkgs), a docker runtime which exposes host GPUs to the containers running on it, but this was the lesser problem.

CUDA itself, a widely used, proprietary library used for various computations on Nvidia GPUs, is already available in the nixpkgs and used in derivations which have gpu support enabled like the machine learning library pytorch (even though you’ll have to bring some patience if you want to use libraries with cuda support, as they are unfree packages and thus not available in the public nix binary cache).

At first, I thought it would be an easy task: Sit through the waiting times on my development machine to build all necessary libs with cuda support once (as they are available in your local nix store afterwards), write a 20 line derivation to put everything into a vanilla container and that’s it.

As you may have guessed by the length of the article (or the query that most likely brought you here), this was not “it”. What follows is a summary of the things I needed to do to get nvidia-cuda containers made with nix to work.

Prerequisites

I assume that you have a working derivation (as in: it compiles and does the stuff you want it to do) with a dependency on cuda, which needs to be deployed on a non NixOS Linux operating system.

If you struggle to get the derivation to work on you own system (without docker involved), I’d point you over to this derivation which builds fast.ai (a machine learning lib build on top of the aforementioned pytorch) to get some inspiration on how to write a working wrapper for cuda applications.

As a first step you should make sure that all subderivations of your final derivation depend on the same version of cuda.

In theory multiple versions of cuda should not be a problem, but I got some very consistent crashes of the nvidia-containers when they were present. I suspect this is a issue on the driver-side. At least it should be easy to overwrite versions differing from the one you want to use.

Also, make sure that the driver on your target host system supports the cuda version you want to use, as this driver will later process the calls from your docker container.

A few technical details

First it is important to note that - in theory - every possible container can be used with GPU support, as the nvidia-docker wiki states.

This implies that no modifications to the layers of the container have to be made in order to make it run on the GPU. This is confirmed by the README of the underlying nvidia-container-runtime, which states that the container runtime only adds a pre-start hook when a container is run.

What we have to do though is to set the correct environment variables. This is the reason why extending the cuda base image from the docker hub does not work when using the nix docker-tools - the docker-tools clear the environment of the container before adding any additional layers.

You can find a list of most relevant environment variables over at the container-runtime repository but this list is not complete.

As the runtime preStart hook mount the driver and library files of the host system into your container you`ll have to specify where the program loader can find the shared libraries requiered by your cuda application.

One way to do this is to set the LD_LIBRARY_PATH environment variable to the path where the nvidia-runtime will mount them (which seems to be /usr/lib64/, but I am certain there are more). You could also specify the files to load directly using the LD_PRELOAD variable and there are most certainly even more ways.

Solution

A generic nix build to build nvidia-docker container is available over at Github and below. The git also included a small demo application which is build using three versions of CUDA to create three different docker images, so you can see it in action.

{
    # see https://github.com/NixOS/nixpkgs/blob/0a351c3f65136c00d3512dd77f48e12a571cf495/pkgs/build-support/docker/default.nix#L656
    cudatoolkit,
    buildImage,
    lib,
    name,
    tag ? null,
    fromImage ? null,
    fromImageName ? null,
    fromImageTag ? null,
    contents ? null,
    keepContentsDirlinks ? false,
    config ? {Env = [];},
    extraCommands ? "", 
    uid ? 0, 
    gid ? 0,
    runAsRoot ? null,
    diskSize ? 1024,
    created ? "1970-01-01T00:00:01Z"
}:

let

  cutVersion = with lib; versionString:
    builtins.concatStringsSep "."
      (take 3 (builtins.splitVersion versionString )
    );
    
  cudaVersionString = 
    "CUDA_VERSION=" + (cutVersion cudatoolkit.version);

  cudaEnv = [
    "${cudaVersionString}"
    "NVIDIA_VISIBLE_DEVICES=all"
    "NVIDIA_DRIVER_CAPABILITIES=all"

    "LD_LIBRARY_PATH=/usr/lib64/"
  ];

  cudaConfig = config // {Env = cudaEnv;};

in buildImage {
  inherit name tag fromImage
    fromImageName fromImageTag contents
    keepContentsDirlinks extraCommands
    runAsRoot diskSize created;

  config = cudaConfig;
}

As you can see, the builder is a wrapper for docker builder, which adds the required environment variables. The builder isn’t perfect though. As your driver version is dependent on the host system, it will not work on all machines - the example I uploaded to github assumes your driver supports at least CUDA 10.0, but you could downgrade this anytime. A perfect solutions would include fetching and using the correct driver for a given CUDA version, but we hit end of the road here - (at least for now).

A simple solution one could use to at least give a immediate and readable error message if the host CUDA version is too low is to set the NVIDIA_REQUIRE_* environment variable. I may include this in the builder if I find the time and update the git accordingly.

Credits

Dennis Gossnel’s Nix Derivation for fast.ai, which I used as a primer to get nix to build something with cuda support
tfc’s Nix Cmake Example, which I use as a kind of reference work whenever I forget how to do something with Nix.
Cover Photo by Tri Eptaroka Mardiana on Unsplash; The NixOS logo is licensed under the Creative Commons Attribution 4.0 International License