Introduction

This template is intended to be used as a base when building a new artifact (e.g., tools, proofs, data) to accompany a published paper. It is heavily geared towards Computer Science (as that is my own field), but may be repurposed for other scientific domains.

The documentation is divided into the following sections:

  • Motivation: the motivation behind this template;
  • Portability: how to ensure that the artifact can work on different machines;
  • Architecture: a high-level description of the architecture of the artifact template;
  • A simple example: walk-through of a concrete example of an artifact using the template;
  • The artifact template in detail: a detailed technical guide on using the template for your own artifacts.

Motivation

This template exists mainly for two reasons:

  1. There is a real problem of missing, incomplete or otherwise unsatisfactory research artifacts in Computer Science. In my personal experience, this seems to be caused mostly by inertia (or laziness), as proper preparation of an artifact for publication is (or should be) in some ways similar to the practice of software release, which of course demands some organizational effort. A template could ease some of that pain by providing a base to construct upon.
  2. Since the work of constructing an artifact should be a core part of research (and not just an afterthought), and in the interest of reproducibility, the artifact itself could (and should) be used for any experiments presented in the paper.1 This ensures that the very environment which produced the data shown in the paper is available to anyone, such as artifact evaluators, essentially guaranteeing a smooth reproduction of the experiments.
1

Of course, this should be done during the latest phase of experimentation, when there is a clear understanding of which experimental results are needed for the paper. It is not worthwhile to develop a stable, publishable artifact during the prototyping phase, when experimental results are not yet clear and the fate of the paper is still largely undecided.

Portability

To ensure portability and ease the setup of the evaluation environment, this template makes heavy use of Docker. Of course, this can be swapped out for any other such tool, so long as the artifact remains reasonably portable between different platforms and configurations. Common alternatives include VirtualBox and Nix.

In any case, it is good practice to thoroughly document software and hardware dependencies, and to provide solutions or alternatives when possible.

Architecture

The goals of this artifact template are to produce an artifact which can:

  1. Be used to both produce (by yourself) and reproduce (by others) experimental results for/from the associated paper;
  2. Be used by others just like any piece of software (if applicable), in various contexts (such as research—e.g., other papers, or industry).

As such, this template proposes to split the work into two repositories:

  • The tool repository, containing the software itself, to be independently maintained and distributed as needed;
  • The artifact repository, building on the tool repository, but also adding any additional material needed to (re)produce experimental results for/from the associated paper.

This split allows for nominal maintenance of both parts of the work. On the one hand, the tool repository should be both published in a live repository (e.g., on GitHub), to allow authors to continue to maintain and distribute the tool, allow others to fork and contribute to it, and archived, to allow for long-term preservation and enable reproduction in the future. On the other hand, the artifact repository should also be archived (for the same reasons), but it should probably not be published in a live repository, as it is not destined to be distributed or accept contributions from others; after all, its only purpose is reproducing the results of the associated paper.

This way, the two repositories can stay clean, containing only relevant content, and avoiding confusion regarding the two main "target groups": developers or researchers modifying the tool are probably only interested in the tool repository, while artifact evaluators are probably only interested in the artifact repository.

A simple example

In order to understand how the template works, we are going to apply it to a concrete example. For the sake of this example, we will imagine a made-up paper, titled Spellcheck: Checking New Spells, which introduces a new spellchecking approach and tool called Spellcheck, and was published in the prestigious CONF'99 conference.

Following the template, this section will walk you through the two repositories: the spellcheck tool repository and the spellcheck-conf99-artifact artifact repository.

For the sake of clarity and brevity, this example is obviously simplified; if you are interested in a real use case, see https://github.com/binsec/rosa and https://zenodo.org/records/14724251.

The spellcheck tool repository

The repository for the spellcheck tool can be found under example/spellcheck-repo.

The Spellcheck approach is implemented here in the form of a Python script, spellcheck.py. It has an external dependency on another tool, called similar-word-finder (found under the directory with the same name). Such a dependency can be materialized in many ways; for example, if similar-word-finder has its own public repository, it can be "linked" to this repository via a Git submodule, conveniently pinning its version down to the exact commit ID.

General "must-have"s

The following essential files are present in the repository:

  • README.md. It gives a short explanation of the approach and tool, listing the dependencies needed in order to install and use it. It also links to the contributing guide (see CONTRIBUTING.md below) and provides a citation for other papers to use (see CITATION.cff below).
  • AUTHORS. It contains a simple list of authors and their emails. In this case, there is a single author: Jane Doe.
  • LICENSE. It contains the license of the tool. In this case, it is the LGPL-2.1.
  • CITATION.cff. It contains metadata that can help cite this repository. See https://citation-file-format.github.io/.
  • CONTRIBUTING.md. It contains a guide to help others (researchers or industry practitioners) to contribute new features or bug fixes to the tool.

Docker image

In order to facilitate the use of the tool on different machines, as well as to make the construction of the artifact easier, this repository is also set up to generate a Docker image of the tool, which can then be used to run a Docker container containing the tool. Portability aside, this may also be convenient in some cases where the tool must run in an isolated environment (e.g., in the cases of malware detection or fuzzing).

The image can be built locally with the build.sh utility script, provided that Docker (either Desktop or Engine) is installed on the machine. A container using the previously built image can then be started via the run.sh script.

These two scripts make use of the IMAGE and VERSION files. The IMAGE file defines the name of the generated image (in this case, spellcheck), while VERSION defines its version number. See Versioning below for more details.

Finally, the Dockerfile and .dockerignore determine how the image gets built. See https://docs.docker.com/reference/dockerfile/ for a detailed description of these files.

Documentation

Most tools need thorough documentation, which should not only explain how to use them in their intended context, but also how to extend them and use them in new contexts (something that is very common in research). In this case, since the example is very simple, there is a single file in the doc directory, but in a real use case the documentation would be much more detailed.

Versioning

Semantic Versioning is used (in tandem with Git) for the versioning of Spellcheck. Concretely, Spellcheck versions will materialize through Git tags and updates to the VERSION file. This makes reproducing results using specific versions of Spellcheck (such as the one used in the fictional paper) easier, and citations to the repository/tool can specify the version to avoid ambiguities, as Spellcheck evolves.

The spellcheck-conf99-artifact artifact repository

The repository for the spellcheck artifact can be found under example/spellcheck-conf99-artifact.

As a reminder, this repository is destined to artifact evaluators at the Artifact Evaluation track of the fictional CONF'99 conference. As such, its contents are in reality heavily dictated by the guidelines of the Artifact Evaluation track. For example, the README.md file is loosely based on the guidelines for ICSE'25.

Docker image

Apart from the files and general structure imposed by the guidelines, this repository shares the same Docker infrastructure as the tool repository. One subtlety is that, as you can see in the Dockerfile, the image of the artifact is based on the image of the tool from the tool repository. This helps to both (1) pin the version of Spellcheck and (2) reuse the Spellcheck Docker image, but it is not strictly necessary—we could also use Git submodules, but we would probably have to partially re-implement the dependency installation and setup done in the Spellcheck tool repository. Also see A note on stacking Docker images for more details.

Data needed for the (re)production of experiments

The following files and directories are new:

  • benchmarks/: this directory contains a list of benchmarks on which Spellcheck is evaluated. In the case of Spellcheck, a benchmark is a text file potentially containing misspelled words. In order to evaluate the precision of Spellcheck, the "ground truth" (i.e., correct spelling fixes) version of each text file is provided, under benchmarks/ground-truth. In reality, this benchmarks/ directory may be organized very differently based on the "target" of the tool, or it might even be a Git submodule pointing to a different repository which contains the benchmark. Finally, in the case where the benchmark is part of the contributions of the paper (and thus an artifact), it might have its own tool repository which will then get referenced here.
  • wordlists/: since Spellcheck also takes a list of correctly spelled words as input, such lists need to be provided. Again, in reality this is the same as the benchmarks/ directory; it could also be an external Git repository or a Docker image we build upon.

Utility scripts

The following utility scripts can be used (e.g., by the artifact evaluators) to simplify the reproduction of the results of the paper:

  • run-full-evaluation.sh: this script runs all of the experiments needed to reproduce the results from the paper. In this case, it is very simple, but in reality it most often translates to months or even years of CPU time.
  • run-reduced-evaluation.sh: this script runs a selected benchmark only, in the interest of time. In reality, this is often done as the full evaluation is infeasible given the deadlines of the reviewers, so a reduced evaluation that is still capable of showing e.g., the trends and conclusions from the paper is preferred.
  • run-benchmark.py: this script runs a single benchmark. As such, it can both be used to perform the experiments in the first place (by the authors) and reproduce selected experiments or even to be tried out on new benchmarks (by the reviewers).

Versioning

Again, Semantic Versioning is used (in tandem with Git) for the artifact repository. The only difference is that, in this case, we also have to choose the version of the Spellcheck tool used in the artifact. This is easy to do via the base Docker image selected in the Dockerfile. Again, the VERSION file defines the version of the entire repository and generated image, so it is possible that the version of the Spellcheck tool and the version of the Spellcheck artifact will not be the same.

The artifact template in detail

This section explains how to set up an artifact using this template in detail, including the changes that need to be made to the individual template files.

It also explains how to use the artifact to both run the experiments for the paper and to reproduce the experiments (e.g., for an Artifact Evaluation track).

Building the artifact

Following what was established in the Architecture section, the artifact should be built before the main paper experiments begin, but after the prototyping phase. A good rule of thumb is to start working on the artifact when the list of experiments needed for the final paper has more or less been established.

Creating the Git repositories

The tool itself should be in its own Git repository. If you already have a Git repository, either because you had one from the prototyping phase you'd like to keep or because you're forking an existing tool, you are done with this step. Otherwise, you should create one now with git init. This repository should remain private for now, as making it public may hurt the double-blind review process; consult your target conference's guidelines first.

Similarly, you should create a new, separate Git repository for the artifact. This repository will transparently pull in the tool (defined in the previous repository), but it will also contain other data relevant to the experiments specifically, that have no place in the tool's "primary" repository.

Versioning

This template encourages the use of Semantic Versioning in tandem with Git tags to aid in reproducibility and debugging.

In simple terms, once you are happy with the state of the artifact, you should:

  • Edit VERSION (as well as any other version-tracking files depending on the programming language(s) used) in the tool repository, commit and tag;
  • Edit VERSION in the artifact repository, commit and tag.

The tool repository

In the Git repository of the tool, copy all files from the template/tool-repo/ directory of the template. If you are not starting from a fresh repository, make sure to not overwrite any existing versions of these files. For example, you can run:

$ rsync -a -v --ignore-existing \
    /path-to-template/template/tool-repo/* /path-to-tool-repo/

Once that is done, you can now go through each file and edit it as described below.

AUTHORS

You should delete the contents of this file and replace them with the names (and emails, if they so wish) of the authors. A good format to follow is Firstname Lastname <email_address>, with one author per line. For example, Jane Doe <jane.doe@example.com>.

build.sh

You should replace TODO with the name of your tool.

CITATION.cff

You should delete the contents of this file and replace them with valid metadata. You can find more about the format of this file on https://citation-file-format.github.io/, and you can use a tool such as https://citation-file-format.github.io/cff-initializer-javascript/ to generate a valid CITATION.cff file.

This file is not a priority. In fact, you will most likely obtain all of the necessary information some time after the acceptance of the associated paper, so you can come back to it at a later time.

CONTRIBUTING.md

You should replace TODO with the name of your tool in the heading. You should also write a thorough contributing guide, including developer dependencies, a guide on how to set up a development environment and so on.

This file is not a priority. You should take care of this before making the repository public, but you obviously do not need it during the experimental phase.

Dockerfile

You should delete the contents of this file and replace them with the appropriate code to generate a usable Docker image for your tool. You can find the full reference at https://docs.docker.com/reference/dockerfile/.

.dockerignore

This file contains some reasonable defaults that you will most likely want to omit from the generated Docker image. However, you should adapt it to the needs of the tool. For example, if your tool's build process generates build artifacts in a build/ directory, that directory should probably also be added to the .dockerignore (as well as the .gitignore).

IMAGE

You should delete the contents of this file and replace them with the name of the Docker image of your tool. This should most likely just be the name of the tool in lowercase.

LICENSE

You should delete the contents of this file and replace them with an actual license, preferably an open source license.

README.md

You should replace the TODOs with appropriate text throughout the file. You can of course edit, remove or add text as you see fit, and depending on the context of your tool.

This file is not a priority, but it is often very useful to add some information such as dependencies as soon as you have it. If multiple people are working on this repository, it will also make collaboration easier if it is kept up to date with information about the dependencies or the build process of the tool.

run.sh

You should replace TODO with the name of your tool.

VERSION

This file starts at a reasonable first version, so you do not need to edit it at first. You are, however, expected to edit it as you keep track of the version of the tool. It is a good idea to make sure this version number stays aligned with other version-tracking mechanisms (e.g., in pyproject.toml for Python-based tools or in Cargo.toml for Rust-based tools).

The artifact repository

In the Git repository of the tool, copy all files from the template/artifact-repo/ directory of the template.

Once that is done, you can now go through each file and edit it as described below.

AUTHORS

You should most likely copy the AUTHORS file from the tool repository.

build.sh

You should replace <TODO name & conf> with the name of your tool and the target conference. For example, Spellcheck CONF'99.

Dockerfile

You should delete the contents of this file and replace them with the appropriate code to generate a usable Docker image for the artifact. You can find the full reference at https://docs.docker.com/reference/dockerfile/.

This template encourages basing the Docker image on the Docker image of the tool repository. For simple cases, this is as easy as adding FROM <TOOL>:<VERSION> at the top of the Dockerfile. For more complex cases, see A note on stacking Docker images.

.dockerignore

This file contains some reasonable defaults that you will most likely want to omit from the generated Docker image. However, you should adapt it to the needs of the artifact. For example, if your artifact's build process generates build artifacts in a build/ directory, that directory should probably also be added to the .dockerignore (as well as the .gitignore).

IMAGE

You should delete the contents of this file and replace them with the name of the Docker image of your artifact. This should most likely just be <tool_name>-<conf_name>-artifact. For example, spellcheck-conf99-artifact.

LICENSE

You should delete the contents of this file and replace them with an actual license, preferably an open source license. If you are using something which uses a different license (e.g., a dataset), make sure to explain that it does not fall under the same license as the rest of the repo (also checking for potential conflicts).

README.md

In the first heading, on the first line, you should replace TODOs with the name of the associated paper and the name of the conference.

The rest of the file is laid out loosely following the Artifact Evaluation track guidelines for ICSE'25, but of course this will depend on your target conference. Consult with their guidelines and adapt the file accordingly.

run-benchmark.py

The scope of this script is to run your tool on a single experimental unit and evaluate the result. For example, if you are building a static analysis tool, it should probably:

  1. Run your tool on a single program;
  2. Evaluate the findings of your tool (e.g., using some ground-truth report).

You can either choose to use Python for the easy argument parsing and command running, or, if this is not appropriate, you can delete this script entirely and replace it with a different one. In that case, remember to update the other scripts accordingly; you can verify with grep "run-benchmark.py" *. If you do wish to use this script, you can modify the existing argument parsing and add code to launch your tool with the given benchmark/input, as well as add code to evaluate the results.

run-full-evaluation.sh

The scope of this script is to run the full set of experiments needed to reproduce the results from the paper.

This template encourages basing this script on the "unit-level" run-benchmark.py script. This makes it easier to modify how the experiments are run in a single place, but is not always feasible. If possible, this script should remain essentially unchanged.

run-reduced-evaluation.sh

The scope of this script is to run a reduced set of experiments with which the main trends and conclusions from the paper can be verified. This script is to be used during artifact evaluation if the run-full-evaluation.sh script cannot finish in a reasonable amount of time (which is often the case).

This template encourages basing this script on the "unit-level" run-benchmark.py script. This makes it easier to modify how the experiments are run in a single place, but is not always feasible. If possible, this script should remain essentially unchanged.

run.sh

You should replace <TODO name & conf> with the name of your tool and the target conference. For example, Spellcheck CONF'99. You should also replace <TODO fixed name> with a good fixed name for the Docker container. This fixed name can be used when instructing reviewers to interact with the container, e.g., to copy a file out to their host machine. Since containers are given random names by default, you can avoid confusion by providing a fixed name.

VERSION

This file starts at a reasonable first version, so you do not need to edit it at first. You are, however, expected to edit it as you keep track of the version of the tool. It is a good idea to make sure this version number stays aligned with other version-tracking mechanisms (e.g., in pyproject.toml for Python-based tools or in Cargo.toml for Rust-based tools).

Be aware that this version will not necessarily match the version of your tool from the tool repository, or the version of the base Docker image of your tool.

A note on stacking Docker images

This template encourages the use of the tool Docker image as a base for the artifact Docker image. This avoids duplication of code and dependency information, and offers a more stable and reproducible setup.

However, in some cases, multiple Docker images may be used. This can occur when the tool makes use of external dependencies which must be used through their own Docker image. It can also occur if the work's contributions contain both a tool and a new benchmark/dataset, in which case both must be published, potentially in different Docker images.

In any case, if you have to "stack" Docker images, you might be tempted to do the following in the artifact Dockerfile:

# Contains dependency at `/root/tool`.
FROM external-dependency/tool:0.1.0 AS dependency
# Contains our tool at `/root/tool`.
FROM my-tool:0.1.0 

COPY --from=dependency /root/tool /root/dependency

# Now, we should have both `/root/tool` and `/root/dependency`.
# Setup code for the rest of the artifact...

While this might result in all of the important files of external-dependency/tool:0.1.0 being put in the right place in the final image (provided there are no conflicts between external-dependency/tool:0.1.0 and my-tool:0.1.0), the dependency might not work as expected. For example, if it needs to install system packages, they will not be installed in the final image, as the last FROM command will essentially overwrite the installation with its own installation of system packages.

One way to get around that is to copy everything from the previous image(s) like so:

FROM external-dependency/tool:0.1.0 AS dependency
FROM my-tool:0.1.0

COPY --from=dependency / /

# Setup code for the rest of the artifact...

This might seem inelegant, but it will work. COPY will not overwrite the entirety of /, but rather copy the missing files (i.e., whatever was installed via the system package manager).

However, at least for images ultimately based on ubuntu:22.04 and ubuntu:24.04, this will result in a broken apt, meaning that other packages needed by the artifact will not be able to be easily installed after that COPY command.

A known fix to this is the following:

# See https://askubuntu.com/a/1272402.
RUN rm /var/lib/dpkg/statoverride && \
    rm /var/lib/dpkg/lock && \
    dpkg --configure -a && \
    apt-get --fix-broken install

Using the artifact

As established in the previous sections, a big advantage of setting up your artifact in this way is that you can both use the artifact to produce the results of your paper and give it to artifact evaluators to reproduce the results.

Producing the results for a paper

Storing the results and producing data for the paper

When producing results for a paper, it is a good idea to be able to store the raw results instead of storing metrics about the results. For example, you can store the true/false positive/negative rates of an experiment and throw away everything else to save space; however, if you later want to retrieve some other metric from the same data, it will be impossible.

In some cases, the raw results are simply too large (even in compressed form). In such cases, you might be able to filter some of it out, keeping only a core of "most useful" data, to reduce the size down to an acceptable amount. Barring exceptionally incompressible data, however, you should probably simply compress the results and keep everything.

It is a good practice to have the raw results be as read-only as possible. One such way is to immediately compress them after the experiment (e.g., in a .tar.xz tarball), and then create scripts to interpret the results from the tarball itself.

For better performance, you can set a fixed directory within the tarball where the results get stored. Then, a script would only have to extract these result files (e.g., CSV files), parse them, perform any necessary calculations on them and output the results (e.g., in TikZ form, or even simply as text on stdout).

While there is some slowness associated with this approach (as you have to de-compress the raw results), it is arguably not really noticeable, as the results often only need to be extracted once (and then stored somewhere if further inspection is necessary). At the same time, the raw results can now be easily packaged with the artifact on an archival repository (e.g., on Zenodo) to provide even better means to other researchers of reproducing your results.

Running experiments in Docker containers

When using Docker containers, it is a good idea to auto-remove them when done by running them with the --rm option. However, a crash might cause all data in the container to be lost (in some cases, it can be multiple hours or even days worth of data). For that reason, it is a good practice to bind mount a volume to the container, essentially mapping a directory on the host machine to a directory in the container. This can be achieved through the --volume option, like so:

$ docker run -ti --rm --volume /path/to/host-dir:/path/to/container-dir

For instance, we can map $HOME/experiment_target_YYYY-MM-DD on the host to /root/output in the container, and direct all output of the experiment in the container in /root/output. That way, even in the event of a crash, we will be able to save whatever data has been produced up until the crash.

Another benefit of this use of containers is that they are less likely to run out of space. As their "default" space is commonly placed on a shared volume on the host machines, on machines that see heavy Docker container usage across multiple users, you run the risk of running out of space on the shared volume (which might ruin an experiment). By using a bind mount as described before, you ensure that the directory containing the largest and most important data from your experiment will not also be used by other running containers.

Providing a reproduction environment for other researchers

The same artifact can be used by other researchers (e.g., in the context of artifact evaluation) to reproduce the results of the paper easily and in some cases even accurately down to the bit level. Since the very same environment was used to produce the results in the first place, we can be very confident of its robustness and its ability to reproduce the same results.