Introduction
This template is intended to be used as a base when building a new artifact (e.g., tools, proofs, data) to accompany a published paper. It is heavily geared towards Computer Science (as that is my own field), but may be repurposed for other scientific domains.
The documentation is divided into the following sections:
- Motivation: the motivation behind this template;
- Portability: how to ensure that the artifact can work on different machines;
- Architecture: a high-level description of the architecture of the artifact template;
- A simple example: walk-through of a concrete example of an artifact using the template;
- The artifact template in detail: a detailed technical guide on using the template for your own artifacts.
Motivation
This template exists mainly for two reasons:
- There is a real problem of missing, incomplete or otherwise unsatisfactory research artifacts in Computer Science. In my personal experience, this seems to be caused mostly by inertia (or laziness), as proper preparation of an artifact for publication is (or should be) in some ways similar to the practice of software release, which of course demands some organizational effort. A template could ease some of that pain by providing a base to construct upon.
- Since the work of constructing an artifact should be a core part of research (and not just an afterthought), and in the interest of reproducibility, the artifact itself could (and should) be used for any experiments presented in the paper.1 This ensures that the very environment which produced the data shown in the paper is available to anyone, such as artifact evaluators, essentially guaranteeing a smooth reproduction of the experiments.
Of course, this should be done during the latest phase of experimentation, when there is a clear understanding of which experimental results are needed for the paper. It is not worthwhile to develop a stable, publishable artifact during the prototyping phase, when experimental results are not yet clear and the fate of the paper is still largely undecided.
Portability
To ensure portability and ease the setup of the evaluation environment, this template makes heavy use of Docker. Of course, this can be swapped out for any other such tool, so long as the artifact remains reasonably portable between different platforms and configurations. Common alternatives include VirtualBox and Nix.
In any case, it is good practice to thoroughly document software and hardware dependencies, and to provide solutions or alternatives when possible.
Architecture
The goals of this artifact template are to produce an artifact which can:
- Be used to both produce (by yourself) and reproduce (by others) experimental results for/from the associated paper;
- Be used by others just like any piece of software (if applicable), in various contexts (such as research—e.g., other papers, or industry).
As such, this template proposes to split the work into two repositories:
- The tool repository, containing the software itself, to be independently maintained and distributed as needed;
- The artifact repository, building on the tool repository, but also adding any additional material needed to (re)produce experimental results for/from the associated paper.
This split allows for nominal maintenance of both parts of the work. On the one hand, the tool repository should be both published in a live repository (e.g., on GitHub), to allow authors to continue to maintain and distribute the tool, allow others to fork and contribute to it, and archived, to allow for long-term preservation and enable reproduction in the future. On the other hand, the artifact repository should also be archived (for the same reasons), but it should probably not be published in a live repository, as it is not destined to be distributed or accept contributions from others; after all, its only purpose is reproducing the results of the associated paper.
This way, the two repositories can stay clean, containing only relevant content, and avoiding confusion regarding the two main "target groups": developers or researchers modifying the tool are probably only interested in the tool repository, while artifact evaluators are probably only interested in the artifact repository.
A simple example
In order to understand how the template works, we are going to apply it to a concrete example. For the sake of this example, we will imagine a made-up paper, titled Spellcheck: Checking New Spells, which introduces a new spellchecking approach and tool called Spellcheck, and was published in the prestigious CONF'99 conference.
Following the template, this section will walk you through the two repositories: the spellcheck
tool repository and the spellcheck-conf99-artifact
artifact repository.
For the sake of clarity and brevity, this example is obviously simplified; if you are interested in a real use case, see https://github.com/binsec/rosa and https://zenodo.org/records/14724251.
The spellcheck
tool repository
The repository for the spellcheck
tool can be found under example/spellcheck-repo
.
The Spellcheck approach is implemented here in the form of a Python script, spellcheck.py
. It
has an external dependency on another tool, called similar-word-finder
(found under the directory
with the same name). Such a dependency can be materialized in many ways; for example, if
similar-word-finder
has its own public repository, it can be "linked" to this repository via a
Git submodule, conveniently pinning its
version down to the exact commit ID.
General "must-have"s
The following essential files are present in the repository:
README.md
. It gives a short explanation of the approach and tool, listing the dependencies needed in order to install and use it. It also links to the contributing guide (seeCONTRIBUTING.md
below) and provides a citation for other papers to use (seeCITATION.cff
below).AUTHORS
. It contains a simple list of authors and their emails. In this case, there is a single author: Jane Doe.LICENSE
. It contains the license of the tool. In this case, it is the LGPL-2.1.CITATION.cff
. It contains metadata that can help cite this repository. See https://citation-file-format.github.io/.CONTRIBUTING.md
. It contains a guide to help others (researchers or industry practitioners) to contribute new features or bug fixes to the tool.
Docker image
In order to facilitate the use of the tool on different machines, as well as to make the construction of the artifact easier, this repository is also set up to generate a Docker image of the tool, which can then be used to run a Docker container containing the tool. Portability aside, this may also be convenient in some cases where the tool must run in an isolated environment (e.g., in the cases of malware detection or fuzzing).
The image can be built locally with the build.sh
utility script, provided that Docker (either
Desktop or Engine) is
installed on the machine. A container using the previously built image can then be started via the
run.sh
script.
These two scripts make use of the IMAGE
and VERSION
files. The IMAGE
file defines the name of
the generated image (in this case, spellcheck
), while VERSION
defines its version number. See
Versioning below for more details.
Finally, the Dockerfile
and .dockerignore
determine how the image gets built. See
https://docs.docker.com/reference/dockerfile/ for a detailed description of these files.
Documentation
Most tools need thorough documentation, which should not only explain how to use them in their
intended context, but also how to extend them and use them in new contexts (something that is very
common in research). In this case, since the example is very simple, there is a single file in the
doc
directory, but in a real use case the documentation would be much more detailed.
Versioning
Semantic Versioning is used (in tandem with Git) for the versioning of
Spellcheck. Concretely, Spellcheck versions will materialize through
Git tags and updates to the VERSION
file.
This makes reproducing results using specific versions of Spellcheck (such as the one used in the
fictional paper) easier, and citations to the repository/tool can specify the version to avoid
ambiguities, as Spellcheck evolves.
The spellcheck-conf99-artifact
artifact repository
The repository for the spellcheck
artifact can be found under
example/spellcheck-conf99-artifact
.
As a reminder, this repository is destined to artifact evaluators at the Artifact Evaluation track
of the fictional CONF'99 conference. As such, its contents are in reality heavily dictated by the
guidelines of the Artifact Evaluation track. For example, the README.md
file is loosely based on
the guidelines for
ICSE'25.
Docker image
Apart from the files and general structure imposed by the guidelines, this repository shares the
same Docker infrastructure as the tool repository. One subtlety is
that, as you can see in the Dockerfile
, the image of the artifact is based on the image of the
tool from the tool repository. This helps to both (1) pin the version of Spellcheck and (2) reuse
the Spellcheck Docker image, but it is not strictly necessary—we could also use
Git submodules, but we would probably have to
partially re-implement the dependency installation and setup done in the Spellcheck tool
repository. Also see
A note on stacking Docker images for more
details.
Data needed for the (re)production of experiments
The following files and directories are new:
benchmarks/
: this directory contains a list of benchmarks on which Spellcheck is evaluated. In the case of Spellcheck, a benchmark is a text file potentially containing misspelled words. In order to evaluate the precision of Spellcheck, the "ground truth" (i.e., correct spelling fixes) version of each text file is provided, underbenchmarks/ground-truth
. In reality, thisbenchmarks/
directory may be organized very differently based on the "target" of the tool, or it might even be a Git submodule pointing to a different repository which contains the benchmark. Finally, in the case where the benchmark is part of the contributions of the paper (and thus an artifact), it might have its own tool repository which will then get referenced here.wordlists/
: since Spellcheck also takes a list of correctly spelled words as input, such lists need to be provided. Again, in reality this is the same as thebenchmarks/
directory; it could also be an external Git repository or a Docker image we build upon.
Utility scripts
The following utility scripts can be used (e.g., by the artifact evaluators) to simplify the reproduction of the results of the paper:
run-full-evaluation.sh
: this script runs all of the experiments needed to reproduce the results from the paper. In this case, it is very simple, but in reality it most often translates to months or even years of CPU time.run-reduced-evaluation.sh
: this script runs a selected benchmark only, in the interest of time. In reality, this is often done as the full evaluation is infeasible given the deadlines of the reviewers, so a reduced evaluation that is still capable of showing e.g., the trends and conclusions from the paper is preferred.run-benchmark.py
: this script runs a single benchmark. As such, it can both be used to perform the experiments in the first place (by the authors) and reproduce selected experiments or even to be tried out on new benchmarks (by the reviewers).
Versioning
Again, Semantic Versioning is used (in tandem with Git) for the artifact
repository. The only difference is that, in this case, we also have to choose the version of the
Spellcheck tool used in the artifact. This is easy to do via the base Docker image selected in the
Dockerfile
. Again, the VERSION
file defines the version of the entire repository and generated
image, so it is possible that the version of the Spellcheck tool and the version of the Spellcheck
artifact will not be the same.
The artifact template in detail
This section explains how to set up an artifact using this template in detail, including the changes that need to be made to the individual template files.
It also explains how to use the artifact to both run the experiments for the paper and to reproduce the experiments (e.g., for an Artifact Evaluation track).
Building the artifact
Following what was established in the Architecture section, the artifact should be built before the main paper experiments begin, but after the prototyping phase. A good rule of thumb is to start working on the artifact when the list of experiments needed for the final paper has more or less been established.
Creating the Git repositories
The tool itself should be in its own Git repository. If you already have a Git repository, either
because you had one from the prototyping phase you'd like to keep or because you're forking an
existing tool, you are done with this step. Otherwise, you should create one now with git init
.
This repository should remain private for now, as making it public may hurt the double-blind
review process; consult your target conference's guidelines first.
Similarly, you should create a new, separate Git repository for the artifact. This repository will transparently pull in the tool (defined in the previous repository), but it will also contain other data relevant to the experiments specifically, that have no place in the tool's "primary" repository.
Versioning
This template encourages the use of Semantic Versioning in tandem with Git tags to aid in reproducibility and debugging.
In simple terms, once you are happy with the state of the artifact, you should:
- Edit
VERSION
(as well as any other version-tracking files depending on the programming language(s) used) in the tool repository, commit and tag; - Edit
VERSION
in the artifact repository, commit and tag.
The tool repository
In the Git repository of the tool, copy all files from the template/tool-repo/
directory of the
template. If you are not starting from a fresh repository, make sure to not overwrite any existing
versions of these files. For example, you can run:
$ rsync -a -v --ignore-existing \
/path-to-template/template/tool-repo/* /path-to-tool-repo/
Once that is done, you can now go through each file and edit it as described below.
AUTHORS
You should delete the contents of this file and replace them with the names (and emails, if they so
wish) of the authors. A good format to follow is Firstname Lastname <email_address>
, with one
author per line. For example, Jane Doe <jane.doe@example.com>
.
build.sh
You should replace TODO
with the name of your tool.
CITATION.cff
You should delete the contents of this file and replace them with valid metadata. You can find more
about the format of this file on https://citation-file-format.github.io/, and you can use a tool
such as https://citation-file-format.github.io/cff-initializer-javascript/ to generate a valid
CITATION.cff
file.
This file is not a priority. In fact, you will most likely obtain all of the necessary information some time after the acceptance of the associated paper, so you can come back to it at a later time.
CONTRIBUTING.md
You should replace TODO
with the name of your tool in the heading. You should also write a
thorough contributing guide, including developer dependencies, a guide on how to set up a
development environment and so on.
This file is not a priority. You should take care of this before making the repository public, but you obviously do not need it during the experimental phase.
Dockerfile
You should delete the contents of this file and replace them with the appropriate code to generate a usable Docker image for your tool. You can find the full reference at https://docs.docker.com/reference/dockerfile/.
.dockerignore
This file contains some reasonable defaults that you will most likely want to omit from the
generated Docker image. However, you should adapt it to the needs of the tool. For example, if your
tool's build process generates build artifacts in a build/
directory, that directory should
probably also be added to the .dockerignore
(as well as the .gitignore
).
IMAGE
You should delete the contents of this file and replace them with the name of the Docker image of your tool. This should most likely just be the name of the tool in lowercase.
LICENSE
You should delete the contents of this file and replace them with an actual license, preferably an open source license.
README.md
You should replace the TODOs with appropriate text throughout the file. You can of course edit, remove or add text as you see fit, and depending on the context of your tool.
This file is not a priority, but it is often very useful to add some information such as dependencies as soon as you have it. If multiple people are working on this repository, it will also make collaboration easier if it is kept up to date with information about the dependencies or the build process of the tool.
run.sh
You should replace TODO
with the name of your tool.
VERSION
This file starts at a reasonable first version, so you do not need to edit it at first. You are,
however, expected to edit it as you keep track of the version of the tool. It is a good idea to make
sure this version number stays aligned with other version-tracking mechanisms (e.g., in
pyproject.toml
for Python-based tools or in Cargo.toml
for Rust-based tools).
The artifact repository
In the Git repository of the tool, copy all files from the template/artifact-repo/
directory of
the template.
Once that is done, you can now go through each file and edit it as described below.
AUTHORS
You should most likely copy the AUTHORS
file from the tool repository.
build.sh
You should replace <TODO name & conf>
with the name of your tool and the target conference. For
example, Spellcheck CONF'99
.
Dockerfile
You should delete the contents of this file and replace them with the appropriate code to generate a usable Docker image for the artifact. You can find the full reference at https://docs.docker.com/reference/dockerfile/.
This template encourages basing the Docker image on the Docker image of the tool repository.
For simple cases, this is as easy as adding FROM <TOOL>:<VERSION>
at the top of the Dockerfile
.
For more complex cases, see A note on stacking Docker images.
.dockerignore
This file contains some reasonable defaults that you will most likely want to omit from the
generated Docker image. However, you should adapt it to the needs of the artifact. For example, if
your artifact's build process generates build artifacts in a build/
directory, that directory
should probably also be added to the .dockerignore
(as well as the .gitignore
).
IMAGE
You should delete the contents of this file and replace them with the name of the Docker image of
your artifact. This should most likely just be <tool_name>-<conf_name>-artifact
. For example,
spellcheck-conf99-artifact
.
LICENSE
You should delete the contents of this file and replace them with an actual license, preferably an open source license. If you are using something which uses a different license (e.g., a dataset), make sure to explain that it does not fall under the same license as the rest of the repo (also checking for potential conflicts).
README.md
In the first heading, on the first line, you should replace TODO
s with the name of the associated
paper and the name of the conference.
The rest of the file is laid out loosely following the Artifact Evaluation track guidelines for ICSE'25, but of course this will depend on your target conference. Consult with their guidelines and adapt the file accordingly.
run-benchmark.py
The scope of this script is to run your tool on a single experimental unit and evaluate the result. For example, if you are building a static analysis tool, it should probably:
- Run your tool on a single program;
- Evaluate the findings of your tool (e.g., using some ground-truth report).
You can either choose to use Python for the easy argument parsing and command running, or, if this
is not appropriate, you can delete this script entirely and replace it with a different one. In that
case, remember to update the other scripts accordingly; you can verify with
grep "run-benchmark.py" *
. If you do wish to use this script, you can modify the existing argument
parsing and add code to launch your tool with the given benchmark/input, as well as add code to
evaluate the results.
run-full-evaluation.sh
The scope of this script is to run the full set of experiments needed to reproduce the results from the paper.
This template encourages basing this script on the "unit-level" run-benchmark.py
script. This
makes it easier to modify how the experiments are run in a single place, but is not always feasible.
If possible, this script should remain essentially unchanged.
run-reduced-evaluation.sh
The scope of this script is to run a reduced set of experiments with which the main trends and
conclusions from the paper can be verified. This script is to be used during artifact evaluation if
the run-full-evaluation.sh
script cannot finish in a reasonable amount of time (which is often the
case).
This template encourages basing this script on the "unit-level" run-benchmark.py
script. This
makes it easier to modify how the experiments are run in a single place, but is not always feasible.
If possible, this script should remain essentially unchanged.
run.sh
You should replace <TODO name & conf>
with the name of your tool and the target conference. For
example, Spellcheck CONF'99
. You should also replace <TODO fixed name>
with a good fixed name
for the Docker container. This fixed name can be used when instructing reviewers to interact with
the container, e.g., to copy a file out to their host machine. Since containers are given random
names by default, you can avoid confusion by providing a fixed name.
VERSION
This file starts at a reasonable first version, so you do not need to edit it at first. You are,
however, expected to edit it as you keep track of the version of the tool. It is a good idea to make
sure this version number stays aligned with other version-tracking mechanisms (e.g., in
pyproject.toml
for Python-based tools or in Cargo.toml
for Rust-based tools).
Be aware that this version will not necessarily match the version of your tool from the tool repository, or the version of the base Docker image of your tool.
A note on stacking Docker images
This template encourages the use of the tool Docker image as a base for the artifact Docker image. This avoids duplication of code and dependency information, and offers a more stable and reproducible setup.
However, in some cases, multiple Docker images may be used. This can occur when the tool makes use of external dependencies which must be used through their own Docker image. It can also occur if the work's contributions contain both a tool and a new benchmark/dataset, in which case both must be published, potentially in different Docker images.
In any case, if you have to "stack" Docker images, you might be tempted to do the following in the
artifact Dockerfile
:
# Contains dependency at `/root/tool`.
FROM external-dependency/tool:0.1.0 AS dependency
# Contains our tool at `/root/tool`.
FROM my-tool:0.1.0
COPY --from=dependency /root/tool /root/dependency
# Now, we should have both `/root/tool` and `/root/dependency`.
# Setup code for the rest of the artifact...
While this might result in all of the important files of external-dependency/tool:0.1.0
being
put in the right place in the final image (provided there are no conflicts between
external-dependency/tool:0.1.0
and my-tool:0.1.0
), the dependency might not work as expected.
For example, if it needs to install system packages, they will not be installed in the final image,
as the last FROM
command will essentially overwrite the installation with its own installation of
system packages.
One way to get around that is to copy everything from the previous image(s) like so:
FROM external-dependency/tool:0.1.0 AS dependency
FROM my-tool:0.1.0
COPY --from=dependency / /
# Setup code for the rest of the artifact...
This might seem inelegant, but it will work. COPY
will not overwrite the entirety of /
, but
rather copy the missing files (i.e., whatever was installed via the system package manager).
However, at least for images ultimately based on ubuntu:22.04
and ubuntu:24.04
, this will result
in a broken apt
, meaning that other packages needed by the artifact will not be able to be easily
installed after that COPY
command.
A known fix to this is the following:
# See https://askubuntu.com/a/1272402.
RUN rm /var/lib/dpkg/statoverride && \
rm /var/lib/dpkg/lock && \
dpkg --configure -a && \
apt-get --fix-broken install
Using the artifact
As established in the previous sections, a big advantage of setting up your artifact in this way is that you can both use the artifact to produce the results of your paper and give it to artifact evaluators to reproduce the results.
Producing the results for a paper
Storing the results and producing data for the paper
When producing results for a paper, it is a good idea to be able to store the raw results instead of storing metrics about the results. For example, you can store the true/false positive/negative rates of an experiment and throw away everything else to save space; however, if you later want to retrieve some other metric from the same data, it will be impossible.
In some cases, the raw results are simply too large (even in compressed form). In such cases, you might be able to filter some of it out, keeping only a core of "most useful" data, to reduce the size down to an acceptable amount. Barring exceptionally incompressible data, however, you should probably simply compress the results and keep everything.
It is a good practice to have the raw results be as read-only as possible. One such way is to
immediately compress them after the experiment (e.g., in a .tar.xz
tarball), and then create
scripts to interpret the results from the tarball itself.
For better performance, you can set a fixed directory within the tarball where the results get
stored. Then, a script would only have to extract these result files (e.g., CSV files), parse them,
perform any necessary calculations on them and output the results (e.g., in TikZ form, or even
simply as text on stdout
).
While there is some slowness associated with this approach (as you have to de-compress the raw results), it is arguably not really noticeable, as the results often only need to be extracted once (and then stored somewhere if further inspection is necessary). At the same time, the raw results can now be easily packaged with the artifact on an archival repository (e.g., on Zenodo) to provide even better means to other researchers of reproducing your results.
Running experiments in Docker containers
When using Docker containers, it is a good idea to auto-remove them when done by running them with
the --rm
option. However, a crash might cause all data in the container to be lost (in some cases,
it can be multiple hours or even days worth of data). For that reason, it is a good practice to
bind mount a volume to the container, essentially mapping a directory on the host machine to a
directory in the container. This can be achieved through the --volume
option, like so:
$ docker run -ti --rm --volume /path/to/host-dir:/path/to/container-dir
For instance, we can map $HOME/experiment_target_YYYY-MM-DD
on the host to /root/output
in the
container, and direct all output of the experiment in the container in /root/output
. That way,
even in the event of a crash, we will be able to save whatever data has been produced up until the
crash.
Another benefit of this use of containers is that they are less likely to run out of space. As their "default" space is commonly placed on a shared volume on the host machines, on machines that see heavy Docker container usage across multiple users, you run the risk of running out of space on the shared volume (which might ruin an experiment). By using a bind mount as described before, you ensure that the directory containing the largest and most important data from your experiment will not also be used by other running containers.
Providing a reproduction environment for other researchers
The same artifact can be used by other researchers (e.g., in the context of artifact evaluation) to reproduce the results of the paper easily and in some cases even accurately down to the bit level. Since the very same environment was used to produce the results in the first place, we can be very confident of its robustness and its ability to reproduce the same results.