How Are Award-winning Systems Research Artifacts Prepared (Episode 1)

Editor’s notes: As more and more systems research conferences start to adopt artifact evaluation (AE), we invite Manuel Rigger and Arpan Gujarati to share their practices and experience on preparing award-winning systems research artifacts — their artifacts received the Distinguished Artifact Awards at OSDI’20. 

In the first edition, Manuel Rigger from ETH Zurich will tell us about how he prepared his award-winning artifacts at OSDI’20 and OOPSLA’20. Manuel generously shared his OSDI’20 AE review.

Manuel Rigger is a postdoctoral researcher in the Advanced Software Technologies (AST) Lab at ETH Zurich, mentored by Zhendong Su. He is working on improving the reliability of software, drawing on and contributing to the systems, software engineering, and programming language fields. In his most recent work, he proposed techniques for automatically testing Database Management Systems, in which he found over 450 unique, previously unknown bugs in widely-used systems such as SQLite, MySQL, PostgreSQL, TiDB, and CockroachDB. Manuel is on the faculty job market this year.

T: Could you introduce your artifact?

M: Our OSDI‘20 paper describes a new approach for automatically finding logic bugs in Database Management Systems (DBMSs), called Pivoted Query Synthesis (PQS). PQS tackles the test oracle problem, that is, it can infer whether a result set returned for a given query and database is correct. It also tackles the test case generation problem by generating meaningful test cases. We implemented PQS as a tool called SQLancer and used it to uncover close to 100 unique, previously unknown bugs in the widely-used DBMSs SQLite, MySQL, and PostgreSQL. Most of the bugs have been fixed by the developers of these DBMSs.

In our OSDI’20 artifact, we included both a snapshot of SQLancer, and metadata for all bugs that we found (e.g., links to the bug reports, the bug statuses, and the test cases to reproduce the bugs). SQLancer’s external dependencies are minimal. It requires a Java installation as well as Maven, which is a build automation tool. Both are available on most platforms, which makes SQLancer easy to install and use. Thus, we did not include a Docker image or VM image, in which we could have preinstalled the tool and its dependencies.

Besides including comprehensive documentation, we also recorded a tutorial-style video that we included in the artifact, in which we explain how to use and evaluate the artifact.

SQLancer is highly reusable, which is facilitated by its architecture. SQLancer operates in roughly three phases: it generates a database, a query, and then uses a test oracle to validate whether the query’s result is correct. These phases correspond to three mostly-independent components. After developing PQS, we designed and implemented two additional novel test oracles in SQLancer, reusing the database and query generation components. We published the follow-up work at ESEC/FSE‘20, and at OOPSLA‘20. Overall, we have reported over 450 unique bugs in nine DBMSs, over 410 of which have been fixed. The latest version of SQLancer, implementing all three approaches, is available on GitHub, where it has been starred over 600 times. It has been adopted by companies,  some of which also contributed to SQLancer to support their DBMSs, further demonstrating SQLancer’s reusability.

T: What did the reviewers like about your research artifact?

M: The reviewers all appreciated that our work on SQLancer went beyond what is typically expected from an artifact. We actively maintain SQLancer on GitHub, where it has received significant interest. Many organizations and companies have adopted it and contributed support also for their DBMSs, which we highlighted in the artifact documentation. PQS has even been implemented outside of SQLancer at least twice; for example, PingCAP implemented a Go version of SQLancer, including PQS.

From feedback received by users and contributors, we could enhance SQLancer and make it easier to use before packaging it as part of the artifact, and the reviewers did not seem to have trouble installing and executing SQLancer.

For example, we lowered the requirements to support older Java versions down to Java 8, and made the Maven configuration work with development IDEs other than Eclipse. SQLancer and PQS are extensively tested, contributing to less problems during reviewing.

Finally, the reviewers appreciated the extensive documentation. We provided detailed instructions on how the artifact could be evaluated, how to use and extend SQLancer, and how to reproduce and validate the bugs that SQLancer had found. The reviewers also liked the video tutorial in which we showed how certain aspects of the artifact could be evaluated.

Despite these positive points, the reviewers also suggested improvements. For example, one reviewer noted that we should add more comments in the source code, and provide a better description of the many options that SQLancer provides to control its behavior. This reviewer also found that some links in the bug database no longer worked, because the SQLite developers suspended their mailing list and migrated to a forum. Another reviewer would have liked to see instructions on how to automatically reduce bugs. The third reviewer had trouble downloading the artifact from Zenodo, which is why we provided two additional hosting options during the interactive reviewing phase.

T: What is your experience with building artifacts?

M: Overall, I have created four artifacts, which successfully went through the artifact evaluation processes at OSDI, ESEC/FSE, and OOPSLA. Two of them were awarded a Distinguished Artifact Award — at OSDI‘20 and OOPSLA‘20. I also became familiar with the artifact evaluation process as a reviewer, and as a co-chair for the artifact evaluation at ECOOP‘20 and ‘19 as well as ISSTA‘21. Serving as a co-chair has been a privilege, and I have learned a lot from my co-chairs and the excellent review committees, for example, by observing what reviewers appreciate when evaluating an artifact and which things constitute stumbling blocks.

My PhD, which I obtained at the Johannes Kepler University Linz, had a major influence on how I build artifacts. Our lab closely collaborated with industry, and my PhD was partly funded by Oracle Labs, whose research engineers had offices in the same building.

This basically meant that I quickly adopted many software engineering best practices, which I am still applying today. For example, continuous integration tests consisting of unit tests and integration tests are run for every SQLancer pull request—using different compilers and Java versions—to ensure that the project stays compilable, compatible with ten different DBMSs, and to prevent bugs from being introduced.

In addition, we apply a number of code quality tools, like PMD, SpotBugs, Checkstyle, and an Eclipse formatter check, to ensure a consistent coding style and to avoid error-prone coding patterns. As another example, to facilitate the adoption of the tool, we provide updates and support on various platforms — SQLancer has both a Twitter account and a Slack workspace.

T: How much time did you spend on your artifact?

M: I probably spent too much time on developing SQLancer and evaluating PQS.

I am very enthusiastic about trying to have a real-world impact, and the obvious way in this line of work was to build a usable tool and report as many bugs as possible, which I did over a period of three months for PQS.

Finding and reporting bugs was very addictive for me, especially because for SQLite, which was the DBMS that we tested first, bugs were typically fixed within a day, providing a great sense of accomplishment. I was so excited that I woke up early every morning to check for new bugs and continue developing the tool. I stumbled on the second approach while on vacation — before the pandemic started — and reduced and reported bugs in the early morning and in the late evening. Typically, the bugs were fixed after I came back from exploring in the evening, so the cycle could start again. For our latest approach, which we implemented during the pandemic, we even tested seven DBMSs, in which we found and reported close to 200 bugs, while testing two would likely have been sufficient for a solid evaluation.

For the PQS artifact specifically, we basically had two choices. We could submit an old version of SQLancer with a working PQS version or the latest version of SQLancer with a broken PQS implementation: after finishing evaluating PQS, I put my energy into implementing and evaluating the two follow-up approaches, rather than maintaining the existing PQS implementation, for which I had removed the tests. I eventually decided to update the PQS implementation, which took almost two weeks of continuous effort, because I had to generalize the architecture of SQLancer to accommodate all three testing approaches. I was rewarded by finding a regression bug in SQLite, which was again quickly fixed. For the two follow-up approaches that we implemented in SQLancer, preparing the artifacts took one or two days.

T: Do you have suggestions for folks who are preparing artifacts such as at the systems building phase (before paper submission)?

M: I recommend thinking about reproducibility as a primary concern from the start. This is not only a service to the reviewers and other researchers, but also to oneself.

At the beginning of my PhD, I used the following workflow: I implemented an approach, ran some experiments, saved the results in an Excel sheet or a CSV, based on which I computed statistics that I then manually added to the paper. When the paper was rejected, I found myself in the situation that I did not remember what exact commands and software versions I used for the experiments, how I extracted and computed the relevant statistics, and what data in the paper was still up-to-date after changing the tool. I basically accumulated technical debt by not implementing a structured and clean way of executing the experiments and summarizing the results.

Nowadays, I start my paper with a Makefile, which automates running scripts to execute the experiments when necessary, creates the plots, outputs the statistics, and builds the paper, which then refers to the generated plots and statistics. As an intermediate step, the Makefile typically stores all the raw data (e.g., execution times) in an SQLite database, based on which a script generates a Latex file, which contains commands for computed constants that I can include in the paper (e.g., a command like \minExecutionTime for the fastest execution time). Any changes to the artifact or experiment are then immediately reflected in the paper, the computation of all the results are comprehensible, and ideally, the effort required to submit an artifact to the artifact evaluation process is low.

T: What other suggestions do you have for a successful artifact?

M: In my opinion, the most important point is to make it easy for reviewers to evaluate the artifact.

The documentation should ideally contain detailed step-by-step instructions on how to evaluate the artifact (e.g., which files to inspect or what commands to execute) and to explicitly connect them with the contributions and claims of the paper. When providing only general instructions and a hello-world-sized example, reviewers might focus on less important or less tested parts of the artifacts, which is more likely to result in issues and unexpected behavior. Providing a video tutorial seems to be an intuitive way to present many artifacts, and was appreciated by the reviewers of our OSDI‘20 and OOPSLA‘20 artifacts.

I think it is often useful to package source code and tools in different formats. For example, rather than including only source code or only a Docker or a VM image, both source code and an image can be included to serve different reviewer types. A reviewer that emphasizes understanding the tool might prefer to directly read the source code, while another reviewer might prefer to mainly execute the tool and see it in action. While reviewers might have problems compiling and running source code due to dependency and installation issues, I also remember reviewers having problems with Docker or VM images due to platform or version issues, or because they perceived such images to not be transparent enough.

T: What could we do to improve the artifact evaluation process and recognition of artifacts?

M: I believe that academic institutions could put a greater emphasis on non-paper research output for hiring decisions and tenure evaluation.

I am on the faculty job market and noticed that many institutions ask for the best N papers (where N is typically between 1 and 5), which is a great strategy to promote quality over quantity in research. It would be great to have a similar question that asks for N non-paper outputs, like artifacts, research infrastructure or start-ups, to incentivize creating those.

I think that we should also incentivize and formally recognize the maintenance of research artifacts with the artifact evaluation process. During my PhD, I built a system to safely and efficiently execute C/C++ code, built on the latest available version of the LLVM compiler framework. I wanted to compare its performance with other tools, many of which were also built on LLVM. This was painful, because outdated versions were more difficult to build on a recent OS and did not support the latest language features. Furthermore, the tool I developed profited through improved optimizations that were implemented in the latest LLVM version, making it difficult to fairly compare its performance with other tools that did not profit through these optimizations. I am confident that many of us have had similar experiences, where an artifact has become outdated and difficult to reuse. To incentivize not only creating, but also maintaining important research artifacts, we could add a category in the artifact evaluation process to re-submit previously accepted artifacts that have been updated, and award a separate badge for that.

It would be useful if the artifact evaluation reviews could influence the outcome of the reviewing process, which I believe has been an idea since the artifact evaluation process was initiated. A challenge is to design a process that is feasible and efficient for both the authors and reviewers. Submitting a polished artifact together with the paper draft might be challenging for authors, and would increase the reviewing burden since also artifacts for papers that would eventually be rejected would be reviewed. Similarly, it would be frustrating for the authors if the artifact evaluation reviews could retrospectively influence the acceptance outcome. A trade-off could be to encourage authors to submit their artifacts after the rebuttal period and give the evaluation committees enough time to evaluate the artifact and incorporate the quality of the artifact when deciding whether to accept a paper.

Note that creating and maintaining an artifact incur much work. I still spend about one day every week maintaining SQLancer, responding to requests, giving support, and implementing new features. Doing so is personally rewarding, but it is unclear whether it will provide additional value to my research career. 

Cover image: Manuel holding a OSDI‘20 Distinguished Artifact Award while tobogganing at Wattener Lizum in the Austrian Alps.

Disclaimer: This post was created by Tianyin Xu for the SIGOPS blog. Any views or opinions represented in this blog are personal, belong solely to the blog author and the person interviewed; they do not represent those of ACM SIGOPS, or their parent organization, ACM.