Experiments in Computer Science

Research in systems-oriented computer science involves the implementation of prototypes and their experimental evaluation. It is not uncommon, in particular for young researchers (means students), to spent most of their time on the implementation itself, spending little time on the evaluation of the system. Of course, it is then often too late to discover that a proper experimental evaluation of a system takes lots of time and effort. I am writing this in order to help people to avoid falling into this trap.

  1. Experiments need to be repeatable.
    This simple statement has lots of consequences. It means that it is not sufficient to run a program (or system) in a random environment with random inputs producing perhaps random outputs. Instead, everything impacting the experiment must be clearly documented. This usually concerns the execution environment, the input data set, and the output produced. It is good if the input data set is openly available. Ideally, data sets can be used that have already been used in related work since this enabled comparisons. Making the implementation available is a great idea as well since this allows others to followup on your work. Finally, care must be taken that output produced by running the program or system are verified to be correct.
  2. Experiments require proper data analysis.
    As we all know, it is not sufficient to throw a coin once to derive any conclusions about its behavior. The same applies to many experiments in computer science. It is often necessary to repeat experiments and it is insufficient to state that an experiment has been repeated N times and you show average values. You need to explain why N is a reasonable number. Furthermore, an average might hide a large variation, leading to wrong conclusions. Hence, it is necessary to do some basic data analysis (like calculating confidence intervals) to show whether a sufficient number of experiments have been performed and whether the numbers or graphs showing mean values carry any meaning.
  3. Experiments produce data not graphs.
    While plots are often nice to visualize results, it is often more useful to provide numeric results in tables. Have you ever tried to read numbers out of a 3D-plot, e.g., to compare them with your results? Once you try to do this, you will notice that many impressive colorful plots have close to zero value. Think about numbers as the main result of your experiment and graphs just as an additional representation to visualize certain interesting aspects.
  4. Experimental results need an interpretation.
    It is not sufficient to produce a number of tables and plots. It is crucial to interpret them. In particular, any unexpected results need an explanation. Yes, this can often be difficult and usually requires further experimentation in order to understand what is going on. But gaining this further insight into the program or system is crucial for understanding it. In fact, substantial (and often fun) research often starts after the initial data has been collected - you observe something unexpected and you start trying to understand it. (Of course, if you are running late, you will unfortunately often miss the fun of doing this part of research.)
  5. Experiments need to be designed.
    It is not sufficient to run a program (or system) in a random environment with random inputs producing (perhaps random) outputs. Instead, you need to design the experiments you perform. You need to think upfront about the research question you want to answer with the experiment. It is often the case that you start from a rather simple question. But once the results obtained show surprising (unexpected) behavior (which is often the case), it is crucial to iterate the process by designing new experiments to find answers explaining the surprising behavior observed. Of course, all this requires that you thought about expected results while designing the experiment. Having simple models of the system and a sound analysis of the complexity of the algorithms involved will be helpful to determine what a reasonable expected behavior is.