Introduction and the Controversy
Suppose we have a DNA profile found at the scene of a crime and also suppose that there are good reasons to believe that the profile comes from (one of) the criminal(s). If we select an individual with this DNA profile, then this individual may become a suspect in the case. There are, however, different ways to select an individual with the DNA profile found at the scene of the crime, and in this article, we are concerned with the evidential value of the match for various selection procedures. More precisely, we will distinguish between two different situations.
- 1. There is a person already identified as a suspect before the DNA profile came up. It so happens that the profile matches with this suspect.
- 2. We run the profile through a database of DNA profiles, and it so happens that there is a unique hit. The person corresponding to this becomes a suspect in the case.
In which of the two situations is the DNA evidence against the suspect larger? Is there a number that correctly reflects the “evidential value” of a unique hit in a database search? These simple questions have inspired a significant number of papers in the statistical forensic literature [1–11], and in this article, we try to explain why this question has been so controversial and how the controversy can be resolved.
There are at least two conflicting intuitive ideas concerning the strength of the database case versus the cold case.
These conflicting ideas can be summarized in a few lines, and they provide a good starting point for the discussion. Let us refer to the first of the two situations as the cold case and to the second as the database case.
View 1: One can argue that the evidential value of a database hit is much larger than in the cold case. Indeed, having a unique hit in a database simply means that all the other persons in the database are innocent. Therefore, we have the same information as in the cold case, plus the extra information that many other people are innocent. Clearly, this increases the weight of the evidence against the suspect. For support of View 1, see e.g., [1, 5] and [4].
View 2: On the other hand, one can argue that, in the database case, the evidential value of the hit is much smaller than in the cold case. If the frequency of the DNA profile is, say, 1 in a million and the database consists of, say, 1 million profiles, then it is quite reasonable to expect a match just by pure chance. If we assume that all persons in the database are innocent, then still there is a probability of about 0.63 that there will be a match. Therefore, a match in the database does not say very much, and the evidential value of a database match therefore is probably small. Here are the words of a proponent of this view (quote from [6] as an answer to criticism on an earlier article [10]):
What to recommend in a case in which, after finding exactly one match in a database of size 630.000 […]. The random match probability with 5 analyzed STR loci is in the magnitude of 1 in 600.000. It is in our view inconceivable to assume, in such a situation, that the evidential value due to the database is higher than without a database search. Rather, it seems that the contrary is of significance.
Which of the two views is correct? In order to address this question we have to be more precise about what we mean by “evidential value”. A reasonable and standard way to quantify the evidential value is via a so-called likelihood ratio. The procedure leading to such a likelihood ratio is as follows. We set up two competing hypotheses; in our case, the first hypothesis is “The suspect left the DNA stain” and the competing hypothesis is “The suspect did not leave the DNA stain”. Often, the first hypothesis is called Hp where the “p” stands for “prosecutor”, and the second hypothesis is called Hd where the “d” stands for “defense”. We next compare the probability of seeing the evidence under both hypotheses by computing the ratio
This ratio is called the likelihood ratio. If this number is high, it implies strong evidence against the suspect because the probability of seeing the evidence is much higher when he/she is guilty than when he/she is innocent.
It seems, therefore, that we would be able to make a choice between the two views above by simply computing the likelihood ratios corresponding to the above two cases. Let me now sketch how this can be done, starting with the cold case.
The basic assumption is that DNA profiles of different people are independent and that for any profile there is a probability p that a given person has this profile. A smaller p simply means that the profile occurs less frequently. Clearly, the independence assumption is not met in reality because of, for instance, family relations, but it is well worth the effort to see what conclusion we can draw from a somewhat simplified situation.
Suppose the DNA profile found at the scene of the crime has a frequency p and suppose the suspect is guilty. The probability to observe the evidence in that case is simply 1 (or 100%, but we prefer probabilities to be numbers between 0 and 1). On the other hand, if the person is not guilty, the probability of seeing the evidence is p; indeed if the suspect is not guilty, then some unknown person has committed the crime, and the probability that he/she has this profile is just p. This means that in the cold case, the likelihood ratio LRc (the “c” stands for “cold”) is equal to
and this makes intuitive sense: when the profile is very rare this implies a very small p and therefore a very high likelihood ratio.
Next we would like to perform a similar computation, with the same set of hypotheses in the database case, but this is slightly more complicated. It turns out that the likelihood ratio LRd (the “d” stands for “database”) in this situation can only be computed if we have more information, namely, the size N of the total population in question and the size n of the database. Given these two numbers, the likelihood ratio turns out to be equal to
we explain this number in the Appendix. This number is larger than LRc, giving support to View 1.
One may wonder what else there is to discuss. However, the situation is not at all that simple, for the following reason. It is fairly obvious in statistical matters that it is “not allowed” to use a data set first to find a reasonable hypothesis and then use the same dataset again to test this hypothesis. Indeed, one can tailor a hypothesis around a dataset, but then the very same dataset should not be used for testing this hypothesis. And yet, this is precisely what happens in the database problem. After looking at the data, that is, after we have found that there is a unique match in the database, we set up the hypothesis that the unique matching person is the criminal. Clearly, we have tailored the hypothesis around the data. It is, therefore, reasonable to say that one is simply not allowed to do this, and therefore, the above-mentioned hypotheses Hp and Hd leading to equation (4) should not be used when a suspect has been found through a database search.
Considerations of this type led the authors of the second ...