Psychol. Inj. and Law (2010) 3:133–147 DOI 10.1007/s12207-010-9073-0 Error Rates in Forensic Child Sexual Abuse Evaluations Steve Herman & Tiffany R. Freitas Received: 14 January 2010 / Accepted: 29 March 2010 / Published online: 27 May 2010 # Springer Science+Business Media, LLC. 2010 Abstract When mental health, medical, and social work professionals and paraprofessionals make false positive or false negative errors in their judgments about the validity of allegations of child sexual abuse, the consequences can be catastrophic for the affected children and adults. Because of the high stakes, practitioners, legal decision makers, and policy makers should have some idea of the magnitude and variability of error rates in this domain. A novel approach was used to estimate individual error rates for 110 professionals (psychologists, physicians, social workers, and others) who conduct or participate in forensic child sexual abuse evaluations. The median estimated false positive and false negative error rates were 0.18 and 0.36, respectively. Estimated error rates varied markedly from one participant to the next. For example, the false positive error rate estimates ranged from 0.00 to 0.83. These estimates are based on participants’ self-reported substantiation rates and on their subjective frequency distributions for the probability of truth for the abuse allegations they evaluate. Keywords Child sexual abuse . Forensic evaluation . Judgment . Overconfidence . Accuracy This research was supported in part by grant #P20MD001125 from the National Institutes of Health. S. Herman (*) : T. R. Freitas Department of Psychology, University of Hawaii at Hilo, 200 West Kawili Street, Hilo, HI 96720, USA e-mail: drsteveherman@gmail.com URL: http://steveherman.com Each year in the USA, caseworkers for Child Protective Services (CPS) agencies conduct approximately 150,000 forensic evaluations in cases of alleged or suspected child sexual abuse (US Department of Health and Human Services 2009). Tens of thousands of additional child sexual abuse (CSA) evaluations by mental health, social work, and medical1 professionals and paraprofessionals (henceforth, collectively referred to as mental health professionals or MHPs) occur in other settings, for example, in Child Advocacy Centers (National Children's Alliance 2007), in specialized sexual assault clinics (Davies et al. 1996), and in disputed child custody cases (Bow et al. 2002). Forensic CSA evaluations by MHPs are also common in other countries (Hershkowitz et al. 2007b; Lamb 1994; Sternberg et al. 2001; Wilson 2007). Some have argued that MHPs who perform forensic CSA evaluations should not explicitly state their opinions about the “ultimate legal issue”—whether or not abuse allegations are true—because of the lack of a sound scientific basis for such opinions (e.g., Kuehnle and Sparta 2006). However, in most forensic CSA evaluations, MHPs are expected to provide explicit or implicit opinions about the likelihood that the abuse allegations are true (Berliner and Conte 1993). For example, CPS caseworkers in the US are generally required to classify abuse allegations as 1 Psychiatrists, other physicians, and nurses collect and evaluate both medical evidence (the physical signs of abuse) and psychosocial evidence (verbal reports during interviews with the alleged child victim and others, documentation of past verbal reports, opinions of other professionals, and behavioral observations). This article focuses on psychosocial evaluations. To the extent that medical personnel evaluate the validity of abuse allegations on the basis of psychosocial data, they are performing psychosocial evaluations. 134 either substantiated2 or unsubstantiated following their investigations. When MHPs make false positive errors—judging false or erroneous CSA allegations to be true—the consequences can be catastrophic for the adults and children involved (e.g., Bikel 1997; Boyer and Kirk 1998; Bruck et al. 1998; Ceci and Bruck 1995; Fukurai and Butler 1994; Garven et al. 1998; Humphrey 1985; Johnson 2004; Nathan and Snedeker 2001; Rabinowitz 2003; Robinson 2005, 2007; Rosenthal 1995; San Diego County Grand Jury 1992, 1994; Schreiber et al. 2006; Seattle Post Intelligencer 1998; Wexler 1990). Although there are fewer anecdotal reports of false negative errors by MHPs, the reports that do exist indicate that the consequences of these errors can also be dire (e.g., Pawloski 2005; Swarns and Cooper 1998; Waldman and Jones 2008). With so much riding on MHPs’ judgments about the validity of allegations of CSA, one might think that legal decision makers would be keenly interested in the “known or potential error rate”3 for these judgments. In fact, relatively little legal attention has been focused directly on the accuracy of either professional or lay judgments about the validity of allegations of CSA. For example, the Oregon Supreme Court recently ruled that an expert opinion that an allegation of sexual abuse was true was admissible as scientific evidence, even though the opinion was based primarily on an expert’s assessment of the credibility of the verbal reports of the alleged child victim and others (State v. Southard 2009). In their analysis of the indicia of scientific validity for the proffered diagnosis of sexual abuse, the Court relied primarily on the general acceptance criterion (State v. Brown 1984; Frye vs. United States 1923), summing up their analysis as follows: “Because there was no physical evidence of sexual abuse in this case, the [experts] based [their] diagnosis on: (1) the boy's 2 The verb to substantiate and the adjective substantiated are used throughout this article in the sense that is familiar in this context: to describe a data collection and evaluation process which culminates in an MHP’s judgment that there is enough evidence that an allegation of sexual abuse is true to warrant State intervention. This usage may be misleading and confusing to those not familiar with child abuse jargon, because it is not consistent with the general meaning of the verb to substantiate. The Oxford English Dictionary defines substantiate as to “provide evidence to support or prove the truth of” (Soanes and Stevenson 2005). However, there are many cases in which MHPs “substantiate” abuse allegations that are not supported by any evidence other than the child’s report. In other words, in many cases that they “substantiate,” MHPs do not actually find or provide any new evidence that “supports or proves the truth of” an abuse allegation. In essence, “substantiation” often boils down to an MHP expressing an opinion that a child is telling the truth. Despite this fundamental definitional problem, for the sake of convenience and readability, the terms to substantiate and substantiated are used throughout this article in the familiar way. 3 This phrase comes from the well-known US Supreme Court decision, Daubert vs. Merrell Dow Pharmaceuticals, Inc. (1993). Psychol. Inj. and Law (2010) 3:133–147 reported behaviors and (2) [their] determination that the boy's reports of sexual abuse were credible…. The experts were all qualified, the techniques used are generally accepted, the procedures rely on specialized literature in the field, and the procedures used are not novel” (State v. Southard 2009, p. 137). The Court did not address the issue of the known or potential error rate for expert judgments about the validity of CSA allegations. The question of whether or not a diagnostic technique produces accurate results is arguably more important than whether or not the technique is generally accepted by a scientific or clinical community, supported by “specialized literature,” novel, or used by qualified experts. Interestingly, the Court went on to rule that the experts’ opinion that the abuse allegations were true should, nevertheless, have been excluded at trial because the potential for prejudice—the likelihood that jurors would ascribe too much weight to the experts’ opinion—outweighed the probative value of that opinion. The Court explained that, in their opinion, the jury was just as capable of judging the credibility of the child and other witnesses as were the experts and that, therefore, the incremental probative value of the experts’ opinion that the allegations were true was “minimal” (State v. Southard 2009, p. 141). A notable exception to the US courts’ general failure to clearly focus on the accuracy of expert and lay opinions about the validity of allegations of CSA comes from the US Supreme Court, which recently overturned a death penalty sentence in a child sexual abuse case. Among other reasons for their decision, the Court, in effect, acknowledged that false positive errors were more likely in professional and lay judgments about the validity of CSA allegations than in judgments about other types of allegations of criminal conduct: The problem of unreliable, induced, and even imagined child testimony means there is a “special risk of wrongful execution” in some child rape cases.… Studies conclude that children are highly susceptible to suggestive questioning techniques like repetition, guided imagery, and selective reinforcement…. Similar criticisms pertain to other cases involving child witnesses; but child rape cases present heightened concerns because the central narrative and account of the crime often comes from the child herself. She and the accused are, in most instances, the only ones present when the crime was committed. (Kennedy v. Louisiana 2009, p. 2663) Like the legal community, the scientific community has shown a relative lack of interest in attempting to estimate error rates for professional and lay judgments about the validity of CSA allegations. Although there is a widespread (but not universal) consensus among prominent researchers Psychol. Inj. and Law (2010) 3:133–147 in this field that MHPs’ judgments about the validity of uncorroborated allegations of sexual abuse lack a firm scientific foundation (Fisher 1995; Fisher and Whiting 1998; Goodman et al. 1998; Horner et al. 1993; Melton and Limber 1989; Poole and Lindsay 1998), there are very few empirical studies or reviews that have focused directly on this issue. Why have both the legal and scientific communities paid relatively little attention to the accuracy question and, by contrast, so much attention to other related issues such as the influence of questioning methods on the reliability of children’s reports of past events? After all, the main reason that researchers are interested in studying questioning methods is because certain questioning methods can reduce the reliability of children’s reports, and unreliable reports from children can lead to errors in expert and lay judgments about the validity of CSA allegations. Yet there have been hundreds of research studies and books that have focused on interview techniques and protocols for forensic child interviews (e.g., Lamb et al. 2008), but only a handful of studies and reviews that have focused directly on the accuracy of expert or lay judgments about the validity of CSA allegations. One obvious reason for this discrepancy is that most legal decision makers and researchers apparently believe that there is no reliable scientific way to study judgment accuracy in this domain (cf. Horowitz et al. 1995). It is true that we cannot directly evaluate judgment accuracy in this domain because we cannot conduct experiments in which some children are randomly assigned to be sexually abused in order to determine whether or not professionals can distinguish between abused and nonabused children. Nor can we subject children to the kinds of intensive suggestive questioning techniques that sometimes produce false reports and false memories of sexual abuse in order to see if professionals can distinguish between true and false reports, because such experiments would pose a severe risk of psychological harm to the children involved. Field studies of real forensic evaluations are also hampered by the absence, in most cases, of a reliable gold standard that could be used to independently assess the accuracy of evaluator judgments. Despite these methodological constraints, there have been a handful of empirical studies that have provided data that is relevant to estimating error rates in professional judgments about the validity of CSA allegations (Finlayson and Koocher 1991; Horner and Guyer 1991a,b; Horner et al. 1992; McGraw and Smith 1992; Realmuto et al. 1990; Realmuto and Wescoe 1992; Shumaker 2000). These studies have focused on the subset of CSA cases in which, from the evaluator’s point of view, there is no strong corroboration for a child’s verbal report of sexual abuse. Of course, these are precisely the cases in which MHPs’ opinions are most likely to play critical roles. Herman 135 (2005) analyzed data from the aforementioned studies, and other empirical studies of clinical judgment, and concluded that the overall error rate for professionals’ judgments about the validity of uncorroborated allegations of CSA was greater than 0.24, but he did not provide separate estimates of false positive and false negative error rates. A ground-breaking study by Hershkowitz et al. (2007a, b) provides the best empirical data so far on MHPs’ ability to distinguish between uncorroborated true and false reports of sexual abuse made by children during forensic interviews. In that study, 42 experienced forensic evaluators read transcripts of 24 actual investigative interviews that were conducted with alleged child victims of sexual abuse. The researchers combed through a large historical database to select transcripts from 12 cases in which there was strong independent evidence that the child’s interview report was true and from 12 cases in which there was strong independent evidence that the child’s report was false. Study participants were asked to judge the validity of the children’s verbal reports of sexual abuse without access to the independent evidence. The false positive and false negative error rates were 0.44 and 0.33, respectively; the overall error rate was 0.39. See the original report by Hershkowitz et al. and an analysis by Herman (2009) for more on this important study. Another body of psychological research that is relevant to assessing the accuracy of evaluator judgments about CSA allegations is the research on the ability of laypersons and experts to determine when people are lying. Although studies in this area are usually referred to as “deception detection” studies, they might be better described as “honesty and deception detection” studies because they usually evaluate participants’ ability to correctly detect false and true statements. The reason this research is important to assessing the accuracy of MHPs’ judgments about CSA allegations is because, in many cases, MHPs’ judgments are based primarily on their determinations that a child or adult is either telling the truth or lying. The evidence from many experimental studies is remarkably consistent: the majority of laypersons and professionals have little or no ability to discriminate between true and false statements about past events made by either children or adults. In a meta-analysis of 206 deception detection studies, Bond and DePaulo (2006) found that the average rate of correct classification was 54%, only slightly higher than the expected chance accuracy rate of 50% (equivalent to a correlation of r≈0.08 between judgments and reality). It is disturbing to note that in the 19 studies in the Bond and DePaulo meta-analysis that directly compared the abilities of professionals (MHPs, police officers, detectives, judges, customs officials, polygraph examiners, job interviewers, federal agents, and auditors) and laypersons to correctly classify true and false statements, the experts, the people we rely on to ferret out 136 the truth, were just as inaccurate as the laypersons: laypersons were correct 55.7% of the time in these 19 studies, the experts, 54.1% of the time. One experimental study that specifically examined the abilities of professionals and laypersons to discriminate between true and false statements made by young children (ages 5-6), adolescents, and adults found that adults are no better at detecting truth and deception in young children than they are at detecting them in adolescents and adults (Vrij et al. 2006). In the Sam Stone experiment, Leichtman and Ceci (1995) used suggestive interviewing techniques to induce children to make false reports about classroom events. They showed videotapes of true and false statements made by children who participated in the experiment to approximately 1,500 experienced child professionals (judges, child development researchers, and MHPs). These professionals “generally failed to detect which of the children’s claims were accurate and which were inaccurate, despite being confident in their judgments… The very children who were least accurate were rated as most accurate” (Ceci and Bruck 1995, p. 281). Other experimental studies have also consistently found that most adults have little or no ability to discriminate between true and false statements made by children (Connolly et al. 2008; Edelstein et al. 2006; Leach et al. 2004; Talwar et al. 2006; Vrij 2002; Vrij and van Wijngaarden 1994). To make matters worse, professionals tend to be overconfident in their abilities to detect honesty and deception in both children and adults, and there is little or no correlation between confidence or experience and deception detection ability (Aamodt and Custer 2006; Colwell et al. 2006; DePaulo et al. 1997; DePaulo and Pfeifer 1986; Elaad 2003; Garrido et al. 2004; Kassin 2002; Mann et al. 2004; Vrij and Mann 2001). The absence of significant confidence-accuracy and experience-accuracy correlations means that confident, experienced experts, the ones who are most likely to influence legal decisions, are likely to be just as inaccurate in their judgments about honesty and deception as are less experienced or less confident experts. A New Approach to Estimating the Accuracy of MHPs’ Judgments The current study was designed to implement a new approach to estimating false positive and false negative error rates for MHPs’ judgments about the validity of CSA allegations. In this retrospective survey study, MHPs who frequently conduct or participate in forensic CSA evaluations were asked to estimate (a) the percent of their evaluations in which they substantiated sexual abuse Psychol. Inj. and Law (2010) 3:133–147 allegations and (b) the percent of their cases that fell into each of five mutually exclusive probability-of-truth bins. The five probability-of-truth bins were 0-20%, 20-40%, 4060%, 60-80%, and 80-100%. Using these data, it was possible to calculate for each participant (a) the implied probability-of-truth threshold required for substantiation and (b) the implied false positive, false negative, and overall error rates. Methods Data were collected over the Internet using a customdesigned, branching, Web-based survey. The Web application that was used to deliver the survey was written in the Perl programming language. The survey was lengthy, about 40 Web pages (because this was a branching survey, the exact number of pages varied); it took about 1 h to complete. The survey included questions about participants' backgrounds, education, and training. There were numerous items that focused on characteristics of participants’ caseloads, substantiation rates, knowledge of relevant research, and other topics related to forensic CSA evaluations. Participants were recruited through personal contacts, a targeted postal mailing to psychologists who conduct child custody evaluations, and announcements that were sent to email lists for forensic MHPs (e.g., the PSYLAW-L, child custody, and IPSCAN email lists). The study announcements stated that English-literate professionals or paraprofessionals who had conducted or participated in at least ten forensic CSA evaluations were eligible to participate. Respondents who asserted via email or telephone that they met the eligibility criteria were emailed a unique ID and password that they used to access the survey Website. Participants who completed the survey were paid $50. As in many other online and paper-based survey studies, it is impossible to estimate the response rate, since there is no way of knowing how many eligible MHPs read a request to participate in the study. The survey was completed by 123 individuals over a 5-month period. Thirteen surveys (11%) were dropped because the professional subgroup was too small (i.e., only five law enforcement investigators completed the survey), important data was missing or entered incorrectly and could not be corrected (four surveys), or the responses appeared to be random or inconsistent (four surveys), leaving a total of 110 complete, consistent protocols. Because data was checked for validity as it was entered by participants, and participants were prompted to correct likely data entry errors before being allowed to proceed, the rates of missing data and entry errors were low, with no single variable missing or likely incorrect for more than 5% of all respondents. Psychol. Inj. and Law (2010) 3:133–147 Survey Items Because no available survey instruments adequately addressed the topics covered in the current study, the survey items were written for this study. The wording of questions was refined using think-aloud protocol sessions with forensic MHPs. Written feedback was also obtained from a number of MHPs who completed early versions of the survey. Results Sample characteristics are described in Table 1. The study included a broad sampling of child abuse professionals and paraprofessionals. About half of the sample held advanced degrees (Ph.D., M.D., or equivalent). The primary child abuse professions (psychology, medicine, and social work) were well represented. About half the sample worked primarily for Child Advocacy Centers or child protection agencies (including both employees and contractors). The median participant devoted about 30% of his or her time to forensic CSA evaluations; 25% of participants devoted 80% or more of their time to CSA evaluations. The median number of hours devoted to each CSA evaluations was six; but about 20% of participants reported spending 3 or fewer hours per evaluation. Participants who devoted 3 or less hours to each evaluation were, in general, participants who served primarily as forensic child interviewers on multidisciplinary teams. A substantial subset of participants, about 45%, worked on a fee-for-service basis; Table 1 data on hourly and total fees is limited to this subset. The eligibility criteria specified a minimum of ten evaluations, so ten was the smallest number of evaluations performed. The median participant had conducted or participated in 162 forensic CSA evaluations. About 10% of participants had participated in 2,000 or more evaluations. There was one participant who reported 8,000 evaluations. This participant and others with very high numbers of evaluations were contacted via email to ensure that these were not data entry errors. These were not data entry errors, but came from participants who worked in specialized sexual assault clinics in urban areas where they had performed several child examinations each day for many years. In all, study participants had conducted or participated in a cumulative total of approximately 90,000 forensic CSA evaluations. Table 2 shows basic data on participants’ self-reported substantiation rates, substantiation thresholds, and probability-of-truth distributions. In addition to the means and standard deviation, Table 2 shows means weighted by number of cases, medians, and ranges for each variable. 137 The wording of the questions that were used to collect the data shown in Table 2 was as follows: [Substantiation rate:] In what percent of your CSA evaluations do you (or your team) classify the abuse allegations as substantiated? [Substantiation threshold:] What is the minimum level of probability that the allegations are true that you would require in order to classify an abuse allegation as substantiated? Please estimate this probability using a scale from 0% probability (you are absolutely certain that the abuse allegations are false) to 50% probability (you believe that there is a 50% chance the allegations are true and a 50% chance that the allegations are false) to 100% probability (you are absolutely certain that the abuse allegations are true). [Probability-of-truth:] To answer the questions on this page, please consider your own personal beliefs about the probability that the CSA allegations are true in the cases you (or your team) evaluate…. Please try to estimate the percent of cases that fall into each of these categories: (the total should be 100%) 1. The probability that the allegations are true is 80-100% (you are certain or fairly certain that the allegations are true). 2. The probability that the allegations are true is 60-80%. 3. The probability that the allegations are true is 40-60%. 4. The probability that the allegations are true is 20-40%. 5. The probability that the allegations are true is 0-20% (you are certain or fairly certain that the allegations are false). As can be seen from the data in Table 2, the range of responses to each of these questions was extreme, for example, the substantiation rates ranged from 0.00 to 0.95. The patterns of subjective frequency distributions in response to the probability-of-truth of question also varied widely; four representative distributions are shown in Fig. 1. The first line in Table 3 shows participants’ self-reported estimates of their own error rates. Specifically, the rates shown in the columns headed “1–PPV” and “1–NPV” show the median response and range of responses to these questions: When people make decisions about the validity of allegations of child sexual abuse, they sometimes make incorrect classifications. You may discover that a mistake has been made if new evidence emerges after 138 Table 1 Sample Characteristics (N = 110) Psychol. Inj. and Law (2010) 3:133–147 Variable % Female gender Age US resident Ethnicity/race African American Asian/Pacific Islander Hispanic White Highest degree AA or some college BA or equivalent MA or equivalent PhD, MD, or equivalent Profession Caseworker 70 M (SD) Median Range 45 (12) 47 26-75 11 (8) 0.43 (0.31) 731 (1,309) 13 (13) 10 0.30 162 6 1–30 0.00–1.00 10–8000 1–60 $122 ($87) $1,565 ($1,498) $100 $1,000 $13–$360 $60–$6,000 92 3 8 4 85 4 8 37 51 21 Social worker Counselor Nurse Physician Forensic psychologist Other psychologist Employer(s) Child advocacy center Child protection agency Medical or hospital setting Court-appointed Law enforcement agency Defense in civil case Defense in criminal case Plaintiff in civil case Prosecution in criminal case Other Annual work income (US dollars) 19 2 7 23 16 12 $10,000-29,999 $30,000-49,999 $50,000-69,999 $70,000-99,999 $100,000 or more Years performing CSA evaluations Proportion of time devoted to CSA evaluations Total CSA evaluations performed Number of hours per evaluations Member of multi-disciplinary team Hourly rate in US dollars (n=58) Total fee in US dollars per evaluation (n=51) 4 28 24 16 30 29 26 14 14 3 3 4 1 3 5 40 Psychol. Inj. and Law (2010) 3:133–147 Table 2 Self-reported substantiation rate, substantiation threshold, and probability of truth (N=110) The second column shows the mean and standard deviation weighted by the number of cases that each participant reported having evaluated 139 M (SD) Weighted M (SD) Median 0.57 (0.24) 0.66 (0.16) 0.50 0.68 0.00-0.95 0.40–1.00 0.47 0.17 0.15 0.09 0.12 0.30 0.14 0.10 0.10 0.10 0.00–1.00 0.00–1.00 0.00-0.80 0.00–1.00 0.00-0.90 Substantiation rate 0.47 (0.24) Substantiation threshold 0.68 (0.15) Probability-of-truth of allegations evaluated 80-100% 0.41 (0.28) 60-80% 0.17 (0.16) 40-60% 0.14 (0.14) 20-40% 0.10 (0.12) 0-20% 0.17 (0.20) an evaluation is complete. For example, an allegation is classified as unsubstantiated, and then, perhaps months later, the alleged perpetrator confesses. Even the most accurate evaluators are bound to make some errors if they evaluate enough cases. One type of mistake is when a true allegation of abuse is classified as unsubstantiated (or inconclusive or unfounded). Another type of mistake is when a false or erroneous allegation of abuse is classified as substantiated. [1–NPV:] Please try to estimate what percent of all of your (or your team's) decisions to classify an abuse allegation as unsubstantiated (or inconclusive or unfounded) are incorrect. In other words, what percent of all of the CSA allegations that you (or your team) classify as unsubstantiated (or inconclusive or unfounded) are actually true allegations? [1–PPV:] Now, please try to estimate what percent of all of your (or your team’s) decisions to classify an abuse Fig. 1 Examples of subjective distributions for the probabilityof-truth for allegations evaluated allegations as substantiated are incorrect. In other words, what percent of all of the CSA allegations that you (or your team) classify as substantiated are actually false or erroneous allegations? The other figures shown in the first row of Table 3 were either reported directly by participants (i.e., Substantiation Threshold) or calculated from the self-reported substantiation rates and the self-reported values for 1-PPV and 1-NPV. Analysis The data shown in the first row of Table 2 indicates that the median study participant believed that he or she erred in 5% of his or her decisions to classify allegations as substantiated and in 10% of his or her decisions to classify allegations as unsubstantiated. The estimated overall error rate based on these self-reported figures is 0.08. This Frequency Participant 10 Participant 29 1 1 .8 .8 .6 .6 .4 .4 .2 .2 0 .00-.20 .20-.40 .40-.60 .60-.80 .80-1.0 0 Frequency Participant 63 1 .8 .8 .6 .6 .4 .4 .2 .2 .00-.20 .20-.40 .40-.60 .60-.80 .80-1.0 Probability allegation is true .00-.20 .20-.40 .40-.60 .60-.80 .80-1.0 Participant 94 1 0 (0.25) (0.11) (0.11) (0.10) (0.15) Range 0 .00-.20 .20-.40 .40-.60 .60-.80 .80-1.0 Probability allegation is true 140 Psychol. Inj. and Law (2010) 3:133–147 Table 3 Error rate estimates: median and range (N=110) Estimate source Participant self-report Calculated, 0% overextremity Calculated, 10% overextremity Calculated, 20% overextremity Calculated, 30% overextremity Substantiation threshold Med (range) Error FPR FNR 1–PPV 1–NPV Med (range) Med (range) Med (range) Med (range) Med (range) 0.68 (0.40-1.00) 0.70 (0.19-1.00) 0.08 (0.00-0.44) 0.28 (0.11-0.75) 0.05 (0.00-0.90) 0.18 (0.00-0.83) 0.11 (0.00-0.80) 0.36 (0.02-1.00) 0.05 (0.00-0.60) 0.12 (0.01-0.62) 0.10 (0.00-0.97) 0.39 (0.09-0.84) 0.66 (0.25-0.90) 0.33 (0.19-0.70) 0.27 (0.00-0.88) 0.40 (0.03-1.00) 0.20 (0.11-0.60) 0.41 (0.18-0.78) 0.62 (0.31-0.80) 0.37 (0.26-0.65) 0.34 (0.00-0.91) 0.43 (0.03-1.00) 0.27 (0.20-0.58) 0.43 (0.26-0.71) 0.58 (0.38-0.70) 0.41 (0.34-0.60) 0.40 (0.00-0.93) 0.46 (0.04-1.00) 0.35 (0.30-0.55) 0.45 (0.34-0.64) Med median, Substantiation threshold the minimum probability of truth required for substantiation, Error total error rate (false positives+false negatives), FPR false positive rate, FNR false negative rate; 1-PPV 1-positive predictive value, the proportion of all substantiations that are erroneous, 1-NPV 1- negative predictive value, the proportion of all non-substantiations that are erroneous estimate is significantly lower than the lower bound estimate of 0.24 that was calculated by Herman (2005) for uncorroborated allegations. A potential problem with these self-reported estimates is that, when people are asked to directly estimate their own error rates, a number of cognitive heuristics and situational factors are likely to result in a bias towards underestimation. In this case, these factors include overconfidence in judgments made under conditions of uncertainty (Baron 2008; Lichtenstein et al. 1982); a lack of useful corrective feedback, which is generally necessary for learning from mistakes (Dawes 1994); confirmation bias (Poole and Lamb 1998); and other biases, heuristics, and cognitive illusions (Baron 2008; Poole and Lamb 1998). In order to provide error rate estimates that may be less subject to underestimation biases, the following approach was taken. First, continuous probability density functions were fitted to the five-bin subjective probability-of-truth distributions; an example, Participant 32 is shown in Fig. 2. Participant 32 was chosen to illustrate the estimation of error rates process because her error rates were close to the median rates. Curves were fitted using weighted beta distributions; one distribution for unimodal distributions and two for bimodal distributions. The curve fits were good for most participants, although there were a few distributions that did not result in good fits (e.g., the distribution for Participant 94 in Fig. 1). For participants with good curve fits there were virtually no differences between error rate estimates based on empirical integration using the fitted curves and a simpler approach in which all of the cases in each bin (or part of a bin) were assigned the average probability for that bin (or that part of the bin). Because the simpler approach produced virtually the same results as integration when there were good curve fits, the simpler approach was used for all estimates. Second, the self-reported substantiation rate for 4 .8 4 .7 3 3 .4 2 Density Frequency .5 Density .6 2 Probability threshold for substantiation (.67) → .3 1 .2 1 Unsubstantiated area = .50 .1 0 0 .00 - .20 .20 - .40 .40 - .60 .60 - .80 Probability that an allegation is true .80 - 1.0 Fig. 2 Self-reported probability distribution for participant 32 0 0 .1 .2 .3 .4 .5 .6 .7 Probability that an allegation is true Substantiated area = .50 .8 .9 1 Fig. 3 Calculation of the substantiation threshold (0.67) from the substantiation rate (0.50) for Participant 32 Psychol. Inj. and Law (2010) 3:133–147 141 each participant was combined with the probability-of-truth distributions to calculate the implicit minimum probability-oftruth required to substantiate an allegation. This procedure is illustrated in Fig. 3 for Participant 32. Finally, the area under the curve for each participant was divided into four subareas representing true positives, false positives, true negatives, and false negatives. To do this, the participants’ own subjective probabilities were used, as illustrated in Fig. 4, so that, for example, it was assumed that 90% of the cases at the 90% probability-of-truth level represented true allegations. Implied error rates (overall error, the false positive error rate, the false negative error rate, 1 – the positive predictive value, and 1 – the negative predictive value) were calculated. The median calculated error rates and ranges are shown in the second row of Table 3. Although the median calculated error rates shown in the second row of Table 3 are much higher than the median self-reported estimates shown in the first row, they are, nevertheless, likely to represent underestimates of the true median error rates. Judgment and decision researchers have found that when experts and laypersons make judgments about the probability that their own judgments about uncertain quantities or events are correct, they often overestimate the probability that their judgments are correct (Baron 2008). When experts make difficult judgments about whether or not an event occurred or will occur, overconfidence is often manifested as overextremity (Griffin and Brenner 2004). The overextremity effect has been demonstrated in empirical studies of judgments about the truth or falsity of various kinds of factual assertions, and it is especially likely to occur when the judgment task is difficult, error rates are high, corrective feedback is lacking, and judges believe that the probability that their judgments are correct is high (Baron 4 True negatives (.31) False negatives (.19) False positives (.08) True positives (.42) Density 3 Unsubstantiated area = .50 2 Substantiated area = .50 Probability threshold for substantiation (.67) → Discussion 1 0 0 .1 .2 .3 .4 .5 .6 .7 Probability that an allegation is true .8 .9 Overall error rate = .28, false positive rate = .21, false negative rate = .32 1 - positive predictive value = .17, 1 - negative predictive value = .39 Fig. 4 Estimated error rates for participant 32 2008). For example, when college students estimated that the probability that judgments they made were correct was 100%, they were actually correct about 70% of the time, a 30% overextremity effect (Wright and Phillips 1980). In another study, when physicians estimated that the probability that patients had pneumonia was 90%, the actual probability was about 20% (Christensen-Szalanski and Bushyhead 1981). In one oft-cited study in which psychologists made postdictions about the behavior of a real person after reading clinical case study information about him, when the participants believed that 53% of their judgments were correct, they were only correct 28% of the time; furthermore, although participants’ confidence in their judgments increased as they received more and more clinical data about the target person, their judgment accuracy did not improve (Oskamp 1965). Other experimental studies have also found overconfidence effects in MHPs’ clinical judgments (Faust et al. 1988; Moxley 1973). In the current context, overextremity means, for example, that the true average probability-of-truth for all of the allegations that are judged to fall into the 80-100% bin is likely to be lower than 90% and, similarly, the true average probability-of-truth for all of the cases that are judged to fall into the 0-20% bin is likely to be higher than 10%. In order to provide a rough illustration of the impact of three possible levels (10%, 20%, and 30%) of overextremity in probability-of-truth estimates on error rates, the estimated probability-of-truth distributions for each participant were compressed and error rates were recalculated. For example, to calculate error rates for 10% overextremity (which means that only 90% of all allegations judged to be 100% likely to be true are actually true), the maximum probability of truth was set to 90% and the minimum to 10%. All five of the bins were compressed and adjusted so that the total area contained in the five bins was still 100%. For example, for an overextremity of 10%, bin 1 was adjusted to range from probability-of-truth of 10% to 26%, bin 2 from 26% to 42%, Bin 3 from 42% to 58%, Bin 4 from 58% to 75%, and Bin 5 from 74% to 90%. Estimated error rates, adjusted for 10%, 20%, and 30% overextremity, are shown in rows three through five of Table 3. 1 This study used self-report data from professionals who conduct or participate in forensic CSA evaluations to estimate error rates for each participant’s judgments about the validity of the CSA allegations they evaluate. The two most important conclusions of this study are (a) error rates are high, higher than most practitioners and legal decisionmakers realize and (b) error rates vary markedly from one practitioner to the next. 142 If the results of the current study are correct and generalizable, then 0.18, 0.36, and 0.28 represent optimistic estimates of the median false positive, false negative, and overall error rates. These are optimistic estimates, likely to be lower than the true medians, because they do not take overextremity bias into account. Variability in these estimated error rates, as shown in Table 3, is extreme. For example, the estimated overall error rates (with overextremity = 0%) ranged from 0.11 to 0.75 in the study sample. There are a number of limitations to this study. First, the study used a self-selected sample of convenience, and caution is required in generalizing to the population of MHPs who conduct or participate in forensic CSA evaluations. Second, the study is based on retrospective selfreports. Self-reports are liable to multiple types of biases. An attempt has been made to estimate the effects of different possible levels of one type of expected bias— overconfidence, in the specific form of overextremity—but other systematic biases may also have influenced participants’ reports. Third, the survey was delivered over the Internet. Internet delivery may have biased responses in unknown ways. No attempt was made to verify the identities and professional status of participants, so it is possible that some participants were imposters, although participants did need to provide a valid email address and a valid name and postal mailing address in order to receive their $50 check for participation. Completing this long survey involved a considerable investment of time, about 1 h, and there was evidence of the kind of random or inconsistent responding that might be expected from imposters in only a small number of protocols (four out of 123); these four protocols were excluded from the analysis. Fourth, there was no attempt to verify the accuracy of information supplied by participants. For example, there was no review of documentation of participants’ evaluations in order to assess the accuracy of their self-reported substantiation rates. Fifth, the approach taken here to the estimation of error rates is novel. Surprisingly, clear examples of past applications of this seemingly obvious approach to error rate estimation could not be located. The general validity of this approach can and should be experimentally tested by judgment and decision researchers in contexts in which independent criteria for assessing the correctness of judgments are available. Pending further validation of this approach, results and conclusions should be viewed as tentative. Arguments that indirectly support the external validity of the results of the current study include the following. First, key data collected in this retrospective survey study were consistent with data collected in field studies. For example, the mean self-reported substantiation rate in the current study was 0.47. The weighted mean substantiation rate across 3,841 forensic CSA evaluations that were examined Psychol. Inj. and Law (2010) 3:133–147 in eight other studies was 0.52 (Bowen and Aldous 1999; Drach et al. 2001; Elliott and Briere 1994; Everson and Boat 1989; Haskett et al. 1995; Jones and McGraw 1987; Keary and Fitzpatrick 1994; Oates et al. 2000). The mean substantiation rates in these eight studies ranged from 0.25 to 0.63. Second, the high error rates found in this study are consistent with the widespread consensus among leading researchers that MHPs’ opinions about the validity of uncorroborated abuse allegations do not have a firm foundation in empirical science and with empirical findings from a handful of studies that have directly or indirectly addressed the issue of evaluator accuracy. In a review and analysis of relevant empirical studies, Herman (2005) concluded that the overall error rate for professionals’ judgments about the validity of uncorroborated CSA allegations exceeded .24. The findings of this study—if correct and generalizable— indicate that judgment errors by MHPs are common. It is natural to ask what, can be done to improve diagnostic performance in this judgment domain. As Swets et al. (2000) note, there are two general approaches to improving diagnostic performance: the first is to improve the accuracy of diagnostic procedures. The second is to adjust decision thresholds in order to maximize overall utility. The potential application of each approach to improving diagnostic performance in forensic CSA evaluations is considered in turn. Improving Judgment Accuracy There are several promising, feasible approaches to improving the accuracy of judgments about the validity of CSA allegations. The most obvious reform would be to require the use of the National Institute of Child Health and Human Development (NICHD) forensic child interview protocol (Lamb et al. 2008, pp. 283-315) in all forensic child interviews. There is now a considerable body of empirical evidence indicating that use of this interview protocol leads to more accurate and detailed reports of past events by child interviewees (Lamb et al. 2008; Lamb et al. 2007). In Hershkowitz et al. (2007a,b), the use of NICHD interview protocol reduced the false negative rate from 0.62 to 0.05. Unfortunately, the use of the NICHD protocol did not impact the false positive rates, which were 0.40 for nonprotocol and 0.48 for NICHD protocol interviews. There is no other forensic child interview protocol that comes close to the NICHD in terms of empirical validation. Mandating the use of the NICHD interview protocol in all (or almost all4) forensic child interviews in cases of alleged or suspected CSA would be likely to have an immediate 4 See Lamb et al. (2008) for a discussion of situations in which modifications to the protocol may be necessary. Psychol. Inj. and Law (2010) 3:133–147 Other steps that may lead to improved accuracy include the following: 1. Develop, test, and refine empirically based decision aids for evaluating CSA allegations (cf. Baird and Wagner 2000; Baird et al. 1999; Herman 2005; Milner 1994). 2. Update published practice guidelines for forensic CSA evaluations (e.g., American Academy of Child and Adolescent Psychiatry 1997; American Professional Society on the Abuse of Children 1997) in order to incorporate evidence about the high risks of error when evaluators make judgments about uncorroborated allegations. As an adjunct to these guidelines, it would be helpful to create written checklists that could be used in the field to document compliance with best investigation and evaluation practices. The use of simple checklists has been shown to be a remarkably effective way of reducing morbidity and mortality in medical practice (e.g., Haynes et al. 2009). 3. Video- or audio-tape all child and adult interviews from start to finish (Berliner and Lieb 2001). 4. Given the abysmal performance of most human lie detectors (Bond and DePaulo 2006), it is time to take a closer look at both new (brain imaging) and old (polygraph) technological approaches to lie detection. In fact, numerous prosecutors and police investigators continue to use polygraph examinations to make decisions about which CSA allegations to pursue. Other promising approaches to improving practice in this area are described in a recent volume edited by Kuehnle and Connell (2009). Maximizing Utility As Swets et al. (2000) explain, a second approach to improving diagnostic performance is to adjust decision thresholds. In some cases, adjusting decision thresholds can reduce overall error rates. In most cases, decision thresholds are adjusted in order to control the balance of false positive and false negative errors. In the current context, policy makers and legal decision makers could adjust decision thresholds by either increasing or decreasing the strength of evidence required for substantiation. There is a tradeoff between false positive and false negative error rates. As the evidentiary threshold for substantiation is raised, the false positive rate will decrease and the false negative error rate will increase. If the threshold is set so high that no allegations are substantiated, then there would be no false positives and all of the true allegation cases would be false negatives. Conversely, if all allegations were substantiated, then there would be no false negatives and all of the false allegation cases would represent false positives. The effect of raising or lowering the substantiation threshold for a typical study participant, Participant 32, is shown in Fig. 5. Figure 5 reflects an assumption of 10% overextremity. If Participant 32 substantiates in 50% of cases (as she reported doing), then her estimated false positive and false negative error rates are 0.28 and 0.35, respectively; her estimated overall error rate is 0.32. The ratio of false negatives to false positives at 50% substantiation is 1.8:1. The lowest total error rate, 0.29, would 1 Overall error False positives False negatives .9 .8 Self-reported substantiation rate → .7 .6 Error rates and dramatic positive impact on the false negative error rate. Thanks in large part to research conducted by Irit Hershkowitz, Michael Lamb, and their colleagues, the use of the NICHD interview protocol has already been mandatory in all CSA investigations in Israel for several years (Hershkowitz et al. 2007a). There are other reforms that might improve the accuracy of judgments about the validity of CSA allegations. An allegation of CSA is an allegation that a serious crime has been committed. MHPs are not trained to investigate crimes or to collect the corroborative evidence (e.g., confessions, DNA evidence) needed to prove that a crime has or has not occurred. It may be time for primary responsibility for CSA investigations to be turned over to those who are trained to investigate crimes, the police (cf. Cross et al. 2005; Walsh 1993). In Sweden, the UK, and other countries, police already often take the lead role in investigating allegations of CSA and performing investigative interviews (Cederborg et al. 2000; Lamb et al. 2009). In the US, CPS caseworkers perform most initial investigations of allegations of sexual abuse by parents or caretakers. A few US jurisdictions are experimenting with having law enforcement personnel conduct all initial investigations of CSA allegations, including allegations of parental or caretaker abuse (Cross et al. 2005). 143 .5 .4 .3 .2 .1 0 0 .1 .2 .3 .4 .5 .6 Substantiation rate .7 .8 .9 Fig. 5 Error rate tradeoffs for participant 32 (overextremity=10%) 1 144 occur if Participant 32 lowered her evidentiary threshold so that she substantiated in 70% of cases; at 70% substantiation, the ratio of false negatives to false positives would be 1: 2.2. To achieve a false negative to false positive ratio of 10:1 (cf. Blackstone 1769, “it is better that ten guilty persons escape than that one innocent suffer”) would require lowering the substantiation rate to 0.24. To achieve a ratio of 100:1 (cf. Franklin 1785/1907, “it is better that a hundred guilty Persons escape than one innocent Person should suffer”) would require lowering the substantiation rate to 0.05. Conclusion This study contributes to our understanding of the disturbing magnitude and variability of error rates in professionals’ judgments about the validity of allegations of CSA. Findings of this study are consistent with the conclusions of the few past studies that provide data that are relevant to the empirical assessment of error rates in this domain. This study casts severe doubt on legal decision makers’ current reliance on MHPs’ decisions to substantiate uncorroborated sexual abuse allegations. Policy makers and legal decisionmakers should be informed that the median false positive rate for experts’ opinions about CSA allegations is likely to exceed 0.18, that false positive rates are highly variable, and that there are no reliable ways to identify accurate experts. Policy makers, with the assistance of scientific researchers, should carefully examine the tradeoff between false positive and false negative errors and determine whether or not current practices are consistent with core sociolegal values. If policy makers determine, for example, that current false positive rates are too high relative to false negative rates, then laws and policies should be enacted to modify the balance of error by adjusting evidentiary requirements for substantiation. References Aamodt, M. G., & Custer, H. (2006). Who can best catch a liar?: A meta-analysis of individual differences in detecting deception. The Forensic Examiner, 15, 6–11. American Academy of Child and Adolescent Psychiatry. (1997). Practice parameters for the forensic evaluation of children and adolescents who may have been physically or sexually abused. Journal of the American Academy of Child and Adolescent Psychiatry, 36, 423– 442. doi:10.1097/00004583-199703000-00026. American Professional Society on the Abuse of Children. (1997). Psychosocial evaluation of suspected sexual abuse in children (2nd ed.). Chicago: Author. Baird, C., & Wagner, D. (2000). The relative validity of actuarial- and consensus-based risk assessment systems. Children and Youth Psychol. Inj. and Law (2010) 3:133–147 Services Review, 22, 839–871. doi:10.1016/S0190-7409(00) 00122-5. Baird, C., Wagner, D., Healy, T., & Johnson, K. (1999). Risk assessment in child protective services: Consensus and actuarial model reliability. Child Welfare Journal, 78, 723–748. Baron, J. (2008). Thinking and deciding (4th ed.). New York: Cambridge University Press. Berliner, L., & Conte, J. R. (1993). Sexual abuse evaluations: Conceptual and empirical obstacles. Child Abuse & Neglect, 17, 111–125. doi:10.1016/0145-2134(93)90012-t. Berliner, L., & Lieb, R. (2001). Child sexual abuse investigations: Testing documentation methods. Olympia, WA: Washington State Institute for Public Policy. Bikel, O. (Writer). (1997). Innocence lost: The plea [Television broadcast]. Boston and Washington, DC: Frontline/Public Broadcasting Service. Information about this television broadcast retrieved on June 11, 2008 from: http://www.pbs.org/wgbh/ pages/frontline/shows/innocence/. Blackstone, W. (1769). Commentaries on the laws of England (Vol. 4). Retrieved May 30, 2007, from Yale University, The Avalon Project Website: http://www.yale.edu/lawweb/avalon/blackstone/ blacksto.htm. Bond, C. F., Jr., & DePaulo, B. M. (2006). Accuracy of deception judgments. Personality and Social Psychology Review, 10, 214– 234. doi:10.1207/s15327957pspr1003_2. Bow, J. N., Quinnell, F. A., Zaroff, M., & Assemany, A. (2002). Assessment of sexual abuse allegations in child custody cases. Professional Psychology: Research and Practice, 33, 566–575. doi:10.1037/0735-7028.33.6.566. Bowen, K., & Aldous, M. B. (1999). Medical evaluation of sexual abuse in children without disclosed or witnessed abuse. Archives of Pediatric and Adolescent Medicine, 153, 1160–1164. Boyer, P. J., & Kirk, M. (Writers). (1998). The child terror [Television broadcast]. Boston and Washington, DC: Frontline/Public Broadcasting Service. Information about this television broadcast retrieved on June 11, 2008 from: http://www.pbs.org/wgbh/ pages/frontline/shows/terror/. Bruck, M., Ceci, S. J., & Hembrooke, H. (1998). Reliability and credibility of young children’s reports. From research to policy and practice. American Psychologist, 53, 136–151. doi:10.1037/ 0003-066X.53.2.136. Ceci, S. J., & Bruck, M. (1995). Jeopardy in the courtroom: A scientific analysis of children's testimony. Washington, DC: American Psychological Association. Cederborg, A. C., Orbach, Y., Sternberg, K. J., & Lamb, M. E. (2000). Investigative interviews of child witnesses in Sweden. Child Abuse & Neglect, 24, 1355–1361. doi:10.1016/S01452134(00)00183-6. Christensen-Szalanski, J. J., & Bushyhead, J. B. (1981). Physicians’ use of probabilistic information in a real clinical setting. Journal of Experimental Psychology: Human Perception and Performance, 7, 928–935. doi:10.1037/0096-1523.7.4.928. Colwell, L. H., Miller, H. A., Miller, R. S., & Lyons, P. M., Jr. (2006). US police officers’ knowledge regarding behaviors indicative of deception: Implications for eradicating erroneous beliefs through training. Psychology, Crime & Law, 12, 489–503. doi:10.1080/ 10683160500254839. Connolly, D. A., Price, H. L., Lavoie, J. A. A., & Gordon, H. M. (2008). Perceptions and predictors of children’s credibility of a unique event and an instance of a repeated event. Law and Human Behavior, 32, 92–112. doi:10.1007/s10979-006-9083-3. Cross, T. P., Finkelhor, D., & Ormrod, R. (2005). Police involvement in child protective services investigations: Literature review and secondary data analysis. Child Maltreatment, 10, 224–244. doi:10.1177/1077559505274506. Daubert v. Merrell Dow Pharmaceuticals, Inc., 509 U.S. 579 (1993). Psychol. Inj. and Law (2010) 3:133–147 Davies, D., Cole, J., Albertella, G., McCulloch, L., Allen, K., & Kekevian, H. (1996). A model for conducting forensic interviews with child victims of abuse. Child Maltreatment, 1, 189–199. doi:10.1177/1077559596001003002. Dawes, R. M. (1994). House of cards: Psychiatry and psychotherapy built on myth. New York: The Free Press. DePaulo, B. M., & Pfeifer, R. L. (1986). On-the-job experience and skill at detecting deception. Journal of Applied Social Psychology, 16, 249–267. doi:10.1111/j.1559-1816.1986.tb01138.x. DePaulo, B. M., Charlton, K., Cooper, H., Lindsay, J. J., & Muhlenbruck, L. (1997). The accuracy-confidence correlation in the detection of deception. Personality and Social Psychology Review, 1, 346–357. doi:10.1207/s15327957pspr0104_5. Drach, K. M., Wientzen, J., & Ricci, L. R. (2001). The diagnostic utility of sexual behavior problems in diagnosing sexual abuse in a forensic child abuse evaluation clinic. Child Abuse & Neglect, 25, 489–503. doi:10.1016/S0145-2134(01)00222-8. Edelstein, R. S., Luten, T. L., Ekman, P., & Goodman, g S. (2006). Detecting lies in children and adults. Law and Human Behavior, 30, 1–10. doi:10.1007/s10979-006-9031-2. Elaad, E. (2003). Effects of feedback on the overestimated capacity to detect lies and the underestimated ability to tell lies. Applied Cognitive Psychology, 17, 349–363. doi:10.1002/acp.871. Elliott, D. M., & Briere, J. (1994). Forensic sexual abuse evaluations of older children: Disclosures and symptomatology. Behavioral Sciences and the Law, 12, 261–277. doi:10.1002/bsl.2370120306. Everson, M. D., & Boat, B. W. (1989). False allegations of sexual abuse by children and adolescents. Journal of the American Academy of Child and Adolescent Psychiatry, 28, 230–235. doi:10.1097/00004583-198903000-00014. Faust, D., Hart, K., & Guilmette, T. J. (1988). Pediatric malingering: the capacity of children to fake believable deficits on neuropsychological testing. Journal of Consulting and Clinical Psychology, 56, 578–582. doi:10.1037/0022-006X.56.4.578. Finlayson, L. M., & Koocher, G. P. (1991). Professional judgment and child abuse reporting in sexual abuse cases. Professional Psychology: Research and Practice, 22, 464–472. doi:10.1037/07357028.22.6.464. Fisher, C. B. (1995). American Psychological Association’s (1992) Ethics Code and the validation of sexual abuse in day-care settings. Psychology, Public Policy, and Law, 1, 461–478. doi:10.1037/1076-8971.1.2.461. Fisher, C. B., & Whiting, K. A. (1998). How valid are child sexual abuse validations? In S. J. Ceci & H. Hembrooke (Eds.), Expert witnesses in child abuse case: What can and should be said in court (pp. 159–184). Washington, DC: American Psychological Association. Franklin, B. (1785/1907). The writings of Benjamin Franklin (Vol. 9). New York: Macmillan. Frye v. United States, 293 F. 1013 (D. C. Cir. 1923). Fukurai, H., & Butler, E. W. (1994). Sociologists in action: The McMartin sexual abuse case, litigation, justice, and mass hysteria. American Sociologist, 25, 44–71. doi:10.1007/BF02691989. Garrido, E., Masip, J., & Herrero, C. (2004). Police officers’ credibility judgments: Accuracy and estimated ability. International Journal of Psychology, 39, 254–275. doi:10.1080/00207590344000411. Garven, S., Wood, J. M., Malpass, R. S., & Shaw, J. S., III. (1998). More than suggestion: The effect of interviewing techniques from the McMartin Preschool case. Journal of Applied Psychology, 83, 347–359. doi:10.1037/0021-9010.83.3.347. Goodman, G. S., Emery, R. E., & Haugaard, J. J. (1998). Developmental psychology and law: Divorce, child maltreatment, foster care and adoption. In W. Damon, I. E. Sigel, & K. A. Renninger (Eds.), Handbook of child psychology (5th ed., Vol. 4, pp. 775–874). Hoboken, NJ: John Wiley & Sons Inc. Griffin, D., & Brenner, L. (2004). Perspectives on probability judgment calibration. In D. J. Koehler & N. Harvey (Eds.), 145 Blackwell handbook of judgment and decision making (pp. 177– 198). Malden, MA: Blackwell Publishing. Haskett, M. E., Wayland, K., Hutcheson, J. S., & Tavana, T. (1995). Substantiation of sexual abuse allegations: Factors involved in the decision-making process. Journal of Child Sexual Abuse, 4, 19–47. doi:10.1300/J070v04n02_02. Haynes, A., Weiser, T., Berry, W., Lipsitz, S., Breizat, A., Dellinger, E., et al. (2009). A surgical safety checklist to reduce morbidity and mortality in a global population. New England Journal of Medicine, 360, 491–499. doi:10.1056/NEJMsa0810119. Herman, S. (2005). Improving decision making in forensic child sexual abuse evaluations. Law and Human Behavior, 29, 87–120. doi:10.1007/s10979-005-1400-8. Herman, S. (2009). Forensic child sexual abuse evaluations: Accuracy, ethics, and admissibility. In K. Kuehnle & M. Connell (Eds.), The evaluation of child sexual abuse allegations: A comprehensive guide to assessment and testimony. Hoboken, NJ: John Wiley and Sons. Hershkowitz, I., Fisher, S., Lamb, M. E., & Horowitz, D. (2007a). Improving credibility assessment in child sexual abuse allegations: The role of the NICHD investigative interview protocol. Child Abuse & Neglect, 31, 99–110. doi:10.1016/j.chiabu.2006. 09.005. Hershkowitz, I., Horowitz, D., & Lamb, M. E. (2007b). Individual and family variables associated with disclosure and nondisclosure of child abuse in Israel. In M. E. Pipe, M. E. Lamb, Y. Orbach, & A. C. Cederborg (Eds.), Child sexual abuse: Disclosure, delay, and denial (pp. 65–75). Mahwah, NJ: Lawrence Erlbaum Associates. Horner, T. M., & Guyer, M. J. (1991a). Prediction, prevention, and clinical expertise in child custody cases in which allegations of child sexual abuse have been made: I. Predictable rates of diagnostic error in relation to various clinical decision making strategies. Family Law Quarterly, 25, 217–252. Horner, T. M., & Guyer, M. J. (1991b). Prediction, prevention, and clinical expertise in child custody cases in which allegations of child sexual abuse have been made: II. Prevalence rates of child sexual abuse and the precision of “tests” constructed to diagnose it. Family Law Quarterly, 25, 381–409. Horner, T. M., Guyer, M. J., & Kalter, N. M. (1992). Prediction, prevention, and clinical expertise in child custody cases in which allegations of child sexual abuse have been made: III. Studies of expert opinion formation. Family Law Quarterly, 26, 141–170. Horner, T. M., Guyer, M. J., & Kalter, N. M. (1993). Clinical expertise and the assessment of child sexual abuse. Journal of the American Academy of Child and Adolescent Psychiatry, 32, 925–931. doi:10.1097/00004583-199309000-00006. Horowitz, S. W., Lamb, M. E., Esplin, P. W., Boychuk, T., & ReiterLaverly, L. (1995). Establishing the ground truth in studies of child sexual abuse. Expert Evidence, 4, 42–51. Humphrey, H. H. (1985). Report on Scott County investigations. Minneapolis, MN: Minnesota Attorney General. Johnson, J. (2004). Conviction tossed after 19 years. A man's molestation trial is nullified after several witnesses retract testimony they gave as children., Los Angeles Times, p. 1 (California; Metro; Metro Desk; Part B), May 1. Jones, D. P., & McGraw, J. M. (1987). Reliable and fictitious accounts of sexual abuse to children. Journal of Interpersonal Violence, 2, 27–45. doi:10.1177/088626087002001002. Kassin, S. M. (2002). Human judges of truth, deception, and credibility: Confident but erroneous. Cardozo Law Review, 23, 809–816. Keary, K., & Fitzpatrick, C. (1994). Children’s disclosure of sexual abuse during formal investigation. Child Abuse & Neglect, 18, 543–548. doi:10.1016/0145-2134(94)90080-9. Kennedy v. Louisiana, 128 S. Ct. 2641 (2009). 146 Kuehnle, K., & Connell, M. (Eds.). (2009). The evaluation of child sexual abuse allegations: A comprehensive guide to assessment and testimony. Hoboken, NJ: John Wiley & Sons Inc. Kuehnle, K., & Sparta, S. N. (2006). Assessing child sexual abuse allegations in a legal context. In S. N. Sparta & G. P. Koocher (Eds.), Forensic mental health assessment of children and adolescents (pp. 129–148). New York, NY: Oxford University Press. Lamb, M. E. (1994). The investigation of child sexual abuse: An interdisciplinary consensus statement. Child Abuse & Neglect, 18, 1021–1028. doi:10.1016/0145-2134(94)90127-9. Lamb, M. E., Orbach, Y., Hershkowitz, I., Esplin, P. W., & Horowitz, D. (2007). A structured forensic interview protocol improves the quality and informativeness of investigative interviews with children: A review of research using the NICHD Investigative Interview Protocol. Child Abuse & Neglect, 31, 1201–1231. doi:10.1016/j.chiabu.2007.03.021. Lamb, M. E., Hershkowitz, I., Orbach, Y., & Esplin, P. W. (2008). Tell me what happened: Structured investigative interviews of child victims and witnesses. Hoboken, NJ: John Wiley and Sons. Lamb, M. E., Orbach, Y., Sternberg, K. L., Aldridge, J., Pearson, S., Stewart, H. L., et al. (2009). Use of a structured investigative protocol enhances the quality of investigative interviews with alleged victims of child sexual abuse in Britain. Applied Cognitive Psychology, 23, 449–467. doi:10.1002/acp.1489. Leach, A., Talwar, V., Lee, K., Bala, N., & Lindsay, R. C. L. (2004). “Intuitive” lie detection of children's deception by law enforcement officials and university students. Law & Human Behavior, 28, 661–685. doi:10.1007/s10979-004-0793-0. Leichtman, M. D., & Ceci, S. J. (1995). The effects of stereotypes and suggestions on preschoolers' reports. Developmental Psychology, 31, 568–578. doi:10.1037/0012-1649.31.4.568. Lichtenstein, S., Fischoff, B., & Phillips, L. D. (1982). Calibration of probabilities: The state of the art to 1980. In A. Tversky & D. Kahneman (Eds.), Judgement under uncertainty: Heuristics and Biases. New York: Cambridge University Press. Mann, S., Vrij, A., & Bull, R. (2004). Detecting true lies: Police officers’ ability to detect suspects’ lies. Journal of Applied Psychology, 89, 137–149. doi:10.1037/0021-9010.89.1.137. McGraw, J. M., & Smith, H. A. (1992). Child sexual abuse allegations amidst divorce and custody proceedings: Refining the validation process. Journal of Child Sexual Abuse, 1, 49–62. doi:10.1300/ J070v01n01_04. Melton, G. B., & Limber, S. (1989). Psychologists’ involvement in cases of child maltreatment. Limits of role and expertise. American Psychologist, 44, 1225–1233. doi:10.1037/0003-066X.44.9.1225. Milner, J. S. (1994). Assessing physical child abuse risk: The child abuse potential inventory. Clinical Psychology Review, 14, 547– 583. doi:10.1016/0272-7358(94)90017-5. Moxley, A. W. (1973). Clinical judgment: the effects of statistical information. Journal of Personality Assessment, 37, 86–91. Nathan, D., & Snedeker, M. (2001). Satan’s silence: Ritual abuse and the making of a modern American witch hunt. Lincoln, NE: Authors Choice Press. National Children's Alliance. (2007). Press release. Retrieved June 1, 2008 from the National Children's Alliance Website: http://www. nca-online.org/pages/page.asp?page_id=6835. Oates, R. K., Jones, D. P., Denson, D., Sirotnak, A., Gary, N., & Krugman, R. D. (2000). Erroneous concerns about child sexual abuse. Child Abuse & Neglect, 24, 149–157. doi:10.1016/S01452134(99)00108-8. Oskamp, S. (1965). Overconfidence in case-study judgments. Journal of Consulting Psychology, 29, 261–265. doi:10.1037/ h0022125. Pawloski, J. (2005). Abuse warning ignored. Albuquerque Journal. Article citation retrieved on June 11, 2008 from the Albuquerque Psychol. Inj. and Law (2010) 3:133–147 Journal archive Website: http://www.abqjournal.com/archives/ search_newslib.htm. Poole, D. A., & Lamb, M. E. (1998). Investigative interviews of children: A guide for helping professionals. Washington, DC: American Psychological Association. Poole, D. A., & Lindsay, D. S. (1998). Assessing the accuracy of young children's reports: Lessons from the investigation of child sexual abuse. Applied & Preventive Psychology, 7, 1–26. doi:10.1016/S0962-1849(98)80019-X. Rabinowitz, D. (2003). No crueler tyrannies: Accusation, false witness, and other terrors of our times. New York: Free Press. Realmuto, G. M., & Wescoe, S. (1992). Agreement among professionals about a child’s sexual abuse status: Interviews with sexually anatomically correct dolls as indicators of abuse. Child Abuse & Neglect, 16, 719–725. doi:10.1016/0145-2134 (92)90108-4. Realmuto, G. M., Jensen, J. B., & Wescoe, S. (1990). Specificity and sensitivity of sexually anatomically correct dolls in substantiating abuse: A pilot study. Journal of the American Academy of Child and Adolescent Psychiatry, 29, 743–746. doi:10.1097/00004583199009000-00011. Robinson, B. A. (2005). “McMartin” ritual abuse cases in Manhattan Beach, CA. Retrieved June 11, 2008 from the Ontario Consultants on Religious Tolerance Website: http://www.religioustolerance.org/ ra_mcmar.htm. Robinson, B. A. (2007). 42 Multi-victim/multi-offender court cases with allegations of sexual and physical abuse of multiple children. Retrieved June 11, 2008 from the Ontario Consultants on Religious Tolerance Website: http://www.religioustolerance. org/ra_case.htm. Rosenthal, R. (1995). State of New Jersey v. Margaret Kelly Michaels: An overview. Psychology, Public Policy, and Law, 1, 246–271. doi:10.1037/1076-8971.1.2.246. San Diego County Grand Jury. (1992). Child sexual abuse, assault, and molest issues. San Diego, CA: Author. San Diego County Grand Jury. (1994). Analysis of child molestation issues. San Diego, CA: Author. Schreiber, N., Bellah, L. D., Martinez, Y., McLaurin, K. A., Strok, R., Garven, S., et al. (2006). Suggestive interviewing in the McMartin Preschool and Kelly Michaels daycare abuse cases: A case study. Social Influence, 1, 16–47. doi:10.1080/ 15534510500361739. Seattle Post Intelligencer. (1998). Special report: A record of abuses in Wenatchee. Retrieved July 10, 2004 from http://seattlepi. nwsource.com/powertoharm/ Shumaker, K. R. (2000). Measured professional competence between and among different mental health disciplines when evaluating and making recommendations in cases of suspected child sexual abuse. Dissertation Abstracts International, 60 5791B. Soanes, C., & Stevenson, A. (Eds.). (2005). The Oxford dictionary of English (2nd ed.). Oxford, UK: Oxford University Press. State v. Brown, 297 Or 404 (1984). State v. Southard, 347 Or 127 (2009). Sternberg, K. J., Lamb, M. E., Davies, G. M., & Westcott, H. L. (2001). The memorandum of good practice: Theory versus application. Child Abuse & Neglect, 25, 669–681. doi:10.1016/ S0145-2134(01)00232-0. Swarns, R. L., & Cooper, M. (1998). 5 held in sex abuse of several children despite monitoring. New York Times. Retrieved June 11, 2008 from the New York Times Website: http://nytimes. com. Swets, J. A., Dawes, R. M., & Monahan, J. (2000). Psychological science can improve diagnostic decisions. Psychological Science in the Public Interest, 1, 1–26. doi:10.1111/15291006.001. Psychol. Inj. and Law (2010) 3:133–147 Talwar, V., Lee, K., Bala, N., & Lindsay, R. C. L. (2006). Adults’ judgments of children’s coached reports. Law and Human Behavior, 30, 561–570. doi:10.1007/s10979-006-9038-8. U.S. Department of Health and Human Services. (2009). Child maltreatment 2007. Washington, DC: U.S. Government Printing Office. Vrij, A. (2002). Deception in children: A literature review and implications for children's testimony. In H. L. Westcott, G. M. Davies, & R. H. C. Bull (Eds.), Children's testimony: A handbook of psychological research and forensic practice (pp. 175–194). Chichester: Wiley and Sons. Vrij, A., & Mann, S. (2001). Who killed my relative? Police Officers' ability to detect real-life high-stake lies. Psychology, Crime & Law, 7, 119–132. doi:10.1080/10683160108401791. Vrij, A., & van Wijngaarden, J. J. (1994). Will the truth come out? Two studies about the detection of false statements expressed by children. Expert Evidence, 3, 78–83. 147 Vrij, A., Akehurst, L., Brown, L., & Mann, S. (2006). Detecting lies in young children, adolescents and adults. Applied Cognitive Psychology, 20, 1225–1237. doi:10.1002/acp.1278. Waldman, H., & Jones, D. P. (2008). Why wasn't he stopped? Hartford Courant, p. A1, February 24. Walsh, B. (1993). The law enforcement response to child sexual abuse cases. Journal of Child Sexual Abuse, 2, 117–121. doi:10.1300/ J070v02n03_11. Wexler, R. (1990). Wounded innocents. Amherst, NY: Prometheus Books. Wilson, K. (2007). Forensic interviewing in New Zealand. In M. E. Pipe, M. E. Lamb, Y. Orbach, & A. C. Cederborg (Eds.), Child sexual abuse: Disclosure, delay, and denial (pp. 265–280). Mahwah, NJ: Lawrence Erlbaum Associates. Wright, G. N., & Phillips, L. D. (1980). Cultural variation in probabilistic thinking: Alternative ways of dealing with uncertainty. International Journal of Psychology, 15, 239–257. doi:10.1080/00207598008246995.