As I unintentionally walked into a debate on this issue, I thought I’d take the time to look at it by itself.

I must concede to Ampersand that Slate’s criticism seems mostly to play on the assumption that such a large range (8,000 to 194,000) is akin to a dartboard that Lancet is using. Yet Roberts, the author of the article, rebutted this point well–discussing the nature of confidence intervals and how the probability could be calculated from the point of view of a normal distribution. As he puts it:

1. There is a 2.5 % chance that the number is lower than 8000, and a 2.5 % chance it’s higher than 194,000 (2.5 % + 2.5 % = 5 %, thus the 95 % chance the number is between 8000 and 194,000). 2. There is a 10 % chance that the number is lower than 45,000, and a 10 % chance it’s higher than 167,000 (thus an 80 % chance the number is between 45,000 and 167,000). 3. There is a 20 % chance that the number is lower than 65,000, and a 20 % chance it’s higher than 147,000 (thus a 60 % chance the number is between 65,000 and 147,000).

So that point is effectively put to rest. Yet there is still much about the conclusions that the article draws that do not necessary follow from the information gathered by the studies.

To begin the article itself can be read here. The first point that should be made is that one bias is immediately introduced into their sample–for those families that were killed entirely during the pre-invasion period, during the slaughters that usually involved Kurdish families but not uncommonly Shia families as well, they obviously would not be there to be interviewed. This is of course but one of the ways that a comparison with pre-invasion Iraq is troubling, but it’s one that isn’t mentioned in the article.

It can also be seen on page 3 that six of Iraq’s provinces weren’t even surveyed at all. These are Al-Basrah, Al-Muthanna, An-Najaf, Dahuk, Arbil, and Kirkuk. The estimated populations in these areas as of 2003 are as follows:

- Al-Basrah: 2,600,000
- Al-Muthanna: Fewer than 1,000,000 (from an earlier, 1997 study)
- An-Najaf: 931,600
- Dahuk: 497,230
- Arbil: 1,134,300 (from 2001 estimate)
- Kirkuk: 949,000

These population figures are of course estimates, with Al-Muthanna’s and Arbil’s being older and perhaps less reliable–but for our purposes, I think it can be established that there are more than a couple million people who essentially were removed from the list of households that could be chosen at random.

In a response in the Lancet itself, Stephen Apfelroth makes a similar point–focusing on the fact that Roberts’ approach was only stastically accurate if you assumed away local variance.

Although sampling of 988 households randomly selected from a list of all households in a country would be routinely acceptable for a survey, this was far from the method actually used—a point basically lost in the news releases such a report inevitably engenders. The survey actually only included 33 randomised selections, with 30 households interviewed surrounding each selected cluster point. Again,

this technique would be adequate for rough estimates of variables expected to be fairly homogeneous within a geographic region, such as political opinion or even natural mortality, but it is wholly inadequate for variables (such as violent death) that can be expected to show extreme local variation within each geographic region.In such a situation, multiple random sample points are required within each geographic region, not one per 739 000 individuals.

Emphasis added by me.

Apfelroth makes a number of other valid criticisms as well, and I recommend the article.

I also recommend this article, which goes into great statistical analysis of the ways in which Roberts’ method could be improved, as well as to how the information from the studies could be more accurately categorized (in terms of combatants, collateral, etc.,).

Finally, there is Roberts’ response.

He begins by defending the cluster sampling method, as it has similarly been applied to cases measuring starvation and other problems in countries around the world.

But he immediately acknowledges Apfelroth’s criticism:

Unfortunately, as Stephen Apfelroth rightly points out, our study and a similar one in Kosovo,3 suggest that in settings where most deaths are from bombing-type events, the standard 30-cluster approach might not produce a high level of precision in the death toll. But the key public-health findings of this study are robust despite this imprecision. These findings include: a higher death rate after the invasion; a 58-fold increase in death from violence, making it the main cause of death; and most violent deaths being caused by air-strikes from Coalition Forces. Whether the true death toll is 90 000 or 150 000, these three findings give ample guidance towards understanding what must happen to reduce civilian deaths.

In the end, arguing that the study isn’t without merit–but welcoming people such as Apfelroth to help improve the methodology.

However, I myself still believe that the margin of error in this projection is too great.

I will summarise my reasons:

- Six provinces were left out of the random sampling entirely
- The data on preinvasion Iraq is insufficient to use as a basis for comparison
- Without putting more resources into identifying local trends, taking a little more than two dozen samplings from 33 localities seems like a poor basis for projecting trends on the national level

That’s where I stand now–though I am but an interested amateur in these matters, and I welcome any criticism.

UPDATE: Ampersand provided a link to this article in The Chronicle of Higher Education, on this very subject. Though it disagrees with my own assessment, it is a very even-handed, well-written take on the subject, and I recommend it.

Thanks for conceding the “dartboard” point was mistaken. However, I think your current critiques still show that you don’t understand this survey’s methodology well enough to critique it accurately.

I don’t understand how you could have read the article and not have noticed the discussion of exactly this point. (“At the end of interviewing every 30 household cluster, one or two households were asked if in the area of the cluster there were any entire families that had died or most of a family had died and survivors were now living elsewhere. We did this to explore the likelihood that families with many deaths were now unlikely to be found and interviewed, creating a survivor bias among those interviewed.”)

Furthermore, although you only mention the possibility that entire households could be killed before invasion, of course it is also possible for entire households to be killed post-invasion (based on the evidence of this survey, such wiped-out households are far more common post-invasion).

You’re simply wrong about this. Every known household in Iraq had an equal chance of being chosen by their methodology, regardless of what province they were in. A truly random sample, by definition, does not guarantee coverage of every province.

This is simply ignorant. As Roberts pointed out in his response, 30 clusters is widely accepted within the field as the professional standard. The method Apfelroth suggests is incredibly impractical in a war zone – in essence, he’s suggesting a standard of measurement that would make studying war zone mortality impossible until after violence had ceased.

Furthermore, the main effect of using 33 rather than 998 clusters is to lead to a wide confidence interval. But you’ve already agreed that a wide confidence interval is not, in and of itself, reason to dismiss a study.

Yes, but the most likely inaccuracy that could result from using cluster sampling of this sort is to underestimate the death toll. I’ll quote Daniel Davies on this:

You don’t say why you think the data on preinvasion Iraq is insufficient. Could you expand on this?

The entire field of statistics is based on projecting large-scale trends from samples that are much smaller than the overall population. What you’re saying here, in essence, is that you don’t believe in statistics.

The low number of samples is reflected in the wide confidence interval, using appropriate statistical methods. Saying “you can’t generalize from low numbers to high numbers” isn’t a valid argument; it’s an expression of ignorance about how statistical methods work.

What you could say with accuracy is that you can’t generalize from such a small sample size without having the problem of a wide confidence interval. But we already know that.

Comment by Ampersand — July 22, 2006 @ 4:12 pm |

Comment by Daran — July 22, 2006 @ 4:25 pm |

Comment by Daran — July 22, 2006 @ 4:33 pm |

That’s why I do this in public rather than private–so that I can learn from your criticism of my criticism!🙂

Sorry, that was sloppy of me. Though I don’t quite understand why they didn’t ask

allof the surveyed houses this question.You are right–once again, pointing out where I need to look a little harder.

I thought that the random samples were chosen from

withinseveral different areas–did I misunderstand?Would it be inaccurate to say that the 30 clusters approached is accepted as a standard in the same way that GDP measurements are an accepted standard?

Ouch–if that’s what I’m saying, that’s not what I meant.

Understand, I’ve an increasing interest in statistics–but this is a very recent development. It’s why it was so easy to point out to me the fallacy of the “dartboard” argument–which seemed persuasive mere months ago.

But I’m a new student to it–and I’m “learning out loud”, if you will.

I think my main problem has been attempting to look at this Lancet article in a vacuum–I need to look at this subject more broadly, so that I can have some sort of basis of comparison.

I’ll be making a second go at this soon. I hope you aren’t finding this tedious–but I do appreciate the quality of feedback I’ve been getting.

This back-and-forth is

exactlywhy I wanted a group blog to begin with🙂Comment by Adam Gurri — July 22, 2006 @ 5:04 pm |

[…] (EDIT: I originally oversimplified this, but for the purpose of this post, I’d just like to say there there’s some controversy over a Lancet Article dealing with casualties in Iraq–and that’s one example of why reasonable disagreement can occur on the subject) […]

Pingback by Creative Destruction » The Era of Passion — July 22, 2006 @ 5:20 pm |

I wrote:

Rereading this, I’m embarrassed at how harsh-sounding it is. Sorry about that, Adam.

I was wondering the exact same thing when I read that bit. I suspect there was some reason that was evident to the interviewers on the ground, but I can’t imagine what it might have been.

It’s inaccurate to say that, I think, because GDP by definition is calculated using a specific formula (there are actually a couple of different formulas, strictly speaking, but really they’re all algebraic variations on the same thing). If you don’t calculate it that way, what you wind up with isn’t GDP. And if you stepped in a time machine and got out in 1970 or 2050, they’d still use the same formula to calculate GDP.

In this case, there’s no one definition of “wartime mortality,” or anything like that. The methodology used by the Lancet is, as I understand it, current best practices for doing the ultra-difficult task of measuring wartime mortality, but that doesn’t mean it won’t change or be improved upon in the future.

Totally understood. Sorry I sounded so harsh. (And, by the way, I’m certainly no expert on stats myself!)

Comment by Ampersand — July 22, 2006 @ 6:30 pm |

I’ve had the exact same problem here. It’s irritating.

Comment by Ampersand — July 22, 2006 @ 6:31 pm |

Hahaha, that’s quite alright!🙂

We have a story that we tell in my family. These are all Cubans, understand–the hottest heads around.

My grandfather was introduced to a friend of his niece. This friend and my grandfather got into an all-out shouting match, with name-calling abound, over some political subject or other. His niece, meanwhile, is sitting there tensely gripping the sides of the chair, thinking “jesus christ, they hate each other.”

When all was said and done, the friend got up and shook my grandfather’s hand, and said “that’s the most fun I’ve had in a while.”

So a sadomasochistic love for debate is in my blood

Comment by Adam Gurri — July 22, 2006 @ 7:51 pm |