Creative Destruction

July 22, 2006

The Lancet Article

Filed under: International Politics,Iraq,Statistical Method — Adam Gurri @ 1:44 pm

As I unintentionally walked into a debate on this issue, I thought I’d take the time to look at it by itself.


I must concede to Ampersand that Slate’s criticism seems mostly to play on the assumption that such a large range (8,000 to 194,000) is akin to a dartboard that Lancet is using. Yet Roberts, the author of the article, rebutted this point well–discussing the nature of confidence intervals and how the probability could be calculated from the point of view of a normal distribution. As he puts it:

1. There is a 2.5 % chance that the number is lower than 8000, and a 2.5 % chance it’s higher than 194,000 (2.5 % + 2.5 % = 5 %, thus the 95 % chance the number is between 8000 and 194,000). 2. There is a 10 % chance that the number is lower than 45,000, and a 10 % chance it’s higher than 167,000 (thus an 80 % chance the number is between 45,000 and 167,000). 3. There is a 20 % chance that the number is lower than 65,000, and a 20 % chance it’s higher than 147,000 (thus a 60 % chance the number is between 65,000 and 147,000).

So that point is effectively put to rest. Yet there is still much about the conclusions that the article draws that do not necessary follow from the information gathered by the studies.

To begin the article itself can be read here. The first point that should be made is that one bias is immediately introduced into their sample–for those families that were killed entirely during the pre-invasion period, during the slaughters that usually involved Kurdish families but not uncommonly Shia families as well, they obviously would not be there to be interviewed. This is of course but one of the ways that a comparison with pre-invasion Iraq is troubling, but it’s one that isn’t mentioned in the article.

It can also be seen on page 3 that six of Iraq’s provinces weren’t even surveyed at all. These are Al-Basrah, Al-Muthanna, An-Najaf, Dahuk, Arbil, and Kirkuk. The estimated populations in these areas as of 2003 are as follows:

  1. Al-Basrah: 2,600,000
  2. Al-Muthanna: Fewer than 1,000,000 (from an earlier, 1997 study)
  3. An-Najaf: 931,600
  4. Dahuk: 497,230
  5. Arbil: 1,134,300 (from 2001 estimate)
  6. Kirkuk: 949,000

These population figures are of course estimates, with Al-Muthanna’s and Arbil’s being older and perhaps less reliable–but for our purposes, I think it can be established that there are more than a couple million people who essentially were removed from the list of households that could be chosen at random.

In a response in the Lancet itself, Stephen Apfelroth makes a similar point–focusing on the fact that Roberts’ approach was only stastically accurate if you assumed away local variance.

Although sampling of 988 households randomly selected from a list of all households in a country would be routinely acceptable for a survey, this was far from the method actually used—a point basically lost in the news releases such a report inevitably engenders. The survey actually only included 33 randomised selections, with 30 households interviewed surrounding each selected cluster point. Again, this technique would be adequate for rough estimates of variables expected to be fairly homogeneous within a geographic region, such as political opinion or even natural mortality, but it is wholly inadequate for variables (such as violent death) that can be expected to show extreme local variation within each geographic region. In such a situation, multiple random sample points are required within each geographic region, not one per 739 000 individuals.

Emphasis added by me.

Apfelroth makes a number of other valid criticisms as well, and I recommend the article.

I also recommend this article, which goes into great statistical analysis of the ways in which Roberts’ method could be improved, as well as to how the information from the studies could be more accurately categorized (in terms of combatants, collateral, etc.,).

Finally, there is Roberts’ response.

He begins by defending the cluster sampling method, as it has similarly been applied to cases measuring starvation and other problems in countries around the world.

But he immediately acknowledges Apfelroth’s criticism:

Unfortunately, as Stephen Apfelroth rightly points out, our study and a similar one in Kosovo,3 suggest that in settings where most deaths are from bombing-type events, the standard 30-cluster approach might not produce a high level of precision in the death toll. But the key public-health findings of this study are robust despite this imprecision. These findings include: a higher death rate after the invasion; a 58-fold increase in death from violence, making it the main cause of death; and most violent deaths being caused by air-strikes from Coalition Forces. Whether the true death toll is 90 000 or 150 000, these three findings give ample guidance towards understanding what must happen to reduce civilian deaths.

In the end, arguing that the study isn’t without merit–but welcoming people such as Apfelroth to help improve the methodology.

However, I myself still believe that the margin of error in this projection is too great.

I will summarise my reasons:

  1. Six provinces were left out of the random sampling entirely
  2. The data on preinvasion Iraq is insufficient to use as a basis for comparison
  3. Without putting more resources into identifying local trends, taking a little more than two dozen samplings from 33 localities seems like a poor basis for projecting trends on the national level

That’s where I stand now–though I am but an interested amateur in these matters, and I welcome any criticism.

UPDATE: Ampersand provided a link to this article in The Chronicle of Higher Education, on this very subject.  Though it disagrees with my own assessment, it is a very even-handed, well-written take on the subject, and I recommend it.

8 Comments »

  1. Thanks for conceding the “dartboard” point was mistaken. However, I think your current critiques still show that you don’t understand this survey’s methodology well enough to critique it accurately.

    The first point that should be made is that one bias is immediately introduced into their sample–for those families that were killed entirely during the pre-invasion period, during the slaughters that usually involved Kurdish families but not uncommonly Shia families as well, they obviously would not be there to be interviewed.

    I don’t understand how you could have read the article and not have noticed the discussion of exactly this point. (“At the end of interviewing every 30 household cluster, one or two households were asked if in the area of the cluster there were any entire families that had died or most of a family had died and survivors were now living elsewhere. We did this to explore the likelihood that families with many deaths were now unlikely to be found and interviewed, creating a survivor bias among those interviewed.”)

    Furthermore, although you only mention the possibility that entire households could be killed before invasion, of course it is also possible for entire households to be killed post-invasion (based on the evidence of this survey, such wiped-out households are far more common post-invasion).

    I think it can be established that there are more than a couple million people who essentially were removed from the list of households that could be chosen at random.

    You’re simply wrong about this. Every known household in Iraq had an equal chance of being chosen by their methodology, regardless of what province they were in. A truly random sample, by definition, does not guarantee coverage of every province.

    Although sampling of 988 households randomly selected from a list of all households in a country would be routinely acceptable for a survey, this was far from the method actually used…

    This is simply ignorant. As Roberts pointed out in his response, 30 clusters is widely accepted within the field as the professional standard. The method Apfelroth suggests is incredibly impractical in a war zone – in essence, he’s suggesting a standard of measurement that would make studying war zone mortality impossible until after violence had ceased.

    Furthermore, the main effect of using 33 rather than 998 clusters is to lead to a wide confidence interval. But you’ve already agreed that a wide confidence interval is not, in and of itself, reason to dismiss a study.

    Again, this technique would be adequate for rough estimates of variables expected to be fairly homogeneous within a geographic region, such as political opinion or even natural mortality, but it is wholly inadequate for variables (such as violent death) that can be expected to show extreme local variation within each geographic region.

    Yes, but the most likely inaccuracy that could result from using cluster sampling of this sort is to underestimate the death toll. I’ll quote Daniel Davies on this:

    Although sampling textbooks warn against the cluster methodology in cases like this, they are very clear about the fact that the reason why it is risky is that it carries a very significant danger of underestimating the rare effects, not overestimating them. This can be seen with a simple intuitive illustration; imagine that you have been given the job of checking out a suspected minefield by throwing rocks into it.

    This is roughly equivalent to cluster sampling a heterogeneous population; the dangerous bits are a fairly small proportion of the total field, and they’re clumped together (the mines). Furthermore, the stones that you’re throwing (your “clusters”) only sample a small bit of the field at a time. The larger each individual stone, the better, obviously, but equally obviously it’s the number of stones that you have that is really going to drive the precision of your estimate, not their size. So, let’s say that you chuck 33 stones into the field. There are three things that could happen:

    a)By bad luck, all of your stones could land in the spaces between mines. This would cause you to conclude that the field was safer than it actually was.

    b)By good luck, you could get a situation where most of your stones fell in the spaces between mines, but some of them hit mines. This would give you an estimate that was about right regarding the danger of the field.

    c)By extraordinary chance, every single one of your stones (or a large proportion of them) might chance to hit mines, causing you to conclude that the field was much more dangerous than it actually was.

    How likely is the third of these possibilities (analogous to an overestimate of the excess deaths) relative to the other two? Not very likely at all. Cluster sampling tends to underestimate rare effects, not overestimate them[2].

    And 2), this problem, and other issues with cluster sampling (basically, it reduces your effective sample size to something closer to the number of clusters than the number of individuals sampled) are dealt with at length in the sampling literature. Cluster sampling ain’t ideal, but needs must and it is frequently used in bog-standard epidemiological surveys outside war zones. The effects of clustering on standard results of sampling theory are known, and there are standard pieces of software that can be used to adjust (widen) one’s confidence interval to take account of these design effects. The Lancet team used one of these procedures, which is why their confidence intervals are so wide (although, to repeat, not wide enough to include zero). I have not seen anybody making the clustering critique who as any argument at all from theory or data which might give a reason to believe that the normal procedures are wrong for use in this case.

    You don’t say why you think the data on preinvasion Iraq is insufficient. Could you expand on this?

    Without putting more resources into identifying local trends, taking a little more than two dozen samplings from 33 localities seems like a poor basis for projecting trends on the national level

    The entire field of statistics is based on projecting large-scale trends from samples that are much smaller than the overall population. What you’re saying here, in essence, is that you don’t believe in statistics.

    The low number of samples is reflected in the wide confidence interval, using appropriate statistical methods. Saying “you can’t generalize from low numbers to high numbers” isn’t a valid argument; it’s an expression of ignorance about how statistical methods work.

    What you could say with accuracy is that you can’t generalize from such a small sample size without having the problem of a wide confidence interval. But we already know that.

    Comment by Ampersand — July 22, 2006 @ 4:12 pm | Reply

  2. As [Roberts] puts it:

    1. There is a 2.5 % chance that the number is lower than 8000, and a 2.5 % chance it’s higher than 194,000 (2.5 % + 2.5 % = 5 %, thus the 95 % chance the number is between 8000 and 194,000). 2. There is a 10 % chance that the number is lower than 45,000, and a 10 % chance it’s higher than 167,000 (thus an 80 % chance the number is between 45,000 and 167,000). 3. There is a 20 % chance that the number is lower than 65,000, and a 20 % chance it’s higher than 147,000 (thus a 60 % chance the number is between 65,000 and 147,000).

    So that point is effectively put to rest.

    Um, no it isn’t. That is the definition of a credible interval. Equating these two concepts is the Prosecutor’s fallacy.

    To illustrate the difference, imagine a School which is known to have exactly 500 boys and 500 girls. We could nevertheless take a random sample and formally calculate the 95% confidence interval. Suppose the result we get is that the interval is [470,510] Does that mean that there is 95% chance that the number of boys lies between these two figures? Not at all, 500 boys lies with 100% probability between those two figures. Suppose we took another sample and calculated the confidence interval to be [501,530] The probability that the actual population statistic lies within the confidence interval is now precisely 0%.

    The correct statement of a 95% confidence interval is that 95% of the time we take a random sample, the population statistic lies within the interval. It doesn’t follow for any particular sample that the probability is 95% because in general we have a priori information (or beliefs) about the statistic concerned that affect the probability.

    Comment by Daran — July 22, 2006 @ 4:25 pm | Reply

  3. That’s twice now I’ve tried to

    nest blockquotes

    without it working. Did I balls up on both occasions, or do nested blockquotes no longer work?

    Comment by Daran — July 22, 2006 @ 4:33 pm | Reply

  4. However, I think your current critiques still show that you don’t understand this survey’s methodology well enough to critique it accurately.

    That’s why I do this in public rather than private–so that I can learn from your criticism of my criticism!🙂

    I don’t understand how you could have read the article and not have noticed the discussion of exactly this point.

    Sorry, that was sloppy of me. Though I don’t quite understand why they didn’t ask all of the surveyed houses this question.

    Furthermore, although you only mention the possibility that entire households could be killed before invasion, of course it is also possible for entire households to be killed post-invasion (based on the evidence of this survey, such wiped-out households are far more common post-invasion).

    You are right–once again, pointing out where I need to look a little harder.

    You’re simply wrong about this. Every known household in Iraq had an equal chance of being chosen by their methodology, regardless of what province they were in. A truly random sample, by definition, does not guarantee coverage of every province.

    I thought that the random samples were chosen from within several different areas–did I misunderstand?

    This is simply ignorant. As Roberts pointed out in his response, 30 clusters is widely accepted within the field as the professional standard. The method Apfelroth suggests is incredibly impractical in a war zone – in essence, he’s suggesting a standard of measurement that would make studying war zone mortality impossible until after violence had ceased.

    Would it be inaccurate to say that the 30 clusters approached is accepted as a standard in the same way that GDP measurements are an accepted standard?

    The entire field of statistics is based on projecting large-scale trends from samples that are much smaller than the overall population. What you’re saying here, in essence, is that you don’t believe in statistics.

    Ouch–if that’s what I’m saying, that’s not what I meant.

    Understand, I’ve an increasing interest in statistics–but this is a very recent development. It’s why it was so easy to point out to me the fallacy of the “dartboard” argument–which seemed persuasive mere months ago.

    But I’m a new student to it–and I’m “learning out loud”, if you will.

    I think my main problem has been attempting to look at this Lancet article in a vacuum–I need to look at this subject more broadly, so that I can have some sort of basis of comparison.

    I’ll be making a second go at this soon. I hope you aren’t finding this tedious–but I do appreciate the quality of feedback I’ve been getting.

    This back-and-forth is exactly why I wanted a group blog to begin with🙂

    Comment by Adam Gurri — July 22, 2006 @ 5:04 pm | Reply

  5. […] (EDIT: I originally oversimplified this, but for the purpose of this post, I’d just like to say there there’s some controversy over a Lancet Article dealing with casualties in Iraq–and that’s one example of why reasonable disagreement can occur on the subject) […]

    Pingback by Creative Destruction » The Era of Passion — July 22, 2006 @ 5:20 pm | Reply

  6. I wrote:

    I don’t understand how you could have read the article and not have noticed the discussion of exactly this point.

    Rereading this, I’m embarrassed at how harsh-sounding it is. Sorry about that, Adam.

    Though I don’t quite understand why they didn’t ask all of the surveyed houses this question.

    I was wondering the exact same thing when I read that bit. I suspect there was some reason that was evident to the interviewers on the ground, but I can’t imagine what it might have been.

    Would it be inaccurate to say that the 30 clusters approached is accepted as a standard in the same way that GDP measurements are an accepted standard?

    It’s inaccurate to say that, I think, because GDP by definition is calculated using a specific formula (there are actually a couple of different formulas, strictly speaking, but really they’re all algebraic variations on the same thing). If you don’t calculate it that way, what you wind up with isn’t GDP. And if you stepped in a time machine and got out in 1970 or 2050, they’d still use the same formula to calculate GDP.

    In this case, there’s no one definition of “wartime mortality,” or anything like that. The methodology used by the Lancet is, as I understand it, current best practices for doing the ultra-difficult task of measuring wartime mortality, but that doesn’t mean it won’t change or be improved upon in the future.

    But I’m a new student to it–and I’m “learning out loud”, if you will.

    Totally understood. Sorry I sounded so harsh. (And, by the way, I’m certainly no expert on stats myself!)

    Comment by Ampersand — July 22, 2006 @ 6:30 pm | Reply

  7. Did I balls up on both occasions, or do nested blockquotes no longer work?

    I’ve had the exact same problem here. It’s irritating.

    Comment by Ampersand — July 22, 2006 @ 6:31 pm | Reply

  8. Rereading this, I’m embarrassed at how harsh-sounding it is. Sorry about that, Adam.

    Hahaha, that’s quite alright!🙂

    We have a story that we tell in my family. These are all Cubans, understand–the hottest heads around.

    My grandfather was introduced to a friend of his niece. This friend and my grandfather got into an all-out shouting match, with name-calling abound, over some political subject or other. His niece, meanwhile, is sitting there tensely gripping the sides of the chair, thinking “jesus christ, they hate each other.”

    When all was said and done, the friend got up and shook my grandfather’s hand, and said “that’s the most fun I’ve had in a while.”

    So a sadomasochistic love for debate is in my blood

    Comment by Adam Gurri — July 22, 2006 @ 7:51 pm | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Create a free website or blog at WordPress.com.

%d bloggers like this: