I used to begin my lectures on probability in Intro Stats with the following slide:
Probability is a normalized denumerably additive measure defined over a sigma algebra of subsets of an abstract space.
If I remember correctly, that's a direct quotation from Kolmogorov, but I can't find the chapter and verse right now.
Following it was a flurry of note-taking activity despite the fact that my slides were available on the course web site (and apparently widely disseminated through a bunch of sites in complete violation of my copyrights). Why do people start writing stuff down if they don't understand it? Every time I put that slide up, I hoped someone would yell Well, what on earth does that mean?
instead of writing it down, but I was regularly disappointed.
In a nutshell, it means that the concept of probability is a figment of our imagination.
Normalized means it takes on values between 0 and 1. Denumerably additive measure means you can add up the probabilities of a bunch of events in a finite or countable infinite set if those events are mutually exclusive, and I am not even going to begin to try to explain in an intuitive manner what a σ-algebra of subsets of an abstract space
means, but you can look it up on Wikipedia, which, if you didn't know what they were to begin with, may not help at all.
The point is, yes, all that sounds like gobbledygook, but when you come down to it, what you need to know to understand what a poll is, and how to interpret its results is actually rather straightforward. It is therefore very disappointing that practically no one does it properly.
Let's take the simplest type of poll question: If the election were held today, would you vote for President Obama?
. We'll assume there are only two possible responses to this question: Yes and No.
Note that we are only interested in the answers of people who would indeed vote. We're not interested in the opinions of those who cannot or will not vote. Keep that in mind for later.
We want to say something about the true proportion of voters who would indeed vote for President Obama if the election were held today. Let's call that value p. The only way to know that is to hold an election today. But, the election is in November. So, instead, we do the next best thing, like a cook tasting a small amount of a well-mixed soup to figure out how salty it is, and ask a small number of people.
The key to the validity of any poll is the random selection of its respondents. If you just ask a bunch of people in the neighborhood, or look at the responses to an online poll taken during your favorite talk show host's program, you will bias your poll to reflect the preferences of people whose preferences do not reflect the true proportion of voters who'd vote for President Obama.
If you randomly and independently pick respondents for your poll, each response is what is called a random variable which takes the value Yes with probability p and the value No with probability (1 - p). That's because we assumed those are the only two responses.
You can't do arithmetic with Yes and No, so let's represent a Yes with the number 1, and represent a No with the number 0.
If you ask just one person at random, there are only two possible samples: {1} or {0}, whose probabilities are, respectively, p and (1 - p). If ask two people picked randomly, there are four possible samples: S1={0,0} (neither person would vote for the president, S2={0,1} and S3={1,0} (one out of the two would vote for the president), and S4={1,1} (both people would vote for the president.
If these two respondents were picked independently, then we can multiply the probabilities of respondents declaring a preference for the president to get the probabilities of obtaining each sample.
If you get S1, the poll result is 0% support for the president, if you get S2 or S3, the poll result is 50% support for the president, and if you get S4, the poll result is 100% support for the president.
If respondents are picked independently, then the probability of getting two people who would not vote for the president is simply P(S1) = (1-p)×(1-p). The probability of getting S2 and S3 are the same: P(S2) = P(S3) = p×(1-p), and the probability that both people would vote for the president is P(S4) = p×p.
The key point to understand here is that there are two mutually exclusive ways our poll with only two people can indicate 50% support for the president, and therefore the probability that the poll indicates 50% support for the president is P(S2) + P(S3) = 2p(1-p).
Just for illustration, let's suppose the president has 80% support. That is, p = 0.8
. Then, we have:- P(poll shows 0% support) = 0.2×0.2 = 0.04 = 4%
- P(poll shows 50% support) = 2×0.8×0.2 = 0.32 = 32%
- P(poll shows 100% support) = 0.8×0.8 = 0.64 = 64%
Note that the probability that poll will indicate at least 50% support for the president, assuming 80% would vote for him is 0.96, i.e. 96%.
Well, OK, so what? How do we figure out the true probability if we don't already know it?
The short answer is, we don't.
Now, most polls involve about a 1,000 respondents (again, keep that in mind for later, there is a reason for it). There is, as before, only one sample with 1,000 respondents that will give us 0% for the president, and there is only one sample that will give us 100% for the president.
But, there are 1,000 ways such a sample can indicate 0.1% (1 out of a
1,000) supporting the president, a whopping 499,500 ways such a sample can
indicate 0.2% support, and gargantuan numbers of ways it can indicate 50%
support. You can find the formula on Wikipedia and
calculate them using the COMBIN function in Excel, but don't
try to count them by hand.
What is nice is we can approximate the distribution of samples using the Normal distribution:
suppose one randomly samples n people out of a large population and ask them whether they agree with a certain statement. The proportion of people who agree will of course depend on the sample. If groups of n people were sampled repeatedly and truly randomly, the proportions would follow an approximate normal distribution with mean equal to the true proportion p of agreement in the population and with standard deviation σ = sqrt(p(1 − p)/n).
We refer to the distribution of all possible percentages that can be obtained from taking random samples of a given size n as the sampling distribution. Right smack in the middle of that distribution is the true proportion of the population supporting the president. Again, we will never know this number until and unless there is an election. What we do know is that about 96% percent of all sample proportions will lie within ±1.96σ of the true population proportion.
Therefore, we can say that the true population proportion p will be within ±1.96σ of the sample proportion about 95% of the time.
We know that σ is determined by the sample size n and the true population proportion p. But we don't know p! What do we do?!
The best information we have on the true population proportion p is the sample proportion, which we'll denote using p′.
p′ is the number of people in our sample who'd vote for the president divided by the total number of people. For example, if 532 people declared they'd vote for him, p′ is 532/1,000 = 0.532 = 53.2%
Our estimate of the standard deviation of the sampling distribution, which is called standard error is then sqrt(p′(1 − p′)/n) = sqrt(0.532*0.468/1,000) ≈ 0.016. Multiplying that with 1.96 gives us the so-called 95% margin of error as approximately 0.031.
Thus, the 95% confidence interval for the population proportion of support for the president on the basis of this poll is 53.2% ± 3.1 percentage points.
This does not mean that there is a 95% chance that the true population proportion is between 50.1% and 56.3%.
The population proportion is either in this interval or it is not. That is, the probability that the population proportion is in this interval is either zero or one. We do not know. We cannot know without holding an actual election today.
What this means is that the confidence interval was constructed using a method such that 95% of confidence intervals so constructed would include the true population proportion.
That's why you do not take the results of a single poll, no matter how properly done, as gospel. If many polls, independently taken, all yield confidence intervals above 50%, your confidence that the true proportion of support is greater than 50% grows. But, a single poll is just that, a single poll.
Now, we can ask interesting questions. For example, what is the probability of getting a poll result of 53.2% support for the president if the true population proportion of support were 49.9%?
In that case, the sampling distribution is approximated using N(0.49,
0.0158). From this distribution, we want the area under the normal curve to
the right of 0.532. You can look it up in a table or just use Excel:
1 - NORMDIST(0.532, 0.49, 0.015808, 1) which gives you
approximately 0.4%.
What if you had sampled only 250 people as I see on some TV shows, and
133 people (53.2%) had said they'd vote for the president. What is the
probability of obtaining such a sample proportion assuming the true
population support is 49.9%. The standard deviation of the sampling
distribution in this case is sqrt(0.49×0.51/250) = 0.032.
Therefore, the probability of obtaining a sample proportion greater than
or equal to 53.2% is 1 - NORMDIST(0.532, 0.49,
0.031616451) which gives you approximately 9.2%.
What is the standard error (i.e. our estimate of the standard deviation of the sampling distribution) if the sample proportion is 53.2% with a sample size of 250? That's simply sqrt(0.532*0.468/250) ≈ 0.032. That gives us a 95% margin of error of approximately 6.2 percentage points. Therefore, the 95% confidence interval for this poll would be approximately [0.48, 0.59].
What sample size would have given us a margin of error of only 1 percentage point? That's easy: You'd need about 9,565 observations to give you a margin of error of one percentage point.
Now, you usually do not know what sample proportion you'll get before you take a sample, so usual rule of thumb is to calculate the required sample size using the assumption that p = 0.5. For example, this sample size calculator does it that way. This is conservative because p×(1-p) is maximized at p = 0.5 for p in [0, 1].
Conclusions
None of this means anything if the sample was not random and respondents were not independently selected.
Also, none of this means much if the correct population was not sampled. The key here is that 1) some people in the U.S. are not allowed to vote (for example, children, non-citizens etc), and 2) some people choose not to vote. Using their responses to judge what voters would do is inappropriate.
A 95% confidence interval does not mean there is a 95% chance the true population proportion lies within the interval. It means 95% of all samples will give confidence intervals that contain the population proportion.
A national poll does not say much about the outcome of the presidential election given the electoral college system. Individual state polls are much more informative.
The poll-taker can do everything right, but if there is an unknown reason that systematically leads some people not to respond to a poll and that factor is related to their political preferences, the poll result does not reflect the preferences of the general voting population.
In light of that, exit polls, conducted as voters are leaving polling places, are likely to be the most susceptible to sample selection bias.
That was almost 3/4 of an Intro Stats class all condensed into one blog post.
The fact that would testify to the verasity of Sinan's mini-lesson on statistics and polls is found in polls conducted in prior to previous elections.
ReplyDeleteInvariably, the most accurate polls are those that poll "likely voters"
Polls that interview registered voters or simply all adults, are not accurate and the proof is out there since we can go back after elections and see which polls fared better.
The silver lining is that Romney leads Obama handily in all "likely voter" polls.