Bayes Theorem of Conditional Probability and the Ambiguity of Data
Add bookmarkIf it walks like a duck and quacks like a duck, is it possible it's not a duck? In this article, guest contributor Phil Mercy, a Six Sigma Master Black Belt working for Motorola Solutions in the UK, examines this oftquoted maxim and describes how Bayes’ theorem of conditional probability makes raw data useful for making business decisions.
The Six Sigma practitioner faces a number of problems. While every business strives to be "Data Driven" in its decision making, we often look at the raw data only to find it noisy and potentially misleading. This leads to extensive analysis overhead, misgivings about the conclusions, and debate about the actions required.
Having encountered this situation many times I’ve had to develop new approaches to work successfully with noisy data, and I have employed a number of these techniques at Motorola. This article aims to show one such approach. Its subject matter is somewhat tongue in cheek, but the math and techniques are very real.
First, let’s look at how the average human deals with data ambiguity.
"If it walks like a duck, and quacks like a duck, then it’s a duck."
This oft quoted maxim is intuitively ‘Correct’ and accurately describes the human reasoning process. Evidence about an object, in this case whether it waddles or quacks, is used to help determine the nature of that object, i.e. whether it’s a duck or not. When the weight of evidence builds up in favour of any single outcome, then a human will deduce that this result is the correct one. "If it walks like a duck, and quacks like a duck, then it’s a duck".
This seems ad hoc, and not analytically sound, but in practice this method works really well to guide our day to day decisions. We are squeamish about stereotyping, about "judging a book by the cover", but it does actually work most of the time. It works even better if you do more than just look at the cover. If the book jacket features a resplendent dragon, and your cursory reading of page one reveals a hero by the name of Prince F’gaark then, odds are, that it’s one of those science fantasy novels you hate and you make the decision not to buy it. After all, when it comes down to it, we’re all in the business of turning raw data into a correct decision for our business.
[eventPDF]
Luckily for us analytical types, there is a Mathematical formalism for this technique: Bayes’ theorem of conditional probability.
The probability of an event A occurring is changed if we know something about a related event B
P(AB) = P(BA).P(A)/P(B)
… and in English  The probability of A, given that B has occurred, is the probability of B, given that A has occurred, times the probability of A, all over the probability of B
If we know that event A normally occurs when event B has already occurred, then knowing something about B may well change your view of A. For complex systems with multiple events A,B,C … etc. being considered, a Bayesian Belief Network is often used to model the likelihood of an outcome. You’ll find Bayes used in a number of high technology areas such as complex risk analysis, data mining, machine data learning, artificial intelligence and language recognition.
Let’s explore further with a (somewhat unlikely) example:
Let’s capture ducks: Imagine a wildlife reserve with a large population of birds of different species, and a whole bunch of scientists trying to learn something about their habits. On our wildlife reserve, the scientists know that approximately 50% of the birds are ducks. They install a humane trap to capture a single bird at a time for potential analysis. The trap has two exits under the control of the scientists, either back into the wild or into a remote lab area where some (humane) tests can be conducted prior to releasing the bird back into the reserve. The trap is concealed from the main lab area and the scientists cannot see what type of bird they have captured before deciding which door to open.
As these tests are time consuming and expensive, they really need some way of determining what species of bird they’ve captured before deciding whether to carry out the tests or not. Currently, they are interested in ducks.
What’s the probability that any single bird in our trap is a duck? Simply, in the absence of any evidence, the answer has to be 50%. All we have is an estimate of the proportion of birds that are ducks, 50%.A50:50 chance is poor, half of our decisions would be wrong, and we’d end up spending half the amount of time and money available analysing birds that aren’t ducks. We need more evidence.
Figure 1Our Unlikely Scenario
So,let’s detect ducks: The scientists ask the University boffins for help. They develop and install a device they call the "Quackometer" and trial it. When a bird enters the trap, the Quackometer monitors the birdcall and flags one of two outputs, either "Quack" or "No Quack". The scientists need some idea as to how good the Quackometer is. Rationally, it is unlikely to be perfect, but just because it isn’t 100% accurate, doesn’t make it useless … it may help them a lot.
Let’s test the Quackometer: For an afternoon, the scientists caught birds in the trap, logged the Quackometer reading, and then released the birds into the lab to check whether they were ducks or not before releasing them. The results are as follows: 91 birds caught; 48 of them ducks; 43/48 (89.6%) of the ducks registered a "Quack" on the Quackometer.
Figure 2 Quackometer results
We can now fully characterise our Quackometer in terms of Bayes’ theory.

We now know the likelihood of a bird emitting a "Quack". P(Quack)= 53.8%

We know better now the likelihood of catching a Duck. P(Duck) = 52.7%

We also know the likelihood of getting a "Quack" if the bird is a Duck. P(Quack Duck) = 89.6%

We can now use Bayes’ theory to calculate the Likelihood of the bird being a duck if the Quackometer output is "Quack"

P(Duck Quack) = P(Quack Duck) * P(Duck) / P(Quack) = 89.6% * 52.7% / 53.8% = 87.8%
We’ve just improved our duck capturing accuracy by introducing the Quackometer. We now expect only 12/100 mistakes as opposed to 50/100. But can we do better?
Introducing the Waddleometer: When they developed the Quackometer, the boffins also came up with a second gadget. They called this the Waddleometer. Whilst it performed less well than the Quackometer it still ostensibly detects ducks. It was trialled at the same time as the Quackometer and of the 48 Ducks caught, 34 produced a Waddle. This gives us P(DuckWaddle) = 70.8% * 52.7% / 49.5% = 75.6%. It’s not as good as the Quackometer, but still better than chance. Using the Waddleometer alone, we’d expect only 24/100 mistakes as opposed to 50/100.
Figure 3  Waddleometer results
At this point in our tale, you’d forgive the Scientists for deciding that the Quackometer was the better of the two devices, and adopting this as their duck detector. But then the scientists would be missing a trick. The Quackometer and Waddleometer have no technology in common, and are therefore independent measurements of whether a bird is a duck or not. In fact the only thing linking the two measurements is the bird. Using Bayes we can mathematically link evidence from these two devices for a more accurate assessment.
Figure 4 shows how we can do this using a Bayesian Belief Network. Each node represents and event which has a probability of occurring and arrows represent how one event can influence the likelihood of another. Here we have three nodes.

"Bird" represents the probability that the Bird we capture is a Duck.

"Quackometer" represents the probability that the Quackometer produces a Quack

"Waddleometer" represents the probability that the Waddleometer indicates a Waddle
Figure 4  BBN Design
Arrows from "Bird" to "Quackometer" and from "Bird" to "Waddleometer" indicate that the output of both measurement devices depends on the type of bird we capture.
Figure 5 adds probability charts to our network and shows the default probability state we see when there’s no output from either the Quackometer or the Waddleometer. We use our trial data to define the probabilities in our nodes giving: P(Duck) = 52.7%, P(Quack) = 53.8%, P(Waddle) = 49.5%. (Exactly how this is done is outside the scope of this article, but do look at the references at the end)
Figure 5  BBN No Evidence
Figure 6 shows the situation where we have registered a "Quack" on the Quackometer . In Bayes speak, this is referred to as "evidence". Here’ we have evidence of a Quack. And the probability of "Quack" in our Quackometer node is now 100%. The probability of "Duck" in our Bird node increases to 87.8%, the result we calculated during the trials as P(Duck Quack).
Figure 6  BBN Evidence of Quack
Interestingly, the probability in the Waddleometer node has changed also. Because the probability of our bird being a duck has risen to 87.8%, the probability of the Waddleometer indicating a waddle has also increased. You get a waddle output in one of two ways, either because it’s a duck whose waddle is detected, or because it’s not a duck and there’s a false reading. Mathematically this is [P(Waddle Duck) * P(Duck)] + [P(Waddle No Duck) * P(No Duck)]. What we haven’t evaluated before now is P(Waddle No Duck), but looking at our trial data, 11 out of 43 Non Ducks gave us a false ‘Waddle’, which is 25.6%. Our Quackometer reading means that P(Duck) has increased to 87.8%, so P(Waddle) is now: (70.8%*87.8%) + (25.6% * 12.2%) = 65.3%, the answer in our Bayesian Belief Network.
Figure 7 shows the situation where we have a single point of evidence from our Waddleometer. The probability of a Duck has increased to 75.6%, and the probability that we’d obtain a Quack has also increased to 71.1%.
Figure 7  BBN Evidence of Waddle
Figure 8 shows the situation where we have evidence of both Quack AND Waddle. With both sets of evidence our likelihood P(Duck) rises to 95.2%, By using both types of evidence we’ve improved our success rate and now only 5/100 decisions are wrong.
Figure 8  BBN Evidence of Quack and Waddle
Summary so far: We’ve used Bayes Theory to combine two pretty noisy measurement systems to provide a pretty good level of accuracy for our scientists, and they’ve reduced their wasted time/money from 50% to 5%. Although I’ve used graphics from a Bayesian Belief Network tool for clarity, I’ve also provided the raw equations so the same feat can be accomplished in a spreadsheet. We’ve achieved a powerful result, but this merely scratches the surface of what is possible using BBNs and for those readers eager to look deeper I’ll point them to the references section.
"That’s all well and good", I hear you cry, "but normally we don’t get the luxury of setting up our own measurement system, we inherit the data".
Let’s continue with our scenario. Let’s assume that 2 years later a Professor visits the wildlife reserve. He’s interested in Duck populations and knows that the scientists have some raw data. Unfortunately, when he arrives he discovers that the records that have been kept over the years don’t include any detailed information about the Birds themselves, only the Quackometer and Waddleometer outputs from the trap when a bird is captured. What’s more, nobody remembers the initial trial results for the Quackometer and the Waddleometer. What can he do to deduce Duck populations from this noisy raw data?
Subject matter expert input: The Prof assembles the current group of scientists and quizzes them about the workings of the trap. He discovers:

If Quackometer says "Quack", and Waddleometer says "Waddle" then we believe it’s a duck and hold it for analysis

Very few of the birds we hold for analysis are not ducks

If we see a "Quack" but not a "Waddle" then it’s probably a duck, but we let it go

If we see a "Waddle" but not a "Quack" then it’s probably a duck, but we let it go

No "Quack", no "Waddle", then it’s almost certainly not a duck
Our Prof turns this expert testimony into a likelihood table for use in his analysis
Figure 9 Professor’s model
For each bird in the data: If he sees "Quack" and "Waddle", he’ll count 1 duck; If he sees "No Quack" and "No Waddle", he’ll count 0 ducks; If he sees either a "Quack" or a "Waddle", he’ll count ½ a duck, When he shows this to the subject matter experts, they think it’s a bit crude but can’t suggest an improvement.
In order to test the Professor’s technique, I’ve randomly generated 2000 points of data using the likelihoods we got in the trials. Assuming the scientists trap and record exactly 40 birds a day, that’s 50 day’s worth of data. So how well does the technique perform?
Figure 10 is the randomly generated data.
Figure 10 is a histogram for the randomly generated data. The mean value is 20.76 ducks a day which, at 51.9%, is close to the trial’s estimate of 52.75% for P(Duck). We can also see that there’s a fair bit of variability in the number of ducks per day over the 50 days with a high of 29 and a low of 13 out of 40.
Figure 11 is the Professor’s estimate for Duck population.
Figure 11 is the professor’s estimate using his likelihood table. The mean value is 21.32 which, at 53.3% is a good estimate for P(Duck). The professor’s result also has a similar spread with a high of 29, and a low of 15 out of 40. A 2 sample TTest of the data confirms what we’d expect, that you statistically can’t tell the difference between the actual data, and the professor’s estimate.
Summary:
This is a significant result. We were confronted with noisy historical data, whose level of accuracy was unknown. We’ve used our knowledge of the working of conditional probabilities, and subject matter expert input, to construct a best guess likelihood table for the variable we are interested in. We’ve used this table to derive Duck population figures where that data wasn’t actually recorded in practice. Effectively, we have reverse engineered an analysis, and it is accurate enough to drive our decisions.
References:
http://agena.co.uk/
http://en.wikipedia.org/wiki/Bayesian_network