Bayes Theorem in the Face of Reality
Bayes Theorem is one of the most important concepts in probability theory. The simple mathematical formula is the core driving component behind many core functions in Machine Learning and Artificial Intelligence. Named after 18th-century British mathematician Thomas Bayes, Bayes Theorem is a mathematical formula for determining conditional probability — the likelihood of an outcome occurring, based on a previous outcome occurring.
The theorem first gained mainstream attention when it was used for a treasure hunt. During a hurricane on the evening of September 12, 1857, off the coast of Georgia, a ship named SS Central America was caught amidst the tumult of the waves. After being battered and bruised by the unforgiving seas, the ship’s timbers came undone as water seeped into it. Only a quarter of the ship’s passengers were able to escape, and the rest, including the captain, were plunged into the depths of the ocean. For weeks, newspapers were filled with stories of horror as the event came to be known as one of the worst sea tragedies in the United States. While in the case of most tragedies, people forget them, SS Central was an exception. With time, people’s curiosity only increased as they began to wonder about the sunken ship’s whereabouts. This was because the ship – when it sank to the ocean floor – was carrying a cargo of gold from mines in California.
As years passed, many people attempted to find the remains of the SS Central America, but none succeeded in tracking it. The ocean was vast, and triangulating the ship’s coordinates was nigh impossible. That was until mathematics came to the rescue.
Fast forward to the 1980s, an American engineer named Tommy Thompson began a new search for SS Central America. Through his charm, Thompson had convinced 161 different investors into funding his hunt for the long-lost ship. Thompson hired a man named Larry Stone, who was infamous for pioneering the Bayesian search theory, which used probability to find missing objects, including sunken ships. Stone would use Bayes Theorem to weigh up different pieces of evidence based on their strength and then compute the probability of a particular conjecture being true.
Thompson’s team had collected hundreds of different eyewitness accounts and reports from the time of the ship’s sinking and produced a list of places where the ship could be found. Stone combined all the information and ran computer simulations to try to work out the most likely location of the shipwreck by employing the Bayes Theorem. He produced a map that covered 1,400 square miles of ocean, split into a two-square-mile-wide grid with each square containing the probability of finding the ship at that location. With the help of the map, Thompson and his team began searching the ocean. The beauty of Bayes Theorem was that it allowed Thompson’s team to update the probability of finding the treasure in each square as they moved from one location to another. Stone’s map told them where to begin their search and how to employ Bayes’ theorem to update all of the probabilities as they went along and if they found any clues — which they did as they found pieces of broken china, washbowls, and toys lying on the seabed, all dating from the nineteenth century. Thompson’s team used the newfound evidence to update the probabilities and triangulate the location of the lost treasure. Finally, on September 11, 1988, the crew’s undersea vessel spotted SS Central America. The lost treasure had been found, all thanks to the Bayes Theorem.
Put very simply, Bayes’ theorem teaches us that for any views that you hold, you should continually update your conclusions when new evidence emerges. It’s a way of integrating probability thinking into our lives. Before we delve into the formula for Bayes Theorem, let’s learn about how it was discovered.
Thomas Bayes was an English minister in the 18th century whose most famous work, “An Essay toward Solving a Problem in the Doctrine of Chances,” was brought to the Royal Society’s attention two years after his death in 1763 by his friend Richard Price. The essay did not contain the theorem but could be termed as the theorem’s blueprint. In his essay, Bayes argued for adjusting our estimates of probabilities when we encounter new data that influences a particular situation.
Bayes used a thought experiment to illustrate his arguments. Imagine that Bayes has his back turned to a billboard table, and he asks his assistant to drop a ball on the table. The table design is perfect in that the ball has an equal chance of landing at any one place on the table. Now Bayes has to figure out where the ball is without looking. He asks his assistant to throw another ball on the table and report whether it is to the left or the right of the first ball. If the new ball landed to the left of the first ball, then the first ball is more likely to be on the right side of the table than the left side. He asks his assistant to throw the second ball again. If it again lands to the left of the first ball, then the first ball is even more likely than before to be on the right side of the table. Thus Bayes would keep asking his assistant to continue throwing balls on the table. Over multiple attempts, Bayes can narrow down the area in which the first ball probably sits. Each new piece of information constrains the area where the first ball probably is. The philosophy behind Bayes thought experiment was simple: we ought to update our beliefs as new evidence shows itself pertaining to an event.
While Thomas Bayes laid the foundation for Bayesian thinking, it was the work of French scholar Pierre-Simon Laplace and others who codified the theorem and made it into the formula that even played a pivotal role during World War II.
In 1941, German U-Boats were devastating allied naval forces. Britain was cut off from its sources of food and couldn’t grow enough on its own soil to feed its citizens. German U-boats were inflicting heavy losses on the Allies’ shipping, and the need to understand their signals was crucial. The German codes, produced by Enigma machines with customizable wheel positions that allowed the codes to be changed rapidly, were considered unbreakable. This attracted the great mathematician Alan Turing to the problem. He built a machine that could test different code possibilities. But the machine had a problem. It was too slow.
The machine needed four days to test all 336 wheel positions on a particular Enigma code. Until more machines could be built, Turing had to find a way to reduce the burden on the machine. That’s when Bayesian thinking came to his rescue. Turing used a Bayesian system to guess the letters in an Enigma message and add more clues as they arrived with new data. With this method, he could reduce the number of wheel settings to be tested by his machine from 336 to as few as 18, allowing him to decrypt messages sent through enigma at a far better speed. The rest, as we know, is history as we are all aware of how important a role Turing’s machine played in the outcome of the war.
Now, let’s imagine a hypothetical scenario to understand how Bayes Theorem works.
Suppose you feel sick and go to the doctor. The doctor prescribes you some tests, so you quickly get them done. The test results come in, and your doctor tells you that you’re positive for a rare disease that affects 0.1% of the population; 99.9% of people are unaffected by it. Being a curious person, you ask the doctor about the test’s accuracy. She says that the test identifies correctly 99% of the time, but the rest 1% of the time shows false positives. Now, what are the chances that you have the disease? Bayes Theorem can come to the rescue. Let’s look at the Formula:
Let P(A) be the chance that you have the disease prior to you taking the test.
Let P(B) be the probability of the event that you test positive.
Let’s expand P(B) further.
P(B) = [P(A) * P(B|A) ] + [P(-A) * P (B|-A)]
In general terms, P(B) = [ the probability of you having the disease and correctly testing positive] + [the probability of you not having the disease and falsely testing positive]
So, we have
Before we proceed, let’s take some time to clear some common points of confusion.
P(A|B) is the posterior probability or the probability of A to occur given event B already occurred. When we write P(A|B), it simply means ‘the probability of event A happening when event B has already taken place. In our case, it is the probability of you having the disease, given you have tested positive.
P(B|A) is the likelihood or the probability of B given A. P(B|A) means the likelihood that you would test positive if you had the disease i.e., probability of event B occurring given evidence A has already happened.
The | symbol translates as given. A probability expressed in this way is a conditional probability because it’s the probability of a belief, A, conditioned by the evidence presented by B.
On the other hand, P(A) and P(B) is the prior probability of event A and B to occur.
Mathematically, when two events are dependent,
P(A|B) = P(A∩B) / P(B)
Similarly, P(B|A) = P(A∩B) / P(A)
where P(A∩B) is the probability of happening of both A and B.
However, if A and B are independent events,
then P(A|B) = P(A)
and P(B|A) = P(B)
Usually, the prior probability (in our case, the likelihood of you being positive for the disease before you even tested) that a hypothesis is true is often the most challenging part of the equation to figure out, and sometimes, it’s no better than a guess.
In our case P(A) is has to be a guess. Since you identify with the general population and the general population has 0.1% chance of having the disease,
P(A) = 0.001 or 0.1%
P(B|A) = .99
P(-A) = .999
P(B|-A) = 0.01
Solving for the equation,
P(A|B) = [P (B|A) * P(A) ] / [P(B|A) * P(A) + P(B|-A) * P(-A)
P(A|B) = [0.99 * 0.001] / [(0.001 * 0.99) + (0.01*0.999 )]
P(A|B) = 0.09 or 9%
So, even if you take a 99% accurate test, the chance of you having the disease is only 9%.
The beauty of the Bayes theorem is not just that it allows us to calculate conditional probabilities; it is also that with each piece of new information, one can update their conclusions.
Suppose, now you went to another doctor to get a second opinion and get that test run again by a different lab. Once again, you can use the Bayes theorem with updated inputs.
This time, P(A) will be the prior probability we calculated, i.e., 9%.
Then P(A|B) is calculated to be
P(A|B) = [.99 * 0.09] / [(0.09 * 0.99) + (0.91 * 0.01)]
P(A|B) = 0.0891 / (0.0891 + 0.0091)
P(A|B) = 91% (approx.)
We observe how this time, the probability (in the face of new evidence) changes to 91%.
At the heart of Bayesian thinking, the Bayes theorem tells us how to update our beliefs or conclusions in light of new evidence. However, it doesn’t tell us how to set our prior convictions. For instance, it is possible for some people to hold a particular idea with a 90% certainty and other people to hold to the same belief with 10% certainty. Our prior perception of truths pertaining to certain ideas or events is always relative, and it’s not always that we have a given number for the same. Often the probabilistic weightage we give to our prior beliefs comes from not mathematical introspection but rather preconditioned perceptions.
Knowing the exact math of probability calculations in Bayes Theorem is not the key to understanding Bayesian thinking. More important is our willingness to update our prior beliefs as new information becomes manifest to us. But then again, is that a wise way to live our days?
Since Bayesian math governs a significant chunk of our modern algorithms, shouldn’t we look at the costs of adopting such a way of thinking? Today, the Bayes theorem is one of the most essential concepts of probability theory used in Data Science. It allows algorithms to update their behaviour based on the appearance of new events/information. The fundamental idea of Bayesian inference is to become ‘less wrong’ with more data. The process is straightforward: we have an initial belief, known as a prior, which we update as we gain additional information. Algorithms do the same. But is being less wrong the right way to approach a problem? At least from a human point of view, some apprehensions may arise. The problem lies less with the method and more with the ideology. But let’s see the situation from a methodological point of view.
Bayesian analysis does not tell you how to select a prior. There is no correct way to choose a prior. Bayesian inferences require skills to translate subjective prior beliefs into a mathematically formulated prior. If you do not proceed with caution, you can generate misleading results. Furthermore, it can produce posterior distributions that are excessively sensitive to the prior probabilities.
However, it is the ideological problems with the Bayes theorem that pose a significant concern. This is because Bayesian thinking if taken to its logical extreme, discourages innovative and novel interventions. Caution needs to be exercised when using Bayesian models to inform complex human decision-making. We need to be careful about which problems we throw into the Bayesian black box. Think about something as simple as repeated failures to solve a particular problem. If the question is mislabeled, Bayesian analysis may give misleading answers. For instance, in any case, where the prior is zero, Bayesian analysis will deem the pursuit impossible. And we all know, many things seem impossible until they are done. Priors are often zero until the thing actually gets done. Bayesian analysis shouldn’t discourage innovative experimentation. Thus it is pertinent that human intuition, will, and grit determine the quantification of priors in such cases where the decisions in question demand a risk-taking and novel approach. Then there are priors that are hard to quantify accurately; employing Bayesian thinking in such cases is bound to fail. By its nature, Bayesian analysis tends to ignore the true range of uncertainty. And no matter what, in an open system, appending all causes with their effects and determining the relevant intersectionality is hard to achieve. We will always be ‘less wrong’.
Fig- A Bayesian Network Model for Diagnosis of Liver Disorders.
Today Bayes theorem sits at the heart of machine learning due to its use in a probability framework for fitting a model to a training dataset and in developing models for classification predictive modeling problems such as the Bayes Optimal Classifier and Naive Bayes. A machine learning algorithm or model is a specific way of thinking about the structured relationships in the data. In this way, a model can be thought of as a hypothesis about the relationships in the data, such as the relationship between input (X) and output (y). The practice of applied machine learning is the testing and analysis of different hypotheses (models) on a given dataset. Bayes Theorem provides a probabilistic model to describe the relationship between data and a hypothesis by calculating the probability of a hypothesis based on its prior probability, the probabilities of observing various data given the hypothesis, and the observed data itself. It serves as a useful framework for thinking about and modeling a machine learning problem. But the uses of Bayesian theory in Machine learning extend beyond modeling hypotheses. Bayes Theorem is used for data classification, predictive labeling of data, global optimization, and belief networks. The applications keep growing as we delve deeper into the world of algorithms to solve for the problems of causal inference.
Fig- A Bayesian network (BN) representing the “profile” of the AI-ALS.
The strength of Bayesian networks is that they can be built from human knowledge (i.e., from theory), or they can be machine-learned from data. Thus, the model source employs both big data and human expertise. Due to their graphical structure, machine-learned Bayesian networks are visually interpretable allowing human learning and machine learning to work in synchrony. Basically, Bayesian networks can be developed from a combination of human and artificial intelligence. In fact, when certain conditions are met, Bayesian network models facilitate causal inference by covering the entire range from correlation to causation. This has special significance for simulations (systems) that are governed by a cause-effect relation. Again, the systems in question will have to be linear, employing under the base assumption that knowledge of a system’s initial conditions can allow for calculation of approximate behavior of the system. The assumption that a certain convergence exists in how systems work and that minor influences (which are almost impossible to observe or quantify) don’t create arbitrarily large effects. But therein lies the heart of the problem, for most systems are non-linear, where the existing cause-effect relationship invites changeability, making the nature of these relationships impossible to identify. Thus Bayesian analysis fails when confronted by chaos, a phenomenon that seems to be embedded in every inch of creation.
The point is that probability, as Thomas Bayes said, is an ‘orderly opinion’. Bayes theorem helps us formulate that opinion. We often forget that sometimes it is chaos we are trying to observe, quantify, model, label, and predict. Some universal patterns are bound to emerge, while other times, we may end up feeding only our prior beliefs. Of course, we as humans have always done that, but never before with the help of quantum supremacy aiding our calculations. In choosing to be less wrong, there is a risk we may become blinds towards the truth. Nonetheless, Bayes theorem offers a coherent representation of domain knowledge under uncertainty when combined with sound knowledge and judgment. How we see our priors are important and the Bayes theorem is a powerful tool that can help us see clearly as long as we ourselves see our biases in quantifying our priors.