# Understanding Naive Bayes Algorithm

Naive Bayes is a machine learning model used for classification. The Naive Bayes technique is based on probabilities. The probability of an event happening or not happening can be calculated using historical data.

# Let’s Dissect the meaning of Naive Bayes.

- It is called Naive because it is based on the Naive Assumption that each input variable is independent or simply put, the presence of a feature in a class is unrelated to the presence of any other feature in that same class.
- Then Bayes’ theorem is used to come up with a Naive Bayes Algorithm. The Naive Bayes Algorithm premise states that the more information we gather about an
**event**, the better we can make a probability on it.

**Explanation 1**

If we want to predict the average temperature for next week using the current average temperature, that can be quite difficult. But if we use temperature for data for the last 300 weeks, we can be able to predict the average temperature better.

# Explanation 2

If I tell you to guess a color that I am thinking of, the chances of you guessing that color correctly are very low. But if I tell you that the color is part of the colors of the rainbow, then your chances of guessing the correct color have increased.

So with Bayes’ theorem, the more information I give you, the closer to accuracy we get.

**NB: It is called Bayes’s because it was named after ****Thomas Bayes****, **who came up with the Bayes Theorem.

# Now to the Specifics:

# Terminologies and Concepts in Baye’s Theorem:

**Prior**— This is the probability of an event before new data is collected. When an event happens, we get more information and with more information we get much more close to accurately predicting the probability.

Example: If I have three boxes, A,B and C and there is only one with a gift, then the probability of finding the gift in the three boxes is 0.333. But If I pick box A, and it turns out not have a gift, then the probability of the gift being in either box B and C is 0.5

**Event**Something that occurs, which gives us information — We chose a box, and there wasn’t a gift inside.**Posterior**This is the final probability that we calculate using the prior probability and the event — The probability that a gift is in the box

The question being answered by Bayes’ Theorem:

“What is the probability of a given that b occurred?”** P(A|B).** This is known as conditional probability

In Machine learning world, this would be**, for instance, what is the probability of an email being spam, given some features, **and this in machine learning is a classification problem.

# Example: Spam Classification Model

*Classification problem: We want to classify emails that are either spam(junk email or ham(not junk email) using the Naive Bayes model.*

The probability that an email can be spam depends on the email content with features such as the words used, the size etc. Example **BUY!, EARN! , OFFER OFFER!, WINNER, etc**

To start, we will:

- Find the
**prior —**The probability that any email is spam. If you have 100 emails and find that 80 of them are ham and 20 are spam then the probability that a new email is spam is 0.2. This is the prior probability. (It is the only information we have at the beginning) - Find the
**posterior —**The probability that an email is spam, knowing that it contains a specific word(**event**).

Let’s use an example from our spam words, say “**WINNER!” .**The truth is, “**WINNER!”** can appear in both spam and ham emails depending on the context. From our prior we have spam emails as 20. Out of these 20, 15 are found to be actually spam and contain the word winner and 5 are not spam and do not contain the word winner. **Therefore out posterior becomes 15/20 or 0.75.**

Let’s now visualize all this using a tree.

- At the root we start with two branches. An Email Being Spam or not. 20/100(1/5) are spam and 80/100(4/5) are not spam.

- We then add more information from the two branches. The probability that both spam and ham emails can contain the spam word “winner”.

**Out of the 20 spam words,**15 are found to have the word winner and 5 do not have the word winner.

**Out of the 80 ham words,**5 are found to have the word winner and 75 do not have the word winner.

From the tree above, we can now calculate the probability that **an email is spam ***given that* it contains the word “**Winner**”. This means we are looking for branches that have the word winner and any branch whose emails don’t contain the word “Winner will be removed.

As a result we get that the probability that **an email is spam ***given that* it contains the word “**Winner**” will be found from the above branches only, which include both Spam and Ham Emails.

Spam and “winner” — 1/5 * 3/4 = **3/20**

Ham and “winner”- 4/5* 1/16 = **1/20**

# Bayes’ Theorem

The tree helped us to visualize the probabilities but we have a formula that can help us with that, and to be more specific, the Naive Bayes Theorem formula below

Where:

**P(A)=P(Spam) **— Probability that an email is spam 20/100 or **1/5**.

**P(B)= P(‘winner’|spam).P(spam)+ P(‘winner’|ham).P(ham) — **Probability that an email has the word winner. From both Spam(1/5*3/4)=3/20 and Ham(4/5*1/16)=1/20.

**P(B|A)= P(‘winner’|spam)**— Probability that a spam email has the word winner. From the image above that is 15/20 or 3/4.

**P(A|B) =P(spam|’winner’) **—What is the probabilit*y* that an email that contains the word winner is spam? This is what we want to answer.If after computation the value of P(A|B) is high then there is a high probability that our email is spam. If this value is low, then there is a low probability that it is spam.

Subsituting all this in our equation we get:

Therefore there is a** 0.75** probability that an email that contains the word winner is spam ,which answers our **P(A|B) **in the equation.

# Advantages of Naive Bayes Algorithm

- Needs less training data to train the model
- It is a simple model that works remarkably well.
- It is quick and it can produce results in a very short duration.

# Disadvantages of Naive Bayes Algorithm

- A bigger data set is required to make much more reliable predictions.
- With small data sets, the precision is less.