Based on the following information collected from emails. What is the probability that if the word "llwws" is in a document, it is spam ?

Word "aabbdd" =Spam=0.61 and Not spam=---
Word "llwws"=Spam--- and Not spam=0.6

1 answer

To find the probability that a document is spam given that it contains the word "llwws", we can use Bayes' Theorem. However, we need a little more information about the overall prevalence of spam and not spam documents in your dataset.

From what you've provided:

The probability that the word "aabbdd" is spam is 0.61.
The probability that "llwws" is not spam is 0.6.

We can denote:

\( P(\text{spam} | \text{llwws}) \): the probability that the document is spam given the presence of the word "llwws".
\( P(\text{not spam} | \text{llwws}) = 0.6 \): the probability that the document is not spam given the presence of the word "llwws".

Let’s also introduce some notations:

\( P(\text{llwws}) \): the total probability of the word "llwws" appearing in any document (spam + not spam).
\( P(\text{spam}) \): the prior probability of a document being spam.
\( P(\text{not spam}) = 1 - P(\text{spam}) \): the prior probability of being not spam.

Using the provided information and assuming we have the following relationships:

If "llwws" has a probability of being not spam as 0.6, then: \[ P(\text{spam} | \text{llwws}) + P(\text{not spam} | \text{llwws}) = 1, \] thus, \[ P(\text{spam} | \text{llwws}) = 1 - 0.6 = 0.4. \]

With this information:

The probability that a document is spam given that it contains "llwws" is 0.4, or 40%.

This leads us to state that if the word "llwws" is in a document, there is a 40% probability that it is spam.