DeAL TutorialNaive Bayes Classifier

Naive Bayes Classifier

We use the following example to explain the Naive Bayesian Classifier in DeAL for categorical attributes. The program uses the following schema:

database({patterns(Id:integer, Col:integer, Val:string),
	labels(Id:integer, Class:string),
	test_data(Id:integer, Col:integer, Val:string),
	smoothing_parameter(K:float)}).

We feed the training and test data into the Naive Bayesian classifier in a verticalized format, as explained below.

Consider the following training data set from text book, "Machine Learning, Tom M. Mitchell, McGraw Hill, 1997"

% ID	| Outlook	| Temperature	| Humidity	| Wind	| Play Tennis
% =============================================================================
% 1	| Sunny		| Hot		| High		| Weak	| No	
% 2	| Sunny		| Hot		| High		| Strong| No	
% 3	| Overcast	| Hot		| High		| Weak	| Yes	
% 4	| Rain		| Mild		| High		| Weak	| Yes	
% 5	| Rain		| Cool		| Normal	| Weak	| Yes	
% 6	| Rain		| Cool		| Normal	| Strong| No	
% 7	| Overcast	| Cool		| Normal	| Strong| Yes	
% 8	| Sunny		| Mild		| High		| Weak	| No	
% 9	| Sunny		| Cool		| Normal	| Weak	| Yes	
% 10	| Rain		| Mild		| Normal	| Weak	| Yes	
% 11	| Sunny		| Mild		| Normal	| Strong| Yes	
% 12	| Overcast	| Mild		| High		| Strong| Yes	
% 13	| Overcast	| Hot		| Normal	| Weak	| Yes	
% 14	| Rain		| Mild		| High		| Strong| No	

Values under the column Play Tennis (Yes|No) are the class labels.

Instead of each tuple being represented in a horizontal manner, the corresponding attributes of the tuple are represented in a vertical format, as shown below:

patterns(1,1,'Sunny').
patterns(1,2,'Hot').
patterns(1,3,'High').
patterns(1,4,'Weak').
labels(1,'No').

patterns(2,1,'Sunny').
patterns(2,2,'Hot').
patterns(2,3,'High').
patterns(2,4,'Strong').
labels(2,'No').

patterns(3,1,'Overcast').
patterns(3,2,'Hot').
patterns(3,3,'High').
patterns(3,4,'Weak').
labels(3,'Yes').

patterns(4,1,'Rain').
patterns(4,2,'Mild').
patterns(4,3,'High').
patterns(4,4,'Weak').
labels(4,'Yes').

patterns(5,1,'Rain').
patterns(5,2,'Cool').
patterns(5,3,'Normal').
patterns(5,4,'Weak').
labels(5,'Yes').

patterns(6,1,'Rain').
patterns(6,2,'Cool').
patterns(6,3,'Normal').
patterns(6,4,'Strong').
labels(6,'No').

patterns(7,1,'Overcast').
patterns(7,2,'Cool').
patterns(7,3,'Normal').
patterns(7,4,'Strong').
labels(7,'Yes').

patterns(8,1,'Sunny').
patterns(8,2,'Mild').
patterns(8,3,'High').
patterns(8,4,'Weak').
labels(8,'No').

patterns(9,1,'Sunny').
patterns(9,2,'Cool').
patterns(9,3,'Normal').
patterns(9,4,'Weak').
labels(9,'Yes').

patterns(10,1,'Rain').
patterns(10,2,'Mild').
patterns(10,3,'Normal').
patterns(10,4,'Weak').
labels(10,'Yes').

patterns(11,1,'Sunny').
patterns(11,2,'Mild').
patterns(11,3,'Normal').
patterns(11,4,'Strong').
labels(11,'Yes').

patterns(12,1,'Overcast').
patterns(12,2,'Mild').
patterns(12,3,'High').
patterns(12,4,'Strong').
labels(12,'Yes').

patterns(13,1,'Overcast').
patterns(13,2,'Hot').
patterns(13,3,'Normal').
patterns(13,4,'Weak').
labels(13,'Yes').

patterns(14,1,'Rain').
patterns(14,2,'Mild').
patterns(14,3,'High').
patterns(14,4,'Strong').
labels(14,'No').

Now suppose we want to test our classifier with the following case - {Sunny,Hot,Normal,Strong}. We add this case as a test data as shown below:

test_data(1,1,'Sunny').
test_data(1,2,'Hot').
test_data(1,3,'Normal').
test_data(1,4,'Strong').

Here, we will also use laplace smoothing (additive smoothing) to treat missing categorical values in the test data set.

% add-one smoothing, but it can be anything else, smoothing_parameter(0) means no smoothing
smoothing_parameter(1).

Build the Model

% prior(C, X) denotes X = P(Class = C).
% we are not calculating actual probability, but only the numerator as the denominator is same for all
prior(C, count<Id>) <- labels(Id, C).

% Also computing an alternate prior predicate with the smoothing parameter
% that will be required later for deriving likelihoods.
smooth_prior(Class,Cnt+K) <- prior(Class,Cnt), smoothing_parameter(K).

% likelihood(I,Val,C,X) denotes X = P(Column I = Val | Class = C).
% But this time we calculate the actual probability (with the smoothing parameter)
count_matches(Col,Val,Class,count<Id>) <- patterns(Id,Col,Val), labels(Id,Class).

% Applying smoothing for missing combinations in the test data
count_matches_smooth(Col,Val,Class,Cnt+K) <- test_data(_,Col,Val), labels(_,Class), smoothing_parameter(K),
					     if(count_matches(Col,Val,Class,_) then 
					     count_matches(Col,Val,Class,Cnt) else Cnt = 0).
likelihood(Col,Val,Class,P) <- count_matches_smooth(Col,Val,Class,M), smooth_prior(Class,N), P = log M - log N.

Decision Phase

total_likelihood(Id,C,sum<Prob>) <- test_data(Id,Col,Val), likelihood(Col,Val,C,Prob).

% posterior = prior x likelihood
% calculating the log now
posterior(Id,C,Prob) <- total_likelihood(Id,C,L), prior(C,Cnt), Prob = L + log Cnt.

% writing a user-defined aggregate to choose the class with max posterior probability
single(nbmax, (C, P), (C, P)).
multi(nbmax, (C, P), (OC, OP), (C, P)) <- P > OP.
multi(nbmax, (C, P), (OC, OP), (OC, OP)) <- P <= OP.
max_posterior(I, nbmax<(C, P)>) <- posterior(I, C, P).

export max_posterior(Id,_).
query max_posterior(Id,_).

Note: For continuous attributes, we can either (i) convert them to categorical or (ii) Use Gaussian distribution.


Last Updated 2014