(a) What are the parameters of the model?
We have 3 independent features, so we would have one number for each
bit/feature for each class, totally 6 numbers. Let's write p(i,j) to be
the probability of bit i under class j.
This assumes that the two classes are equally likely. It is also possible
to have a separate parameter which is the probability of selecting class I.
(b) Imagine we are given data consisting of the three(a typo made this "two" in the original statement) feature values for each sample from the model. We are not given the class label. Describe an "expectation" procedure to compute class labels for the data given a model.
Ok, we have probabilities p(i,j) and a triple of bits. We want to
know the probability that this triple came from class I vs. class II.
Pr(triple t | class j) = product_{i} p(i,j) if bit i is on in the triple t, 1-p(i,j) otherwise
Pr(class j) = 1/2 (or the parameter)
Pr(triple t) = sum_{j} Pr(triple t | class j) Pr(class j)
Pr(class j | triple t) = Pr(triple t | class j) Pr(class j) / Pr(triple t)
Pr(class j | triple t) is the (fractional) class label j for the triple t.
(c) How do you use this procedure to learn a maximum likelihood model for the data?
Use EM: Start by setting up the parameters, doing expectation to get the labels. Use fractional counting (maximization) to get new parameter values. Repeat until things settle down.
The new parameters are computed by the following formula:
p(i,j) = (sum_{triple t in which feature i is on} Pr (class j | triple t)) / ( sum_{triple t} Pr(class j | triple t})