The problem is to find the matrix W.
A frequently proposed "solution" to this problem is to use principal
components analysis, or PCA. Here one forms the
covariance matrix C for the example feature vectors, and finds the eigenvalues
and eigenvectors of C. The m eigenvectors having the largest eigenvalues
are then used as the columns of W. **
PCA can be shown to be optimal in a least-squares sense for representing
the example feature vectors. Unfortunately, it usually does not provide
the best linear combination of features for discriminating between
the different classes. A more appropriate alternative is to use multiple
discriminant analysis or MDA (see Duda
and Hart). A full exposition of MDA is beyond the scope of these notes.
However, the two-class case is simple. In this case, it turns out that W
is the d-by-1 vector w given by
where m1 is the mean vector for Class 1 and
m2 is the mean vector for Class 2. The resulting
single feature y = w' x is often called
Fisher's linear discriminant. Unfortunately, MDA does not
provide additional features if this feature turns out to be inadequate.
__________
* In essence, this is what the use of a "hidden layer" accomplishes
in feedforward neural networks. Feature combination is also sometimes called
feature extraction, the idea being that the combination
draws together or "extracts" the meaningful features from the
primitive features.
** A more modern approach is to perform a singular-value decomposition
of the matrix of example feature vectors. However, this alternative suffers
from the same shortcomings as classical PCA.
Back to Stepwise
Up to Feature Selection