I HAVE NO QUESTION FOR THIS WEEK'S LECTURE.
Notes Week 12
IIR
Text classification and
Naive Bayes
A standing query is like any
other query except that it is periodically executed on a collection to which
new documents are incrementally added over time.
To capture the generality and scope of the problem space to which standing
queries belong, we now introduce the general notion of a classification
problem.
There are two instances
each of region categories, industry categories, and subject area categories. A
hierarchy can be an important aid in solving a classification problem.
The first supervised
learning method we introduce is the multinomial Naive Bayes or multinomial NB
model, a probabilistic learning method. In text classification, our goal is to
find the best class for the document.
The multinomial NB model
is formally identical to the multinomial unigram language model. We also used MLE
estimates and encountered the problem of zero estimates owing to sparse data;
but instead of add-one smoothing, we used a mixture of two distributions to
address the problem there.
There are two different
ways we can set up an NB classifier. An alternative to the multinomial model is
the multivariate Bernoulli model or Bernoulli model. It is equivalent to the
binary independence model.
Vector space classification
Decisions of many vector
space classifiers are based on a notion of distance, e.g., when computing the
nearest neighbors in kNN classification. However, in addition to documents,
centroids or averages of vectors also play an important role in vector space
classification. Centroids are not length- normalized. For unnormalized vectors,
dot product, cosine similarity and Euclidean distance all have different behavior
in general.
Unlike Rocchio, k nearest
neighbor or kNN classification determines the deci- sion boundary locally. For
1NN we assign each document to the class of its closest neighbor. For kNN we
assign each document to the majority class of its k closest neighbors where k
is a parameter. The rationale of kNN classifi- cation is that, based on the
contiguity hypothesis, we expect a test document d to have the same label as
the training documents located in the local region surrounding d.
1NN is not very robust.
The classification decision of each test document relies on the class of a
single training document, which may be incorrectly labeled or atypical. kNN for
k > 1 is more robust. It assigns documents to the majority class of their k closest
neighbors, with ties broken randomly.
没有评论:
发表评论