IS2140 Information Retrievial: Notes Week 12 and NO MUDDIST THIS WEEK

I HAVE NO QUESTION FOR THIS WEEK'S LECTURE.

Notes Week 12

IIR

Text classification and Naive Bayes

A standing query is like any other query except that it is periodically executed on a collection to which new documents are incrementally added over time. To capture the generality and scope of the problem space to which standing queries belong, we now introduce the general notion of a classification problem.

There are two instances each of region categories, industry categories, and subject area categories. A hierarchy can be an important aid in solving a classification problem.

The first supervised learning method we introduce is the multinomial Naive Bayes or multinomial NB model, a probabilistic learning method. In text classification, our goal is to find the best class for the document.

The multinomial NB model is formally identical to the multinomial unigram language model. We also used MLE estimates and encountered the problem of zero estimates owing to sparse data; but instead of add-one smoothing, we used a mixture of two distributions to address the problem there.

There are two different ways we can set up an NB classifier. An alternative to the multinomial model is the multivariate Bernoulli model or Bernoulli model. It is equivalent to the binary independence model.

Vector space classification

Decisions of many vector space classifiers are based on a notion of distance, e.g., when computing the nearest neighbors in kNN classification. However, in addition to documents, centroids or averages of vectors also play an important role in vector space classification. Centroids are not length- normalized. For unnormalized vectors, dot product, cosine similarity and Euclidean distance all have different behavior in general.

Unlike Rocchio, k nearest neighbor or kNN classification determines the deci- sion boundary locally. For 1NN we assign each document to the class of its closest neighbor. For kNN we assign each document to the majority class of its k closest neighbors where k is a parameter. The rationale of kNN classifi- cation is that, based on the contiguity hypothesis, we expect a test document d to have the same label as the training documents located in the local region surrounding d.

1NN is not very robust. The classification decision of each test document relies on the class of a single training document, which may be incorrectly labeled or atypical. kNN for k > 1 is more robust. It assigns documents to the majority class of their k closest neighbors, with ties broken randomly.

IS2140 Information Retrievial

2014年4月10日星期四

Notes Week 12 and NO MUDDIST THIS WEEK

没有评论:

发表评论