2014年4月17日星期四

Last Week Note and No Muddist

NO MUDDIST POINT FOR THE LAST WEEK, Thank you!

Generalizing from relevance feedback using named entity wildcards

This paper proposes news ways of generalizing from relevance feedback by augmenting the traditional bag-of-words query model with named entity wildcards that are anchored in context. The use of wildcards allows generalization beyond specific words, while contextual restrictions limit the wildcard-matching to entities related to the user's query. We test our new approach in a nugget-level adaptive filtering system and evaluate it in terms of both relevance and novelty of the presented information. Our results indicate that higher recall is obtained when lexical terms are generalized using wildcards. However, such wildcards must be anchored to their context to maintain good precision. How the context of a wildcard is represented and matched against a given document also plays a crucial role in the performance of the retrieval system.

Learning to rank for information retrieval

The task of "learning to rank" has emerged as an active and growing area of research both in

information retrieval and machine learning. The goal is to design and apply methods to automatically

learn a function from training data, such that the function can sort objects (e.g., documents) according

to their degrees of relevance, preference, or importance as defined in a specific application.

The relevance of this task for IR is without question, because many IR problems are by nature

ranking problems. Improved algorithms for learning ranking functions promise improved retrieval

quality and less of a need for manual parameter adaptation. In this way, many IR technologies can be

potentially enhanced by using learning to rank techniques.

2014年4月10日星期四

Notes Week 12 and NO MUDDIST THIS WEEK

I HAVE NO QUESTION FOR THIS WEEK'S LECTURE.

Notes Week 12

IIR

Text classification and Naive Bayes

A standing query is like any other query except that it is periodically executed on a collection to which new documents are incrementally added over time. To capture the generality and scope of the problem space to which standing queries belong, we now introduce the general notion of a classification problem.

There are two instances each of region categories, industry categories, and subject area categories. A hierarchy can be an important aid in solving a classification problem.

The first supervised learning method we introduce is the multinomial Naive Bayes or multinomial NB model, a probabilistic learning method. In text classification, our goal is to find the best class for the document.

The multinomial NB model is formally identical to the multinomial unigram language model. We also used MLE estimates and encountered the problem of zero estimates owing to sparse data; but instead of add-one smoothing, we used a mixture of two distributions to address the problem there.

There are two different ways we can set up an NB classifier. An alternative to the multinomial model is the multivariate Bernoulli model or Bernoulli model. It is equivalent to the binary independence model.

Vector space classification

Decisions of many vector space classifiers are based on a notion of distance, e.g., when computing the nearest neighbors in kNN classification. However, in addition to documents, centroids or averages of vectors also play an important role in vector space classification. Centroids are not length- normalized. For unnormalized vectors, dot product, cosine similarity and Euclidean distance all have different behavior in general.

Unlike Rocchio, k nearest neighbor or kNN classification determines the deci- sion boundary locally. For 1NN we assign each document to the class of its closest neighbor. For kNN we assign each document to the majority class of its k closest neighbors where k is a parameter. The rationale of kNN classifi- cation is that, based on the contiguity hypothesis, we expect a test document d to have the same label as the training documents located in the local region surrounding d.

1NN is not very robust. The classification decision of each test document relies on the class of a single training document, which may be incorrectly labeled or atypical. kNN for k > 1 is more robust. It assigns documents to the majority class of their k closest neighbors, with ties broken randomly.

2014年4月4日星期五

Notes Week 11 and NO MUDDIST POINT FOR THIS WEEK

NO MUDDIST FOR the class on 03/31

As the information on the internet is increasing continuously, personalized approach is very needed to give each user their unique information access regarding their interest, browsing history, and etc.

The first step is to collecting information about users. A basic requirement of such a system is that it must be able to uniquely identify users. Although accurate user identification is not a critical issue for systems that construct profiles representing groups of users, it is a crucial ability for any system that constructs profiles that represent individual users. There are five basic approaches to user identification: software agents, logins, enhanced proxy servers, cookies, and session ids.

The second step is user profile representations. The most common representation for user profiles is sets of keywords. These can be automatically extracted from Web documents or directly provided by the user. Weights, which are usually associated with keywords, are numerical representations of user’s interests. In order to address the polysemy problem inherent with keyword-based profiles, the profiles may be represented by a weighted semantic network in which each node represents a concept. Concept-based profiles are similar to semantic network-based profile in the sense that both are represented by conceptual nodes and relationships between those nodes.

The third step is user profile construction. Keyword-based profiles are initially created by extracting keywords from Web pages collected from some information source, e.g., the user’s browsing history or bookmarks. Semantic network-based profiles are typically built by collecting explicit positive and/or negative feedback from users. Similar to keyword vector profile construction techniques, keywords are extracted from the user-rated pages. This section describes three representative systems that build user profiles represented as weighted concept hierarchies. Although each uses a different construction methodology, they each use a reference taxonomy as the basis of the profile.