2014年4月17日星期四

Last Week Note and No Muddist

NO MUDDIST POINT FOR THE LAST WEEK, Thank you!

Generalizing from relevance feedback using named entity wildcards

This paper proposes news ways of generalizing from relevance feedback by augmenting the traditional bag-of-words query model with named entity wildcards that are anchored in context. The use of wildcards allows generalization beyond specific words, while contextual restrictions limit the wildcard-matching to entities related to the user's query. We test our new approach in a nugget-level adaptive filtering system and evaluate it in terms of both relevance and novelty of the presented information. Our results indicate that higher recall is obtained when lexical terms are generalized using wildcards. However, such wildcards must be anchored to their context to maintain good precision. How the context of a wildcard is represented and matched against a given document also plays a crucial role in the performance of the retrieval system.

Learning to rank for information retrieval

The task of "learning to rank" has emerged as an active and growing area of research both in

information retrieval and machine learning. The goal is to design and apply methods to automatically

learn a function from training data, such that the function can sort objects (e.g., documents) according

to their degrees of relevance, preference, or importance as defined in a specific application.

The relevance of this task for IR is without question, because many IR problems are by nature

ranking problems. Improved algorithms for learning ranking functions promise improved retrieval

quality and less of a need for manual parameter adaptation. In this way, many IR technologies can be

potentially enhanced by using learning to rank techniques.

2014年4月10日星期四

Notes Week 12 and NO MUDDIST THIS WEEK

I HAVE NO QUESTION FOR THIS WEEK'S LECTURE.

Notes Week 12

IIR

Text classification and Naive Bayes

A standing query is like any other query except that it is periodically executed on a collection to which new documents are incrementally added over time. To capture the generality and scope of the problem space to which standing queries belong, we now introduce the general notion of a classification problem.

There are two instances each of region categories, industry categories, and subject area categories. A hierarchy can be an important aid in solving a classification problem.

The first supervised learning method we introduce is the multinomial Naive Bayes or multinomial NB model, a probabilistic learning method. In text classification, our goal is to find the best class for the document.

The multinomial NB model is formally identical to the multinomial unigram language model. We also used MLE estimates and encountered the problem of zero estimates owing to sparse data; but instead of add-one smoothing, we used a mixture of two distributions to address the problem there.

There are two different ways we can set up an NB classifier. An alternative to the multinomial model is the multivariate Bernoulli model or Bernoulli model. It is equivalent to the binary independence model.

Vector space classification

Decisions of many vector space classifiers are based on a notion of distance, e.g., when computing the nearest neighbors in kNN classification. However, in addition to documents, centroids or averages of vectors also play an important role in vector space classification. Centroids are not length- normalized. For unnormalized vectors, dot product, cosine similarity and Euclidean distance all have different behavior in general.

Unlike Rocchio, k nearest neighbor or kNN classification determines the deci- sion boundary locally. For 1NN we assign each document to the class of its closest neighbor. For kNN we assign each document to the majority class of its k closest neighbors where k is a parameter. The rationale of kNN classifi- cation is that, based on the contiguity hypothesis, we expect a test document d to have the same label as the training documents located in the local region surrounding d.

1NN is not very robust. The classification decision of each test document relies on the class of a single training document, which may be incorrectly labeled or atypical. kNN for k > 1 is more robust. It assigns documents to the majority class of their k closest neighbors, with ties broken randomly.

2014年4月4日星期五

Notes Week 11 and NO MUDDIST POINT FOR THIS WEEK

NO MUDDIST FOR the class on 03/31

As the information on the internet is increasing continuously, personalized approach is very needed to give each user their unique information access regarding their interest, browsing history, and etc.

The first step is to collecting information about users. A basic requirement of such a system is that it must be able to uniquely identify users. Although accurate user identification is not a critical issue for systems that construct profiles representing groups of users, it is a crucial ability for any system that constructs profiles that represent individual users. There are five basic approaches to user identification: software agents, logins, enhanced proxy servers, cookies, and session ids.

The second step is user profile representations. The most common representation for user profiles is sets of keywords. These can be automatically extracted from Web documents or directly provided by the user. Weights, which are usually associated with keywords, are numerical representations of user’s interests. In order to address the polysemy problem inherent with keyword-based profiles, the profiles may be represented by a weighted semantic network in which each node represents a concept. Concept-based profiles are similar to semantic network-based profile in the sense that both are represented by conceptual nodes and relationships between those nodes.

The third step is user profile construction. Keyword-based profiles are initially created by extracting keywords from Web pages collected from some information source, e.g., the user’s browsing history or bookmarks. Semantic network-based profiles are typically built by collecting explicit positive and/or negative feedback from users. Similar to keyword vector profile construction techniques, keywords are extracted from the user-rated pages. This section describes three representative systems that build user profiles represented as weighted concept hierarchies. Although each uses a different construction methodology, they each use a reference taxonomy as the basis of the profile.

2014年3月26日星期三

Notes Week 10 and NO MUDDIST FOR THIS WEEK

NO MUDDIST FOR the class on 03/24

Week 11

IES CH14 Parallel IR

· Index partitioning and replication are two popular approaches to improve the efficiency of information retrieval.

· Intra-query parallelism is that we divide the index into independent parts so that each node is responsible for a small piece of the overall index, which greatly increase the efficiency.

· The two predominant index partitioning schemes are document partitioning schemes and term partitioning.

· In a document-partitioned search engine, each of the n nodes is involved in processing all queries received by the engine. In a term-partitioned conﬁguration, a query is seen by a given node only if the node’s index contains at least one of the query terms.

· The main advantage of the document-partitioned approach is its simplicity. Because all index servers operate independently of each other, no additional complexity needs to be introduced into the low-level query processing routines.

· Term partitioning addresses the disk seek problem by splitting the collection into sets of terms instead of sets of documents.

· Despite its potential performance advantage over the document-partitioned approach, at least for on-disk indices, term partitioning has several shortcomings that make it difficult to use the method in practice LIKE SCALABILITY/LOAD IMBALANCE/TERM-AT-A-TIME.

2014年2月28日星期五

Muddist Week 7

Why positive feedback is more useful than negative feedback to an IR system?

Notes Week 8

Week 8

MIR Ch10 User Interface and Visualization

Human computer interaction

1) Principle

· Offer informative feedback is especially important for information access interfaces.

· Reduce working memory load. Information access is an iterative process, the goals of which shift and change as information is encountered. One key way information access interfaces can help with memory load is to provide mechanisms for keeping track of choices made during the search process.

· Provide alternative interfaces for novice and expert users. An important tradeoff in all user interface design is that of simplicity vs power.

2) Role of visualization

· Human are highly attuned to images and visual information. pics can be captivating and appealing, especially if well designed.

· The growing prevalence of fast pics processors and high resolution color monitors is increasing interest in information visualization.

· Visualization of inherently abstract information is more difficult, and visualization of textually represented information is especially challenging.

3) Evaluating

· An important aspect of HCI is the methodology for evaluation of user interface techniques. Precision and recall measures have been widely used for comparing the ranking results of non-interactive systems, but are less appropriate for assessing interactive systems.

· Empirical data involving human users is time consuming to gather and difficult to draw conclusion from.

The information access process

1) Model of interaction

· Start with an information need

· select a system and collections to search on

· formulate a query

· send the query to system

· receive the results in the form of information items

· scan, evaluate, and interpret the results

· either stop, or,

· reformulate the query and repeat step 4

2) Earlier interface studiesThe bulk of the literature on studies of HCI seeking behavior concerns information intermediaries using online systems consisting of bibliographic records.