IS2140 Information Retrievial: Notes Week 5 and No muddist for this week

IIR

CH 11.Probabilistic Information Retrieval

· The probabilistic approach to IR provides a different formal basis for a retrieval model and results in different techniques for setting term weights.

· The probability ranking principle

1. The 1/0 loss case

A ranked retrieval is a collection of documents, the user issues a query, and an ordered list of documents is returned.

2. The PRP with retrieval costs

Let C1 be the cost of not retrieving a relevant document and C0 the cost of retrieval of a nonrelevant document. Then the Probability Ranking Principle says that if for a specific document d and for all documents d′ not yet retrieved.

· The Binary Independence Model

BIM is the model that has traditionally been used with the PRP. It introduces some simple assumptions, which make estimating the probability function P(R|d, q) practical.

· Deriving a ranking function for query terms

Given a query q, we wish to order returned documents by descending P(R = 1|d, q).

· Probability estimates in theory

For each term t, ct numbers for the whole collection is given a contingency table of counts of documents in the collection, where dft is the number of documents that contain term t.

· Probability estimates in practice

Under the assumption that relevant documents are a very small percentage of the collection, it is plausible to approximate statistics for nonrelevant documents by statistics from the whole collection.

CH 12. Language Models for Information Retrieval

· Finite automata and language models

A traditional generative model of a language, of the kind familiar from formal language theory, can be used either to recognize or to generate strings.

· The simplest form of language model simply throws away all conditioning context, and estimates each term independently. Such a model is called a unigram language model: P (t1t2t3t4) = P(t1)P(t2)P(t3)P(t4)

· Multinomial distributions over words

Under the unigram language model the order of words is irrelevant, and so such models are often called “bag of words” models, as discussed in CH6. Even though there is no conditioning on preceding context, this model nevertheless still gives the probability of a particular ordering of terms. However, any other ordering of this bag of terms will have the same probability.

· The query likelihood model

Language modeling is a quite general formal approach to IR, with many variant realizations.

· The classic problem with using language models is one of estimation terms appear very sparsely in documents. In particular, some words will not have appeared in the document at all, but are possible words for the in- formation need, which the user may have used in the query.

IS2140 Information Retrievial

2014年2月6日星期四

Notes Week 5 and No muddist for this week

没有评论:

发表评论