IIR
CH
11.Probabilistic Information Retrieval
·
The probabilistic approach to IR provides a
different formal basis for a retrieval model and results in different
techniques for setting term weights.
·
The probability ranking principle
1.
The 1/0 loss case
A ranked retrieval is a collection of documents, the user
issues a query, and an ordered list of documents is
returned.
2.
The PRP with retrieval costs
Let C1 be the cost of not
retrieving a relevant document and C0 the cost of retrieval of a nonrelevant
document. Then the Probability Ranking Principle says that if for a specific
document d and for all documents d′ not yet retrieved.
·
The Binary Independence Model
BIM is the model that has
traditionally been used with the PRP. It introduces some simple assumptions, which
make estimating the probability function P(R|d, q) practical.
·
Deriving a ranking function for query terms
Given a query q, we wish to
order returned documents by descending P(R = 1|d, q).
·
Probability estimates in theory
For each term t, ct numbers for
the whole collection is given a contingency table of counts of documents in the
collection, where dft is the number of documents that contain term t.
·
Probability estimates in practice
Under the assumption that
relevant documents are a very small percentage of the collection, it is
plausible to approximate statistics for nonrelevant documents by statistics
from the whole collection.
CH
12. Language Models for Information Retrieval
·
Finite automata and language models
A traditional generative model
of a language, of the kind familiar from formal language theory, can be used
either to recognize or to generate strings.
·
The simplest form of language model simply
throws away all conditioning context, and estimates each term independently.
Such a model is called a unigram language model: P (t1t2t3t4) =
P(t1)P(t2)P(t3)P(t4)
·
Multinomial distributions over words
Under
the unigram language model the order of words is irrelevant, and so such models
are often called “bag of words” models, as discussed in CH6. Even though there
is no conditioning on preceding context, this model nevertheless still gives
the probability of a particular ordering of terms. However, any other ordering
of this bag of terms will have the same probability.
·
The query likelihood model
Language
modeling is a quite general formal approach to IR, with many variant
realizations.
·
The classic problem with using language models
is one of estimation terms appear very sparsely in documents. In particular,
some words will not have appeared in the document at all, but are possible
words for the in- formation need, which the user may have used in the query.
没有评论:
发表评论