IS2140 Information Retrievial: Notes

Week 2

IIR

· Chapter 1 Section 1.2

Inverted index – build the index in advance to reduce the speed

1. Collect the documents to be indexed;

2. Tokenize(令牌化)the text, turning each document into a list of tokens;

Inverted index has two parts: dictionary and postings as the figure below:

Dictionary kept in memory, with pointers to each posting list (much larger), which is stored on disk.

3. Do linguistic preprocessing, producing a list of normalized tokens, which are the indexing terms;

4. Index the documents that each term occurs in by creating an inverted in- dex, consisting of a dictionary and postings.

Core indexing step: (see figure 1.4)

1. Sorting the list so that the terms are alphabetical;

2. Multiple occurrences of the same term from the same document are then merged;

3. Instances of the same term are then grouped, and the result is split into a dictionary and postings;

New terms:

Token

docID -integers

Document frequency - the number of documents that contain each term

· Chapter 2

1.Tokenization is the process of chopping character streams into tokens, while linguistic preprocessing then deals with building equivalence classes of tokens which are the set of terms that are indexed.

2. Document delineation and character sequence decoding (How the basic unit of a document can be defined and how the character sequence that it comprises is determined):

1) Obtaining the character sequence in a document

2) Choosing a document unit

3. Determining the vocabulary of terms

1) Tokenization

2) Dropping common terms: stop words

3) Normalization (equivalence classing of terms)

4) Stemming and lemmatization

4. Faster postings list intersection via skip pointers

5. Positional postings and phrase queries

New terms:

Document unit

INDEXING GRANULARITY

STOP WORDS

· Chapter 3

1. I found lots of data structure related stuff in this chapter such as binary tree or B-tree. In Chapters 1 and 2 the author developed the ideas underlying inverted indexes for handling Boolean and proximity queries. In this chapter the author is going to develop techniques that are robust to typographical errors in the query, as well as alternative spellings.

2. In 3.1, data structures are developed to help the search for terms in the vocabulary in an inverted index.In 3.2, the author study the idea of a wildcard query: a query such as *a*e*i*o*u*, which seeks documents containing any term that includes all the five vowels in sequence. The * symbol indicates any (possibly empty) string of characters. Users pose such queries to a search engine when they are uncertain about how to spell a query term, or seek documents containing variants of a query term; for instance, the query automat* would seek documents containing any of the terms automatic, automation and automated.

New terms

WILDCARD QUERY

PERMUTERM INDEX

k-GRAM INDEX

IS2140 Information Retrievial

2014年1月10日星期五

Notes - Week 2

没有评论:

发表评论