IS2140 Information Retrievial: Notes Week 3

IIR

CH4

Chapter 4 is about the indexing construction.

4.1

When building an information retrieval system, many decisions are based on the characteristic of the computer hardware on which the system runs.

· In order to access as much data as possible in a short time, we use the technique called caching.

· To maximize data transfer rates, chunks of data that will be read together should therefore be stored contiguously on disk.

· Operating systems generally read and write entire blocks. Block sizes of 8, 16, 32, and 64 KB are common.

· The processor can process data when the system bus is transferring data, we can use this feature to store compressed data on disk.

4.2 BLOCKED SORT-BASED INDEXING

· We discussed big documents that need a secondary storage.

· Basic steps in constructing a nonpositional index:

1.Make a pass through the collection assembling all term-docID pairs.

2.Sort the pairs with the term as the dominant key and docID as the secondary key.

3.Organize the docIDs for each term into a postings list and compute statistics like term and document frequency.

· With main memory insufficient, we need to use an external sorting algorithm.

· Inversion involves two steps: First, we sort the termID–docID pairs. Next, we collect all termID–docID pairs with the same termID into a postings list, where a posting is simply a docID. In the final step, the algorithm simultaneously merges the blocks into one large merged index.

· The time complexity of BSBI is O(T logT).

4.3 SINGLE-PASS IN-MEMORY INDEXING

· Blocked sort-based indexing has excellent scaling properties, but it needs a data structure for mapping terms to termIDs. For very large collections, this data structure does not fit into memory. A more scalable alternative is single-pass in-memory indexing or SPIMI.

· SPIMI uses terms rather than termIDs.

4.4 DISTRIBUTED INDEXING

· Web search engines, therefore, use distributed indexing algorithms for index construction. The result of the construction process is a distributed index that is partitioned across several machines – either according to term or according to document.

4.5 DYNAMIC INDEXING

· We choose the simple option of storing the index as one large file here.

· Because of this complexity of dynamic indexing, some large search engines adopt a reconstruction-from-scratch strategy. They do not construct indexes dynamically. Instead, a new index is built from scratch periodically. Query processing is then switched from the new index and the old index is deleted.

CH5 INDEX COMPRESSION

Previous chapter introduce the inverted index and the dictionary. In CH5, the author mainly talked about the compression techniques that are essential for an IR system.

· The statistical properties of terms in information retrieval

1) Heaps’ law: Estimating the number of terms: estimates vocabulary size as a function of collection size: M = kTb.

2) Zipf’s law: Modeling the distribution of terms

· Dictionary compression

1) Dictionary as a string

The simplest data structure for the dictionary is to sort the vocabulary lexicographically and store it in an array of fixed-width entries.

2) Blocked storage

· Postings file compression

New Term

Seek time; buffer; termID; SPIMI; Mapreduce; master node; splits; key-value pairs

IS2140 Information Retrievial

2014年1月24日星期五

Notes Week 3

没有评论:

发表评论