IIR
CH4
Chapter
4 is about the indexing construction.
4.1
When
building an information retrieval system, many decisions are based on the
characteristic of the computer hardware on which the system runs.
·
In order to access as much data as possible in a
short time, we use the technique called caching.
·
To maximize data transfer rates, chunks of data
that will be read together should therefore be stored contiguously on disk.
·
Operating systems generally read and write
entire blocks. Block sizes of 8, 16, 32, and 64 KB are common.
·
The processor can process data when the system
bus is transferring data, we can use this feature to store compressed data on
disk.
4.2
BLOCKED SORT-BASED INDEXING
·
We discussed big documents that need a secondary
storage.
·
Basic steps in constructing a nonpositional
index:
1.Make a pass through the
collection assembling all term-docID pairs.
2.Sort the pairs with the term
as the dominant key and docID as the secondary key.
3.Organize the docIDs for each
term into a postings list and compute statistics like term and document
frequency.
·
With main memory insufficient, we need to use an
external sorting algorithm.
·
Inversion
involves two steps: First, we sort the termID–docID pairs. Next, we collect all
termID–docID pairs with the same termID into a postings list, where a posting
is simply a docID. In the final step, the algorithm simultaneously merges the
blocks into one large merged index.
· The time complexity of BSBI is O(T logT).
4.3 SINGLE-PASS IN-MEMORY INDEXING
·
Blocked sort-based indexing has excellent
scaling properties, but it needs a data structure for mapping terms to termIDs.
For very large collections, this data structure does not fit into memory. A
more scalable alternative is single-pass in-memory indexing or SPIMI.
·
SPIMI uses terms rather than termIDs.
4.4 DISTRIBUTED
INDEXING
·
Web search engines, therefore, use distributed
indexing algorithms for index construction. The result of the construction
process is a distributed index that is partitioned across several machines –
either according to term or according to document.
4.5 DYNAMIC INDEXING
·
We choose the simple option of storing the index
as one large file here.
·
Because of this complexity of dynamic indexing,
some large search engines adopt a reconstruction-from-scratch strategy. They do
not construct indexes dynamically. Instead, a new index is built from scratch
periodically. Query processing is then switched from the new index and the old
index is deleted.
CH5 INDEX COMPRESSION
Previous chapter introduce the
inverted index and the dictionary. In CH5, the author mainly talked about the
compression techniques that are essential for an IR system.
·
The statistical properties of terms in
information retrieval
1) Heaps’
law: Estimating the number of terms: estimates vocabulary size as a function of
collection size: M = kTb.
2) Zipf’s
law: Modeling the distribution of terms
·
Dictionary compression
1) Dictionary
as a string
The simplest data
structure for the dictionary is to sort the vocabulary lexicographically and
store it in an array of fixed-width entries.
2) Blocked
storage
·
Postings file compression
New Term
Seek time; buffer; termID; SPIMI; Mapreduce;
master node; splits; key-value pairs
没有评论:
发表评论