Index compression
When we use index compression to reduce the space, it will take some extra time to decompress the data, how can we still ensure the efficiency of IR when we use compression?
2014年1月31日星期五
2014年1月30日星期四
Notes Week 4
Week 4
IIR sections 1.3 and 1.4,
chapter 6
CH1
1.3 Processing Boolean
queries
·
The
five steps when possessing a simple conjunctive query:
1.
Locate the previous word in the Dictionary
2. Retrieve its postings
3. Locate the next word in
the Dictionary
4. Retrieve its postings
5. Intersect the two postings
lists
·
For
more complicated process we can use query optimization.
·
For
arbitrary Boolean queries, we have to evaluate and temporarily store the
answers for intermediate expressions in a complex expression.
1.4 The extended Boolean
model versus ranked retrieval
·
The
Boolean retrieval model contrasts with ranked retrieval models such as the
vector space model, in which users largely use free text queries, that is, just
typing one or more words rather than using a precise language with operators
for building up query expressions, and the system decides which documents best
satisfy the query.
CH6 Scoring, term weighting
and the vector space model
·
Parametric
and zone indexes serve two purposes, First, they allow us to index and retrieve
documents by metadata such as the language in which a document is written.
Second, they give us a simple means for scoring (and thereby ranking) documents
in response to a query.
·
Tf-idft,
d assigns to term t a weight in document d that is highest when t occurs many
times within a small number of documents (thus lending high discriminating
power to those documents); lower when the term occurs fewer times in a document,
or occurs in many documents (thus offering a less pronounced relevance signal);
lowest when the term occurs in virtually all documents.
2014年1月24日星期五
Notes Week 3
IIR
CH4
Chapter
4 is about the indexing construction.
4.1
When
building an information retrieval system, many decisions are based on the
characteristic of the computer hardware on which the system runs.
·
In order to access as much data as possible in a
short time, we use the technique called caching.
·
To maximize data transfer rates, chunks of data
that will be read together should therefore be stored contiguously on disk.
·
Operating systems generally read and write
entire blocks. Block sizes of 8, 16, 32, and 64 KB are common.
·
The processor can process data when the system
bus is transferring data, we can use this feature to store compressed data on
disk.
4.2
BLOCKED SORT-BASED INDEXING
·
We discussed big documents that need a secondary
storage.
·
Basic steps in constructing a nonpositional
index:
1.Make a pass through the
collection assembling all term-docID pairs.
2.Sort the pairs with the term
as the dominant key and docID as the secondary key.
3.Organize the docIDs for each
term into a postings list and compute statistics like term and document
frequency.
·
With main memory insufficient, we need to use an
external sorting algorithm.
·
Inversion
involves two steps: First, we sort the termID–docID pairs. Next, we collect all
termID–docID pairs with the same termID into a postings list, where a posting
is simply a docID. In the final step, the algorithm simultaneously merges the
blocks into one large merged index.
· The time complexity of BSBI is O(T logT).
4.3 SINGLE-PASS IN-MEMORY INDEXING
·
Blocked sort-based indexing has excellent
scaling properties, but it needs a data structure for mapping terms to termIDs.
For very large collections, this data structure does not fit into memory. A
more scalable alternative is single-pass in-memory indexing or SPIMI.
·
SPIMI uses terms rather than termIDs.
4.4 DISTRIBUTED
INDEXING
·
Web search engines, therefore, use distributed
indexing algorithms for index construction. The result of the construction
process is a distributed index that is partitioned across several machines –
either according to term or according to document.
4.5 DYNAMIC INDEXING
·
We choose the simple option of storing the index
as one large file here.
·
Because of this complexity of dynamic indexing,
some large search engines adopt a reconstruction-from-scratch strategy. They do
not construct indexes dynamically. Instead, a new index is built from scratch
periodically. Query processing is then switched from the new index and the old
index is deleted.
CH5 INDEX COMPRESSION
Previous chapter introduce the
inverted index and the dictionary. In CH5, the author mainly talked about the
compression techniques that are essential for an IR system.
·
The statistical properties of terms in
information retrieval
1) Heaps’
law: Estimating the number of terms: estimates vocabulary size as a function of
collection size: M = kTb.
2) Zipf’s
law: Modeling the distribution of terms
·
Dictionary compression
1) Dictionary
as a string
The simplest data
structure for the dictionary is to sort the vocabulary lexicographically and
store it in an array of fixed-width entries.
2) Blocked
storage
·
Postings file compression
New Term
Seek time; buffer; termID; SPIMI; Mapreduce;
master node; splits; key-value pairs
2014年1月17日星期五
Notes Week 1
Week 1
FOA
FOA
·
Linguistic expressions must make sense in
meaning (appreciate word game) so
that the search engine can figure out what is talking about.
·
New types of electronic artifact bring an empirical grounding for new theories
of language that may well be revolutionary.
·
FOA conversation loop:
1.
Asking
a question – query (maybe imprecise or vague)
2.
Constructing
an answer – retrieval by search engine
3.
Assessing
the answer – relevance feedback (filter the answers)
· IR – information retrieval
Exists
since 1987,borrowed heavily from computational linguistics.
· Fundamental operation of a search engine is match – descriptive features by users
and the documents in database. How? By keywords.
IES
1.1.1 Web
search engines usually incorporate layers of caching and
replication, taking advantage of commonly occurring queries and exploiting
parallel processing.
1.1.2 The
search application systems must interface directly with the file system layer
of the operating system and must be engineered to handle a updated load.
1.1.3 Techniques for categorization and
filtering have the widest applicability, and we provide an introduction to
these areas.
1.2.1 To perform relevance ranking, the search
engine computes a score, sometimes called a retrieval status value, for each
document.
1.2.2 A particular document might be an e-mail
message, a Web page, a news article, or even a video.
1.2.3 Response time is the most visible aspect of
efficiency experienced by a user between issuing a query and receiving the
results.
MIR
·
Man has organized information for later retrieval and searching by
compiling, storing, organizing, and indexing papyrus, hieroglyphics, and books.
·
IR has gained a place with other technologies at the center of the
stage.
·
How the Web Changed Search
1、The characteristics of the
document collection itself
1) Composed of pages distributed over millions of
sites and connected through hyperlinks.
2) Crawling: new phase introduced by the Web.
2、The size of the collection
3、In a very large collection, predicting relevance is much harder than before.
3、In a very large collection, predicting relevance is much harder than before.
2014年1月10日星期五
Notes - Week 2
Week 2
IIR
·
Chapter 1 Section 1.2
Inverted index – build
the index in advance to reduce the speed
1.
Collect the documents to be indexed;
2.
Tokenize(令牌化)the text, turning each document into a list of tokens;
Inverted index has two parts: dictionary
and postings as the figure below:
Dictionary kept in memory, with pointers to
each posting list (much larger), which is stored on disk.
3.
Do linguistic preprocessing, producing a list of
normalized tokens, which are the indexing terms;
4. Index
the documents that each term occurs in by creating an inverted in- dex,
consisting of a dictionary and postings.
Core indexing
step: (see figure 1.4)
1.
Sorting the list so that the terms are
alphabetical;
2.
Multiple occurrences of the same term from the
same document are then merged;
3. Instances
of the same term are then grouped, and the result is split into a dictionary
and postings;
New terms:
Token
docID -integers
Document frequency -
the number of documents that contain each term
·
Chapter 2
1.Tokenization is the process of chopping character streams into tokens,
while linguistic preprocessing then deals with building equivalence classes of
tokens which are the set of terms that are indexed.
2. Document delineation and
character sequence decoding (How the basic unit of a document can be
defined and how the character sequence that it comprises is determined):
1)
Obtaining the character sequence in a document
2)
Choosing a document unit
3. Determining the
vocabulary of terms
1)
Tokenization
2)
Dropping common terms: stop words
3)
Normalization (equivalence classing of terms)
4)
Stemming and lemmatization
4. Faster postings
list intersection via skip pointers
5. Positional postings and phrase queries
New terms:
Document unit
INDEXING
GRANULARITY
STOP WORDS
·
Chapter 3
1.
I found lots of data structure related stuff in this chapter such as
binary tree or B-tree. In Chapters 1 and 2 the author developed the ideas underlying
inverted indexes for handling Boolean and proximity queries. In this chapter
the author is going to develop techniques that are robust to typographical
errors in the query, as well as alternative spellings.
2. In 3.1,
data structures are developed to help the search for terms in the vocabulary in
an inverted index.In 3.2, the
author study the idea of a wildcard query: a query such as *a*e*i*o*u*, which
seeks documents containing any term that includes all the five vowels in
sequence. The * symbol indicates any (possibly empty) string of characters.
Users pose such queries to a search engine when they are uncertain about how to
spell a query term, or seek documents containing variants of a query term; for
instance, the query automat* would seek documents containing any of the terms
automatic, automation and automated.
New
terms
WILDCARD
QUERY
PERMUTERM
INDEX
k-GRAM
INDEX
订阅:
评论 (Atom)
