Week 2
IIR
·
Chapter 1 Section 1.2
Inverted index – build
the index in advance to reduce the speed
1.
Collect the documents to be indexed;
2.
Tokenize(令牌化)the text, turning each document into a list of tokens;
Inverted index has two parts: dictionary
and postings as the figure below:
Dictionary kept in memory, with pointers to
each posting list (much larger), which is stored on disk.
3.
Do linguistic preprocessing, producing a list of
normalized tokens, which are the indexing terms;
4. Index
the documents that each term occurs in by creating an inverted in- dex,
consisting of a dictionary and postings.
Core indexing
step: (see figure 1.4)
1.
Sorting the list so that the terms are
alphabetical;
2.
Multiple occurrences of the same term from the
same document are then merged;
3. Instances
of the same term are then grouped, and the result is split into a dictionary
and postings;
New terms:
Token
docID -integers
Document frequency -
the number of documents that contain each term
·
Chapter 2
1.Tokenization is the process of chopping character streams into tokens,
while linguistic preprocessing then deals with building equivalence classes of
tokens which are the set of terms that are indexed.
2. Document delineation and
character sequence decoding (How the basic unit of a document can be
defined and how the character sequence that it comprises is determined):
1)
Obtaining the character sequence in a document
2)
Choosing a document unit
3. Determining the
vocabulary of terms
1)
Tokenization
2)
Dropping common terms: stop words
3)
Normalization (equivalence classing of terms)
4)
Stemming and lemmatization
4. Faster postings
list intersection via skip pointers
5. Positional postings and phrase queries
New terms:
Document unit
INDEXING
GRANULARITY
STOP WORDS
·
Chapter 3
1.
I found lots of data structure related stuff in this chapter such as
binary tree or B-tree. In Chapters 1 and 2 the author developed the ideas underlying
inverted indexes for handling Boolean and proximity queries. In this chapter
the author is going to develop techniques that are robust to typographical
errors in the query, as well as alternative spellings.
2. In 3.1,
data structures are developed to help the search for terms in the vocabulary in
an inverted index.In 3.2, the
author study the idea of a wildcard query: a query such as *a*e*i*o*u*, which
seeks documents containing any term that includes all the five vowels in
sequence. The * symbol indicates any (possibly empty) string of characters.
Users pose such queries to a search engine when they are uncertain about how to
spell a query term, or seek documents containing variants of a query term; for
instance, the query automat* would seek documents containing any of the terms
automatic, automation and automated.
New
terms
WILDCARD
QUERY
PERMUTERM
INDEX
k-GRAM
INDEX

没有评论:
发表评论