IS2140 Information Retrievial: 一月 2014

2014年1月31日星期五

Muddist Week 3

Index compression

When we use index compression to reduce the space, it will take some extra time to decompress the data, how can we still ensure the efficiency of IR when we use compression?

2014年1月30日星期四

Notes Week 4

Week 4

IIR sections 1.3 and 1.4, chapter 6

CH1

1.3 Processing Boolean queries

· The five steps when possessing a simple conjunctive query:

1. Locate the previous word in the Dictionary

2. Retrieve its postings 

3. Locate the next word in the Dictionary 

4. Retrieve its postings 

5. Intersect the two postings lists

· For more complicated process we can use query optimization.

· For arbitrary Boolean queries, we have to evaluate and temporarily store the answers for intermediate expressions in a complex expression.

1.4 The extended Boolean model versus ranked retrieval

· The Boolean retrieval model contrasts with ranked retrieval models such as the vector space model, in which users largely use free text queries, that is, just typing one or more words rather than using a precise language with operators for building up query expressions, and the system decides which documents best satisfy the query.

CH6 Scoring, term weighting and the vector space model

· Parametric and zone indexes serve two purposes, First, they allow us to index and retrieve documents by metadata such as the language in which a document is written. Second, they give us a simple means for scoring (and thereby ranking) documents in response to a query.

· Tf-idft, d assigns to term t a weight in document d that is highest when t occurs many times within a small number of documents (thus lending high discriminating power to those documents); lower when the term occurs fewer times in a document, or occurs in many documents (thus offering a less pronounced relevance signal); lowest when the term occurs in virtually all documents.

2014年1月24日星期五

No Muddist Point for Week 2

Hi there, I have no muddist point for week 2.

Notes Week 3

IIR

CH4

Chapter 4 is about the indexing construction.

4.1

When building an information retrieval system, many decisions are based on the characteristic of the computer hardware on which the system runs.

· In order to access as much data as possible in a short time, we use the technique called caching.

· To maximize data transfer rates, chunks of data that will be read together should therefore be stored contiguously on disk.

· Operating systems generally read and write entire blocks. Block sizes of 8, 16, 32, and 64 KB are common.

· The processor can process data when the system bus is transferring data, we can use this feature to store compressed data on disk.

4.2 BLOCKED SORT-BASED INDEXING

· We discussed big documents that need a secondary storage.

· Basic steps in constructing a nonpositional index:

1.Make a pass through the collection assembling all term-docID pairs.

2.Sort the pairs with the term as the dominant key and docID as the secondary key.

3.Organize the docIDs for each term into a postings list and compute statistics like term and document frequency.

· With main memory insufficient, we need to use an external sorting algorithm.

· Inversion involves two steps: First, we sort the termID–docID pairs. Next, we collect all termID–docID pairs with the same termID into a postings list, where a posting is simply a docID. In the final step, the algorithm simultaneously merges the blocks into one large merged index.

· The time complexity of BSBI is O(T logT).

4.3 SINGLE-PASS IN-MEMORY INDEXING

· Blocked sort-based indexing has excellent scaling properties, but it needs a data structure for mapping terms to termIDs. For very large collections, this data structure does not fit into memory. A more scalable alternative is single-pass in-memory indexing or SPIMI.

· SPIMI uses terms rather than termIDs.

4.4 DISTRIBUTED INDEXING

· Web search engines, therefore, use distributed indexing algorithms for index construction. The result of the construction process is a distributed index that is partitioned across several machines – either according to term or according to document.

4.5 DYNAMIC INDEXING

· We choose the simple option of storing the index as one large file here.

· Because of this complexity of dynamic indexing, some large search engines adopt a reconstruction-from-scratch strategy. They do not construct indexes dynamically. Instead, a new index is built from scratch periodically. Query processing is then switched from the new index and the old index is deleted.

CH5 INDEX COMPRESSION

Previous chapter introduce the inverted index and the dictionary. In CH5, the author mainly talked about the compression techniques that are essential for an IR system.

· The statistical properties of terms in information retrieval

1) Heaps’ law: Estimating the number of terms: estimates vocabulary size as a function of collection size: M = kTb.

2) Zipf’s law: Modeling the distribution of terms

· Dictionary compression

1) Dictionary as a string

The simplest data structure for the dictionary is to sort the vocabulary lexicographically and store it in an array of fixed-width entries.

2) Blocked storage

· Postings file compression

New Term

Seek time; buffer; termID; SPIMI; Mapreduce; master node; splits; key-value pairs

2014年1月17日星期五

Notes Week 1

Week 1

FOA

· Linguistic expressions must make sense in meaning (appreciate word game) so that the search engine can figure out what is talking about.

· New types of electronic artifact bring an empirical grounding for new theories of language that may well be revolutionary.

· FOA conversation loop:

1. Asking a question – query (maybe imprecise or vague)

2. Constructing an answer – retrieval by search engine

3. Assessing the answer – relevance feedback (filter the answers)

· IR – information retrieval

Exists since 1987,borrowed heavily from computational linguistics.

· Fundamental operation of a search engine is match – descriptive features by users and the documents in database. How? By keywords.

IES

1.1.1 Web search engines usually incorporate layers of caching and replication, taking advantage of commonly occurring queries and exploiting parallel processing.

1.1.2 The search application systems must interface directly with the file system layer of the operating system and must be engineered to handle a updated load.

1.1.3 Techniques for categorization and filtering have the widest applicability, and we provide an introduction to these areas.

1.2.1 To perform relevance ranking, the search engine computes a score, sometimes called a retrieval status value, for each document.

1.2.2 A particular document might be an e-mail message, a Web page, a news article, or even a video.

1.2.3 Response time is the most visible aspect of efficiency experienced by a user between issuing a query and receiving the results.

MIR

· Man has organized information for later retrieval and searching by compiling, storing, organizing, and indexing papyrus, hieroglyphics, and books.

· IR has gained a place with other technologies at the center of the stage.

· How the Web Changed Search

1、The characteristics of the document collection itself

1) Composed of pages distributed over millions of sites and connected through hyperlinks.

2) Crawling: new phase introduced by the Web.

2、The size of the collection
3、In a very large collection, predicting relevance is much harder than before.

2014年1月10日星期五

Notes - Week 2

Week 2

IIR

· Chapter 1 Section 1.2

Inverted index – build the index in advance to reduce the speed

1. Collect the documents to be indexed;

2. Tokenize(令牌化)the text, turning each document into a list of tokens;

Inverted index has two parts: dictionary and postings as the figure below:

Dictionary kept in memory, with pointers to each posting list (much larger), which is stored on disk.

3. Do linguistic preprocessing, producing a list of normalized tokens, which are the indexing terms;

4. Index the documents that each term occurs in by creating an inverted in- dex, consisting of a dictionary and postings.

Core indexing step: (see figure 1.4)

1. Sorting the list so that the terms are alphabetical;

2. Multiple occurrences of the same term from the same document are then merged;

3. Instances of the same term are then grouped, and the result is split into a dictionary and postings;

New terms:

Token

docID -integers

Document frequency - the number of documents that contain each term

· Chapter 2

1.Tokenization is the process of chopping character streams into tokens, while linguistic preprocessing then deals with building equivalence classes of tokens which are the set of terms that are indexed.

2. Document delineation and character sequence decoding (How the basic unit of a document can be defined and how the character sequence that it comprises is determined):

1) Obtaining the character sequence in a document

2) Choosing a document unit

3. Determining the vocabulary of terms

1) Tokenization

2) Dropping common terms: stop words

3) Normalization (equivalence classing of terms)

4) Stemming and lemmatization

4. Faster postings list intersection via skip pointers

5. Positional postings and phrase queries

New terms:

Document unit

INDEXING GRANULARITY

STOP WORDS

· Chapter 3

1. I found lots of data structure related stuff in this chapter such as binary tree or B-tree. In Chapters 1 and 2 the author developed the ideas underlying inverted indexes for handling Boolean and proximity queries. In this chapter the author is going to develop techniques that are robust to typographical errors in the query, as well as alternative spellings.

2. In 3.1, data structures are developed to help the search for terms in the vocabulary in an inverted index.In 3.2, the author study the idea of a wildcard query: a query such as *a*e*i*o*u*, which seeks documents containing any term that includes all the five vowels in sequence. The * symbol indicates any (possibly empty) string of characters. Users pose such queries to a search engine when they are uncertain about how to spell a query term, or seek documents containing variants of a query term; for instance, the query automat* would seek documents containing any of the terms automatic, automation and automated.

New terms

WILDCARD QUERY

PERMUTERM INDEX

k-GRAM INDEX