Section 5 : Information retrieval and information extraction
Commentary
Section Goals
- To introduce two active research and application fields related to natural language text information processing (i.e., information retrieval and information extraction), to which a great deal of attention has recently been given as a result of popular applications, such as Web search engines (e.g., Google and Yahoo) and Web mining.
Learning Objectives
Learning Objective 1
- Outline the information retrieval (IR) and information extraction (IE) tasks, and compare the differences between them.
- Explain the principles of the language modeling approach based on "bag of words," and naive Bayes models for IR, produced from a Boolean keyword model.
- Describe the issues related to the IR systems' evaluation, such as precision, recall, and ROC curve measures.
- Summarize how NLP can help refine IR systems, and how to present result sets by clustering and relevance feedback.
- Describe the data structures and components in real IR systems, such as inverted index and vector space model.
- Describe the main steps and techniques involved in information extraction systems.
- Explain the following concepts or terms:
- Information retrieval
- Boolean keyword model
- Bag of words
- Precision
- Recall
- ROC curve
- Relevance feedback
- Document classification
- Document clustering
- Agglomerative clustering
- K-means clustering
- Vector space model
- Inverted index
- Stop words
- TREC
- Information extraction
- Tokenization
- Cascaded finite-state transducer
Objective Readings
Required readings:
Reading topics:
Information Retrieval, Information Extraction (see Sections 22.3 and 22.4 of AIMA3ed)
--. (2006). Special issue on Web information retrieval. Journal of Information Retrieval, 9(2). Springer (ISSN: 1573-7659)
Objective Questions
- What are the widely adopted main data structures and models in practical IR systems?
- What are the main techniques involved in information extraction?
Objective Activities
- Explore the most recent novel methods for IR systems, other than vector space models and inverted index. Report your findings and discuss them in the course conference.
- Explore recent, well-known open source tools for IR and IE, such as Lucene and GATE, and test them on your computer. Try to implement prototype systems based on them for testing.