Skip To Content

Athabasca University

Section 5 : Information retrieval and information extraction

Commentary

Section Goals

  • To introduce two active research and application fields related to natural language text information processing (i.e., information retrieval and information extraction), to which a great deal of attention has recently been given as a result of popular applications, such as Web search engines (e.g., Google and Yahoo) and Web mining.

Learning Objectives

Learning Objective 1

  • Outline the information retrieval (IR) and information extraction (IE) tasks, and compare the differences between them.
  • Explain the principles of the language modeling approach based on "bag of words," and naive Bayes models for IR, produced from a Boolean keyword model.
  • Describe the issues related to the IR systems' evaluation, such as precision, recall, and ROC curve measures.
  • Summarize how NLP can help refine IR systems, and how to present result sets by clustering and relevance feedback.
  • Describe the data structures and components in real IR systems, such as inverted index and vector space model.
  • Describe the main steps and techniques involved in information extraction systems.
  • Explain the following concepts or terms:
    • Information retrieval
    • Boolean keyword model
    • Bag of words
    • Precision
    • Recall
    • ROC curve
    • Relevance feedback
    • Document classification
    • Document clustering
    • Agglomerative clustering
    • K-means clustering
    • Vector space model
    • Inverted index
    • Stop words
    • TREC
    • Information extraction
    • Tokenization
    • Cascaded finite-state transducer

Objective Readings

Required readings:

Reading topics:

Information Retrieval, Information Extraction (see Sections 22.3 and 22.4 of AIMA3ed)

--. (2006). Special issue on Web information retrieval. Journal of Information Retrieval, 9(2). Springer (ISSN: 1573-7659)

Objective Questions

  • What are the widely adopted main data structures and models in practical IR systems?
  • What are the main techniques involved in information extraction?

Objective Activities

  • Explore the most recent novel methods for IR systems, other than vector space models and inverted index. Report your findings and discuss them in the course conference.
  • Explore recent, well-known open source tools for IR and IE, such as Lucene and GATE, and test them on your computer. Try to implement prototype systems based on them for testing.

Updated November 17 2015 by FST Course Production Staff