Section 4 : Probabilistic language models and machine translation
Commentary
Section Goals
- To introduce probabilistic language models of natural language processing.
- To introduce machine translation, a typical application of NLP, and statistical machine translation methods.
Learning Objectives
Learning Objective 1
- Outline representation, smoothing, and evaluating of probabilistic language models.
- Explain the methods of learning probabilities for probabilistic context free grammars (PCFGs), and exemplify a small application of them.
- Describe a general schematic diagram of machine translation systems, and name the tasks for which machine translation can be useful.
- Explain the principle of statistical machine translation, and how to learn probabilities for machine translation.
- - Explain the following concepts or terms:
- Corpus-based approach
- Probabilistic language model
- N-gram models
- Unigram, bigram, and trigram models
- Add-one smoothing
- Linear interpolation smoothing
- Probabilistic context free grammar (PCFG)
- Lexicalized PCFG
- Memory-based, interlingua-based, and transfer-based machine translation systems
- Language model
- Translation model
- Sentence alignment
Objective Readings
Required readings:
Reading topics:
Probabilistic Language Models, Machine Translation (see Sections 22.1 and 23.4 of AIMA3ed)
Supplemental Readings
Krieger, H.-U. (2007). From UBGs to CFGs: A practical corpus-driven approach. Natural Language Engineering, 13(04), 317-351.
Basili, R., Pazienza, M. T., and Velardi, P. (1996). An empirical symbolic approach to natural language processing. Artificial Intelligence, 85(1-2), 59-99.
Manning, C. D., and Schutze, H. (2000). Foundations of Statistical Natural Language Processing. Cambridge, MA: MIT Press.
Objective Questions
- What are the desirable properties and difficulties of rule-based methods for uncertain reasoning?
- How can we learn different probabilistic models (such as a language model, or a fertility model) for statistical machine translation from a bilingual corpus?
- How can the EM algorithm be used to improve the estimated probabilistic models?
Objective Activities
- Explore research work or systems incorporating both probabilistic and logic representation and reasoning in language processing. Report your findings in the course conference.
- Test some machine translation systems, such as Google Translate, to see how satisfactory a general translation system is at this time. To test, you can translate English into another language you know (e.g., French), and vice versa.
- Explore the following probabilistic statistical language processing algorithms that are related to this section of the textbook.
- Complete Exercise 22.1 & 23.10 of AIMA3ed.