Corpus i eines informàtiques

Universitat Pompeu Fabra

Corpus i Eines Informàtiques

Programs and Tools for Corpus Analysis

Màster Lingüística Teòrica i Aplicada

Curs 2011-2012

Núria Bel

15/09/2011

Guia docent curs 2011-2012

1 Introduction

The course "Corpus I Eines informàtiques" is part of the "Màster en Lingüística Teòrica I Aplicada" of the Departament de Traducció I Ciències del Llenguatge and the Institut Universitari de Lingüística Aplicada of the Universitat Pompeu Fabra.

The course is about methodology for carrying out empiric, corpus-based, research on linguistics and applied linguistics. In particular is about the use of software programs as basic tools to handle large quantities of data or to exhaustively find particular data.

This course pretends to offer the basis for the student being autonomous in the use of current and future tools for handling linguistic data.

2 Objective of the course

The objective of the course is to get students acquainted with the rationale behind the programs and tools that assist researchers when handling linguistic data coming from texts, from an object of study called a corpus.

3 Competences

Following the guidelines supplied by the Bologna Framework, the course is defined along the competences that the student must acquire by participating in it.

Generic competences

? Analysis and synthesis: make good questions, imagine what to do to get an answer ...

? Critical reasoning: Critical thinking involves determining the meaning and significance of what is observed or expressed, or, concerning a given inference or argument, determining whether there is adequate justification to accept the conclusion as true.

? Autonomous in searching and finding information and sources of information.

? Use computing tools for their needs.

? Learning autonomy: create their own routines for learning.

Specific competences

- Use tools to compile a corpus for specific purposes (your research?)

- Apply state-of-the-art criteria for corpus design to guide the use of the tool

- Basics of representativeness, significance, sources of texts and annotation of corpus, including of tools

- Define needs and find tools (and information sources) and functionalities to solve them.

- Know the basics of the different tool functionalities as to know when and why to use them

- Find, install and use of tools that perform typical Corpus Linguistics functions, including full understanding of pattern matching with Regular Expressions.

- Familiarity with terminology of NLP and Text Processing.

4 Syllabus

? Section 1

- What is a corpus? Why to use computers?

- Tools for basic functions. Keyword in Context, KWIC and concordances. Count frequencies of words. Significance of frequency related to contexts. Count frequencies of sequences of words. Count frequencies of sequences of words that are specially related, i.e. collocations. Assess the strength of this relation, i.e. Mutual Information. Pattern search and Regular Expressions

? Section 2

- Representativeness, balance and sampling. Reference corpus. Most well known reference corpus and other sources of texts.

- Types of corpora. General corpora. Specialized corpora. Written corpora. Spoken corpora. Synchronic corpora. Diachronic corpora. Learner corpora. Monitor corpora. Copyrights and other legal issues.

? Section 3

- Corpus mark-up. From character encoding to Corpus mark-up languages. Metadata for describing corpus.

- Corpus annotation. Levels of Linguistic Annotation. Tools for the annotation of corpora.

? Section 4

- Parallel corpora and specific tools. Tools for finding parallel texts. Alignment of parallel texts. Exploitation of parallel corpora.

5 Methodology

The main characteristics of the course are the following. The course is mainly based on practice and exercises in order to achieve that the student acquires the competences listed in section 3 of this document. Competences are understood as knowledge, understanding, attitudes and skills. Since competence is defined as a learned ability to adequately perform a task and encompasses professional knowledge, skills and attitudes, together with personality traits and abilities, we want students, at the end of the course, feel they can successfully perform specific corpus-based related tasks.

The course will be organized in two main blocks that roughly correspond to 5 weeks each. In the first half we will follow two selected studies (and publications) to perform a quick and dirty introduction to prototypical tools following experiments/studies made by others. Thus, the students will work on exercises following the contents introduced in classes.

In the second half of the course the students are asked to define an experiment that involves the definition and creation of a corpus and its exploitation by means of the tools they have tried at the first part of the course. In order to evaluate the project, students are asked to write a paper about it. The paper will be peer-reviewed by other students of the course (with the help of guidelines based in current peer review process in outstanding conferences). The class time will be devoted to the introduction of contents about tools. The seminar time will be devoted to discussions and exercises.

6 Activity Plan (2011-2012)

Sesion	Topics	Seminar	Readings
26-09-2011	Course introduction, objectives and practical issues
3-10-2011	Corpus and tools, an introduction (1)	1st. Assignment. Observing linguistic data. What are the data observed? List them.	Oncin, 2009
10-10-2011	Corpus Analysis Tools (2)	Discussion Assignment 1 Reading 2	Biber & Conrad
17-10-2011	Statistical measures for corpus analysis. Dispersion, distribution, MI, collocations (3)	Presentation Hèctor Martínez
24-10-2011	Exploitation tools: pattern search and Regular Expressions. (4)	Assignment 2. Measuring significance	Cantos, 2002
31-10-2011	HOLIDAY
7-11-2011	Corpus Design (5)	Discussion Assignment 2 My Question? my corpus?	Day, 1997
14-11-2011	Metadata and markup languages. Levels of linguistic annotation (5-6)	My experiment
21-11-2011	Levels of linguistic annotation	My paper
28-11-2011	Parallel corpora (7)	My evaluation

7 Evaluation

Evaluation will be based on getting evidence on the acquisition of the competences mentioned in section 3 and the final mark will be assessed from the following ratios:

- Class participation: 5%

- Homework assignments: 50%

- Final project, the paper: 40%

- Participation in the evaluation process: 5%

Homework assignments will be evaluated only according to the following parameters. It is expected that most of the students are graded 2, and only those really contributing with particular original ideas or generous work.

- 1 -- Below average

- 2 - Average

- 3 - Excellent

8 Recommended readings and further resources

Section 1

Definitions of Corpus

John Sinclair (2005) "Corpus and Text - Basic Principles" in Martin Wynne (ed.)"Developing Linguistic Corpora: a Guide to Good Practice". Oxford Oxbow Books. [Available online from Available online from http://ahds.ac.uk/linguistic-corpora/, last Access September 2010]

Tools

John Sinclair,1982. Reflections on computer corpora in English language research. In Computer corpora in English language research, ed. Stig Johansson: 1-6. Bergen.

Christopher D. Manning, Hinrich Schütze. 1999. Foundations of statistical natural language processing. Cambridge, Mass. : MIT Press, Chapter 1.4. "Dirty Hands".

Church, K. i P Hanks. Word association norms, mutual information and lexicography. En Proceedings of the 27th Annual Meeting of the ACL, pg. 76-83. (1989).

Daniel Jurafsky and James H. Martin. 2000 (or newer). Speech and Language Processing. Prentice Hall. Chapter 2.1. Regular Expressions.

Jorge Vivaldi (2009). "Catálogo de herramientas informáticas relacionadas con la creación, gestión y explotación de corpus textuales" dins Tradumática (7). Bellaterra: Universitat Autònoma de Barcelona, Departament de Traducció i d'Interpretació. ISSN 1578-7559
http://webs2002.uab.es/tradumatica/revista/num7/articles/10/10art.htm

Oakes, Michael P. (1998) Statistics for corpus linguistics Edinburgh : Edinburgh University Press, cop. 1998

Section 2

Representativeness.

D. Biber. 1993. "Representativeness in corpus design". Literary and Linguistic Computing 8/3: 243-257. [a pdf copy can be found at internet and an extract of the paper is legally reprinted in McEnery, Xiao and Tono (2006): Corpus-Based Language Studies. Routledge]

William H. Fletcher (2010) Corpus Analysis of the World Wide Web, to appear in Chapelle, Carol A, (Ed.). (2011). Encyclopedia of Applied Linguistics. Wiley-Blackwell. http://www.encyclopediaofappliedlinguistics.com/ and available at:

http://webascorpus.org/Corpus_Analysis_of_the_World_Wide_Web.pdf

Adam Kilgarriff. Googleology is bad science. Computational Linguistics. 33, 1 (Mar. 2007), 147-151. DOI= http://dx.doi.org/10.1162/coli.2007.33.1.147

Some reference corpus

[BNC] British National Corpus (BNC):http://www.natcorp.ox.ac.uk/

[CORDE] Corpus Diacrónico del Español (CORDE):http://corpus.rae.es/cordenet.html
[CREA] Corpus de Referencia del Español Actual (CREA):http://corpus.rae.es/creanet.html

[ICE] International Corpus of English: http://www.ucl.ac.uk/english-usage/ice/

[CITLC] Corpus Informatitzat de la Llengua Catalana: http://ctilc.iec.cat/

[Europarl] http://www.statmt.org/europarl/

Types of corpora

S. Atkins, J. Clear and N. Ostler (1991) "Corpus design criteria". http://www.natcorp.ox.ac.uk/archive/vault/tgaw02.pdf [last access September 2010]. In 1992 it was published in Literary and Linguistic Computing 7/1: 1-16.

Section 3.

Jorge Vivaldi Palatresi (2009). "Corpus and exploitation tool: IULACT and bwanaNet" dins Cantos Gómez, Pascual; Sánchez Pérez, Aquilino (ed.) A survey on corpus-based research = Panorama de investigaciones basadas en corpus [Actas del I Congreso Internacional de Lingüística de Corpus (CICL-09), 7-9 Mayo 2009, Universidad de Murcia]. Murcia: Asociación Española de Lingüística del Corpus. Pàg. 224-239. ISBN 978-84-692-2198-3

Tony McEnery, Richard Xiao and Yukio Tono (2006): Corpus-Based Language Studies. Routledge. Unit A3 and A4.

Lou Burnad (2005) "Metadata for Corpus Work" in Martin Wynne (ed.)"Developing Linguistic Corpora: a Guide to Good Practice". Oxford Oxbow Books. [Available online from Available online from http://ahds.ac.uk/linguistic-corpora/, last Access September 2010]

Section 4

Gómez Guinovart, Xavier and Alberto Simões (2009): Parallel corpus-based bilingual terminology extraction. In Proceedings of the 8th International Conference on Terminology and Artificial Intelligence, IRIT (Institut de recherche en Informatique de Toulouse), Université Paul Sabatier, Toulouse. http://webs.uvigo.es/sli/arquivos/TIA09.pdf

Bowker, Lynne; Pearson, Jennifer (2002),Working with Specialized Language : A Practical Guide to Using Corpora. London ; New York : Routledge, 2002

----------------------------

Some addresses useful when working in your project

http://www.kwicfinder.com/KWiCFinder.html (Concordancer to be installed)

http://www.kwicfinder.com/kfNgram/kfNgramHelp.html (for n-grams)

http://webascorpus.org/searchwac.html (Web-based concordancer as a web application)

http://ucrel.lancs.ac.uk/wmatrix/ Wmatrix is a software tool for corpus analysis and comparison.

We will mainly use Sketchengine. Students will receive user and password.

http://www.sketchengine.co.uk/

http://trac.sketchengine.co.uk/wiki/SkE/DocsIndex