Corpus i Eines Informàtiques (Programs and Tools for Corpus Analysis)

8037-31373-T1-Corpus i Eines Informàtiques (Programs and Tools for Corpus Analysis)

Estudi: Màster en Lingüística Teòrica i Aplicada
Curs: 2013-2014
Trimestre: 1
Nombre de crèdits ECTS: 5
Hores de dedicació de l'estudiant: 100
Tipus d'assignatura: Obligatòria
Professor/s: Núria Bel
Llengua de docència: Anglès

1. Introduction

The course "Corpus I Eines informàtiques" is part of the "Màster en Lingüística Teòrica I Aplicada" of the Departament de Traducció I Ciències del Llenguatge and the Institut Universitari de Lingüística Aplicada of the Universitat Pompeu Fabra.

The course studies the methodology to carry out empirical and corpus-based research on linguistics and applied linguistics. It has a particular focus on use of specific software as basic tools used to handle large quantities of data as well as to exhaustively analyze particular data.

The aim of the course is to offer the fundamentals that permit autonomy on the part of the student to use both current and future tools to handle and exploit linguistic data.

2. Competences to be acquired

The students will acquire the following competences:

Generic competences:

Analysis and synthesis: The student will acquire the skills needed to ask thoughtful questions, as well as the knowledge of the steps required to obtain a well-founded answer.
Critical reasoning: The student will develop skills to determine the meaning and significance of what is observed or expressed; or concerning a given inference or argument, to determine whether there is adequate justification to accept a conclusion as true.
Autonomy: The students will develop skills to create their own routines for learning, as well as to search and find both information and sources of information.
Use of computing tools for their needs.

Specific competences:

- Application of state-of-the-art criteria for corpus design and use of tools to compile a corpus for specific purposes
- Use of concepts, such as representativeness and significance, for empirical research in linguistics
- Definition of requirements needed to find tools (and information sources) and functionalities to utilize them.
- Discovery, installation and use of tools that perform typical Corpus Linguistics functions, including the understanding of pattern matching with Regular Expressions and corpus annotation tools
- Familiarity with terminology of NLP and Text Processing.

3. Content

Section 1
- What is a corpus? Why to use computers?
- Tools for basic functions. Keyword in Context, KWIC and concordances. Count frequencies of words. Significance of frequency related to contexts. Count frequencies of sequences of words. Count frequencies of sequences of words that are specially related, i.e. collocations. Assessment of the strength of a relation, i.e. Mutual Information. Pattern search and Regular Expressions

Section 2
- Representativeness, balance and sampling. Reference corpus. Most well-known reference corpus and other sources of texts.
- Types of corpora. General corpora. Specialized corpora. Written corpora. Spoken corpora. Synchronic corpora. Diachronic corpora. Learner corpora. Monitor corpora. Copyrights and other legal issues.
Section 3
- Corpus mark-up. From character encoding to Corpus mark-up languages. Metadata for describing corpus.
- Corpus annotation. Levels of Linguistic Annotation. Tools for the annotation of corpora.

Section 4
- Parallel corpora and specific tools. Tools for finding parallel texts. Alignment of parallel texts. Exploitation of parallel corpora.

4. Evaluation

Evaluation will be based on getting evidence on the acquisition of the competences mentioned in section 3 and the final mark will be assessed from the following ratios:

Class participation: 5%
Homework assignments: 50%
Final project, the paper: 40%.
Participation in the final project peer-evaluation process: 5%

In case of failure after the main evaluation, the student must deliver a (revised) final project in two months time.

5. Methodology

The main characteristics of the course are the following: The course is based mainly on practical exercises in order for the student to acquire the competences listed in section 3 of this document. Since competence is defined as a learned ability to adequately perform a task and encompasses knowledge, skills and attitudes, the goal of this course is for the students to be able to successfully perform specific corpus-based related tasks using processing tools. The class time will be devoted to the introduction of contents regarding these tools. The seminar time will be devoted to discussions and exercises.

The course will be organized in two main blocks that roughly correspond to 5 weeks each. In the first half, we will follow two selected studies (and publications), which compose a quick introduction to prototypical tools following experiments/studies made by others. Thus, the students will work on exercises following the content introduced in classes.

The second half of the course requires the students to define an experiment that involves the definition and creation of a corpus and its exploitation by means of the tools they have learned about in the first part of the course. In order to evaluate their project, each student will be required to write a paper about it. The paper will be peer-reviewed by other students of the course (using guidelines based on current peer review process for outstanding conferences).

6. Recommended readings and further resources

Section 1

Definitions of Corpus:

John Sinclair (2005) "Corpus and Text - Basic Principles" in Martin Wynne (ed.)"Developing Linguistic Corpora: a Guide to Good Practice". Oxford Oxbow Books. [Available online from Available online from http://ahds.ac.uk/linguistic-corpora/, last Access September 2010]

Tools:

John Sinclair,1982. Reflections on computer corpora in English language research. In Computer corpora in English language research, ed. Stig Johansson: 1-6. Bergen.

Christopher D. Manning, Hinrich Schütze. 1999. Foundations of statistical natural language processing. Cambridge, Mass.: MIT Press, Chapter 1.4. "Dirty Hands".

Church, K. i P Hanks. Word association norms, mutual information and lexicography. En Proceedings of the 27th Annual Meeting of the ACL, pg. 76-83. (1989).

Daniel Jurafsky and James H. Martin. 2000. Speech and Language Processing. Prentice Hall. Chapter 2.1. Regular Expressions.

Jorge Vivaldi (2009). "Catálogo de herramientas informáticas relacionadas con la creación, gestión y explotación de corpus textuales" dins Tradumática (7). Bellaterra: Universitat Autònoma de Barcelona, Departament de Traducció i d'Interpretació. ISSN 1578-7559
http://webs2002.uab.es/tradumatica/revista/num7/articles/10/10art.htm

Oakes, Michael P. (1998) Statistics for corpus linguistics Edinburgh: Edinburgh University Press, cop. 1998

Section 2

Representativeness:

D. Biber. 1993. "Representativeness in corpus design". Literary and Linguistic Computing 8/3: 243-257. [a pdf copy can be found at internet and an extract of the paper is legally reprinted in McEnery, Xiao and Tono (2006): Corpus-Based Language Studies. Routledge]

William H. Fletcher (2010) Corpus Analysis of the World Wide Web, to appear in Chapelle, Carol A, (Ed.). (2011). Encyclopedia of Applied Linguistics. Wiley-Blackwell. http://www.encyclopediaofappliedlinguistics.com/ and available at: http://webascorpus.org/Corpus_Analysis_of_the_World_Wide_Web.pdf

Adam Kilgarriff. Googleology is bad science. Computational Linguistics. 33, 1 (Mar. 2007), 147-151. DOI= http://dx.doi.org/10.1162/coli.2007.33.1.147

Some reference corpora:

[BNC] British National Corpus (BNC):http://www.natcorp.ox.ac.uk/

[CORDE] Corpus Diacrónico del Español (CORDE):http://corpus.rae.es/cordenet.html
[CREA] Corpus de Referencia del Español Actual (CREA):http://corpus.rae.es/creanet.html

[ICE] International Corpus of English: http://www.ucl.ac.uk/english-usage/ice/

[CITLC] Corpus Informatitzat de la Llengua Catalana: http://ctilc.iec.cat/

[Europarl] http://www.statmt.org/europarl/

Types of corpora:

S. Atkins, J. Clear and N. Ostler (1991) "Corpus design criteria". http://www.natcorp.ox.ac.uk/archive/vault/tgaw02.pdf [last access September 2010]. In 1992 it was published in Literary and Linguistic Computing 7/1: 1-16.

Section 3.

Jorge Vivaldi Palatresi (2009). "Corpus and exploitation tool: IULACT and bwanaNet" dins Cantos Gómez, Pascual; Sánchez Pérez, Aquilino (ed.) A survey on corpus-based research = Panorama de investigaciones basadas en corpus [Actas del I Congreso Internacional de Lingüística de Corpus (CICL-09), 7-9 Mayo 2009, Universidad de Murcia]. Murcia: Asociación Española de Lingüística del Corpus. Pàg. 224-239. ISBN 978-84-692-2198-3

Tony McEnery, Richard Xiao and Yukio Tono (2006): Corpus-Based Language Studies. Routledge. Unit A3 and A4.

Lou Burnad (2005) "Metadata for Corpus Work" in Martin Wynne (ed.) "Developing Linguistic Corpora: a Guide to Good Practice". Oxford Oxbow Books. [Available online from Available online from http://ahds.ac.uk/linguistic-corpora/, last Access September 2010]

Section 4

Gómez Guinovart, Xavier and Alberto Simões (2009): Parallel corpus-based bilingual terminology extraction. In Proceedings of the 8th International Conference on Terminology and Artificial Intelligence, IRIT (Institut de recherche en Informatique de Toulouse), Université Paul Sabatier, Toulouse. http://webs.uvigo.es/sli/arquivos/TIA09.pdf

Bowker, Lynne; Pearson, Jennifer (2002). Working with Specialized Language : A Practical Guide to Using Corpora. London; New York: Routledge, 2002

We will mainly use Sketchengine. Students will receive user and password.

http://www.sketchengine.co.uk/

http://trac.sketchengine.co.uk/wiki/SkE/DocsIndex