Corpus i Eines Informàtiques (Programs and Tools for Corpus Analysis)
| 8037-31373-T1-Corpus i Eines Informàtiques (Programs and Tools for Corpus Analysis)
Estudi: Màster en Lingüística Teòrica i Aplicada
1. Introduction The course "Corpus I Eines informàtiques" is part of the "Màster en Lingüística Teòrica I Aplicada" of the Departament de Traducció I Ciències del Llenguatge and the Institut Universitari de Lingüística Aplicada of the Universitat Pompeu Fabra. The course studies the methodology to carry out empirical and corpus-based research on linguistics and applied linguistics. It has a particular focus on use of specific software as basic tools used to handle large quantities of data as well as to exhaustively analyze particular data. The aim of the course is to offer the fundamentals that permit autonomy on the part of the student to use both current and future tools to handle and exploit linguistic data. The students will acquire the following competences:
Generic competences:
Specific competences:
3. Content
4. Evaluation Evaluation will be based on getting evidence on the acquisition of the competences mentioned in section 3 and the final mark will be assessed from the following ratios:
In case of failure after the main evaluation, the student must deliver a (revised) final project in two months time. 5. Methodology The main characteristics of the course are the following: The course is based mainly on practical exercises in order for the student to acquire the competences listed in section 3 of this document. Since competence is defined as a learned ability to adequately perform a task and encompasses knowledge, skills and attitudes, the goal of this course is for the students to be able to successfully perform specific corpus-based related tasks using processing tools. The class time will be devoted to the introduction of contents regarding these tools. The seminar time will be devoted to discussions and exercises. The course will be organized in two main blocks that roughly correspond to 5 weeks each. In the first half, we will follow two selected studies (and publications), which compose a quick introduction to prototypical tools following experiments/studies made by others. Thus, the students will work on exercises following the content introduced in classes. The second half of the course requires the students to define an experiment that involves the definition and creation of a corpus and its exploitation by means of the tools they have learned about in the first part of the course. In order to evaluate their project, each student will be required to write a paper about it. The paper will be peer-reviewed by other students of the course (using guidelines based on current peer review process for outstanding conferences). 6. Recommended readings and further resources
Section 1
Definitions of Corpus:
John Sinclair (2005) "Corpus and Text - Basic Principles" in Martin Wynne (ed.)"Developing Linguistic Corpora: a Guide to Good Practice". Oxford Oxbow Books. [Available online from Available online from http://ahds.ac.uk/linguistic-corpora/, last Access September 2010] Tools: John Sinclair,1982. Reflections on computer corpora in English language research. In Computer corpora in English language research, ed. Stig Johansson: 1-6. Bergen. Christopher D. Manning, Hinrich Schütze. 1999. Foundations of statistical natural language processing. Cambridge, Mass.: MIT Press, Chapter 1.4. "Dirty Hands". Church, K. i P Hanks. Word association norms, mutual information and lexicography. En Proceedings of the 27th Annual Meeting of the ACL, pg. 76-83. (1989). Daniel Jurafsky and James H. Martin. 2000. Speech and Language Processing. Prentice Hall. Chapter 2.1. Regular Expressions. Jorge Vivaldi (2009). "Catálogo de herramientas informáticas relacionadas con la creación, gestión y explotación de corpus textuales" dins Tradumática (7). Bellaterra: Universitat Autònoma de Barcelona, Departament de Traducció i d'Interpretació. ISSN 1578-7559 Oakes, Michael P. (1998) Statistics for corpus linguistics Edinburgh: Edinburgh University Press, cop. 1998 Section 2
Representativeness:
D. Biber. 1993. "Representativeness in corpus design". Literary and Linguistic Computing 8/3: 243-257. [a pdf copy can be found at internet and an extract of the paper is legally reprinted in McEnery, Xiao and Tono (2006): Corpus-Based Language Studies. Routledge] William H. Fletcher (2010) Corpus Analysis of the World Wide Web, to appear in Chapelle, Carol A, (Ed.). (2011). Encyclopedia of Applied Linguistics. Wiley-Blackwell. http://www.encyclopediaofappliedlinguistics.com/ and available at: http://webascorpus.org/Corpus_Analysis_of_the_World_Wide_Web.pdf Adam Kilgarriff. Googleology is bad science. Computational Linguistics. 33, 1 (Mar. 2007), 147-151. DOI= http://dx.doi.org/10.1162/coli.2007.33.1.147 Some reference corpora:
[BNC] British National Corpus (BNC):http://www.natcorp.ox.ac.uk/ [CORDE] Corpus Diacrónico del Español (CORDE):http://corpus.rae.es/cordenet.html [ICE] International Corpus of English: http://www.ucl.ac.uk/english-usage/ice/ [CITLC] Corpus Informatitzat de la Llengua Catalana: http://ctilc.iec.cat/ [Europarl] http://www.statmt.org/europarl/
Types of corpora:
S. Atkins, J. Clear and N. Ostler (1991) "Corpus design criteria". http://www.natcorp.ox.ac.uk/archive/vault/tgaw02.pdf [last access September 2010]. In 1992 it was published in Literary and Linguistic Computing 7/1: 1-16. Section 3.
Jorge Vivaldi Palatresi (2009). "Corpus and exploitation tool: IULACT and bwanaNet" dins Cantos Gómez, Pascual; Sánchez Pérez, Aquilino (ed.) A survey on corpus-based research = Panorama de investigaciones basadas en corpus [Actas del I Congreso Internacional de Lingüística de Corpus (CICL-09), 7-9 Mayo 2009, Universidad de Murcia]. Murcia: Asociación Española de Lingüística del Corpus. Pàg. 224-239. ISBN 978-84-692-2198-3 Tony McEnery, Richard Xiao and Yukio Tono (2006): Corpus-Based Language Studies. Routledge. Unit A3 and A4. Lou Burnad (2005) "Metadata for Corpus Work" in Martin Wynne (ed.) "Developing Linguistic Corpora: a Guide to Good Practice". Oxford Oxbow Books. [Available online from Available online from http://ahds.ac.uk/linguistic-corpora/, last Access September 2010] Section 4
Gómez Guinovart, Xavier and Alberto Simões (2009): Parallel corpus-based bilingual terminology extraction. In Proceedings of the 8th International Conference on Terminology and Artificial Intelligence, IRIT (Institut de recherche en Informatique de Toulouse), Université Paul Sabatier, Toulouse. http://webs.uvigo.es/sli/arquivos/TIA09.pdf Bowker, Lynne; Pearson, Jennifer (2002). Working with Specialized Language : A Practical Guide to Using Corpora. London; New York: Routledge, 2002
We will mainly use Sketchengine. Students will receive user and password.
|