Academic year 2013-14
Speech Processing
Degree: | Code: | Type: |
Bachelor's Degree in Computer Science | 21480 | Optional subject |
Bachelor's Degree in Telematics Engineering | 21762 | Optional subject |
Bachelor's Degree in Audiovisual Systems Engineering | 21610 | Compulsory subject, 2nd year |
ECTS credits: | 4 | Workload: | 100 hours | Trimester: | 3rd |
Department: | Dept. of Information and Communication Technologies |
Coordinator: | Emilia Gómez |
Teaching staff: | Emilia Gómez, Mireia Farrús, Martí Umbert |
Language: | Catalan (explanations), English (material) |
Timetable: | |
Building: | Communication campus - Poblenou |
This is an intermediate course about digital sound signal processing, conceived for the students of Audiovisual Systems Engineering.
The course focuses on the main techniques for the analysis, description, synthesis and processing of voice signals.
The course is built after some foundations from previous courses, specially Acoustic Engineering and Signals and Systems (second year, Audiovisual Systems Engineering).
Competences to be worked during the course according to the degree description:
Cross-disciplinary competences | Specific competences |
---|---|
Instrumentals G1. Analysis and synthesis G2. Organization and planning G3. Application of knowledge in novel problems and situations G4. Information retrieval and management G5. Decision making Oral and writting communicationInterpersonal G8. Team work Interdisciplinary contextsSystemic G11. Flexibility and creativity to apply knowledge to novel contexts G12. Ability for self-training |
Specific from basic training B4 - INF / B4 - A. Complex variable functions B7 - INF / B7 - A. Fourier transforms and sampling theorem B8 - INF. Lineal and Time-Invariant systems and related functions B7 -T. Probability B9 - A. Sound propagation, acoustics, and digital signal processing Related to Audiovisual systems AU1. Telecommunication applications and systems AU3. Sound and image systems and technical requirements AU4. Audio processing techniques AU5. Analysis, synthesis, coding and recognition of speech signals AU6. Sound and music signal processing AU22. Audio and music coding principles |
The evaluation is splitted between the three main activities of the course: theoretical concepts (T), seminars (S) and labs (L) as follows:
Description | Timing | Recoverable | |
---|---|---|---|
Writtent tests |
Final exam (70% of T): the final exam includes all the course conceptual materia, including questions related to the labs. |
End of term |
Yes |
Written products |
Test (30% of T): partial exam of concepts, including questions related to the labs. |
Middle of term |
No |
Seminar activities (S) |
Along the term |
No |
|
Practical work |
Labs (L): submission of lab reports (35% of L) (individually or in couples) and a lab interview with the teacher along the term (5% of L). |
Along the term |
No |
Minimum requirements:
• T: concept evaluation. A minimum of 5/10 is required to pass the course.
• L: A minimum of 5/10 is required to pass the course.
The final mark is obtained through the following formula:
Nota Final = 0,5*T + 0,4*L + 0,1*S
This course is intended to give the foundations to the techniques for voice analysis, recognition and synthesis, with a focus on speech signals.
• Acoustic, physiological and perceptual foundations of the human voice.
• Introduction to digital signal analysis.
• Methods for voice modeling and processing.
• Usage and implementation of voice processing algorithms.
Those concepts are structured into the following blocks:
Block 1. Introduction:
• Voice generation/perception chain.
• Sound acoustics.
• Speech processing applications.
Block 2. Foundations:
• Acoustical foundations: voice production mechanisms, speech vs singing, classificaiton of speech sounds, introduction to phonetics.
• Perceptual basis: pitch, loudness, timbre.
Block 3. Short-time analysis of voice signals.
• STFT, multiresolution.
• Features: energy, ZCR, ST-ACF, pitch.
Block 4. Perceptual models.
• Physical vs perceptual vs formant-based models.
• Speech perception.
• Voice transformations.
Block 5. Production-based models. Linear Predictive Coding (LCP).
Block 6. Text-To-Speech Synthesis.
Block 7. Automatic Speech Recognition.
• Cepstral analysis.
• Hidden Markov Models.
For each topic, there is a theory session, a seminar and a lab. During the theory session, the main concepts are presented to the whole class. The student should then review them with the help of the material provided by the teacher.
After that, a seminar session is planned, where the student will be solving some exercises and problems related to the theoretical concepts. This activity is carried out in small grups in an interactive way.
Finally, a practical session with computers is scheduled in order to solve, in couples, practical problems requiring algorithm implementations and practical work with sound and the proposed software.
Sessions | Hours for personal work | ||||
---|---|---|---|---|---|
Lessons | Large group (2h) | Small group (1h) | Mid-size Group - Labs (2h) | ||
1. Introduction 2. Foundations |
2 |
1 |
|
5 |
|
3. Spectral analysis | 1 | 1 | 1 | 8 | |
4. Perceptual models | 1 | 1 | 7 | ||
5. LPC | 1 | 1 | 1 | 7 | |
6. Text-To-Speech Synthesis | 1 | 1 | 7 (control) | ||
7. Cepstral analysis | 1 | 1 | 1 | 7 | |
8. Recognition | 2 | 2 | 1 | 8 | |
Summary | 1 | 8 | |||
Final exam preparation |
|
|
|
7 |
|
Total: |
18 |
8 |
10 |
64 |
Total: 100 |
Theory: 18 hours (9 sessions, 2 hours each).
• Lesson 1: Introduction.
• Lesson 2: Speech production. Classification of speech sounds.
• Lesson 3: Spectral analysis.
• Lesson 4: Perceptual models.
• Lesson 5: Linear Predictive Coding (LPC) .
• Lesson 6: Text-To-Speech Synthesis.
• Lesson 7: Cepstral analysis.
• Lesson 8-9: Speech Recognition: Hidden Markov Models.
Seminars: 8 sessions, 1 hour each.
• Seminar 1: Voice acoustics.
• Seminar 2: Spectral analysis.
• Seminar 3: LPC .
• Seminar 4: Test.
• Seminar 5: Cepstrum.
• Seminar 6: Speech recognition.
• Seminar 7: Voice transformations.
• Seminar 8: Review and summary.
Labs: 5 sessions, 2 hours each.
• Lab 1: Spectral analysis.
• Lab 2: Spectral models.
• Lab 3: Analysis and synthesis.
• Lab 4: Cepstrum.
• Lab 5: Speech recognition.
Main references
• Quatieri, T. F. 2001. Discrete - Time Speech Signal Processing: Principles and Practice. Prentice Hall.
• Rabiner, L. R. and R. W. Schafer. 2007 . Introduction to Digital Speech Processing. Foundations and Trends in Signals Processing, Vol. 1, Nos. 1 - 2, 2007
Complementary references
• Rabiner, L. R. and Schafer, R. W. 1978. Digital Signal Processing of Speech Signals. Prentice Hall.
• O'Shaughnessy, D. 1999. Speech communications: human and machine. Wiley, John & Sons.
• Rabiner, L. R. and B. H. Juang. 1993. Fundamentals of Speech Recognition. Prentice Hall.
• Park, Sung-won. Linear Predictive Speech Processing.
• Park, Sung-won. Discrete Wavelet Transform.
• Spanias, Andreas. 1994. "Speech Coding: A Tutorial Review". Proceedings of the IEEE.
• Pan, Davis. 1995. "A Tutorial on MPEG/Audio Compression". IEEE Multimedia Journal.
• Rabiner, Lawrence. 1989. "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition". Proceedings of the IEEE.
Teaching material
• Slides and notes.
• Seminar activities.
• Lab instructions.
Software
• PRAAT http://www.fon.hum.uva.nl/praat/
• Octave http://www.gnu.org/software/octave/
• MATLAB