Understanding Acoustics and Audio Systems
Understanding Acoustics and Audio Systems
, (CS&IT)
Unit – 3
1) Acoustics, Nature of Sound Waves, Fundamental Characteristics of Sound
2) Microphone, Amplifier, Loudspeaker, Audio Mixer, Digital Audio, Synthesizers
3) MIDI, Basics of Staff notation, Sound Card, Audio Transmission
4) Audio File Formats &CODECs, Audio Recording Systems, Audio and Multimedia
5 Voice Recognition and Response, Audio Processing software
ACOUSTICS
Sound is a form of energy similar to heat and light. Sound is generated from
vibrating objects and can flow through a material medium from one place to another.
During generation a kinetic energy of the vibrating body is converted to sound energy.
1
SUB: MULTIMEDIA TECHNOLOGY Class: III BSc., (CS&IT)
medium, it sets up, alternate regions of compression and refraction by shifting the
particles of the medium. Example: Molecules of air.
2
SUB: MULTIMEDIA TECHNOLOGY Class: III BSc., (CS&IT)
iv) Speed- A sound wave is also characterized by its speed. The speed of sound
depends on the medium through which the sound travels and the temperature of the
medium but not on the pressure. The speed is about 340 m/s in air and 1500 m/s in
water.
1) MICROPHONE
A microphone records sound by converting the acoustic energy into electrical
energy.
Based on the constructional features,
microphones may be of two types:
Moving coil type
Condenser type
Based on the functional features,
microphones may be divided into three types:
1. Omni-directional
2. Bi-directional
3. Uni-directional
3
SUB: MULTIMEDIA TECHNOLOGY Class: III BSc., (CS&IT)
An omni-directional
microphone - It is equally
sensitive to sounds coming
from all directions.
A bi-directional
microphone – It is
sensitive sounds coming
from two directions: the
front an/d rear.
A uni-directional
microphone – It is
designed to record sound
from a single source, e.g., a
single individual speaking.
4
SUB: MULTIMEDIA TECHNOLOGY Class: III BSc., (CS&IT)
2) AMPLIFIER
The amplifier takes power from a power supply and uses the energy to produce
an output electrical signal which has the same shape as the input signal but larger in
amplitude.
Amplifier circuits are designated as A, B, AB, and C for analogue design and D
and E for digital design. There are 5 types of class amplifiers. They are:
i) Class-A:
Class-A amplifiers use 100% of the input cycle for generating the
output. There are not very efficient, a theoretical maximum of 50% efficiency is
obtained and are usually used for small signal levels.
ii) Class-B:
Class-B amplifiers only use half of the input cycle for amplification.
Though they produce a large amount of distortion, these amplifiers are more efficient
than class-A because the amplifying element is switched off during half of the cycle.
iii) Class-C:
Class-C amplifiers use less than half of the input cycle for amplification.
Though they produce a huge amount of distortion, they are most efficient. One way of
reducing distortion is to introduce a negative feedback.
iv)Class-D:
Class-D digital amplifiers use a series of transistors as switches. The
input signal is sampled and converted digital pulses using an ADC.
v) Class-E:
Class-E digital amplifiers use Pulse Width Modulation (PWM) to
produce output waves whose widths are proportional to the desired amplitudes.
5
SUB: MULTIMEDIA TECHNOLOGY Class: III BSc., (CS&IT)
3) LOUDSPEAKER
A loudspeaker is a device that converts
electrical energy back to acoustic energy and
therefore, functions just opposite to that of a
microphone. A loudspeaker is divided into
smaller units each of which are tuned for a
small frequency range through the use of
filters known as crossover circuits,
appropriate frequencies and filtered from the
input signal and fed to corresponding units.
4) AUDIO MIXER
In professional studios, multiple
microphones may be used to record
multiple tracks of sound at a time.
EX: Recording performance of an
orchestra. Controls are also provided
for adjusting the overall volume and
tempo of the audio, as well as for
providing special effects like chorus,
echo, reverb, panning.
DIGITAL AUDIO
Analog quantity is converted to digital from through the processes of sampling
quantization and cord word generation. The Nyquist Sampling Theory sampling
frequency needs to be twice the input frequency. Human hearing range from 20 Hz to
20 kHz.
Sampling Frequency- High frequency Range ->44 to 48KHzalways sample at 44 KHz
F=2f
Example: Recording Human Speech
Then data rate D of the digitized audio signal is the amount of data flowing/second in
bits/second as, D=f.b.c
Let T be the duration of audio in seconds as, N=F.T
6
SUB: MULTIMEDIA TECHNOLOGY Class: III BSc., (CS&IT)
2) Wavetable Synthesizer
It produced sound by retrieving high quality digital recordings of actual
instruments from memory and playing them on demand. Modern synthesizers are
generally of wavetable type.
3) Polyphony
It refers to ability of playing more than one note at a time. Polyphony is
generally measured or specified as a number of notes or voices. Most of the earth
music synthesizers were monophonic, meaning that they cloud only play on note at a
time.
4) Multi-Timbral
A synthesizer is said to be multi-timbral, if it is capable of producing two or
more different instrument sounds simultaneously.
7
SUB: MULTIMEDIA TECHNOLOGY Class: III BSc., (CS&IT)
MIDI Connections
The MIDI data stream is usually originated by a MIDI controller, or a MIDI
sequencer. A MIDI controller is a device that is played as an instrument, like a
keyboard.
a) The instruments might be temporarily stored in another device called in
MIDI sequences which allows MIDI data sequences to be captured, stored, editing,
combined, and replayed.
b) Each sound modules can be configured to play a specific part of the music
Example: specific instrument sound like drums or piano.
8
SUB: MULTIMEDIA TECHNOLOGY Class: III BSc., (CS&IT)
9
SUB: MULTIMEDIA TECHNOLOGY Class: III BSc., (CS&IT)
A Staff is simply five horizontal parallel lines on which we write music. It does
nothing by itself.
Fig:Staff
A note is an open dot or letter O turned on its side. It can have a tail and/or flag
depending on its value. A note represents a single tone and a series of notes represents
10
SUB: MULTIMEDIA TECHNOLOGY Class: III BSc., (CS&IT)
There are 12 notes to western music. They are labelled using the first seven
letters of the alphabet. The remaining five notes are labelled by modifying this letter
by adding the word sharp# or flat (b) to it. The twelve notes of the western scale are
called (C, C#/D1, D, D#/E1, E, F, F#/G1, G, G#/A1, A, A#/B1 and B) After that they
keep repeating(C, C#/D1, etc).
Sharp (#) refers to a key directly next to and above, Flat(b) refers to the key directly
next to and below is called C-sharp(C#) and/or D-Flat(Db). Wrting black keys on the
staff is a easy as know, simply put the flat (b), or sharp (#), before the note. A note
that is not sharp, nor flat is called natural. When referring to note (A), it is assumed
that you mean (A) natural.
Music is made up of short and long notes. We measure the length of a note by
counting. The combination of notes into patterns is called Rhythm. Silence or rest is
also a part of music. The time signature denotes how many beats per measure, and
11
SUB: MULTIMEDIA TECHNOLOGY Class: III BSc., (CS&IT)
what note gets one beat. It is composed of two numbers written in fraction form. The
top number indicates how many beats to measure, the bottom tells us what type of
note.
The time signature is placed after the key signature. Notes written together on the
same beat, are to be played, or sung together. It is a localized place for all sharps of
flats depending on the scale, appearing in the score. Bars denote measures of
appropriate length as determined be the time signature.
SOUND CARD
The sound card is an expansion board in a multimedia PC that interfaces with
the CPU via slots on the motherboard. It is a SoundBlaster from creative labs.
12
SUB: MULTIMEDIA TECHNOLOGY Class: III BSc., (CS&IT)
MIDI - MIDI port for interfacing with an external synthesizer is present in some cards
like SoundBlaster 16.
AUDIO TRANSMISSION
To convey digital audio in real time between different hardware devices, there
must be both a data communications channel and common clock synchronization. In
addition the interconnection requires an audio format recognized by both transmitting
and receiving devices. Some important interfaces are:
i) AES/EBU Standard: The Audio Engineering Society/European Broadcasting
Union - It carries digital audio signals between devices and components published in
1992 and subsequent revised a number of times. It is same as part 4 of the IEC
(International Electrotechnical Commission) standard 60958 and is officially known
as AES3. It specifies the format for serial digital transmission of two channels of
periodically sampled and uniformly quantized audio signals on a single twisted pair
wire.
14
SUB: MULTIMEDIA TECHNOLOGY Class: III BSc., (CS&IT)
The transmission rate is samples of audio data one from each channel are
transmitted in time division mutiplex in one sample period. There are three types of
connectors are defined they are:
Type I: Balanced 3 conductor, 110-ohm twisted pair cabling with XLR
connector.
Type II: Unbalanced 2 conductor 75-ohm coaxial cable with RCA connector
Type III: Optical Fiber with TOSLINK connector.
NOTE: The most common connector is the 3-pin XLR.
ii) S/PDIF Standard: The Sony/Philips Digital Interconnect Format- It is a
standard for transmission of digital audio signal between devices and components. It
was developed from the AES/EBU standard used in DAT systems. Data is sent as a
stream of 192 words for each channel alternately, each word is composed of 32 bits.
A set of words for each sample in each channel is called a data frame.
Identified by a code in the first four bits of each word. The word format provides
support for the SCMS- Serial Copy Management System to control copying. This
standard is used to transmit compressed audio data from DVD player to a home
theater system supporting Dolby Digital or DTS surround sound formats.
iii) TRS Connectors: The most common audio connector is the phone audio jack,
also known as TRS (Tip-Ring-Sleeve) Connector. It is available in 3 sizes: 2.5 mm,
3.5mm and 6.5 mm. The 6.5 mm are used for manual telephone excahnges. The
3.5mm miniature and 2.5mm sub-miniature jack were designed for audio output from
transistor radios. All three inchs are available in 2-conductor (mono) and 3-conductor
(stereo) versions which carrying the left and right channel audio data and the third
conductor is ground.
Male (plugs) and female
(sockets) jacks are as headphone
and earphone jacks of audio
equipments, microphone inputs
on cassette recorders, line-in and
line-out ports of PC sound cards,
audio ouputs for electric guitars
and keyboards, I/O ports if
external audio amplifiers.
Portable devices like walkmans,
also known as TRS jacks which
includes Tip carries the left
channel signal: Ring carries the
right channel signal and Sleeve
which is the ground conductor.
15
SUB: MULTIMEDIA TECHNOLOGY Class: III BSc., (CS&IT)
iv) RCA Connectors: It is used mainly for home applications developed by Radio
Corporation of America (RCA). The male connector consists of a central male
connector surrounded by a metal ring and is found at cable ends. The female (jack)
connector found on device consists of a central hole with a ring of metal around it.
The jack also has a small plastic ring which is color coded for the signal type: yellow
– composite video, red – right audio channel and white or black – left audio channel.
v) XLR Connectors: It is mainly used in professional audio/video recording and
transmission applications. It is manufactured by Canon and referred as Code K, and
versions included as Latch and Rubber gasket which together as XLR, some types are:
1) XLR 3-with three pins connector Used for high quality microphones.
(An XLR3M (Male) connector is pin1–chassis ground, pin2-normal
used for output and XLR3F polarity voltage(hot) and pin3- reverse
(female)connector is used of input.) polarity voltage(cold)
is used for uncompressed 8,12 and 16-bit audio files both mono and multi-channel.
The sampling rate is 44.1 kHz and hold an audio compression using lossless CODECs
2) AIFF : (Audio Interchange File Format) – It is used for storing audio data in PCs
was co-developed by Apple based on Electronic Arts Interchange File Format (IFF).
The Audio in AIFF file is uncompressed so it is much larger than files that use
lossless or lossy compression. Types of chunks found in AIFF as common, sound data
chunk, marker, instrument,comment,name,author, and copyright chunk. The following
are the structure related to AIFF are:
a) IFF: (Interchange File Format) – It is used in Machintosh systems to facilitate
data transfer between software programs of different vendors. It is built up from
chunks, each chunks begins with a TypeID or OSType followed by 32-bit integer
specifies the size of the following data. Chunks hold different datatypes like text,
numerical or raw data.
b) RIFF: (Resource Interchange File Format) – It is used in Microsoft in Windows
for multimedia files. It is designed for 68k processor used for Apple Macintosh.
Microsoft file formats with WAV and AVI uses RIFF as their core design. The
optional INFO chunk allows RIFF files to be tagged with information into a number
of categories as copyright (ICOP), comments (ICMT) and artist (ART).
c) NIFF: (Notation Interchange File Format) – It is a musical notation based on
Microsoft RIFF structure. It is designed for exchanging musical information between
different musical editing and typesetting programs. OCR scans musical scoresheet and
converts the data as NIFF file.
d) RMI: (RIFF based MIDI) – It is introduced by Microsoft with standard MIDI file
enclosed in a RIFF wrapper. Later in embraced by MMA (MIDI Manufacturers
Association) are an extended MIDI format served as a container for MIDI and DLS
files.
3) MID: (MIDI) - It is textual files contain instructions on how to play a piece of
music in digital musical instruments published by MMA (MIDI Manufacturers
Association).
4) DLS: (Downloadble Sounds) – It provides game developers and composers to add
their own custom sounds to the GM sound set stored in a sound card’s ROM and it
automatically download from disk/CD-ROM into system RAM allows MIDI music.
At the same time, it enables the wavetable synthesizers in computer sound cards to
deliver improved audio.
5) XMF: (Extensible Music Format)- It is a container format have one or more
MIDI files, DLS files, WAV and other digital audio files to create a collection of all
resources needed to present a musical piece.
17
SUB: MULTIMEDIA TECHNOLOGY Class: III BSc., (CS&IT)
structure. Its features are elliptic curve crytography key exchange, DES block cipher
and SHA-1 hashing function.
18) RA: (RealAudio) – It is developed by RealNetworks, designed to conform to low
bandwidths and used as a streaming audio format by using Real-Time Streaming
Protocol (RTSP) and a Proprietary protocol Real Data Transport (RDT) to send actual
audio data. Latest version supports lossless compression.
19) OGG: (Ogg Vorbis) – It is a free and open audio compression project to create
multimedia and signal processing [Link] replaced MP3 as the de-facto standard
audio CODEC with many newer video game titles employing Ogg Vorbis.
20) ACC: (Advanced Audio Coding) – It is a part of MPEG-2 audio standard and
improvements over MP3 to provide better sound quality at samplig frequencies of 8-
96kHz, subsequently updated in MPEG-4 (Audio) Part-3 with Perceptal Noise
Substitution(PNS) and Long Term Predictor(LTP).
21) OMG, OMA, AA3:- It is used to store information on Minidiscs and other Sony-
branded audio players using ATRAC(Adaptive Transform Acoustic Coding) audio
compression algorithm. ATRAC3 LP2 mode uses a 132kbps data rate, ATRAC3 LP4
mode reduces the data rate to 66 kbps by using joint stereo and lowpass filter of
around 13.5 kHz.
22) MPC: (Musepack)- It is an open source version of MP2 (MPEG-1 layer2) format
uses Huffman coding, noise substitution techniques and variable bit rate between
3kbps and 1.3Mbps. It uses APEv2 metadata container and is also a streamable format
23) SPX: (Speex)- It is a patent free audio compression format designed for speech
coding in VoIP applications which needed to handle lost packets. CELP is a encoding
technique with three sampling rates as 8,16 and 32 kHz. It supports Variable Bit Rate
(VBR) for specific sounds to improve sound quality and Voice Activity Detection
(VAD) allows encoder to detect the presence of speech for silence and adjust bit rates.
DTX(discontinuous transmission) allows the encoder to start and stop transmission
asynchronously.
24) M4A, M4P: (MPEG-4 Audio, Part-14)- It is a container file format uses Apple
Lossless Audio Codec(ALAC) or Apple Lossless Encoder(ALE) for lossless encoding
of digital music, where data is stored within an MP4 container in the filename
extension M4A and developed by Apple computer.
25) AMR: (Adaptive Multi Rate)- It is a patented file format for speech coding and
adopted by 3GPP as the standard speech CODEC for mobile phones. In supports a
sampling frequency of 8kHz at 13-bit depth and is partitioned into 20ms audio frames
each containing 160 samples. It uses 8 different CODECs having different bit rates
ranging from 4.75 kbps to 12.2kbps.
19
SUB: MULTIMEDIA TECHNOLOGY Class: III BSc., (CS&IT)
Dolby A type NR
Dolby B type NR
Dolby C type NR
Dolby S type NR
Dolby A type NR
It is introduced in 1965 was Dolby Laboratories first innovation. It was
originally intended for use by professional recording studio to make quieter master-
tape recordings.
Dolby B type NR
It is the original system designed for consumer tapes and is included nowadays
in most tapes-recording systems.
Dolby C type NR
It acts on the same basic principle but improves on the B-type by providing
twice the noise reduction.
Dolby S type NR
It is a highest-performance Dolby system for analog cassette recording, is
derived from Dolby SR and shares several its advanced features.
a) Cinema Systems
Dolby SR
It is Dolby’s next advancement in analog film sound. This technology delivered
a significantly improved dynamic range over Dolby stereo and is still included today
on nearly 5 mm film prints.
20
SUB: MULTIMEDIA TECHNOLOGY Class: III BSc., (CS&IT)
Dolby Stereo
Dolby stereo is the original Dolby multi-channel film sound format that
revolutionized the movie experience. This analog optical technology was developed
for 35 mm prints and is encoded with four sound channels: Left / Center /Right and
surround for ambient sound and special effects.
b) Encoding/Decoding
Dolby AC-₂
Dolby AC-2 is an adaptive-transform-based algorithm that combines
professional audio quality with a low reducing bit rate, substantially reducing the data
capacity required in such applications as satellite and terrestrial link.
Dolby AC- ꝫ
It refers surround sound technology that devices high quality digital audio for
up to 5.1 discrete channel. The live speaker channels produce a directional and more
realistic effect and the low-frequency effects channel can often be felt as well.
Dolby Digital EX
It refers to surround sound format that introduces a center real channel of the
5.1 playback format of Dolby digital.
Dolby Digital Live
It is the real-time encoding technology that brings surrounding sound too
interactive audio, such as video gaming and PCs.
Dolby Digital Plus
It is a highly sophisticated and versatile audio CODEC based on Dolby Digital
and designed specifically to adapt to the changing demands of future audio, video
delivery.
Dolby E
It is a professional audio coding developed to assist conversion of broadcast
and other channel facilities to multi-channel audio. Other benefits, Dolby E encoded
audio can be edited, de-encoded many times without any degradation.
Meridian Lossless Packing
Meridian lossless Packing is a true “lossless” coding system defined for DVD-
Audio that compact PCM data with bit-for-bit accuracy, unlike “lossy” perceptual
coding such as Dolby digital.
c) Matrix
Dolby Pro Logic -It is the foundation of the multi-channel home theatre experience.
This technology decodes source audio encoded in two-channel Dolby surrounding for
four-channel playback.
21
SUB: MULTIMEDIA TECHNOLOGY Class: III BSc., (CS&IT)
Dolby Pro Logic IIx - It is a state-of-the-art matrix decoding technology that expands
native two and 5.1-channel source audio for 6.1-or7.1 –channel playback resulting in a
seamless, wrap around sound field.
Dolby Surround - Dolby surrounding is the original Dolby multi-channel film sound
format that revolutionized movie soundtracks. This technology encodes four channels
of audio onto just two audio tracks for media such as TV broadcast.
d) Virtual
22
SUB: MULTIMEDIA TECHNOLOGY Class: III BSc., (CS&IT)
3) DTS Neo-6: It provides up to six channels of matrix decoding from stereo matrix
material, Sound systems offer discrete multi-channel sound and provide optimum
decoding of extended surround matrix soundtracks and can also generate a center
surround channel from 5.1 material.
2) Playing Audio Content: It is initiated by pressing play button in the toolbar. Other
funcions are stop, pause, rewind, goto start,goto end and loop playing are also
executed. Editors also positioning the head at a specific point in the file by mentioning
the time in hh:mm:ss format and the sound is heard on the speakers.
3) Pasting and Mixing: A portion of the audio waveform can be selected, copied and
pasted either in the same or different file. It will reflect the duration of the original
file, hence the file size changes as new data is inserted. A mix function allows one
23
SUB: MULTIMEDIA TECHNOLOGY Class: III BSc., (CS&IT)
sound to mix with two sounds so that both of them are heard simultaneously and the
total duration of the sound remains unchanged in which new samples not added, but
existing samples changed.
4) Changing Digitization Parameters: It provides functions to range the audio
digitization parameters such as sample rate(2 to 96 kHz), bit depth (8-16 bits) and
number of channels as one channel(mono) to two channels (stereo).
5) Cut and Trim: Cut function enables the user to select a portion of the audio file
and discard that portion. The file duration will be shortened in this case. Trim function
allows one to select a portion of a file and discard the remaining portion. The file
duration is same as the duration of the selected portion.
6) Zooming: It implies displaying a magnified view of the sound waveform without
actually changing the data stored in the file. When an audio waveform is displayed
pictorially, there is mapping between the number of samples and pixels is expressed as
a zoom ratio. We can zoom-in and zoom-out the view along the time axis as well as
the amplitude axis. Ex: 1: 64 means 64 samples of the sound are represented by 1
pixel in the pictorial representaion.
7) Amplitude Normalization: If in an audio waveform the maximum amplitude level
falls below the upper ceiling of the dynamic range. The normalize function can be
used to raise the amplitude levels so that they just cover the entire dynamic range
without any clipping.
8) Recording: It consists of two types: external and internal recordings.
a) External recording – It is done by using an external device and connecting it to
the computer for recording purposes such as voice and music recording.
Voice Recording – Microphone needs to be connected to the input port
of the sound card using a cable and connector
Music Recording – It is done by connecting the output port of an
external playback device to the sound card.
24
SUB: MULTIMEDIA TECHNOLOGY Class: III BSc., (CS&IT)
25
SUB: MULTIMEDIA TECHNOLOGY Class: III BSc., (CS&IT)
b) Reasons for using audio: To ensure that the goal and the objectives are met. In
which the redundancy is the key to better communication to the receiver or learner
through two sensory channels. The capacity of the human information system is
limited and hence the information loss and error do occur. Redundancy is the key to
minimizing these factors.
There are two conditions influence the effectiveness of audio are,
Information presented in each mode should be congruent and not contradictory.
Identical Presentations of words in sound and text should be avoided.
The audio is connected to motivation. A multimedia developer needs to
simulate the user’s motivation and curoisity to involve in all parts of the
program. It mainly act as a learning mode to attract the listeners.
c) Suggestions for using Audio Content:
i) Speech: To produce high quality recorded speech a script should be written and
professionally recorded. High quality microphones and a professional narrator with
clear diction and the ability to correctly pronounce the words are plus points. In
general, an effective narrator should be:
Vary intonation to motivate, explain, emphasize,etc.
Use a conventional tone.
Be amiable, sincere, straight-forward.
Avoid a lecturing tone.
To develop narrative scripts for integrating speech, multimedia developers should:
Write the way people speak.
Use language that the audience can understand and write in clear straight-
forward and in a short sentence.
Avoid slang vary informal language.
Interpret what the user is seeing rather that just describing it.
Adhere to time limit requirements.
Synchronize narration with visuals wherever possible.
ii) Music: A variety of pre-recorded music is available in digital form on CDs. In
addition, music can be recorded and edited by connecting synthesizers, keyboards and
other musical instruments with computers using MIDI. Music can be used to establish
mood, signal a turn of extents, provide transitions and continuity, accompanying titles
and introductory information, emphasize important points with visual information.
The following provides a rough guidelines for music:
Choose a music style that conveys the mood you wish to create.
26
SUB: MULTIMEDIA TECHNOLOGY Class: III BSc., (CS&IT)
27
SUB: MULTIMEDIA TECHNOLOGY Class: III BSc., (CS&IT)
In a voice- dictation system, this would mean that all the words would be
recognized or processed, where voice- command system only the command words be
picked out of a continuous stream of speaking and the other words being ignored.
Because of this difficulty discrete world systems would be much implemented.
Small vocabulary systems size implies less than 1000 words or phrases,
while large vocabulary means larger number of words or phrases. Vocabulary
size can be processing the times as well as resources required for recognition
can grow in a non-linear fashion as vocabulary size increases.
Large vocabulary systems base their recognition on elements smaller than
a word such as syllable ir phoneme. Phonemes are the smallesr distinguishable
sounds in the dialect of a language. Each phoneme represents a family of
sounds, since the actual pronounciation of a specific phoneme varies according
to the surrounding phonemes, an effect called Co-articulation. Recognition by
phonemes be the right direction for tackling problems related to large
vocabularies, since there are only about 40 phonemes in the English language
which comprises of over 40000 words.
[Link]. Speaker- Dependent Systems Speaker- Independent Systems
It needs to be trained to A general voice model is sufficient
1. recognize each user’s voice and given all differences in
patterns. accent,pitch, etc.
Commercial Systems used and Recognize and respond to hundreds of
2. recognize about 30000 words words on a single Digital Signal
and 250 phrases are recited by Processing (DSP) Chip.
the speaker.
A recent development is a speaker adaptive system which constantly
uodates its word models based on the actual speaking patterns employed during
use and allow an operator to start using the system sooner.
In voice response, it is divided into synthesized and digitized speech. In
the latter case, words or phrases are spoken, recorded, indexed and saved. These
speech fragments are then pieced together by the application to form the
spokem response. Although the vocabulary is limited the quality is high since it
is actual recorded human speech.
Synthesized speech systems allow a much larger vocabulary but
their quality is limited by that of the synthesis hardware/software. The voice
produced may also be sound mechanical. Speech recognition is nowadays
28
SUB: MULTIMEDIA TECHNOLOGY Class: III BSc., (CS&IT)
29