Sis Eng
Sis Eng
Sis Eng
STC-H203M
SIS 2М
ГНОМ
Version 7.x
Copyright
Copyright © 1993 - 2009 by Speech Technology Center Limited (STC Ltd.). All rights
reserved. SIS Speech Interactive Software in a part or as a whole may be used
according to the corresponding license. To receive more copies or other information,
please contact STC.
Disclaimer
Speech Technology Center accepts no liability whatsoever for any loss or injury
incurred by the owner or by any third party while using this SIS software and
specifically disclaims any warranties, merchantability or fitness for any particular
purpose.
The contents of the SIS software and User Manual are subject to change without
notice.
TABLE OF CONTENTS
TABLE OF CONTENTS
i
6.3.1 Reading text files without headings......................................................................... 31
7 DATA INPUT FROM SOUNDCARD .......................................................................................... 33
7.1 Sound signal presentation by SIS and its relation with sound card bit range....................... 33
7.2 Sound input parameters ...................................................................................................... 34
7.3 Digital waveform viewer ....................................................................................................... 35
7.4 V i s i b l e s p e e c h o f c u r r e n t l y i n p u t s i g n a l ............................................................ 36
8 PLAYING SOUND DATA ........................................................................................................... 37
9 SAVING DATA ........................................................................................................................... 40
9.1 Saving data segments and fragments to disk ...................................................................... 40
9.2 Saving data in the text form ................................................................................................. 41
9.3 Saving current state of processing....................................................................................... 41
10 ZOOMING AND SCROLLING DATA IN THE WINDOW .......................................................... 42
10.1 Data scrolling ..................................................................................................................... 42
10.2 Horizontal data zooming .................................................................................................... 43
10.3 Selecting data fragment for display.................................................................................... 44
10.4 Vertical data zooming ........................................................................................................ 44
10.5 Shifting and resizing the data box...................................................................................... 45
10.6 Zoom mode........................................................................................................................ 46
10.7 Editing the signal ............................................................................................................... 48
10.8 Visualization options .......................................................................................................... 49
11 OPERATIONS WITH CURSORS USING TEMPORARY AND CONSTANT MARKS
FOR SELECTING DATA FRAGMENTS .................................................................................. 51
11.1 Creating, shifting and deleting the vertical cursor .............................................................. 51
11.2 Temporary marks............................................................................................................... 51
11.3 Constant marks.................................................................................................................. 52
11.4 Marks list............................................................................................................................ 53
11.5 Horizontal cursor and horizontal temporary marks ............................................................ 54
11.6 Saving the marks together with data.................................................................................. 55
12 GENERATING A TEST SIGNAL .............................................................................................. 56
13 DATA PROCESSING............................................................................................................... 57
1 3 . 1 S e l e c t i n g s o u r c e s e g m e n t / f r a g m e n t o f d a t a ............................................... 57
13.2 Signal normalization .......................................................................................................... 58
13.3 Operations with constants.................................................................................................. 59
13.4 Linear signal transformation............................................................................................... 60
13.5 Deleting a signal ................................................................................................................ 61
13.6 Appending a fragment to the end of the segment .............................................................. 61
1 3 . 7 I n s e r t i n g a f r a g m e n t o f t h e c u r r e n t s e g m e n t i n t o a n o t h e r s e g m e n t .... 62
1 3 . 8 C o p y i n g t h e d a t a ...................................................................................................... 63
1 3 . 9 M o v i n g a s i g n a l ........................................................................................................ 63
1 3 . 1 0 U s i n g t h e c o n t e x t u a l m e n u o f a n a c t i v e w i n d o w ....................................... 64
1 3 . 1 1 S h i f t i n g t h e c u r r e n t s e g m e n t ............................................................................ 64
1 3 . 1 2 D a t a s m o o t h i n g ...................................................................................................... 64
1 3 . 1 3 M i x i n g t h e s e g m e n t s ............................................................................................ 65
1 3 . 1 4 C l i p p i n g ..................................................................................................................... 67
ii
13.15 C h a n g i n g s i g n a l s a m p l i n g r a t e ......................................................................... 67
13.16 R e v e r s e ..................................................................................................................... 69
13.17 I n v e r s i o n ................................................................................................................... 70
13.18 M o d i f i c a t i o n o f a s i g n a l t e m p o w i t h o u t v o i c e p i t c h d i s t o r t i o n s ........... 70
13.19 M o d u l a t i o n ................................................................................................................ 71
13.20 W a v e f o r m a c c u r a c y t r a n s f o r m a t i o n ................................................................ 71
13.21 M o n o / s t e r e o s i g n a l t r a n s f o r m a t i o n ................................................................. 71
13.22 O p e r a t i o n s w i t h a s e r v i c e t i m e c h a n n e l ....................................................... 72
14 NOISE REDUCTION................................................................................................................ 73
1 4 . 1 A d a p t i v e i n v e r s e f i l t e r i n g ..................................................................................... 73
14.1.1 Method overview...................................................................................................... 73
14.1.2 Method parameters ................................................................................................. 73
1 4 . 2 A d a p t i v e w i d e b a n d n o i s e f i l t e r i n g ..................................................................... 75
14.2.1 Method overview...................................................................................................... 75
14.2.2 Method parameters ................................................................................................. 77
1 4 . 3 A d a p t i v e f i l t e r i n g o f s t a t i o n a r y w i d e b a n d n o i s e .......................................... 80
14.3.1 Method overview...................................................................................................... 80
14.3.2 Method parameters ................................................................................................. 80
1 4 . 4 A d a p t i v e f i l t r a t i o n o f t o n a l a n d r e g u l a r n o i s e ............................................... 81
14.4.1 Method overview...................................................................................................... 81
14.4.2 Method parameters ................................................................................................. 83
1 4 . 5 F i l t e r i n g o f i m p u l s e n o i s e ..................................................................................... 85
1 14.6 Dynamic filtering ....................................................................................................... 86
1 4 . 7 F i l t e r i n g s t e r e o s i g n a l s ......................................................................................... 87
14.7.1 Method overview...................................................................................................... 87
14.7.2 Method parameters ................................................................................................. 87
14.7.3 Using stereo filter .................................................................................................... 89
1 4 . 8 S i g n a l p r o c e s s i n g w i t h D i r e c t X p l u g - i n s ......................................................... 89
14.8.1 Sound Cleaner plug-in ............................................................................................ 89
14.8.2 Other DirectX plug-ins ............................................................................................ 91
1 4 . 9 E q u a l i z e r ..................................................................................................................... 92
14.9.1 Equalizer controls .................................................................................................... 92
14.9.2 Equalizer toolbar ..................................................................................................... 93
14.9.3 Equalizer options ..................................................................................................... 94
14.9.4 Zooming and scrolling ............................................................................................. 96
14.9.5 Adjusting filter FC .................................................................................................... 96
14.9.6 Elastic mode............................................................................................................. 96
14.9.7 Additional FC adjustment ....................................................................................... 96
14.9.8 Equalizer operation modes..................................................................................... 97
14.9.9 Stereo signal filtering .............................................................................................. 97
15 ANALYSIS OF SPEECH SIGNALS.......................................................................................... 99
1 5 . 1 P r o d u c i n g 3 - d i m e n s i o n a l r e p r e s e n t a t i o n s f o r 3 - D d a t a .......................... 100
15.1.1 3-D scaling options................................................................................................ 100
15.1.2 Scaling the third dimension for not yet calculated 3-D data............................. 100
15.1.3 Changing the image limits .................................................................................... 101
15.1.4 Adjusting brightness and contrast of the image................................................. 101
15.1.5 Visible speech normalization ............................................................................... 101
15.1.6 Changing the graphic representation .................................................................. 102
1 5 . 2 W i n d o w i n g f o r F F T ................................................................................................ 102
iii
15.2.1 Theoretical substantiation of windowing............................................................. 102
15.2.2 Description of 5 most commonly used weighting windows............................... 104
15.2.3 Equi-periodic window ............................................................................................ 105
15.2.4 Window selection tips ........................................................................................... 106
1 5 . 3 C a l c u l a t i n g a d y n a m i c s p e c t r o g r a m ............................................................... 106
15.3.1 Parameter settings ................................................................................................ 107
1 5 . 4 C a l c u l a t i n g p e r i o d i c i t y f u n c t i o n ( d y n a m i c c e p s t r o g r a m ) ........................ 109
15.4.1 Parameter settings ................................................................................................ 110
15.5 Calculating dynamic characteristics of the autoregressive model
o f t h e s p e e c h s i g n a l ............................................................................................. 112
15.5.1 Parameter settings ................................................................................................ 112
15.6 Spectral analysis based on linear prediction of speech.
O b t a i n i n g a g r a p h i c a l r e p r e s e n t a t i o n o f L P C f r e q u e n c y r e s p o n s e ..... 114
15.6.1 Parameter settings ................................................................................................ 114
1 5 . 7 C a l c u l a t i n g t h e e n e r g y ( r o o t - m e a n - s q u a r e ) ................................................ 116
1 5 . 8 C a l c u l a t i n g z e r o c r o s s i n g f r e q u e n c y .............................................................. 117
1 5 . 9 A v e r a g i n g .................................................................................................................. 117
1 5 . 1 0 P i t c h e x t r a c t i o n .................................................................................................... 118
15.10.1 General information ............................................................................................ 118
15.10.2 LLK method .......................................................................................................... 118
15.10.3 Pitch extraction step by step.............................................................................. 120
15.10.4 Spectral method .................................................................................................. 121
15.10.5 Speaker comparison ........................................................................................... 122
15.10.6 Controlling the correctness of the pitch extraction.......................................... 123
15.10.7 Editing pitch over cepstrum image .................................................................... 123
1 5 . 1 1 C a l c u l a t i n g i n s t a n t a n e o u s F o u r i e r s p e c t r u m ............................................ 124
15.11.1 Parameter settings .............................................................................................. 124
15.11.2 Spectrum calculation process ............................................................................ 124
1 5 . 1 2 C a l c u l a t i n g a v e r a g e F o u r i e r s p e c t r u m ........................................................ 125
15.12.1 Parameter settings .............................................................................................. 125
1 5 . 1 3 F o r m a n t s a n a l y s i s ............................................................................................... 127
15.13.1 Formants calculation ........................................................................................... 127
15.13.2 Editing formants over spectrum image ............................................................. 129
1 5 . 1 4 H i s t o g r a m c a l c u l a t i o n ........................................................................................ 130
1 5 . 1 5 C o m p a r i n g h i s t o g r a m s a n d c a l c u l a t i n g t h e i r p a r a m e t e r s .................... 130
1 5 . 1 6 E d i T r a c k e r m o d u l e .............................................................................................. 132
16 USING PLUG-INS .................................................................................................................. 133
17 TESTING THE INPUT/OUTPUT CHANNEL .......................................................................... 135
1 7 . 1 T e s t i n g t h e i n p u t c h a n n e l ................................................................................... 135
1 7 . 2 T e s t i n g t h e o u t p u t c h a n n e l ................................................................................ 137
1 7 . 3 I n p u t / o u t p u t c h a n n e l l o o p t e s t .......................................................................... 138
GLOSSARY ................................................................................................................................. 140
iv
SIS 7.0.X END-USER LICENSE
The Software, in accordance with the existing international and Russian copyright laws
and treaties, may not be installed and used on more than one computer simultaneously.
You may install the Software onto a second computer provided the Software to be first
uninstalled (all the files are deleted) from the hard disk of the first computer.
Copies of the Software can be stored on other storage devices only for backup or
archival purposes to prevent software loss as a result of disk failure.
If the user violates the terms of this license in any way, Speech Technology Center will
refuse to fulfill its obligations on warranty services, technical assistance and privileged
installations of subsequent versions of the Software. Speech Technology Center also
reserves the right to take other measures to protect its copyright in accordance with
software copyright law.
For all information about the software and hardware produced by STC refer to:
When requesting assistance, you should have the following information readily
available:
z the name of the operation system being used and its version number;
1
1 INTRODUCTION
The current software package SIS produced by Speech Technology Center (STC) is
intended for speech signal editing and analysis, noise reduction and speaker
identification. It is a powerful multi-purpose tool allowing you to improve speech
intelligibility, enhance signal quality and reduce noise in recorded speech. It provides
the operator with unrivalled opportunities for signal input/output, editing, analysis,
display and processing of speech and other low frequency signals.
SIS allows experts to implement some additional operations when at the following
stages of the digital signals research:
z Visual analysis – getting on the screen the required images for the parametric
representations of the selected signals.
SIS can be delivered along with the STC-H246 sound I/O device intended for measuring
characteristics and forming electrical signals in the sound frequency range. This device
provides:
z Copying signals onto a computer’s hard disk without any distortions and with all
signal properties relevant for forensic examination preserved.
z Measuring amplitudes and frequencies of the variable electrical signals with the
voltage, phase and time intervals determined.
It should be noted that this software was designed especially for experts in low
frequency signal processing, linguistics and forensic audio. Please contact Speech
Technology Center specialists to find out more about training and consultation.
2
2 SYSTEM CAPABILITIES
z Displaying the dynamic spectrogram in the real time during signal input.
z Loading several signals into one window, including the signals with different
sampling rate and capacity values.
z Combining two mono signals into stereo one and vice versa; producing
composite stereo signal.
z Multiwindow interface.
z Main characteristics of the speech signal can be displayed in one data window
(via superposition of the images) to control calculation accuracy:
Formant trajectories, pitch curve and dynamic spectrogram.
Cepstrogram and pitch curve.
Waveforms of several signals with different sampling rate and capacity
values.
z Manual pitch curve and formant trajectories correction using different types of
visualization and calculation.
4. Using plug-ins.
4
3 SYSTEM INSTALLATION AND ADJUSTMENT
Once the uninstallation has been completed, run Setup.exe again and follow the
installation wizard instructions appearing on the screen.
The software is protected with a HASP software protection key. Therefore, at the final
stage of the installation process the program will install a HASP device driver. When you
will see a relevant message on the screen, check that your HASP device is connected to
your PC and press OK.
z Click with the right mouse button anywhere on your desktop background and
select Properties from the context menu.
5
If your display is in a different mode, you will not be able to work with SIS and the
following message will appear at program start:
To operate properly, SIS requires 2 GB of disk (virtual) memory on one of the drives,
namely on that specified in YMS_DISK =. By default it is drive C. If there isn’t enough
free space available on this drive, specify some other drive for reading and writing data
by changing the C drive name to another (for example, YMS_DISK = E). When started,
SIS will check the amount of the available free disk space and, if necessary, prompt the
user to change the drive.
The SOUND_CARD = line specifies the sound card you use. If you use a 24-bit sound
card STC-H246 (default setting):
SOUND_CARD = STC-H246
SOUND_CARD = STC-H216
Delete SOUND_CARD = line if you use sound card integrated in your PC.
The BIT24_MODE parameter specifies the allowed operating mode. If 24-bit mode is
not allowed:
BIT24_MODE = 0
2. Software supplied for the other cards (with HASP key) operates only with
L the Windows preferred device for playback and recording. It is also advisable
to select the specified sound card as the preferred device for playback and
recording. You can do this via Start ► Settings ► Control Panel ►
(Sounds and) Multimedia ► tab Audio).
7
4 GETTING STARTED
If you wish to have several configuration files for different purposes, save each
configuration under a different name using the Options►Save configuration
command. You can subsequently restore the desired configuration with the
Options►Load configuration command. A dialog window will be displayed where you
can choose any of the existing configuration files.
Do not confuse the language of dialog with the font. The font may be changed with the
key combination you typically use on your computer for changing the font (e.g. <Ctrl>
+ <Shift>).
8
4.3 Changing font size of dialog windows
The size of dialog windows can vary depending on the screen size and resolution. To
change the size of dialog windows (as well as the font size), use the Options►Font
size menu item. A dialog window will appear on the screen (Fig. 3).
The choice of either option will not affect the font size of the main menu (this parameter
can be changed using standard Windows procedures), but it will change the font size of
all dialog windows. The size of windows will be modified proportionally to the used font
size.
The Small font size is recommended for 1024*768 screen resolution, the Normal size –
for 1280*1024 and more.
The Small buttons size is recommended for 1024*768 screen resolution, the Normal –
for 1280*1024 and more. Buttons style submenu allows you to choose Plastic or
Metallic style.
The menu string contains menus of all functions and operations provided in the system.
9
Figure 5. SIS main screen
The main toolbar below the menu string consists of the following toolbars:
File
Open file
Save fragments
Copy image
Sound
Start playback
Pause playback
Go to play point
Stop playback
DirectX plug-ins
Sound Cleaner
EdiTracker
10
other DirectX plug-ins
Information
Window information
Windows list
Show
Logarithmic/linear scale
Vertical self-scaling
Edit
Copy data
Insert data
Move data
Delete data
Interrupt operation
The appearance of the main toolbar can be customized by user. Double-click on toolbar
to open Customize Toolbar window. Each toolbar has its own customizing window.
Figure 6 shows Customize Toolbar window of the File toolbar.
At the right area of the window there is a list of current toolbar buttons. At the left there
is a list of buttons that can be added on the toolbar (Available toolbar buttons list).
By default all available buttons are placed on the toolbar.
To remove button from the toolbar select the button in the Current toolbar buttons
list and press <-Remove. To place removed button on a toolbar, select it in the
Available toolbar buttons list and press Add->. In the same way a separator
between buttons on a toolbar can be added or removed.
11
Figure 6. The Customize Toolbar window
To change an order in which buttons are displayed on the toolbar, use Move Up and
Move Down.
To restore default settings press Reset. To close Customize Toolbar window press
Close.
Key Function
12
Key Function
The Data►Undo option realized in the new SIS software versions (beginning from
v. 7.0.1) allows you to cancel the last signal processing operation and return to the
previous signal state. You can use the Undo command for the most of data processing
operations. The undo Depth, e.g. a number of times you can use the Undo operation,
is limited by the disk space allowed by user. You can change this parameter using the
sis_70.ini file (by default the Depth value is 8):
[Data Storage]
Undo Depth = 8.
13
can choose a new window to be created (indicated by the "_" sign). However, the arrow
will allow you to browse through the first 5 suitable windows only. To see the full list of
available windows, press Choose destination window at the bottom of the menu and
select an appropriate window from the displayed list.
To create a new destination window, press Create destination window. The menu will
disappear and a dashed rectangular frame specifying the dimensions of the prospective
window will be displayed on the screen. The mouse cursor will also disappear: moving
the mouse or pressing the arrow keys you can shift the window frame to the desired
position. By moving the mouse while holding down the left mouse button or by using the
<Shift>/s (t) key combination you can change the frame dimensions. The left top
corner will be fixed, while the right bottom corner position can be modified the desired
way.
To display a new window as limited by the rectangular frame, press <Enter> or click
the mouse button. Once the new window has been displayed, the dialog window will
appear again on top of it.
The OPTIONS button calls window where you can set necessary parameters of the
selected operation.
If no destination window has been chosen, the system will automatically create a new
empty window, but it will place it at its own choice – below the source window with the
same left and right boarders.
When checking the All segments in window option, the selected operation will be
applied to all segments in the current window and not only to the active one.
14
The source-destination menu also allows you to specify which part of the speech signal
should be processed. The following options are available:
z All data. All sound data in the current window will be processed.
z Between temporary marks. Only data located between temporary marks will be
processed (there should be two of them).
For any operation to be started, you should press the desired data part type field.
15
5 WORKING WITH DATA WINDOWS
If a window is deleted, all data displayed in it and not duplicated simultaneously in, at
least, one more window will be deleted as well. Windows may overlap; they can be
shifted; their size and shape can be modified.
At least one window should be created to start working with the data.
The newly created window will be of the UNI (universal) type and can be used for any
kind of input data.
An unlimited number of windows can be created in the program, but once their number
gets over 107, short window names (consisting of one symbol) begin to repeat, as a
result of which linking of windows and some other operations will be difficult to perform.
Thus, it is undesirable to have more than 108 windows open simultaneously.
A newly created window always appears on top of all previously opened or created
windows. It becomes current (active), i.e. all operations (editing, analysis, processing,
etc.) will be related to the data contained in this window.
Several different segments of data can be loaded into one window. The names of all
segments currently contained in the data box will be displayed in the window heading,
as long as its width allows. Each segment name will be displayed in the same colour as
its data. The heading background colour is black if the window is active and grey if it is
inactive.
Contextual menu contains commands from Show and Windows menus of the main
screen and commands of copy/paste fragment. Also, depending on data type, it may
contain Background color or Drawing type selection items. If the window contains
constant marks, contextual menu has Make highlighted option.
If data in the window represents the results of calculations, the Recalculate option is
available. It allows you to repeat calculations with new parameters.
16
Figure 8. Context menu of an active window
Switching between windows does not require closing previously opened dialog windows:
after you activate another window, any currently open submenu will be redrawn and
appear again on the screen.
Keep in mind that any modifications of values in a menu field are considered
) temporary until you come out of this field. That is, if you activate another
window while the modified field is still active, the changes will be lost and
previous settings will be restored after the menu is redrawn.
17
5.3.3 Switching between windows using the windows list
Using the menu Windows►Windows list you can view the list of all windows currently
opened in the program. The list looks as shown in Figure 9; it contains window activity
indicator, window selection indicator, window name and finally window type.
Use the mouse or the <Tab> or <Shift>/<Tab> keys to switch between the columns.
To move up/down in the column, use either the mouse or the arrows keys.
You can make any window active by clicking on the activity indicator or by pressing the
<Spacebar> key. Only one window at a time can be activated.
The menu is scrollable, so if you do not see the desired item in the visible part of the
list, browse through the list using the scroll bar at the right (or pressing the arrow keys
or <Page Up>/<Page Down> on the keyboard).
To move active window to the center of main window, press Move to center.
To quit the windows list, press OK (to apply the changes) or Cancel (to quit without
applying the changes).
To make the comments field active, click there with the mouse – its background colour
will get lighter and a blinking blue cursor will appear in it. To quit the comments line,
press <Enter> or click with the mouse anywhere outside it.
18
You can append as many comments lines to the window as you like, but if the vertical
size of the window is insufficient to display all the lines, some of them will not be
displayed. As you reduce the size of the window, the lines located above the data will
disappear first (starting with the lowest), then the lines below the data will be hidden
(also starting with the lowest). As you increase the window size, the lines will come up
again.
You can also add a mark text using contextual menu of the active window. To do this,
move the mouse pointer to the constant vertical mark. When it takes the form of ,
press the right mouse button and in the contextual menu select Mark text item. The
viewer panel will appear below the data box (if it has not been created before) and the
Mark text dialog window will be opened. Enter the mark text and press OK.
The marks in the viewer panel are the same as in the data box. If the mark is provided
with an inscription, the text will be positioned over the mark (bottom to top, left to
right), but independent of the text length at least one dash of the mark will remain at
the bottom of the viewer panel.
If the text doesn't fit onto one vertical line, it will be displayed in several lines. But if
different marks are placed too close to each other, the respective texts will overlap. To
let text be the most completely displayed, move apart the vertical borders of comment
fields or zoom-in the data box.
The viewer is refreshed every time when the horizontal size of the data box is changed,
as well as when the segments are deleted or inserted.
The text of any mark or the viewer height may be changed without entering the main
menu. To change the mark text move the mouse pointer to the corresponding vertical
line (+/- 5 pixels), until it takes the form of , and press the left mouse button. The
pop-up menu will appear on the screen containing Mark text, Copy and Delete items.
To edit mark text, select Mark text item.
19
To change the viewer panel height, click the right mouse button within the viewer panel
area and in the appeared contextual menu choose The panel height. Enter the value of
panel height. It should be even (SIS will approximate it anyway) and be in the range
from 4 to 42 (symbols).
New values will be applied after pressing OK and will be ignored if you press Cancel.
If you choose Close panel, the marks viewer panel will be deleted from the window. All
texts will remain unchanged.
To dissociate the current window from any of the linked windows, just remove its name
from the link field.
The name of the current window itself is not entered into the field.
After you have entered the names of the windows you want to link, press OK. The
specified windows will be linked. If for some reason linking is impossible, the system will
display a message “Link is not set” in the message line.
Once the link has been established, all operations with the horizontal cursor and
constant/temporary marks will be performed simultaneously in all linked windows.
To bring the visible area along the horizontal axis in all linked windows to that in the
window which will be chosen as active, press <F11>. To make system synchronize the
horizontal zoom rate and data box boundaries in linked windows automatically, select
Synchronize linked windows in the Options►Extra options menu.
20
mouse while holding its left button down, you can drag this frame to any desired
position on the screen.
The whole SIS main window area is available for moving, except the menu line, the
toolbar and the bottom line (reserved for system messages).
Type - contains information about the window type (e.g. Waveform, Cepstrum,
Spectrogram, etc);
X axis: - contains information about the physical parameter, units and scale type
represented along the horizontal axis (see section 5.6.1);
Y axis: - contains information about the physical parameter, measure units and
scale type represented along the horizontal axis (see section 5.6.1);
Z axis: - contains information about the third axis (see section 5.6.1). Not used
for 2-D data.
21
Figure 10. Window information box
For some types of windows you can press the X axis or Y axis to receive the following
choice lists:
Linear, Logarithmic, Barks – scale choice for the Y-axis of spectrogram and for
zero crossing frequency, for the X-axis of the FFT spectrum;
When you press Data drawing mode, an option list appears allowing you to choose the
method of data representation in the box. It contains the following options:
Copy - new data will be drawn on top of the previous in an opaque way;
OR - new data will be drawn on top of the previous with addition of colours by the
method of the “logic OR”, (bit-by-bit addition: 0 and 1 – 1; 0 and 0 - 0; 1 and 0 -
1, 1 and 1 - 1);
XOR - new data will be drawn on top of the previous with addition of colors by the
method of the “exclusive OR”, (bit-by-bit addition: 0 and 1 - 1; 0 and 0 - 0; 1 and
0 - 1; 1 and 1 - 0);
Free memory (Mb): XXXX - an information field showing the amount of free virtual
memory in the SIS system in Mbytes. All data about all signals are stored in the virtual
memory, and once it is overflowed, signal analysis and filtration becomes impossible.
Segments list - contains the list of segments displayed in the current window. Each
line corresponds to one segment. It has two fields, one containing segment names, the
22
other - segment types. The colour of the text corresponds to the colour of the segment.
The name of the segment can be modified like ordinary text.
All changes in the active window will be applied only after you quit the described menu.
To get information about another window, just click on the desired window without
closing the information box. The box will for a moment disappear, the new active
window will be displayed on top, then the information box will pop up again, but now it
will provide information about the new active window.
z Linear.
z Logarithm.
The pop-up menu also contains the information on the axis: physical parameter
measured, measure units, scale type.
The <F5> key quickly switches scale type of the vertical axis from linear to dB and vice
versa. For information: signal value in dB is equal to decimal logarithm of the value
increased on 20 (for negative values the module will be taken).
To change the name and color of an existing segment, you can use the
Show►Segments list menu item. The Segments list window will pop up on the
screen (see Fig. 11). It contains two submenus: the list of colors and the list of
segments.
In the list of colors each line corresponds to one color and is indicated by this particular
color. In the list of segments, each segment is described by one line. In this line you can
type in a name, type, indicator of stereo-segment, sampling rate of the segment, and a
field for color changing, indicated by the letter C. The name of the segment can be
updated, as a usual textual line with the length up to 12 symbols. If it is a stereo-
segment, you will see the letter S after its type. The list contains the information about
all (up to 255) segments from an active window.
23
If you make another window active by clicking on it with the mouse, without leaving the
menu, the menu will disappear and appear again, but already with a list of segments
from a new active window.
To change the color of a segment, select a desired color in the menu of color-choice
(double-click on the respective color field). Then enter the right submenu (by clicking
anywhere in it with the left mouse button) and in the Color column click with the left
mouse button on the field C in the line of the corresponding segment. The segment
name and the letter C will immediately change its colour. After you quit the Segment
list window (by pressing OK), the active window will be displayed with the new name
and colors of segments.
For “visible speech”, selected color will be applied only if a segment is represented by
deviation right/up.
Indication of the processing fragment type and its change can be performed by the
button at the right of the signal window. On the left-clicking this button the number of
control buttons, specifying the data fragment type, will appear:
24
- all data;
- highlighted;
- between temporary marks;
- visible in window ;
- selected.
To choose a new type or confirm the old one, click one of these buttons. At that, the
button in the signal window changes its appearance.
25
6 READING FILES FROM DISK
SIS allows you to read various file types, generated both by the SIS system itself (e.g.
*.DAT files, also from previous versions of the system), and by systems supporting
*.WAV, *.MP3, *.WMA, *.AVI file formats.
The actual file extension is of no consequence, since the specified file types contain, in
addition to the sound data proper, all associated information. Thus, if you saved a
spectrogram as a *.DAT file, it will still be read as a spectrogram, not as a waveform.
However, it is advisable to make use of the typical extension types given below, to
facilitate file search.
It should be noted that if the file is generated by the SIS system or by some other
software in the Windows-compatible format (*.WAV), the type of data will be checked
on compatibility with the type of window for data input. For example, if the file contains
a Fourier spectrum, the data will be entered as a Fourier spectrum, irrespective of the
user's desire and the file extension.
For the files with .DAT extension there is a possibility to read not entire file, but only its
part. To activate this option, in the Open file dialog box remove flag Read whole file.
After this, each time with opening .DAT- file the Big files window will appear:
In this window you have to enter the beginning position and the length (in seconds) of
the fragment to read.
To disable this option check Read whole file in the Open file dialog box.
26
If the file has an .ALW extension (A-law waveform), SIS will read it as 8- bit file in A–
law format. The sampling rate is 8 kHz.
If the file has a .SND extension, SIS will read it as 8-bit file mu-law format. The
sampling rate is 8 kHz.
If the file has a .VOX extension (waveform in covox format), SIS will read it as 8- bit file
in ADPCM format. The sampling rate is 8 kHz.
The Data►Open file command also allows you to open file with any unidentified format
as an oscillogram (mono, 16 bits, PCM). In this case the sampling rate can be assigned
arbitrary:
The entered data will be immediately displayed in the current window. A waveform is
typically displayed in counts – arbitrary units (along the Y axis); the time is represented
in seconds (along the X axis). If desired, you can change the mode of signal
representation to dB instead of counts. To do this, use the Show►Linear/dB scaling
menu or press the <F5> key.
) To read compressed sound WAV (RIFF) files or to read MP3, AVI, MPEG and
other files of such kind the system uses DirectShow filters of the operational
system in order to build a file reading graph. The Speech Technology Center does
not supply DirectShow filters and does not account for incorrect work of the
filters. Installation and correct tuning of the filters is produced by the user’s
administrative and technical support services. Free filter packages can be found in
the Internet (for instance K-Lite Mega Codec pack). In their documentation
composition and range of application are described usually. If in attempt to open
such file the program opens the window of unidentified format (see Fig. 13), it
means that the program failed to open the file through the DirectShow filter and
suggested opening it as an unidentified format file.
To read compressed sound WAV (RIFF) files or to read MP3, AVI, MPEG and other files
of such kind the system uses DirectShow filters of the operational system in order to
build a file reading graph. The Speech Technology Center does not supply DirectShow
filters and does not account for incorrect work of the filters. Installation and correct
27
tuning of the filters is produced by the user’s administrative and technical support
services. Free filter packages can be found in the Internet (for instance, K-Lite Mega
Codec pack). In their documentation composition and range of application are described
usually. If in attempt to open such file the program opens the window of unidentified
format (see Fig. 14) then it means that the program failured in opening the file through
the DirectShow filter and suggests opening it as an unidentified format file.
Several sound signals can be displayed in the same window. Each signal (called
segment) is displayed in a different color (see Fig. 14). The names of the segments are
represented at the top of the window by the same colours as corresponding segments.
In the case of stereo-segments the right channel data is represented by a darker shade
of the same color as the left channel data (for example, left channel data is light-blue,
right channel data is dark-blue).
You can change the color of the segments using the Show►Segments list menu (see
section 5.6.4).
Playing file from the file open menu before loading it to the program is not supported in
the current version.
In any mode file open menus allow deleting one or several files from disk (for instance,
when free disk space is required). Just right-click on the file name and press <Delete>
(or select the Delete option in the context-sensitive menu). The system will ask for a
confirmation and the file will be sent to the recycle bin.
28
6.2 Standard file extensions
The following file extensions are supported:
z .AUC – autocorrelation,
z .CEP - cepstrum,
z .DAT - waveform,
z .ENE - energy,
z .PIT - pitch,
z .SPE - spectrogram,
z Third line describes the data format and should contain one of the following:
X,Y
F=%%%%%%
S=%%%%%%.
If the third line contains X,Y, it means that each following line contains two numeric
values, the first of which is an abscissa value (X), the second is an ordinate value (Y).
Please note that all abscissas should be arranged in the ascending order and change
with a constant step. An abscissa and ordinate values should be divided by space.
29
Example:
STCautoidentification Text segment
Waveform
X,Y
0 523
6.25e-005 459
0.000125 463
0.0001875 527
0.00025 602
0.0003125 536
If the third line contains F=%%%%%%, then the fourth line should contain the Y
symbol. F means the signal sampling frequency value in Hz, e.g. F=11025. Then each
subsequent line should contain one numeric value (signal value in the next point). The
first signal value corresponds to abscissa (time) value equal to zero.
Example:
STCautoidentification Text segment
Waveform
F=11025
Y
-87
-78
-88
-97
-111
If the third line contains S=%%%%%%, then the fourth line should contain Y. S
means the step with which the abscissa value (X) changes, e.g. S=10.7666.
Then each subsequent line contains one numeric value (for the next point). The step for
each signal should be written in standard units, e.g. in Hz for an FFT spectrum type
and in seconds for a Waveform. The first signal value corresponds to abscissa (time)
value equal to zero.
Example:
STCautoidentification Text segment
Spectrum
S=10.7666
Y
34.1674
59.8291
85.2392
96.8834
104.305
30
6.3.1 Reading text files without headings
The user can read text files containing one or two columns with numeric values. Signal
values with a constant step are placed in one column, while two-column arrangement is
used to represent abscissa and ordinate values.
The file should have a .TXT extension. When you open such a file from SIS, the system
will display the following query: "File type is detected by extension as text file. Do you
agree?" If you choose No, the system will read the file in binary format, PCM 16 bit. If
you choose OK, the system will start setting the signal parameters. You will first see a
dialog window shown in Fig. 15.
Under First three the actual first three lines of the processed file are displayed – for the
user to make sure this is the right file. Up to 30 characters can be displayed from each
line.
Clicking with the mouse on one of the last four lines, the user can either cancel the
procedure (pressing Cancel) or select the desired segment type (Waveform, Pitch or
FFT spectrum). When you click OK, a format selection window will appear (see
Fig. 16).
Under First three the actual first three lines of the processed file are displayed – for the
user to make sure this is the right file. Up to 30 characters can be displayed from each
line.
31
Selecting X,Y means that each line contains two numeric values, the first considered as
a coordinate, the second – as a signal value. The coordinates (abscissas) must change
with a constant step. If not, SIS will calculate the mean step value and will use it for this
segment.
Y means that each line contains one numeric value standing for signal value in the
specified point.
Clicking Cancel, you will cancel the operation and the file will not be read.
After you have selected one of the suggested formats, the following window will be
displayed:
One of the values, sampling rate or step, has to be entered. If you enter the non-zero
value of sampling rate, the step will be calculated via sample rate. If you enter the non-
zero value of step, while the sampling rate is zero or not entered, the system will use
this step value. If both values, sampling rate and step, are zero or not entered, the
system will calculate step for sampling rate equal to 11025 Hz.
After pressing OK the system will start reading the values until the file ends or until 100
format errors are detected, or SIS virtual memory is overflowed.
32
7 DATA INPUT FROM SOUNDCARD
When recording data via the line input or from a microphone, you need not synchronize
signal actuation on your input device with the moment of pressing the record button in
the program menu. Initially, after you press the record button (or select
Data►Input from sound card), a settings dialog box will be displayed and sound
input will start in the so-called “tuning mode” for you to adjust your incoming signal
level. To start signal input, press <spacebar>.
In the tuning mode you can control the volume of the input sound with the Windows
mixer. During recording you can open the mixer window and adjust the volume to the
desired level using the sliders.
Signal input can be performed only if the sound window is active. Thus, after you
L have adjusted signal level in the mixer window, activate the sound window again
by clicking on it with the mouse pointer.
When performing a 24-bit signal input using analog-to-digital converter, the program
receives 24-bit integer from the sound card, converts it to the floating-point number
(float) and divides the result by 256 (i.e. normalizes a signal to 32767 limits). As a
result, a 24-bit signal amplitude will not exceed 32767 value, but a bit range precision
will be increased in 256 times. This allows you to use signals of both formats in the
same window.
It is not forbidden in the SIS program to operate with float-signal with amplitude value
greater than 32767, but 10 times reserve is kept for amplitude limited by 320000 value.
This reserve is needed, if a signal value exceeds 32767 after some kind of processing
(e.g. filtration). In this case, the result will not be clipped and the useful signal will
remain unaffected. So as not to decrease signal quality after such processing it is
recommended to normalize a signal to limits under 37767.
33
There is no prohibition to generate signals with amplitude value greater than
L 32767 in the SIS program. But we don't recommend you to operate with such
signals as they can't be reproduced by the sound card correctly.
When performing a float-signal output onto 24-bit card, the program multiplies a signal
amplitude value by 256 and converts the result to integer (int16). Thus, if signal
amplitude value exceeds 32767, a bit range overflow will occur and a signal (sound) will
be clipped. That is, as mentioned above, the signals which amplitude value is greater
than 32767 can't be reproduced correctly even by 24-bit sound card.
In the Sampling rate field the user has to specify the sampling rate (frequency) at
which the sound will be recorded. By default it is 11025 Hz. Any value within the [4001 -
192000] range can be entered here.
z Mixed mono - the sound is recorded as a stereo signal from two channels and
the channels are mixed (added up) during the input.
File format of the recorded signal depends on the input accuracy (16- or 24-bit).
Recorded signal can be saved either as 16-bit, or as 24-bit data, regardless of the
formats supported by your sound card. The maximum sound card resolution is shown
below the Input signal resolution group.
34
Program automatically checks if the sound card supports 24-bit data. To suppress
checking, include into the file sis_70.ini the following line:
[SOUND CARD]
BIT24_MODE = 0
If program is supplied to operate with third-party sound cards, the sis_70.ini file
always contains this line.
To suppress signal output during signal recording, check Mute output. In this case, you
can estimate a signal level by its waveform.
Having set the necessary parameters, press OK. Then a new window will be created, the
system will pass into the digital waveform viewer mode and wait for the user’s request
to start the recording of the sound signal.
Pressing Cancel you will close the menu and return to the previously active window or
menu.
In the digital waveform viewer mode the user has an opportunity to control the source
signal and to start/suspend/stop sound recording into computer memory.
Once the signal input mode is activated, a box appears in the current window with the
input signal displayed in the yellow color. To start recording, press the <spacebar>.
The signal will change its color from yellow to blue. When you press the <spacebar>
for the second time, the recording process will be suspended. To resume recording,
press the <spacebar> again. To quit the digital waveform viewer mode press <ESC>.
The input signal will be displayed in the window and automatically assigned a name
NONAME1. The next such window will be called NONAME2 etc. You can subsequently
edit its name in the usual way.
The following key combinations are available for data zooming and shifting during sound
input:
35
7.4 Visible speech of currently input signal
If during signal input from the sound card you press the <F4> key, a new window will
appear under the waveform window. It will contain visible speech (spectrum) of the
input signal displayed continuously and simultaneously with the signal waveform.
The following key combinations are available for data zooming and shifting in the visible
speech window:
Press <TAB> to move to the source signal (waveform) window and back.
To stop visible speech display, make the visible speech window active and press
<ESC>. The source signal will still be continuously displayed in the waveform window.
To suspend/resume signal input press <spacebar>, to quit signal input mode press
<ESC>.
36
8 PLAYING SOUND DATA
During playback the cursor will be immovable and located in the window center; the
“visible speech” will be moving relatively to the cursor. Pressing the <spacebar> will
stop sound playback and “visible speech” movement simultaneously.
In the pause of playback (after pressing the <spacebar> or button on the toolbar)
you can set one or several constant marks by pressing <Insert> key. By pressing the
<spacebar> once more sound playback will be resumed from where it stopped.
Now if you press the <Insert> key, a constant mark will appear in the current cursor
position, or you can press the <spacebar> and sound playback will be resumed from
where it stopped.
In the Playback menu you can choose the desired fragment to be played:
z Between temporary marks. Only data located between temporary marks will be
played (there should be two of them).
z From temporary marks to end. Sound data from the second (right) temporary
mark to the end will be played.
z From start to temporary marks. Sound data from the beginning to the first
(left) temporary mark will be played.
z Visible data and later. Data from that currently visible in window to the end will
be played.
z All segments in window. All segments in the active window will be played one
by one, beginning with the top one. The system makes a pause between different
segments – the pause duration can be specified in the Playback►Options menu.
z Selected. Only data fragments marked by selected (ticked) marks will be played.
Marks can be selected in the Marks list.
37
If the loop mode button is not pressed, the playback starts from the beginning of
the fragment and stops at its termination. Playback can be terminated only by pressing
<Esc> or button on the toolbar. Other user’s actions, except playback mode
operation, will be ignored.
If the loop mode button is pressed, the sound fragment will be played repeatedly.
To play the fragment again, press <Home>. To return to the loop mode, press
<spacebar>.
Note that in the pause of playback any pressing <Home> will switch the system
L into the one-time playback mode, while any pressing of the <spacebar> will
return the system back to the loop mode.
To start playback, press <F6> on the keyboard. Also, the following key combinations
can be used for playback of speech fragments:
You can change signal playback frequency in the Playback►Options menu. After you
press OK, the signal in the current window will be played at a specified frequency. To
reset the frequency, enter the Playback►Options menu again and quit it by pressing
Cancel (or <ESC>).
The Options menu also allows you to set the time interval (pause) between the
playback of segments in the current window. Press the <PLAY> button to start
playback right from the Playback►Options menu.
Without quitting the playback mode (in pause of playback) you can also set one or
several constant marks by pressing <Insert>, scroll image, change size and location of
boxes. To assign names to the marks, use the Marks►Marks list menu. To resume
playback, just press the <spacebar> again.
The default window length is 5 sec (2.5 sec to the left and 2.5 sec to the right of the
active cursor) and can be changed within the 0.1 to 100 sec range in the
Options►Extra options menu (Data view size (sec) for speech visualization). To
resume playback, just press the <spacebar> again.
Using the button in the right side if each signal window you can control playback
speed. Placing the mouse cursor on this button, you can see the pop-up box with the
current value of playback speed coefficient (1.00 value conforms to an original playback
speed).
To change playback speed, left-click the speed control button and move the mouse
cursor to the right (to speed up playback) or to the left (to slow down playback) without
38
releasing the button. The playback speed coefficient can be chosen from the 0.33..3.03
range. When releasing the left mouse button, the current coefficient value will affect
playback speed until the user will change it again.
39
9 SAVING DATA
z if *.DAT file was loaded, then whole data segment with marks and comments is
saved as a *.DAT file;
z if *.WAV file was loaded, then whole data segment without marks and comments
is saved;
A window can include some segments but only the top (active) segment is saved. The
name of the active segment goes first (from the left) in the window name. An active
segment name starts the name of the window (it is in the left). In order to make active
any other segment in the window, press its name with the right button of the mouse.
On selection the Data►Save as command, a Save As dialog box will appear on the
screen, where you will have to specify the location, name and type of the file to be
saved. In addition to the standard options, it has the Now you work with group-box
which specifies the type of fragment of data to be saved. You can modify the currently
indicated type using the buttons available under the fragment type field:
- all data;
- highlighted;
- between temporary marks;
- visible in window;
- selected.
Suggested for saving file types correspond to saving data types (see standard
extensions in section 6.2).
Under the field Now you work with the type of the saving fragment is shown. It can
be changed with the help of five buttons below. To save a data segment, choose All
data as a type of processing fragment.
To the right of the fragment type selection area, there are two check-boxes: Save
comments and Save marks. If the signal contains no marks or comments, these
options will be dimmed.
Remember that marks inside saving fragment or at its boundaries are saved only.
40
If the signal contains constant marks and/or textual comments, these check-boxes will
be accessible and the user can tick them if he wants this information to be stored with
the sound data.
Remember that marks, comments and other overhead information are saved only in
*.DAT files.
To save the data (fragment or segment) as text, use the Data►Text export option
available in the main menu. The Save as window described in the section 9.1 will
appear on the screen.
Each file is saved in the text format described in section 6.2. The data will be converted
into text without accuracy loss.
If several empty windows or windows with different types of visualization are opened,
you can save this state of processing with all its settings using the Data►Save current
state menu command.
Later you can continue working with the saved settings using the Data►Load saved
state menu command.
41
10 ZOOMING AND SCROLLING DATA
IN THE WINDOW
As soon as the data box appears in the window, you can make use of the zooming and
scrolling tools: a data scroll bar, buttons for horizontal and vertical zooming (double-
headed-arrows) and the third axis zoom button for 3-dimensional data.
DEFINITIONS:
A segment is a chunk of data which forms an entity and is not connected with other
data. For example, data read from a file will form a segment. Similarly, all data written
from the sound card within one session, upon the completion of sound input, will form
one segment. Each new segment in the window is represented by a different color, as
long as the number of colors is sufficient.
A fragment is a part of data which is singled out in some way from the segment but
has not lost its connection with the rest of the data. It can be, for example, some part of
a segment limited by temporary marks, or part of a segment in the highlighted interval
between constant marks, or part of a segment visible in the box.
(1) Select the Show►Change box position/length command in the menu and
modify the box left and right boundaries in seconds. This may change the position and
the width of the box. If, contrary to your expectations, the data disappears from the
box, use Show►All data (or press <F8>) to display all loaded data. If this does not
help, expand the upper and lower limits of the box (Show►Change box
position/height) or adjust them using the Show►Y-autoscaling command or press
<5> on the number pad.
(2-3) Use the horizontal scroll bar. The functions of the SIS scroll bar are similar to that
of standard Windows scroll bar, but also have some specific features. The scrollbar is
located at the bottom of the data box.
If you activate the left or the right arrow of the scroll bar, the coordinate
boundaries (seconds) of the horizontal axis in the box will be shifted by 3/4 of the
current box width and the data will be redrawn (3/4 value can be changed in the
Options►Extra options menu item).
42
The black marker in middle part of the scroll bar can also be used to shift
the data. For smooth scrolling of data drag the marker with pressed left mouse button.
If the horizontal box boundaries don't exceed the data boundaries, the total width of the
scroll bar corresponds to the total length of data; and the width of its black part
(marker) corresponds to the width and position of the box respectively. If you activate
any point on the scroll bar middle part using the mouse, the box boundaries will be
changed so that the left border of the black marker will be precisely in this activated
point. You should activate the left border of the scroll bar middle part to see the
beginning of the data. If the horizontal box boundaries exceed the data boundaries,
then the left border of the scroll bar middle part corresponds to the minimum of the left
box boundary and the left data boundary in the box; and the right border of the scroll
bar middle part corresponds to the maximum of the right box boundary and the right
data boundary in the box.
If there is an area without sound data inside the windowed selection, it will be marked
by a thin horizontal line at half height of the marker between the scroll arrows.
You can change the shift step of the visible area in the box using the Options►Extra
options menu (Data view shift). The shift value is always set in the proportion of the
visible area width and can be changed in the limits from 0.001 to 1.0.
If you press <ESC>, the rectangular will disappear without any changes.
If you click with the mouse on the border between the black and grey fields, no changes
will take place.
If you click on the black area of the rectangle, the width of the box (in seconds) will be
reduced according to the proportion between the distance from the left boundary of the
black area to the cursor and the whole length of the black area.
If you click on the grey area, the width of the box (in seconds) will be enlarged
according to the proportion between the distance from the left boundary of the black
area to the cursor and the whole length of the black area.
43
Thus, without entering the menu, you can increase the width of the box by 3 times and
reduce it by 15 times. You can make several clicks to obtain the desired zoom degree.
The third way of rescaling the image in the data box is to use the mouse wheel. Move
the mouse pointer to the horizontal axis until the pointer takes the form of the double-
headed arrow, and rotate the mouse wheel to spread or compress the signal image in
the box.
z All data. All the current segment will be displayed. You can alternatively press
<F8>.
z From temporary marks to end. Data from the right temporary mark to the end
will be displayed.
z From start to temporary marks. Data from the beginning to the left temporary
mark will be displayed.
If you press <ESC>, the rectangular will disappear without any changes.
If you click with the mouse on the border between the black and grey fields, no changes
will take place either.
If you click on the black area of the rectangle, the height of the box will be reduced (i.e.
the image will be enlarged) according to the proportion between the distance from the
left boundary of the black area to the cursor and the whole length of the black area.
If you click on the grey area, the height of the box will be enlarged (i.e. the image will
shrink) according to the proportion between the distance from the left boundary of the
black area to the cursor and the whole length of the black area.
44
Another way to modify the top and bottom borders of the data box is by using the
Show►Change box position/height menu. The values of the top and bottom box
boundaries are given in the same coordinates as those currently used for data display in
the box.
You an also use the Show►Y-autoscaling command or press <5> on the number pad
to adjust the bottom and top boarders of the box to the maximum and minimum signal
amplitude in the window.
Another way of rescaling the image in the data box is to use the mouse wheel. Move the
mouse pointer to the vertical axis until the pointer takes the form of the double-headed
arrow, and rotate the mouse wheel to spread or compress the signal image in the box.
z Shift the data box upwards/downwards without resizing it. If one of the horizontal
lines of the frame goes out of the window boundaries during the shifting, it
becomes invisible on the screen.
z Resize the data box by modifying its bottom and right boundaries. To do this,
press and hold down the left mouse button. Now you can move the mouse either
up – to reduce data box height, or down – to enlarge it. Similarly, without
releasing the mouse button, you can modify the data box length – by dragging its
right boundary. You can only reduce the box length relative to its current length in
this mode. To enlarge it, use the horizontal zoom control described in the section
10.2. You can simultaneously modify the box height and length by dragging the
frame bottom right corner with the mouse diagonally across the window, while the
top left corner of the frame remains fixed.
z After you sized the data box as desired, you can change its position. Release the
left mouse button and move the frame with the mouse until it occupies the desired
position on the screen. The left and right boundaries will always remain within
current data box limits, while the top and bottom coordinates can exceed them.
After you positioned the frame the desired way, click with the right mouse button
and the data box will be modified accordingly.
If you press <ESC>, you will quit the mode without any changes.
45
10.6 Zoom mode
The Zoom mode is intended for a detailed examination of any part of the data in the
box. To use this mode, choose the Show►Zoom menu item.
If the active window contains 3-D data (spectrogram, cepstra and etc.), set at least one
temporary mark in the area of interest.
The Zoom mode is not used for spatial representations of data (axonometry,
L right/up deviation).
If the active window contains 2-D data (for example, a waveform), it is not necessary to
set marks in it. After selecting the Show►Zoom you will see a dashed frame appearing
on the screen. You can resize and position it to include the desired piece of data using
the mouse and arrow keys (similarly to the sizing of a newly created window - see 4.5
for details). After you modified the frame size and position the desired way, click the
right mouse button. A new window will appear containing the zoomed data. All
segments located inside the frame will be represented in the zoom window.
If you zoom on 3-D data (e.g. spectrogram) no frame will appear. To specify the point
you wish to zoom on, set one or two temporary marks there. You can move along the
segment in the zoom window (up, down, forward, backward) using the arrow keys. If
you press the <+> key on the number pad, the previous slice will be displayed in the
zoom window. No more than 4 slices, each in a different color, can be displayed in the
window simultaneously. The slice shift interval relative to the first slice (in msec) is used
for the slice name.
For stereo segments both channels will be represented (the right one has a darker
color). You can move along the segment (up, down, forward, backward) using arrow
keys. Each pressing will shift the cursor by one frame size.
While working in the Zoom mode, the mouse cursor is located within the zoom window
and you can't actually use the main menu. Therefore SIS provides a special set of hot
keys and key combinations for basic menu commands. Using these keys you can do the
following:
Key Function
Ctrl/s Shifts the frame by 3/4 of its width to the left (not used for
"visible speech")
Ctrl/t Shifts the frame by 3/4 of its width to the right (not used for
"visible speech")
46
Key Function
The message line (at the bottom of the screen) lists the most frequently used hotkeys.
While in the "zoom" window, you can make use of the buttons available for other
window types as well. Thus, when viewing 2-D data (waveform, pitch) you can press the
cursor source button , move the cursor to the required place and set a constant mark
47
(pressing <Insert>). The constant mark will appear both in the zoom and in the source
window. For 3-D data the mark will indicate the current section of the "visible speech".
In 2-D data windows the mark will appear in the position corresponding to the middle
point of the zoom window.
Using a contextual menu (pressing the right mouse button) one can switch to the main
window for producing standard actions (listening, scaling etc.) without closing “zoom”
window and loosening its settings (size, shift). Later it is possible to return from the
main window to “zoom” window.
“Zoomed” fragments can be played in following modes (0.5 sec standoff can be changed
in options appeared on the right mouse button press):
z Visible in the “zoom” window data (±0.5 sec before and after, if necessary).
z Between temporary marks in the “zoom” window data (-0.5 sec before and
+0.5 sec after the selection, if necessary).
In the Options►Extra options you can change the step of visible data shift performed
by pressing right/left arrow keys in the Zoom mode. By default this value is 0.25 of the
visible area.
To enable editing option, open the signal in the Zoom mode and press the right mouse
button. Then in the pop-up menu select Edit mode►Draw or Edit mode►Erase.
Once in the Draw mode, you can change the amplitude value of the signal with pressed
left mouse button. The correction of the signal occurs according to a principle of linear
interpolation.
Once in the Erase mode, you can use the left mouse button to erase the signal. Erasing
means that the amplitude value is being set to zero. To adjust the size of the eraser,
rotate the mouse wheel.
The waveform and the pitch must be shown in the linear scale (along the vertical axis),
while spectra and other signals may be shown in the dB scale; the signal value in the
count is shown in dB and ordinary units.
To disable editing option, deselect the corresponding item (Draw or Erase) in the pop-
up menu. To quit the Zoom mode, press <ESC>.
48
10.8 Visualization options
If you enter the Options►Extra options menu, you will be able to change the
following parameters:
z Visible data shift step (upon clicking on the arrows of the scrolling
indicator/control in any window). The default value is 0.75 of the visible area
width. The value is always set proportionally to the visible area width in the [0.001
- 1.0] range.
z Visible data shift step for the Zoom mode (upon clicking on the arrow keys). The
default value is 0.25 of the visible area width.
z Visible data size (upon pressing <F9> in the playback mode). The default value is
5 sec (2.5 to the left and 2.5 sec to the right from the stop point) and can be
changed in the [0.1 - 100 sec] range.
z Scaling width is fixed. This mode provides a scaling width to be equal for all
windows. Scale divisions of vertical axis will expand until the numbers at
graduations exceeds 5 digits (with floating point). Then scaling width will increase
again.
z Shift data to zero after copying. When copying data from one window to another,
they will be shifted to zero. When deleting data from the beginning of segment,
remaining data will be also shifted to zero. If this option is not selected, coordinate
position of the beginning of the data does not change.
z Playback process visualization. The cursor will accompany the point of playback.
z Change box limits after copying. Each time, when reading a new file, the data will
be automatically scaled along a vertical axis. If this option is not selected,
autoscaling will be performed only when reading the first file.
z Mode COPY for drawing. In this mode upper segments will cover lower segments
during drawing. Alternatively, in the OR mode, the upper segments will be
transparent.
z To cycle highlighted data copying. Highlighted data copying dialogue will appear
again after copying. Highlighted region between the constant marks can be edited
without exiting copying dialog window.
49
z Use hours and minutes for time scaling.
50
11 OPERATIONS WITH CURSORS USING TEMPORARY
AND CONSTANT MARKS FOR SELECTING
DATA FRAGMENTS
To evoke the cursor, either click with the left mouse button on the cursor source button
located in the top right corner of the current window, or double-click with the left
mouse button on any place of working area of the window. The mouse cursor will
disappear, and a yellow dashed line (cursor) will appear in the middle of the box. This
line will follow the horizontal mouse movements. A message line displaying the X-
coordinate of the window cursor will appear at the top of the box. When a message line
appears, the data box will be reduced in size and redrawn.
"Visible speech" (spectrogram, cepstrum etc.) may be redrawn rather slow when
inserting an empty message line. To speed up working, select Add comments to
"Visible Speech" in the menu Options►Extra options. An empty message line will
be created simultaneously with the creation of visible speech.
To quit the mode, press the middle (right) mouse button or <ESC>. The standard
mouse cursor will appear on the screen, and the vertical cursor in the box will
disappear.
You can alternatively evoke the vertical cursor using the Marks►Set cursor/mark
command. In the top left corner of the window cursor position along the X-coordinate
will be shown. If there are suitable signals in the box, the coordinates of the current
signal in the given point will be also shown separated by a colon.
To quit the mode, press <ESC>. The standard mouse cursor will appear in the screen,
and the vertical cursor in the box will disappear.
To set a temporary mark, evoke the cursor pressing the cursor source button, move the
cursor to the desired position and click the left mouse button or press
51
<Ctrl>/<Insert> key combination. The X-coordinate of the mark will be displayed in
the comments line.
If there are two temporary marks already set in the window, the older mark will
disappear. If you press <Ctrl>/<Delete>, the nearest to the cursor temporary mark
will disappear.
Beginning from version 6.2.1, the SIS system offers a new mode for operating with
temporary marks. To enable this mode, in the Options►Extra options menu tick the
Select signal between temporary marks item. To set a temporary mark, just move
the mouse pointer to the desired position and click the left mouse button. The second
mark can be set in the same way. If the data box already contains two temporary
marks, the older one will disappear. The selected fragment between temporary marks is
displayed as a contrasting area.
To select fragment you can also press and hold down the left mouse button, and move it
to the desired position. The selected area is marked out by color. The left and right
edges of the selection correspond to the positions of temporary marks. The positions of
the edges of the selected fragment can be changed by dragging them with the left
mouse button.
To set a permanent mark, evoke the cursor, move it to the desired position and press
<Insert>.
If you press <Delete>, the nearest to the cursor constant mark will disappear. To
delete all marks simultaneously, select Marks►Delete all marks.
Beginning from version 6.2.1, the SIS system offers a new mode for operating with
constant marks. To enable this mode in the Options►Extra options menu, tick the
Select signal between temporary marks item. To set a constant mark, move the
mouse pointer to the desired position and press <Insert> key.
Any fragment between two constant marks can be highlighted. For that, just click with
the right mouse button anywhere within the marked region and select Make
highlighted item in the context menu. You will see yellow highlighting appear above
the selected fragment. To remove this highlighting, move the cursor to any place of the
highlighted fragment, press the right mouse button and select Make highlighted menu
item once more.
52
11.4 Marks list
All constant marks in the window are entered into the marks list. You can view this list
via the Marks►Marks list menu. The window containing the marks list additional menu
will appear on the screen (see Fig. 19).
z Highlighting index (for the fragment from the current to the following mark).
z Length of the fragment between the current and the following mark.
Apart from the marks list proper, there are some additional fields allowing you to select
and deselect certain marks. If you have a highlighted fragment in your data, the
coordinates of the beginning and end points of the highlighted area, as well as its overall
duration will be indicated.
Besides, each mark within the highlighted fragment will be denoted with one of the
following characters (depending on its position in the highlighted area): s (start) –
starting mark, h (highlighted) – all marks between the highlighted fragment’s
boundaries, f (finish) – final mark. These symbols appear between the mark’s
coordinate and signature fields in the marks list.
53
If the mark is currently inside the data box, you will see the V symbol in the View field.
If you click on this symbol with the left mouse button, the corresponding data fragment
(from the current to the following mark) will be displayed.
To select a mark in the marks list, activate a check-box in the desired line pressing
<spacebar> or clicking the left mouse button.
To save marks list to a text file, use the <Export marks list> button. The following
dialog window will appear:
This window contains a prompt how to convert a text file into a table form, using
Microsoft Word. Checking Save selected only, you can exclude saving the unnecessary
marks. After pressing OK, you will be asked to choose name and destination folder for a
text file.
“Visible speech” (spectrogram, cepstrum etc.) may be redrawn rather slowly when
inserting an empty message line. To speed up working, select Add comments to
“Visible Speech” in the menu Options►Extra options menu. An empty message line
will be created simultaneously with the creation of visible speech.
54
If you press the left mouse button, the temporary horizontal mark will appear in the
cursor position, and its coordinate position will be shown in the message line to the right
of the current cursor position value.
If you press the left mouse button once more, the second temporary horizontal mark
will appear.
Marks will appear until their total amount exceeds the Number of horizontal marks
parameter value. This parameter can be varied in the range from 2 to 5 using the
Options►Extra options menu. If you press the left mouse button again, the nearest
mark will be replaced by the new one, and the other marks remain unchanged.
Coordinate positions of the marks will be displayed in the message line to the right of
the current cursor position value in ascending order. Temporary horizontal marks are
set in the linked windows simultaneously, but their values are represented only if a
suitable comment line already exists.
After pressing the right (middle) mouse button or <ESC>, the horizontal cursor
disappears and the mouse cursor appears again. The comments line and temporary
mark values will remain in the window.
It should be kept in mind, however, that marks and comments (as well as all other
service information) can be saved only for *.DAT sound files.
55
12 GENERATING A TEST SIGNAL
The system provides an opportunity to form test signals of several types with various
parameters (amplitude, period, frequency). To generate a test waveform, select
Data►Generate test waveform. The following menu will appear on the screen
(Fig. 21):
z Choose between mono/stereo data format (by either enabling or disabling Stereo
option).
z Specify the signal bit capacity (by either enabling or disabling 24-bits signal).
Note that waveform type selection is performed by setting its magnitude to any value
above 0. You can generate a mixed signal by selecting several waveform types. All
signals with amplitudes above zero will be mixed during generation.
Having selected the desired signal type and having set all necessary parameters for
signal generation, press OK. The generated signal will be drawn in the active window.
56
13 DATA PROCESSING
To edit the data, enter the Edit menu using the keyboard or the mouse. In the drop-
down menu you will see listed all available operations.
The Now you work with item at the top of the list allows you to choose data fragment
to be processed. The currently selected type is indicated below.
1. Move the mouse cursor to the name of the necessary segment in the active
window and click the second (right) mouse button. The selected segment will
become active and the window name will be redrawn.
2. Move the mouse cursor to the symbol in the active window name line and click
the right (middle) mouse button. The bottom segment in the window will become
active, while the order of other segments will not change.
3. Select Data►Activate next and thus activate the next segment (second from
above) or select the menu Data►Activate previous to activate the previous
(bottom) segment.
The user can choose to edit not the whole segment, but some part of it. The currently
specified data type to be processed is indicated under the Now you work with entry at
the top of the drop-down menu. To change fragment type, enter this field and choose
the desired fragment type from the list. The following options are available:
z All data. The whole current segment of the active window will be processed.
z Visible in window. Part of the current segment displayed in the box will be
processed.
You can also change the fragment type by pressing the Data interval selection button
located at the right part of the window.
57
The specified fragment type will be immediately displayed under the Now you work
with entry.
z Before playback – to bring the signal to conformity with the bit range of the
digital-analog converter or to increase the volume.
z Specify data fragment type for processing. This can be done by pressing the Now
you work with field at the top of the menu. A list of available options will appear,
identical to that described in section 13.1. Fragment selection can also be
performed using the buttons below:
- all data;
- highlighted;
58
- between temporary marks;
- visible in window;
- selected.
To apply normalization to all segments of the active window, check All segments in
window option.
After pressing OK, normalization will be performed and active window will be redrawn.
The top fields are intended for data fragment selection and are described in sections
13.1 and 13.2.
" - " – subtraction of constant (the constant is subtracted from the signal),
You should choose one of the operations and enter the necessary constant value in the
field located to the right of the Signal field under the Constant field.
59
Then, if you process a stereo signal, there will be the To be processed field with a
button next to it for channel selection. By pressing this button you can change the
processing mode. The one currently selected appears on the button. Three modes are
available:
z Both channels.
z Left channel.
z Right channel.
Channel selection option is not available for mono signals – you will see dashed lines in
place of these fields.
To apply the chosen operation with constant to all segments of the active window, check
the All segments in window option.
Press OK to start the process. After the operation has been performed, the active
window will be redrawn.
When processing 16-bit signals you should keep in mind that if the constant value is too
big, so that the result of the operation in one of the counts exceeds the bit range of the
integer (-32767, 32767), you will see the message: "Overflow. Request killed"
displayed in the message line, and the operation will not run. This doesn’t apply to
24-bit signals. For signal bit accuracy transformation see section 13.19.
The top fields are intended for data fragment selection and are described in sections
13.1 and 13.2.
Linear transformation consists in multiplying the signal by the linear function on the left
and on the right edge. The values for the left and right edge are to be set in the
Multiplier on left edge and Multiplier on right edge fields.
To apply linear transformation to all segments of the active window, check the
All segments in window option.
Press OK to start the process. After the operation has been performed, the active
window will be redrawn.
When processing 16-bit signals it should be kept in mind that if the constant value is too
big, so that the result of the operation in one of the counts exceeds the bit range of the
integer (-32767, 32767), you will see the message: "Overflow. Request killed"
displayed in the message line, and the operation will not run. This doesn’t apply to 24-
bit signals. For signal bit accuracy transformation see section 13.19.
60
Figure 24. The Linear transformation menu
If you press OK, the fragment or the whole current segment will be immediately deleted
and the window will be redrawn. If the fragment is deleted, all the next nearest data will
be moved to the released place, so the segment will become continuous again, but the
position of the temporary marks will not change and they will mark other data chunks of
the current segment. Constant marks will be processed along with the signal if the To
process marks option is selected. If this option is not selected, the position of the
constant marks will not change.
To perform this operation, enter the Edit►Append menu. The operation menu will
appear on the screen (Fig. 25).
The top fields are intended for data fragment selection and are described in sections
13.1 and 13.2.
The lower part of the menu is intended for destination window selection and is similar to
the source–destination menu type described in section 4.6.
61
Figure 25. The Append menu
After pressing OK the operation will be performed and the destination window will be
updated. The only thing that can prevent this operation from being performed, is the
unsuitable destination-window type.
If you append stereo signal to mono signal, the data of the left and right channels will
be mixed. If you append mono-signal to stereo signal, the same data will be appended
to the both channels.
The menu contains 5 fields for selecting the place in the target signal for inserting the
specified data fragment:
z Before temporary mark. If there are two temporary marks in the destination
window, the data will be inserted before the left temporary mark.
z Before highlighted.
62
z After highlighted.
z To signal start.
Select the required option with the mouse cursor or using the arrow keys and press OK.
The insertion will be performed after that if the destination window has the destination
segment of the same type as the source segment. The data inserted in the destination
window will not be deleted from the source window. All the data following the point of
insertion will be moved to the right (to the greater abscissa values) to clear the place for
the insertion, but the position of the temporary and constant marks will not be changed
and they will mark other data of the destination segment.
If you insert stereo signal into mono signal, the data of the left and right channels will
be mixed. If you insert mono signal into stereo signal, the same data will be inserted
into both channels.
After you press OK, the operation will be performed. Regardless of the source data type,
the copied data will form a uniform segment with the destination window segment. All
constant marks will be copied with the data and placed in the new segment at the same
points as in the old one.
The fragments of the waveforms and pitch curves are copied with shifting to zero by
default (the left border of each copying fragment is equal to zero). You can change this
option by entering the Options►Extra options menu and correcting the field Shift
data to zero after copying. If this option is not checked, the fragment’s horizontal
borders will not be changed after copying. It may be convenient, but it is not simple to
find scattered segments. To display the upper segment in whole, use the <F8> hot key.
63
Here you have to create or choose destination window and then press OK.
“Moving” consists in re-assigning the current segment to another window; data is not
copied. It saves a lot of time, especially when you work with long segments. The
destination window becomes active after segment moving.
This operation is not applicable to fragments of data – only entire segments can be
moved.
For example, to copy (cut) and subsequently paste the data fragment, you can do the
following:
z Click on it with the right mouse button and choose the Copy fragment (Cut
fragment) command in the appeared contextual menu.
z Click the right mouse button in the destination window where the copied (cut)
fragment is to be pasted and choose the Paste fragment command in the
contextual menu.
z In the dialog window, choose point of fragment insertion (see Figure 25) and
confirm the command.
Signal is displayed with temporary marks and is shown in the center of the window.
Image scale is changing in order that the inserted fragment occupies 50% of the window
length.
To smooth the data, select the Edit►Data smoothing menu item. The operation menu
will appear on the screen (Fig. 28).
The top fields are intended for data fragment selection and are described in sections
13.1 and 13.2.
64
Figure 28. The Data smoothing menu
Using the Create new segment field, you can confirm or cancel the creation of a new
segment. In case you have confirmed the new segment creation (tick the field), the
smoothed spectrum will be drawn in the window with the other color together with the
processed segment; if this mode is cancelled, only the smoothed spectrum will be drawn
in the active window, and the processed spectrum will not be represented in the
window.
Using the Geometrical average field, you can transfer the segment smoothing from
the linear dimension into the logarithmic one. If this mode is enabled (ticked), the
logarithms will be counted of the data before smoothing, and inverse operation will be
performed after it. The data must not have negative values to carry out this mode.
There is the Filter length submenu, which permits to choose the necessary smoothing
window length from the list. The window length is specified both in counts and in Hertz.
The minimum smoothing practically doesn't change the window length of the spectrum,
the maximum smoothing results in square polynomial of it. The approximation is
performed by the least-squares method using a method of the sliding polynomial.
The calculation starts after pressing OK. The smoothed spectrum or its fragment will be
drawn in the active window by another color.
65
To mix the signals, load the processed signals in the active window and enter the menu
Edit►Mixing. The operation menu will appear on the screen (Fig. 29).
z Choose the processed fragment type: all data, highlighted data, data between
temporary marks, data visible in window, selected data.
z Set the Segment result type: current or new. If the current segment is chosen
as the result, the mixing result will be drawn instead of the current segment; if a
new segment is chosen as the result, the mixing result will be drawn on top of the
mixed signals in a different color.
z Set the length of the resulting segment (Result length): it can be set equal
either to the length of the current segment, or to the mixing interval (for example,
between temporary marks).
z Assign the desired weights to each mixed signal. There is a table in the menu
which contains all signals to be mixed and you can enter necessary weights for
each signal using the mouse and keyboard.
z Set the result signal bit capacity (Result type): 16-bits signal or 24-bits signal.
66
13.14 Clipping
Clipping is often applied for the partial reduction of extended impulse noise (when there
are pulse bursts and each has a large duration). To perform clipping, enter the menu
the Edit►Clipping menu item. The operation menu will appear on the screen (Fig. 30).
The top fields are intended for data fragment selection and are described in sections
13.1 and 13.2.
To apply clipping to all segments of the active window, check the All segments in
window option.
After pressing OK, the operation will be performed, and the active window will be
redrawn.
In the average spectrum of a signal with a 10000 Hz sampling rate, you will see two
spectral peaks separately if the distance between them is no less than 10 Hz (after
applying the Analysis►FFT power spectrum operation: frame size - 2048 counts,
Hann window). If you lower the sampling rate in about 80 times (down to 125 Hz), it is
possible to discern the peaks with the 0.12 Hz distance provided between them.
67
rate to a lesser or bigger value than the initial rate. This procedure works slower than
the first one.
If the user has selected Divide by integer, the following Sampling rate menu will
appear on the screen. This is a standard menu of the source-destination type described
in section 4.6.
z Divisor [2-102] (value can vary in range from 2 up to 102). The new value
(in Hz) will appear in the field New sampling rate according to these values. You
should remember that spectral range is limited by a half of the sampling rate.
During the sampling rate division the whole spectrum in the range from a half of
the new sampling rate up to a half of the previous sampling rate will be
suppressed by more than 72 dB. High-frequency area of the residuary signal
spectrum (10%) falls into the transient region and will be distorted. So the
maximum undistorted frequency is shown in the Pass band frequency field (a
half of the new sampling rate minus 10 %).
Having set necessary parameters, press OK. Operation will be performed. The active
window will be redrawn with the result segment. Pressing <ESC> you can interrupt the
operation (only after the new segment has been created).
If the user has selected the Set to arbitrary option, the following Sampling rate menu
will appear on the screen. This is a standard menu of the source-destination type
68
described in section 4.6. By pressing the OPTIONS field, you can enter the specific
menu for this operation (Fig. 33).
z New sampling rate. New sampling rate value in the limits from 100 up to
48000 Hz. The value must of the integer type and set-on accuracy - 1 Hz.
z Submenu items: Fast transformation or Best accuracy. If you choose the Fast
transformation option, the processing will be performed four times faster, but
the signal within the bandwidth will be suppressed only by 55 dB. If you choose
the Best accuracy, the signal will be suppressed by more than 75 dB.
Having set the necessary parameters, press OK. The operation will be performed. The
active window will be redrawn with the resulting segment. Pressing <ESC> you can
interrupt the operation (only after the new segment has been created).
13.16 Reverse
To perform signal reversion, you select the Edit►Reverse menu command. The
operation will be immediately performed and the active window will be redrawn.
Reversion changes the order of data points in the waveform within the given interval to
the inverted. For example, if the user enables the reverse of the signal part between 1
and 2 sec, with data counts from 10000 to 20000, then after performing the operation
the data counts will be swapped over as follows: 10000 to 20000, 10001 to 19999 and
so on.
It should be noted that if several neighboring intervals between the marks are selected
continuously, they will be interpreted by the reverse operation as one large interval in
the Selected data mode. And so the values will be swapped only for this large interval
but not within each of the small selected intervals.
69
13.17 Inversion
To perform the inversion, select the Edit►Negative menu command.
For signals having the integer format (waveforms, pitch, energy), this operation means
that all the signal values within the given interval will change their sign to the opposite
(i.e. they are multiplied by (-1)).
For average spectrum, this operation means that each value of the signal within the
given interval will be replaced by value inverted to the initial one (i.e. 1 divided by the
initial value), and thus the inverse spectrum is obtained.
To obtain a tempocorrected signal from the initial one, enter the menu
Edit►Tempocorrection. The Tempocorrection window will appear on the screen
(Fig. 34). This is the standard menu of the source-destination type. The work with this
menu is described in section 4.6.
z Moderation coefficient. You can choose any value in the range [0.33..3].
z Period. An assumed pitch period (in sec). The envelope of the output signal
becomes sawtooth and specific extra-sounds arise when increasing this
parameter; the clicking arises when decreasing the period.
Having set the necessary parameters, press OK. The operation will be performed, and
the active window will be redrawn.
13.19 Modulation
To perform signal modulation, select Edit►Modulation. Modulation consists in point-
by-point multiplication of two signals with a floating normalization of the result.
Modulation can be applied to 16-bit or 24-bit waveforms recorded in the mono format
and having the same sampling frequency. A 24-bit numeral is represented in SIS as a
32-bit numeral where 24 bit is for the stagnant part and 8 bit – for the exponent.
After you have selected this operation, a dialog window with a notification "You
modulate the active segment by second segment" will be displayed.
This option allows you to convert a 16-bit signal into a 24-bit signal or vice versa.
71
z To roll stereo segment channels.
z To merge two mono segments into one stereo segment (provided they have the
same sampling and bit rates).
To use these features, enter the menu Edit►Mono/stereo operations and select the
desired operation type.
The operation of merging two mono signals may be lengthy if signals start from different
time points, because the part of the data from the earlier signal will be deleted.
Other operations are pretty straightforward and don't require any further explanation.
72
14 NOISE REDUCTION
The Noise reduction menu is intended for filtering sound signals. Depending on the
type of noise present in the signal, the user can choose one of the options listed in the
menu:
z Adaptive filtering:
Inverse filtering;
Wideband noise;
Stereo noise.
z DYNAMIC filtering.
Gain range reduction leads to the weakening of this effect, but it also leads to the
weakening of the useful signal and the suppressed (residual) interference sounds
become more distinct. So, useful signal extraction requires a compromise between the
gain level of speech and noise on the one hand and the rejected interference on the
other.
73
Figure 36. The Inverse filtering options menu
Time delay of filter, sec – [1..1000] sec – sets time of correcting filter adaptation to
signal spectrum changes. The recommended value is 3-4. For non-stationary noise and
music 1-2 is recommended.
Gain, dB – [0..10] dB - sets signal amplitude value at filter output. The recommended
value: 3-9 dB. Adjusted by ear.
Frame length (counts) – 16, 32, ...2048 - sets the spectrum resolution, the number
of spectrum bands and the duration of the processed fragment of data. For large
numbers of narrowband interferences (tonal pulses etc.) the recommended value is
1024..2048.
Method version:
z Inversion.
z Contrasting.
74
The Inversion method version cancels high-power tonal noises, while the Contrasting
version, conversely, emphasizes frequency maxima in speech (if tonal interferences are
absent).
z Inverse timbre – if enabled, creates an addition to the timbre correction filter for
passing the signal outside the timbre correction filter passband. This allows you to
hear which part of the signal is not covered by the filter band.
Low frequency, Hz – [100..700] Hz - sets the filter lower passband frequency. The
recommended value is 200 Hz. In case of intensive low-frequency noise (hum) this
parameter should be increased.
High frequency, Hz – [2500..Fmax] Hz - sets the filter higher passband frequency. The
recommended value is 3600 Hz. In case of intensive broadband noise this parameter
should be decreased.
Low frequency gain, dB – [-24..+24] dB - amplifies the output signal in the low-
frequency area, to obtain more natural timbre.
z Speech processing.
z Music processing.
Adaptive wideband noise filtration allows cancelling noise both in speech and in pauses.
This does not improve speech intelligibility, but considerably reduces the tiring effect
75
during listening. This method of adaptive filtration allows you to cancel narrowband
noise types as well, which makes it quite a universal noise cancelling means. Adaptive
background enhancement is used for the unmasking of background acoustic signal
environment.
The Speech treatment mode is used for speech signal enhancement on the
background of unsteady broadband and tonal noise types related to industrial
electromagnetic hum, mechanical vibrations, apartments and street noise,
communication channels or recording devices noise. It allows speech signal unmasking
with signal-to-noise ratio in the range from -5 up to -10 dB.
The Music treatment mode is used for removing unsteady broadband noise from the
music signal.
The Removing noise in pauses mode differs from the Speech treatment mode in
that the initial signal remains unchanged in the speech activity areas.
The Extract background mode is used for background acoustic environment extraction
(i.e. the result will be the very components which are usually suppressed).
The Timbre correction mode is used for providing convenient speech sounding,
reduction of strong noise in the high-frequency and low-frequency areas, smoothing of
channel and recording devices frequency response function, enhancement of weak
speech signals and signals after filtration in the most informative spectrum band. This
mode allows you to "select" the most informative spectral area by suppressing the
signal in HF or LF area, where the noise exceeds useful signal.
One of the features of filtering based on adaptive spectral subtraction is the appearance
of "bells", "clicking", "purling" in the output signal, which is caused by the decreasing of
the initial signal noise level and unmasking its certain peaks. Increasing noise reduction
level will lead to the disappearance of these sounds, but useful signal reduction will be
also suppressed simultaneously. So a compromise between noise reduction and speech
distortion is needed for useful signal extraction. Unmasked noise can be "masked" once
more by increasing the reduction depth. Moreover, partial "bells" reduction can be
achieved by increasing the smoothing values (Spectrum smoothing and Smoothing
by time in the data smoothing parameters menu).
76
At last, when adjusting adaptive spectral subtraction settings, you should keep in mind
that background sounds can change dramatically after noise cancellation has been
performed. Thus, the processed roaring of the going car can sound as "purling". If you
want to use your ear intuition – to help you make out well-known sounds, you should
decrease the suppression range to 25-15 dB. This will lead to background and acoustic
environment enhancement. Then, slowly increasing the reduction range, you can
achieve the desired compromise between the noise reduction level and suppression of
useful speech signal.
If signal-to-noise ratio is small, the filtered useful signal will be weak. To increase its
amplitude it may be useful to apply additional amplification, set by the Maximum
amplification for one frequency, dB parameter.
The parameters are adjusted by ear. The influence of each parameter and
recommended values are listed below.
The options menu allows you to control the filtration process by manipulating the
following parameters.
z Speech treatment.
z Music treatment.
77
z Removing noise in pauses.
Reduction intensity – allows you to set one of the 10 available reduction degrees
(from Weak,1 to Strong,10). You can disable this parameter by choosing Turn off.
Disabling is necessary to enter the Advanced options menu (see Fig. 38 below),
otherwise, a warning message will appear at the bottom of the screen: "Turn off
DEFAULT SETTINGS to get access to advanced options", and access will be denied.
Low frequency, Hz – [100..700] Hz - sets the filter lower passband frequency. The
recommended value is 200 Hz. In case of intensive low-frequency noise (hum) this
parameter should be increased.
High frequency, Hz – [2500..Fmax] Hz - sets the filter higher passband frequency. The
recommended value is 3600 Hz. In case of intensive broadband noise this parameter
should be decreased.
Low frequency gain, dB – [-24..+24] dB - amplifies the output signal in the low-
frequency area. Can be used to obtain more natural timbre.
Advanced options - this field provides access to the additional options menu (Fig. 38).
This menu is used when the standard set of options proved insufficient.
78
Time delay of filter, sec – [1..1000] sec - sets time of correcting filter adaptation to
signal spectrum changes. The recommended value is 3-4 sec. For non-stationary noise
1-2 sec is recommended.
Frame length (counts) – 16, 32, ...2048 - sets the spectrum resolution, the number
of spectrum bands and the duration of the processed fragment of data. As you increase
the frame length, an ever larger number of different sounds become mixed in one
frame, and speech-free bands are eliminated. For spectral subtraction the recommended
value is 128..2048.
Suppression [1-40] - controls noise and, to some extent, speech suppression level in
the signal. Increasing this value will increase the suppression level.
Contrast [10-100] - controls the transition level between suppression and passing
areas in each spectral component of the signal. Increasing this value will lead to larger
speech signal amplitude and sharper residual noise sound. The recommended value is
80.
Best accuracy – when enabled, this parameter increases filter performance and
accelerates its adaptation to noise, due to reducing the processing rate approximately
twice.
79
14.3 Adaptive filtering of stationary wideband noise
The user should mark in the signal an area of pure noise with temporary marks. This
fragment will be used as a sample.
If the noise is stationary or almost stationary (the noise of a record, street, hall etc.),
this method is preferable to the adaptive ones, since the latter cannot accurately
estimate the background noise and, in addition to that, produce noise themselves. This
noise, called the adaptation noise, is perceived by human ear and worsens the quality
and even speech intelligibility in some nontrivial cases.
Using manual mode of average noise spectrum determination, the user has the
possibility to specify the signal fragment containing noise without accurately marking its
borders (Noise print autodetection).
The Options menu allows you to control the filtration process by manipulating the
following parameters.
80
z Denoiser by noise print.
Both methods reduce noise, but if you select the first method, you should accurately
mark the noise sample in the signal using temporary marks while the second method
allows the selected fragment to contain some useful signal as well (i.e. the sample can
be marked with overlapping).
Reduction intensity – allows you to set one of the 10 available reduction degrees
(from Weak,1 to Strong,10). Increasing this value will steadily increase noise
reduction degree. You can disable this parameter by choosing Turn off. Disabling is
necessary to enter the Advanced options menu; otherwise, a warning message will
appear at the bottom of the screen: "Turn off DEFAULT SETTINGS to get access to
advanced options", and access will be denied.
Low frequency, Hz – [100..700] Hz - sets the filter lower passband frequency. The
recommended value is 200 Hz. In case of intensive low-frequency noise (hum) this
parameter should be increased.
Low frequency gain, dB – [-24..+24] - amplifies the output signal in the low-
frequency area. Can be used to obtain more natural timbre.
Advanced options - this field provides access to the additional options menu
completely identical to that used in the Noise reduction►Wideband noise operation
(see section 14.2.2, Fig. 37).
One-channel adaptive noise filtration mode works with mono signals and is used for
periodic or similar to periodic interferences cancellation (vibrations, power-line hum,
household device noises, slow music, cars, etc). This mode can be used for unmasking
81
the speech signal (suppressing the tonal noise by 20..40 dB) or, in some cases, for
concert hall noise cancellation.
There are four method implementations: spectral, temporal, echo cancellation and
harmonics suppression.
The temporal method (In time domain) provides better rate of convergence, but
requires more computing power at an equal number of coefficients.
The main difference of these noise cancellation methods from others consists in that for
certain noise types they provide far better preservation of the speech signal – as
compared to other methods. This results from this filter principle of operation consisting
in noise compensation (subtraction), not multiplication by zero. In this case, the useful
signal remains unaffected, provided the adaptation is good.
However, the application of this method has some restrictions. Thus, using adaptive
noise reduction for mono signal, you can suppress only harmonics with relatively stable
phase. This mode is effective for stereo signals only in cases when the noise ratio in the
primary and reference channels changes relatively slowly.
The main parameters defining noise reduction level are the number of filter coefficients
N (the Frame length parameter in the options) and delay time. These parameters are
different for mono and stereo modes.
You should set the delay time equal to less than N/2. Increasing the number of
coefficients leads to the ability to cancel more noise spectral peaks (spectral peaks of
the tonal noise follow with the interval equal to the voice fundamental period
frequency). So you should set 512-2048 coefficients for the 50 Hz interference. To
prevent speech signal distortion, a delay time must be set not less than 25 msec (250
counts at the 10000 Hz frequency).
The filter adjustment time is defined by the adaptation rate. The rate value is set in the
16..29 range for unstable noise types and in the 2..15 range for stable (slowly
changing) noise types.
82
14.4.2 Method parameters
To perform adaptive filtering of tonal and regular noise, select the Noise reduction►
Tonal and regular noise menu. The Tonal and regular noise menu will appear on
the screen. This is a standard menu of the source-destination type described in section
4.6. By pressing the OPTIONS field you can open parameter menu for this operation
(Fig. 40).
The Options menu allows you to control the filtration process by adjusting the following
parameters.
Frame length (counts) – 16, 32, ...2048 - sets the number of spectrum bands and the
duration of the processed fragment of data. 256-2048 is recommended.
Method version:
z Spectral,
z In time domain,
z Echo cancellation,
z Harmonics suppression.
Delay (points) - changes from 0 to 1024. It is recommended to set 250 and more, but
less than the number of filter coefficients. If the delay is equal to 0.02 sec (200 counts
at sampling rate 10000 Hz), the quality of the processed speech signal will degrade.
83
signal output will be equal to input. It is necessary to choose a signal fragment, if
possible without useful signal, enable Adaptation and Save coefficients. Once the
filtration is finished, disable Adaptation and process the signal using the saved (fixed)
filter coefficients. If noise features are stable or the noise source is stationary, the noise
will be cancelled (weakened) and the speech will be kept safe.
Save coefficients - if this mode is enabled, filter coefficients will be saved after
filtration and can be used for subsequent filtering, when the adaptation mode is
disabled. Otherwise, these values will be equal to zero after processing.
Show deleting noise - if enabled, this parameter will output cancelled noise without
signal. This parameter is useful to estimate how much of the useful signal is mistakenly
cancelled.
Adaptation rate – [2..30] - sets filter adjustment (convergence) time to the changes in
the noise spectrum. The recommended speed is 10-15. For quickly changing noise types
20-25 is recommended. The quality of the processed speech signal will become worse at
high rates.
High freq. contrasting – when enabled, greatly increases high frequencies, which
sometimes leads to intelligibility improvement.
Low frequency, Hz – [100..700]Hz - sets the filter lower passband frequency. The
recommended value is 200 Hz. In the case of intensive low-frequency noise (hum) this
parameter should be increased.
High frequency, Hz – [2500 – Fmax]Hz - sets the filter higher passband frequency. The
recommended value is 3600 Hz. In case of intensive broadband noise this parameter
should be decreased.
Low frequency gain, dB – [24..+24]dB - amplifies the output signal in the low-
frequency area. Can be used to obtain more natural timbre.
Having set the necessary parameters, press OK. Choose the processed fragment type in
the displayed menu and filtration will start. Pressing <ESC> will interrupt the filtration.
If the filtration was interrupted, part of the data will remain unprocessed. Once the
84
filtration is completed, the filtered segment will be displayed in the active window on top
of the source segment.
Impulse noise filtering method is based on the substitution of pulse areas of the signal
by the interpolated and smoothed values. The signal remains unchanged in the areas
where pulses are not detected. The effectiveness of the filtration depends on the correct
choice of the processing parameters.
To run the filtration, enter the Noise reduction►Filtering of IMPULSE noise menu.
The operation menu will appear on the screen. This is a standard menu of the source-
destination type described in section 4.6. By pressing the OPTIONS field you can open
parameter menu for this operation (Fig. 41).
The Options menu allows you to control the filtration process by adjusting the following
parameters:
Impulse length [1-5] - performs pulse detection and localization based on its
duration. It is set in relative units. If this parameter is decreased, weak and short pulses
will be detected better, while long ones – worse; and vice versa – if this parameter is
increased. The default value is 3.
Detecting threshold [1-10] - performs pulse detection and localization based on its
energy. It is set in relative units and is equal to 6 by default. If this parameter is
decreased, weak and short pulses will be detected well, but useful signal can be
damaged.
Contrast [-9,9] – similarly to the detecting threshold, performs pulse detection and
localization based on its energy. The value is set in relative units and is equal to 0 by
85
default. Increasing the contrast leads to a greater number of detected pulses. The
contrast stretches pulse energy axis and thus influences the pulse width estimation.
Calculation rate [1, 2, 4] - defines the compromise between speed and quality of
processing. The processing speed increases proportionally to step increase, which
results in some quality loss. The default value is 1.
Method version - allows you to choose one of three methods of signal processing:
Pressing the field corresponding to the type of processed fragment will initiate the
filtration process. The filtered fragment will be drawn in the destination window or, if it
is not available, in the active window above the initial fragment in another color.
To run dynamic filtration, select the Noise reduction►DYNAMIC filtering menu item.
The operation menu will appear on the screen. This is a standard menu of the source-
destination type described in section 4.6. After pressing the OPTIONS field you can
choose one of the following methods:
z Compressor,
z Limiter,
The AGC (Auto Gain Control) and the Compressor methods compress the dynamic
range of the signal, which weakens impulse noise and equals signal levels for speakers
talking with different loudness. Both methods perform the same function using different
means.
The Limiter method weakens strong lengthy impulses and weak background noise.
86
The single parameter used for all these methods is Threshold. Once current mean
amplitude of the signal starts to differ from the specified threshold level, the signal will
be processed. The signal above the threshold level will be weakened in the AGC (Auto
Gain Control), Compressor and Limiter methods. The signal below the threshold
level will be weakened if the Removing noise in pauses method is selected. The
threshold level is 2000 (counts) by default.
If the noise to be cancelled comes from a stationary source, both channels can contain
signal and noise, but then there must be an area in both channels without useful signal
(only source noise). The effectiveness of cancellation in standard acoustic conditions is
12-25 dB in this case.
The Options menu allows you to control the filtration process by adjusting the following
parameters.
87
DEFAULT SETTINGS – restores default method settings.
Frame length (counts) – 16, 32, ...2048 - sets the number of filter coefficients.
Theoretically, the necessary number of the filter coefficients should correspond to
reverberation (echo) time in the room where the signal was recorded. In practice,
however, increasing the frame size results in slower processing and is sometimes
inefficient, because the noise to be cancelled is exposed to nonlinear distortions before
recording.
Save coefficients - if this mode is enabled, filter coefficients will be saved after
filtration and can be used for subsequent filtering, when the Adaptation mode is
disabled.
Show deleting noise - if this parameter is enabled, the filtered signal will be at the
output, otherwise, it will contain the difference between the input and the filtered signal.
Delay (points) - sets the shifting value of the input signal relative to input noise (the
left channel relative to the right channel). For stereo signal this value should be set to 0.
Method version - allows you to choose one of the following processing methods:
z Delay estimation - signal is not processed, the delay value (in counts) is
estimated and set.
z Space reflection - allows you to isolate the useful signal, if it comes into both
channel in phase. It happens if the speaker is located symmetrically with regard to
both microphones (delay = 0). Asymmetrical position can be taken into account by
setting a nonzero delay.
Adaptation rate [2..30] - sets adaptation speed. If the parameter value is too big, the
filter will become unstable. If the value is too small, the noise will be suppressed
weakly.
88
Timbre correction - enables/disables a timbre correction filter.
Low frequency, Hz – [100..700] Hz - sets the filter lower passband frequency. The
recommended value is 200 Hz. In the case of intensive low-frequency noise (hum) this
parameter should be increased.
High frequency, Hz – [2500..Fmax] Hz - sets the filter higher passband frequency. The
recommended value is 3600 Hz. In case of intensive broadband noise this parameter
should be decreased.
Low frequency gain, dB – [24..+24] dB - amplifies the output signal in the low-
frequency area. Can be used to obtain more natural timbre.
To start adaptive filtration, press OK. To cancel operation, use the Cancel button.
Having returned to the source-destination menu, choose the processed fragment type.
It is recommended to specify All data if the noise is stationary throughout the whole
speech signal. Then the adaptive filtration of the specified signal fragment will start.
Pressing <ESC>, you can interrupt filtering.
) You are strongly recommended to load Sound Cleaner first and convert it to
DX-plug-in mode before running it as a plug-in from SIS.
89
To run Sound Cleaner as a plug-in, press the toolbar button in SIS main window.
Preview. This option is used to adjust Sound Cleaner filtering parameters for current
signal processing. The signal will be transferred from SIS to Sound Cleaner for
processing and then back to SIS for playback. Selecting this option, you can adjust the
Sound Cleaner filter using all capabilities available there with constant audio control of
the filter performance. Having achieved the desired effect, return to the SIS window,
press <ESC> and select the next menu item.
Process. After you select this option, you will see a standard menu of the source-
destination type described in section 4.6. The Options menu includes two processing
options:
z Mute processing,
If you tick Mute processing, the signal will be filtered without being played, which will
significantly increase the processing speed. If this option is disabled, processing will be
performed simultaneously with playback in real time.
If Edit signal mode is enabled, the source signal will be replaced with the result of the
filtering. If you disable this option, SIS will create a new segment for the processed
data.
Keep in mind that Sound Cleaner version 6.02 and below will not be able to process
24-bit signals – you are recommended to upgrade the Sound Cleaner software to
version 6.03.
See the Sound Cleaner User Manual for more information about working with its
different sound filters.
90
14.8.2 Other DirectX plug-ins
SIS allows you to use any DirectX modules installed on your computer. To call external
filter press the button on the toolbar. The DX filters window shown in Figure 44 will
appear on the screen.
The DX filters window contains toolbar and filters list. Functions of buttons on the
toolbar are described in the table below.
Button Action
To add filter to the filters list press the button on the toolbar and select filter in the
drop-down list. To include filter in processing tick its name in the filters list. Filters which
are not selected will not be included in filtering.
During processing filters will be applied in the same order as they appear in the list.
To delete filter from the list, click on its name in the list with the left mouse button and,
holding down it, press <Delete>.
91
z Process mode is used to process signal and save the result of filtering in the SIS
signal window.
If (Mute) option is selected, the signal will be filtered without being played, which
will significantly increase the processing speed. If this option is disabled, processing will
be performed simultaneously with playback in real time. If (Edit mode) is selected
the source signal will be replaced with the result of the filtering. If you disable this
option, SIS will create a new segment for the processed data.
14.9 Equalizer
This process displays signal spectrum and enables user to correct it via inverse filtering
and filter contrasting. Equalizer may work in automatic or semi-automatic mode; it is
also possible to tune the filter manually to make fine spectrum adjustments. This
module may suppress any stationary components of a signal regardless of their
frequency and location; it also may be used to raise the amplitude in a chosen spectral
band. This filter works well for phonograms containing considerable stationary noises
such as power-line noise, mechanical and engine noises and so on.
Number displayed in the window header is current FFT window size. The larger it is, the
larger is number of equalizer bands and more fine and precise adjustments may be
made. To select number of bands, you should open the Options dialog box (described
further). Try to set largest possible number of bands to achieve best filtering quality and
precision, but remember, that it increases system load as well. FFT window size is
strictly determined by number of bands (in fact, it is equal to number of bands
multiplied by four).
92
Figure 45. The DX filters window
93
wish to view, and release the mouse button. Selected area will be enlarged to fit the
spectrum window.
Switch Y-axis from linear to logarithmic (dB) scale. When this button is pressed,
equalizer displays signal in logarithmic scale, otherwise the scale is linear.
Turn spectrum accumulation on and off. While this button is pressed, the program
accumulates signal calculating and displaying average spectrum. Depress it to switch
back to instant spectrum. It automatically stops spectrum accumulation.
Build inverse or harmonic filter (selected by user) basing on current spectrum (either
instant or average, if spectrum accumulation is on). Pressing this button automatically
stops spectrum accumulation. Note, that either filter is calculated within passband
borders only.
Contrast the filter. See section 14.9.3 for more information about filter contrasting.
Automatic filtration button turns off spectrum accumulation and then inverse or
harmonic filter calculation (filter type is selected by user). You may set time of
automatic spectrum accumulation in Options menu.
Save settings. Allows you to save equalizer settings to file with .eq_cfg extension.
Load settings. Allows you to load previously saved equalizer settings from .eq_cfg
file.
This indicator button turns red if a signal was over-amplified which caused an
overflow during equalizer output. Press the button to bring the indicator back to passive
state until the next overflow.
There you may choose number of bands for the equalizer. Generally speaking, setting
greater number of bands will increase filter performance as well as your PC load.
94
Figure 46. The Equalizer Additional Options window
The Filter Contrasting Options field occupies the center of the window. Contrasting
means that will automatically detect narrow gaps in the filter FC and then broaden and
deepen them. This operation may considerably improve filtering quality, especially if
there are clear local noise peaks in the spectrum of a signal. To enable filter contrasting
check the respective flag in the bottom part of the window. Then adjust contrasting
options:
z Analysis Window Width value determines maximum width of a gap (in Hz),
which will be considered "narrow" and, therefore, will be subject to contrasting.
Default value is 70 Hz.
z High Intensity flag, when checked, will slightly broaden the gaps in addition to
common contrasting effect.
At the lower part of the screen you may see a group of additional controls.
Accumulation time value is a duration of spectrum accumulation for automatic filter
calculation. Filter contrasting flag turns on respective operation, as was mentioned
earlier; Graphics drawing may be turned off to disable drawing of signal spectrum
and, thus, to enhance performance of weak PCs.
OK button will close the window saving all the changes, Cancel will discard them.
95
14.9.4 Zooming and scrolling
To zoom signal spectrum window in and out, you may use appropriate toolbar buttons,
buttons and horizontal zoom bar located below the spectrum area. Blue-marked
fragment of this bar indicates which part of the whole spectrum is currently displayed. If
the whole bar is blue, then the whole spectrum is displayed (if, for example, button was
pressed).
Click the left or the right mouse button over the bar to set respective borders of
displayed spectrum. You may also scroll it with arrow buttons to the left of the bar or
with arrow keys. In the latter case you have to set the focus on the bar. Pressing the
button or key moves the displayed area 1/16 (1/32 for large window) part of the whole
signal area to the right or left.
To include sliders in an “elastic” group, you should click left and right mouse button on
the bar below the sliders, thus setting left and right border of the group. Selected group
will be marked with blue indicator in the bar.
Note, that only sliders (not signal bands!) may be grouped and bound together. This
means, that if you zoom in or out, those sliders, which you have previously included in a
group will control other signal bands.
The button to the left of the Elastic mode bar turns linearization (smoothing) on and
off. Outlines of a filter built in Elastic mode will be smooth (not jagged) if this mode is
on.
96
The Q1 slider adjusts FC convexity within 100-800 Hz frequency band. The Q2 slider
changes FC increase/decrease for every 1000 Hz grade starting from 1000 Hz value.
Both sliders work within -18/+18 dB range with current value indicated to the left of it.
The Play sound while processing parameter, if enabled, allows playing signal during
filtering. If it is disabled, the signal will be filtered without being played, which will
significantly increase the processing speed.
The buttons at the top of the options window allows you to select data fragment to be
processed by equalizer (all data, highlighted, between temporary marks, visible in
window). The box located below these buttons contains the list of channels which are
available for processing (left channel, right channel, both channels).
z Preview mode – press the Preview button. The filtered signal will not be saved
in SIS window. This mode is used to adjust filtering parameters or to accumulate
spectrum.
z Process mode – press the Process button. This mode is used to filter the signal
and to save the result of filtering in SIS signal window. By default the filtered
signal is saved in the same window as the source signal. You can select another
window or create a new one in source-destination menu.
2. Process the left channel. To do this, select Left channel in the Equalizer options
window.
4. Process the right channel. To do this, select Right channel in the Equalizer
options window.
If there is no need to adjust filters for each channel separately, then after adjusting a
left channel filter select Both channels in the Equalizer options window. In this case
channels will be processed one after another without playback.
L In the Equalizer options window Both channels and Play sound while
processing options cannot be selected simultaneously.
98
15 ANALYSIS OF SPEECH SIGNALS
SIS provides various methods of speech signal analysis. Enter the Analysis menu and
select one of the analysis options listed below:
z Spectrogram,
z Cepstrum,
z Autocorrelation,
z LPC Spectrogram,
z Average…,
z Formants analysis,
z Energy (r.m.s.),
z Average spectrum,
z Histogram,
z Histogram parameters.
z Partial correlation,
z Compare speakers.
To use any of above-listed methods, enter the respective Analysis menu item. After
choosing method, the operation menu will appear on the screen. This is a standard
menu of the source-destination type described in section 4.6.
The 3-D images (time-frequency-intensity) are used for the visualization of the listed
above 3-dimensional characteristics. The 3-D axis is perpendicular to the screen plane
and consequently 3-D data are represented in a particular way, for example, by
deviation to the right, using hues of grey or axonometry etc.
99
15.1 Producing 3-dimensional representations for 3-D data
To receive 3-D images of the specified characteristics, enter the Analysis menu item
and choose the required characteristic. The operation menu will appear on the screen.
This is a standard menu of the source-destination type described in section 4.6.
First you should enter the field Options and set the required parameters for calculation.
The third dimension can be represented in various ways, namely: by deviation to the
right and upwards, using axonometry, different colors, hues of grey.
The first two modes (right deviation and up deviation) produce practically the same
images.
Unlike in the first two modes, in the axonometric mode for intensity representation the
third axis is introduced for spectra images.
The main advantage of these three representation modes (as compared to colored and
grey hues), is that they cover a wider dynamic range. The dynamic range is limited, on
the one hand, by the screen ability to represent and, on the other hand, by the ability of
the user to distinguish the hues of grey and different colors.
Enter the field Drawing type in the Options menu for the specific operation and
choose the desired representation type in the list of the specified modes of data
representation (by clicking on the selected option with the mouse or by pressing
<Enter>). Your choice will be reflected in the corresponding field of the Options menu.
Having set the necessary parameters, press OK and you will return to the standard
menu. There you can create or choose a destination window for the output. Start the
procedure by pressing on the field with the desired data fragment type for processing.
To change this factor, enter the Show►Redraw 3-Dimensional data menu and
modify the factor value under the Multiplier for the 3-rd dimension field. The
multiplier value indicated there is the one used for the currently displayed 3-D data. If
there are no 3-D data in the active window, you will fail to enter the specified menu.
15.1.2 Scaling the third dimension for not yet calculated 3-D data
Everything described in the previous section applies to the already displayed 3-D data.
The scale multiplier for newly calculated and displayed data is 1 by default. To change
this multiplier, enter the menu Show►3-D drawing options menu item and modify
the multiplier value in the Multiplier for the 3-rd dimension field. When you enter
100
this menu item, you will see the factor value as set for the last representation. The
modified value of the multiplier will be used for the next 3-D calculations.
The Show menu contains the Change box position/length item for changing the
image boundaries along the horizontal axis. Select this item and in the displayed Box
length window enter the values for the left and right box boundaries into the
corresponding fields. Press OK and the 3-D image will be redrawn within the new
boundaries.
The Show menu also contains the Change box position/height item for changing the
image boundaries along the vertical axis. To modify the image upper and lower limits,
select this item and in the displayed Box height window enter the desired values for
the top and bottom box boundaries into the corresponding fields. Press OK and the 3-D
image will be redrawn within the new boundaries.
Besides of changing data box location and size, you are also allowed to adjust frequency
band when performing spectrogram. To set frequency band, enter the Show►3-D
drawing options and select the Lower frequency limit and Upper frequency limit
values.
After left-clicking the button with the mouse, you can control the brightness using
special slider appeared in the box. The button located above the slider allows switching
from the brightness adjustment to the contrast adjustment and vice versa.
To exit the mode, you can press the <Esc> key or press the button once more or
click with the left mouse button anywhere in the SIS workspace.
You can also change the brightness fluently by pointing the mouse to the button and
turning the wheel.
When pressing the button, the additional red axis and the red circle (indicating a
current value of normalization) will appear at the top of the window. Move the red circle
101
to the required point holding the left mouse button pressed. The data will be redrawn
after releasing the mouse button.
To exit the mode, press the <Esc> key or press the button once more or click with
the left mouse button anywhere in the SIS workspace.
z Right deviation,
z Up deviation,
z Axonometry,
z Colour,
z Grey scale.
When a rectangular window is applied to the signal, the edges of a windowed area may
be very abrupt; this may result in distortions of spectrum structure, with spectral peaks
determined by the form and position of the window, rather than by the signal
properties. To reduce this effect, signal edges are commonly smoothed within the
analysis area, which is implemented by using a roll-off window function with values
decaying from the window centre to the borders.
102
The application of such windows in the spectrum domain results in smoothed spectrum
values and eliminated outbreaks of amplitude. However, this somewhat degrades the
spectrum resolution.
The application of analysis windows in the time domain corresponds (in the spectrum
domain) to the convolution of a signal spectrum with a window function spectrum. Thus,
the application of a rectangular window (i.e. when no weighting window is applied)
corresponds to the convolution of a signal spectrum with a rectangular function
spectrum (known as the Dirichlet convolution kernel). This convolution yields the so-
called "window” effects which consist in the smoothing of spectra of closely located
signals and in enhancing the influence of distant in frequency but powerful interference.
Each window function spectrum contains a main lobe and a number of side-lobes -
parasitic additional lobes affecting each spectrum value and thus degrading the initial
spectrum calculations. Side-lobes having large amplitudes may significantly affect a
given spectral count, even if they are quite distant. The side-lobe amplitudes of the
window function can be reduced only at the cost of increasing the width of the main
lobe, that is, at the cost of spectral resolution degradation. The choice of the weighting
window is used for controlling the effects related to the presence of side-lobes in the
computed spectrum.
The minimum width of spectral peaks in the data fragment weighted by the window is
determined by the width of the main lobe of this window and does not depend on the
initial data.
The side-lobes, sometimes also referred to as “spectral leakage”, affect the amplitudes
of adjacent spectral peaks. Since discrete-temporal Fourier transform is a periodic
function, the superposition of side lobes from adjacent spectral periods may cause
further shifts in the frequencies of spectral peaks. Leakage leads not only to peak
amplitude errors in discrete signal spectra, but can also mask the presence of weak
signals against the background of strong ones (formants with weak amplitudes against
strong ones) and, hence, hamper their detection.
A number of parameters are used for the classification of window functions and
evaluation of their quality.
The main lobe bandwidth allows for frequency resolution assessment. Two parameters
are used to quantitatively assess the main lobe width. One conventional parameter is
103
the bandwidth at the half-power point, i.e. 3 dB below the main lobe maximum. The
other parameter is the equivalent bandwidth.
Likewise, two parameters are used for the evaluation of side-lobes. One of them is the
side lobe peak (maximum) revealing how effectively the window suppresses leakage.
The other is the side-lobes roll-off, i.e. the rate at which the level of the side-lobes
nearest to the main lobe drops. This value essentially depends on the used number of
counts N, and with N increasing tends to an asymptotic value expressed in decibels per
octave in the frequency bandwidth change.
⎛ ⎛ 2 ⋅ π ⎞⎞
SIGNAL [i ] = SIGNAL [i ] ⋅ ⎜⎜ 0.54 − 0.46 ⋅ cos⎜ ⋅ i ⎟ ⎟⎟ .
⎝ ⎝ N ⎠⎠
⎛ ⎛ 2 ⋅ π ⎞⎞
SIGNAL [i ] = SIGNAL [i] ⋅ ⎜⎜ 0.5 − 0.5 ⋅ cos⎜ ⋅ i ⎟ ⎟⎟ .
⎝ ⎝ N ⎠⎠
z NUTTALL – Nuttall window, i = 0.. N − 1 ,
i − ( N − 1) / 2
ARG = ,
N −1
SIGNAL [i] = SIGNAL [i] ⋅ (0.3635819 + 0.4891775 ⋅ cos(2 ⋅ π ⋅ ARG ) +
+ 0.1365995 ⋅ cos(4 ⋅ π ⋅ ARG) + 0.0106411⋅ cos(6 ⋅ π ⋅ ARG)) .
i − ( N − 1) / 2
ARG = ,
N −1
⎛ 1 ⎞
SIGNAL [i] = SIGNAL [i ] ⋅ exp⎜ − ⋅ (2 ⋅ α ⋅ ARG) 2 ⎟ , α ≥ 2 .
⎝ 2 ⎠
104
These windows have the following properties:
Maximum side-lobe
-13.3 -43 -31.5 -98 -115
level (dB)
Half-power bandwidth
0.89 1.30 1.44 1.70 3.0
(3 dB)
* - Gaussian windows applied to 16-bit and 24-bit signals have different width. So the
side-lobes level equals to -139.
Of all the windows listed in the table, the rectangular window has the narrowest main
lobe and the highest level of side-lobes.
The "square cosine" type of window is named after the Austrian meteorologist Julius von
Hann. It is often mistakenly called "Hanning" window.
The Gaussian window side-lobes in the double logarithmic scale do not tend to a straight
line, but the side slopes drop much faster than in any other described window.
The window of an "elevated cosine" type was introduced by R.U. Hamming and, thus, is
frequently referred to as the Hamming window. The multipliers 0.54 and 0.46 were
chosen to nearly completely eliminate the maximal side-lobe.
During the computation of the normal spectrum and its associated functions, a different
number of function periods is taken into account when harmonic amplitudes are
computed. Thus, if a 256 window spectrum is computed, the first harmonic with the 256
period fits only once into the window, while the last one with the 2 period will fit as
many as 128 times. As a result, high-frequency harmonics (with small periods) are
considerably averaged out, while low-frequency ones are not.
105
Window length increases linearly with the increase of the harmonic period (the period is
inversely proportional to frequency), without exceeding the frame limits. The
proportionality coefficient is specified in the interval between 0.3 and 9. Click with the
mouse button anywhere outside the window selection menu. You will see the Enter
period multiplier [0.3,…,9] dialog appeared on the screen. Enter the desired value
and press OK. Pressing Cancel will restore the previous value.
You are recommended to set a small frame shift step to prevent skipping high-frequency
data.
Thus, if strong signal components are located both close to and at a distance from weak
signal components, you should choose an analysis window with the same level of side-
lobes near the main lobe, to ensure a small shift step of spectral peaks.
If there is only one strong component far removed from the weak signal components,
you should choose a window with a quickly decreasing level of side-lobes (high roll-off
rate); whereas their level in the immediate proximity to the main lobe will be of no
consequence in this case.
If high resolution is required between closely located signal components, and there are
no distant components, the optimal choice would be a window with a very narrow main
lobe, while even a rising level of side-lobes may prove acceptable in this case.
If the signal has a restricted dynamic range, the characteristics of side-lobes are
irrelevant.
106
spectrogram, enter the Analysis►Spectrogram menu item. The operation menu will
appear on the screen. This is a standard menu of the source-destination type described
in section 4.6.
A progress bar for the current process will be displayed at the bottom of the screen in
the message line. You can interrupt the process by pressing the <ESC> key.
Press the OPTIONS field to set the necessary parameters. After setting the parameters
you should create or choose a destination window and start the calculating process by
pressing on the desired data fragment type.
Manipulating these parameters you can receive the most acceptable visible speech
representation. The parameters included in this menu (and other menus enclosed in it)
are listed below.
Frame length. The user can choose the frame length from the list: 16, 32, ...16384
counts. When the frame length is set in counts, it is displayed in milliseconds in the
adjoining field. According to the frame size it is possible to receive a narrowband or
broadband spectrum. The spectral picture is more detailed on the narrowband spectrum,
and it is more general on the broadband spectrum. The frame size should exceed the
maximum pitch period value to receive the narrowband spectrum. The frame size should
be 256 counts or more for male voice, and it should be 128 counts or more for female
voice. The frame size should be lower than the maximum pitch period value to receive
107
of the broadband spectrum. It should be 64 counts for male voice, and 32 counts for
female voices.
Frame shift. Frame shift step determines the size, on which window shifts along the
signal. It is necessary to remember, that if the frame shift step exceeds the frame size
itself, not all the signal points will be involved in the spectrogram calculating process.
The optimum frame shift step value should be set within the limits from 1/4 up to 1/2 of
the frame length. When the frame shift step is set in counts, it is displayed in
milliseconds in the adjoining field.
z Hamming,
z Hann,
z Nuttall,
z None (rectangular),
z Equi-periodic, 2.0.
See section 15.2 for the description of window types and window selection tips.
Use normalization. If you enter this field, the additional menu will appear on the
screen. There you should set the following parameters:
Using these parameters, you set the starting frequency for the spectrum amplitude
increase and the rise length. For example, the spectrum amplitude increase from 200 Hz
on 6 dB/oct means the increase of the spectrum amplitude on 6 dB at 400 Hz in
comparison with 200 Hz and etc. Changing the spectrum increase you can get the most
obvious spectrum image on the high frequencies. Amplitude increase optimum value is
set by the trial-and-error method for each particular signal.
108
z Normalize signal which is highest the level ... % from maximum.
Having set the normalization parameters, press OK and the Options menu will appear
on the screen again.
Number of formants. If the specified value is negative (below zero), the system itself
will determine the number of formants to be represented. If the zero value is set, the
formats will not be calculated and represented. If you enter a positive value here, the
specified number of the formants will be represented, but not more than 10.
z Formants:
Male voice: Tenor (130-520 Hz), Baritone (110-390 Hz), Basso (80-350 Hz)
or ContrBasso (50-220 Hz).
Female voice: Soprano (260-1050 Hz), Mezzo Soprano (220-880 Hz) or
Contralto (165-700 Hz).
Child.
z Broadband:
Male voice.
Female voice.
Child.
z Harmonics:
Tone harmonics: High pitch (>220 Hz) or Low pitch (<220 Hz).
Technical harmonics: Maximal, Medium or Minimal resolution.
Using the OPTIONS field you can set the necessary parameters. After setting the
parameters you should create or choose a destination window and start the calculating
process by pressing on the desired data fragment type. The only reason why this
operation will not be performed may be the unsuitable type of the source widow.
A progress bar for the current process will be displayed at the bottom of the screen in
the message line. You can interrupt the process by pressing the <ESC> key.
109
15.4.1 Parameter settings
You can set the necessary parameters in the Options menu (Fig. 49).
Manipulating these parameters you can receive the most acceptable visible speech
representation. The parameters included in this menu (and other menus enclosed in it)
are listed below:
Frame length. The user can choose the frame length from the list: 16, 32, ... 16384
counts. When the frame length is set in counts, it is displayed in milliseconds in the
adjoining field. For cepstrum calculation, the frame size is typically 256 counts for male
voices and 128 counts for female voices.
Frame shift. The optimum frame shift step value should be set within the limits from
1/4 up to 1/2 of the analyze frame length. When the frame shift step is set in counts, it
is immediately displayed in milliseconds in the adjoining field.
Weighting window type. The following options are available:
z Hamming,
z Hann,
z Nuttall,
z None (rectangular),
z Equi-periodic, 2.0.
See section 15.2 for the description of window types and window selection tips.
110
Use result filtering. There is an opportunity to choose a filter at 0..55 points. Sliding
averaging of the image according to the chosen point number is performed during your
work with the filter. As a rule, averaging is not performed for cepstrum calculation.
Use normalisation. If you enter this field, the additional menu will appear on the
screen. There you should set the following parameters:
For cepstrum calculation, the rise from 1 count by 3 dB is usually specified. More
accurate values can be obtained manually using a dynamic spectrogram.
Having set the normalization parameters, press OK. The Options menu will appear on
the screen again.
z Male voice:
Tenor (130-520 Hz).
Baritone (110-390 Hz).
Basso (80-350 Hz).
z Female voice:
Soprano (260-1050 Hz).
Contralto (165-700 Hz).
z Additional:
Singing (male voice).
Noised signal.
Infrasound.
Constant Harmonics (max resolution).
111
15.5 Calculating dynamic characteristics of the autoregressive
model of the speech signal
The following dynamic characteristics of the autoregressive model of the speech signal
can be calculated in the system:
Using the OPTIONS field you can set the necessary parameters. After setting the
parameters you should create or choose a destination window and start the calculating
process by pressing on the desired data fragment type. The only reason why this
operation will not be performed may be the unsuitable type of the source widow. A
progress bar for the current process will be displayed at the bottom of the screen in the
message line. You can interrupt the process by pressing the <ESC> key.
Frame length. The user can choose the frame length: 16, 32, ...16384 counts. When
the frame length is set in counts, it is immediately displayed in milliseconds in the
adjoining field.
Frame shift. The user can set the frame shift step value. When the frame shift step is
set in counts, it is immediately displayed in milliseconds in the adjoining field.
z Hamming,
z Hann,
z Nuttall,
z None (rectangular),
z Equi-periodic, 2.0.
See section 15.2 for the description of window types and window selection tips.
Use result filtering. There is an opportunity to use a filter at 0.. 55 points. If enabled,
the filter will perform sliding averaging of the image according to the chosen number of
points.
112
Use normalisation. If you enter this field, the additional menu will appear on the
screen. There you should set the following parameters:
For the calculation of dynamic characteristics of the speech signal autoregressive model
cepstrum the rise is set in counts (points).
Having set the normalization parameters, press OK, and the Options menu will appear
again on the screen.
z Male voice:
Tenor (130-520 Hz).
Baritone (110-390 Hz).
Basso (80-350 Hz).
z Female voice:
Soprano (260-1050 Hz).
Contralto (165-700 Hz).
z Additional:
Singing (male voice).
Noised signal.
InfraSound.
Constant Harmonics (max resolution).
113
1 5 . 6 S p e c t r a l a n a l ys i s b a s e d o n l i n e a r p r e d i c t i o n o f s p e e c h .
Obtaining a graphical representation of LPC frequency
response
The spectrum analysis based on the Linear Prediction Coefficients (LPC) is performed in
the following way. LPC are calculated using the Levinson-Darbin method and the
autocorrelation coefficients are simultaneously calculated in a temporary window. The
window size cannot be less than the number of coefficients and more than 1024. The
speech signal is weighted in this window by the function (the choice of the function
depends on the choice of the weighting window type).
When the LPC are determined, the sequence is formed, where the first (m+1) members
are equal to LPC (m is the polar model order), and the rest of the members are equal to
zeros. The length of the sequence is selected with regard to the required frequency
resolution (at 10000 Hz sampling rate and resolution not lower than 30 Hz, the length N
is no less than 10000/30; choosing the nearest power of the number “2” exceeding N
you will receive the length of the sequence).
Having calculated FFT of the given sequence and having taken its reversed value, you
can get a smoothed spectrum of the speech signal, where the search for maxima is
performed. The described method of spectral representation allows you to get a clearer
representation of the formant structure as compared to the common analysis methods.
To receive a 3-D image of frequency response to LPC, enter the menu Analysis►LPC
Spectrogram menu item. The operation menu will appear on the screen. This is a
standard menu of the source-destination type described in section 4.6.
Using the OPTIONS field you can set the necessary parameters. After setting the
parameters you should create or choose a destination-window and start the calculating
process by pressing on the desired data fragment type. The only reason why this
operation will not be performed may be the unsuitable type of the source widow. A
progress bar for the current process will be displayed at the bottom of the screen in the
message line. You can interrupt the process by pressing the <ESC> key.
Manipulating these parameters you can receive the most acceptable visible speech
representation. The parameters included in this menu (and other menus enclosed in it)
are listed below:
Frame length (counts). The user can set the desired frame length in counts. The
default value is 512.
Frame shift step (counts). The user can set the desired frame length in counts. The
optimum frame shift step value should be set within the limits from 1/4 up to 1/2 of the
frame length. The default value is 128.
114
Frequency resolution (Hz). The value is selected from the list.
Number of LPC coefficients. The user can set the number of LPC coefficients to be
used for obtaining frequency response.
z Hamming,
z Hann,
z Nuttall,
z None (rectangular).
See section 15.2 for the description of window types and window selection tips.
Use result filtering. There is an opportunity to use a filter at 0..55 points. If enabled,
the filter will perform sliding averaging of the image according to the chosen number of
points.
Use normalisation. If you enter this field, the additional menu will appear in the
screen. There you should set the following parameters:
115
None (normalization is disabled),
All signal (the whole signal is normalized),
Highest level only (fragments exceeding the set level are normalized).
Having set the normalization parameters, press OK. The Options menu will appear
again on the screen.
Number of formants. If the specified value is negative (below 0), the system itself will
determine the number of formants to be represented. If the zero value is set, the
formats will not be calculated and represented. If you enter a positive value here, the
specified number of the formants will be represented, but not more than 10.
z Formants:
Male voice: Tenor (130-520 Hz), Baritone (110-390 Hz), Basso (80-350 Hz)
or ContrBasso (50-220 Hz).
Female voice: Soprano (260-1050 Hz), Mezzo Soprano (220-880 Hz) or
Contralto (165-700 Hz).
Child.
z Smoothen formants:
Male voice, Basso (80-350 Hz).
Male voice, Tenor (130-520 Hz).
Female voice.
z Harmonics:
Tone harmonics: High pitch (>220 Hz) or Low pitch (<220 Hz).
Technical harmonics: Maximal or Minimal resolution.
Thus, if you set the frame length equal to 1 ms and the active segment sampling is
10000 Hz, it corresponds to 100 counts. Then, to receive an energy value at any point
of the signal, the system takes 50 values to the left from the point and 50 to the right,
116
then each value is squared (raised to the second power), the results are summarized,
then the sum is divided by 101 and a square root is extracted from the received value.
In all cases the frame where the averaging is performed shifts along the signal with a
step of 1.
To calculate the signal energy, select the Analysis►Energy (r.m.s) menu item. A
standard menu of the source-destination type will appear on the screen (described in
detail in section 4.6).
After pressing the OPTIONS field a dialog window will be displayed where the user can
specify parameter Frame length, ms.
Having set the desired step value, press OK. To start the process, click on the field with
the desired data fragment type.
To calculate zero crossing frequency, enter the menu Analysis►More… and in the
drop-out list select Zero crossing frequency. The operation menu will appear on the
screen. This is a standard menu of the source-destination type described in section 4.6.
Click on the OPTIONS field to set the operation parameters. For this operation only one
parameter should be specified - Frame length, ms.
Having set the required value, press OK. To start the process, click on the field with the
desired data fragment type. Once the calculations are completed, the curve of zero
crossing frequency will be drawn in the destination window.
15.9 Averaging
Averaging is performed for previously calculated 3-D data.
To average the data, select Analysis►Average…. Then the Average menu will appear
on the screen. This is a standard menu of the source-destination type described in
section 4.6.
To start the process, click on the field with the desired data fragment type. Once the
calculations are completed, the curve of the averaged characteristic will be drawn in the
destination window.
117
15.10 Pitch extraction
15.10.1 General information
Pitch extraction is performed for vocalized areas of the speech signal. The signal is
considered tonal, if in the given analyzed frame (window) we can see the periodicity of
the whole signal, or at least of its low-frequency part.
The pitch period is the duration of the signal periodicity interval. The period length is
measured in counts or milliseconds. The average period length makes about 70-80
counts for male voices and 40-50 counts for female voices at the sampling rate
10000 Hz.
The pitch (or fundamental frequency - FF) is the value equal to the inverted duration of
the pitch period in msec or the sampling rate divided by the period length in counts. The
fundamental frequency for male voices makes on average 120-140 Hz, for female voices
- 200-240 Hz.
Pitch extraction is realized for mono signals. For stereo signal only left channel pitch
extraction will be performed.
z You can apply standard parameter settings by selecting Default for men or
Default for women.
z Alternatively, you can set the desired parameters manually by selecting Set
options manually field.
118
The initial parameter values will correspond to those currently stored in the
configuration file, but they can be modified with the Copy default options to manual
field.
When you copy parameter values for manual parameter settings, standard values
specified in the Options menu are used: Default for men or Default for women,
depending on which of the options is currently active.
Once you have made manual settings, ensure that the field Manually set is selected
(by the blue color).
Press Set options manually, to set the parameters manually. The menu containing
three active fields (Signal options, Noise and pauses options and Cepstrum
specific options), as well as a number of information fields, will appear on the screen
(Fig. 52).
To change currently displayed values, press one of the two active fields above.
Parameter value "-1" means that in this method the parameter is not used.
Signal options:
z Abs. min. (Hz) – absolute pitch minimum – sets the lower threshold for pitch
extraction, so that pitch frequency will be computed only above this value. The
absolute minimum value of the pitch frequency can be obtained with the
cepstrogram as the smallest function of the pitch change.
z Abs. max. (Hz) – absolute pitch maximum – sets the upper threshold for pitch
extraction, so that pitch frequency will be computed only below this value. The
absolute maximum value of the pitch frequency can be obtained with the
cepstrogram as the biggest function of the pitch change.
Setting minimum and maximum pitch values allows you to get a more accurate result
when you deal with a noisy signal. If you have noticed after the first pitch extraction,
that the method calculates the frequencies corresponding to frequency of the tonal noise
119
located outside the range of the real pitch frequencies, try to set the threshold pitch
values so that the real pitch range lies within them, and the tonal noise frequency lies
outside these limits.
Noise and pauses options:
z Frame length, ms. Analysis frame length for detecting noise/pause intervals.
z Frame shift, ms. Frame shift step for identifying frames as noise/pause intervals.
z Pause threshold. Magnitude threshold for pause detection (in counts). The signal
with the amplitude value below the given threshold is considered as a pause in the
analyzed frame. The threshold amplitude value is set by trial-and-error method:
the correctness of the selected threshold is checked by ear. It makes sense to
change this parameter if you deal with a noisy signal, and some of the noised
pauses were interpreted by system as tonal fragments at the first pitch extraction.
The signal with an energy level (to be accurate: a square root from the energy)
less than the given parameter will be considered as a pause.
z Noise zero cross. Threshold value for zero crossing frequency (in Hz). The signal
with zero crossing frequency above the given threshold is considered as noise in
the analyzed frame. The threshold amplitude value is set by trial-and-error
method: the correctness of the selected threshold is checked by ear. The signal
fragment having a zero crossing frequency bigger than specified in the given
parameter, is considered as noise. It makes sense to change the given parameter
if your signal contains high-frequency noise components. Default parameters are
optimal for processing pure signals.
1. Calculate the dynamic cepstrogram of the analyzed speech sample and use the
horizontal cursor to define the pitch minimum and maximum values.
120
2. Calculate the narrow-band spectrum of the analyzed signal and use the horizontal
cursor to determine the bottom and top borders of the signal periodicity
(banding).
3. Determine the threshold amplitude value to cut off the pauses; this value is
adjusted manually while listening to the signal.
4. Determine noise cut-off threshold value using the ZC parameter – zero crossing
frequency, which is adjusted manually while listening to the signal.
One of the ways to check pitch extraction accuracy is to impose the obtained pitch on
the cepstrum. Therefore, before pitch extraction, calculate the signal cepstrum and use
the cepstrum window as the destination-window.
Having set the desired parameters, press OK. To start pitch extraction, click on the field
with the desired data fragment type.
A progress bar for the current process will be displayed at the bottom of the screen in
the message line. You can interrupt the process by pressing the <ESC> key.
Once the calculations are completed, the calculated pitch will be drawn in the
destination window. The time is laid off along the horizontal axis from left to right. It is
represented in the same scale as the source waveform used for pitch extraction. Pitch
period duration in Hz is laid off along the vertical axis. Remember, that the selected
window should not contain any segments or it should contain segments of the same
type.
z Female voice.
z Phone channel.
z Equal time. The equal lengths of the signals under consideration will be used for
comparison.
When pressed START, program performs pitch extraction (if sound data are opened)
and comparing pitch statistics. The results of the comparison will be represented in the
summary table (Fig. 54).
If you need to save the result of the comparison in a textual view, use the COPY button.
The result is copied to the clipboard and can be then pasted to MS Word or MS Excel
file.
122
15.10.6 Controlling the correctness of the pitch extraction
You can control the correctness of the pitch extraction in several ways:
z By comparing pitch extraction results obtained using various methods and taking
pitch values on different fragments.
The first way is the most convenient and obvious. To use it, calculate the pitch curve
into the same window as the cepstrum. Now you can easily check up the correctness of
pitch extraction: the pitch is extracted correctly, if it coincides with the maximum
blackness of the dynamic cepstrogram on the absolute majority of the fragments.
Otherwise, calculate the pitch again, having modified boundary frequency values (see
section 15.10.2).
If after performing pitch extraction repeatedly you will still see the discrepancy, apply
the method of the step-by-step approximation to achieve correct pitch extraction. SIS
provides several ways to achieve this:
z Increase or reduce average (initial) pitch in the settings so that this value +/-
initial smoothness could allow "cutting off" undesirable errors on partials
(harmonics).
z Increase absolute minimum or reduce absolute maximum a little bit, thus cutting
off undesirable frequency ranges.
Having specified the final settings, perform pitch extraction throughout the whole signal.
If different settings are required for different signal fragments, these fragments should
be processed separately and the obtained results should be saved into a file.
To do this, load pitch and cepstrum into one window and open a fragment to be
123
corrected in the Zoom mode (see section 10.6). Then edit pitch by usual means of SIS
(for more detailed description see section 10.7).
Frame length (counts) – set the required dimension of the Fast Fourier
transformation.
Weighting window type - choose the required type of the weighting window:
z Hamming,
z Hann,
z Nuttall,
z None.
See section 15.2 for the description of window types and window selection tips.
124
will be drawn in the destination window. Before running the operation, make sure that
the destination window you specified is empty or contains segments of the same type.
Using the OPTIONS field you can set the necessary parameters. After setting the
parameters you should create or choose a destination-window and start the calculating
process (spectrum accumulation) by pressing on the desired data fragment type. A
progress bar for the current process will be displayed at the bottom of the screen in the
message line. You can interrupt the process by pressing the <ESC> key.
125
Frame length (counts). The user can choose the frame length: 32..16384 counts.
When the frame length is set in counts, it is immediately displayed in milliseconds
above.
Frame shift step (counts). The user can set the frame shift step value. Frame shift
step determines the size by which the window shifts along the signal for the next FFT
calculation. It should be kept in mind that if the frame shift step exceeds the frame size
itself, not all the signal points will be included in the spectrum accumulation process.
The optimum frame shift step value should be set within the limits from 1/4 to 1/2 of
the frame length.
When the frame shift step is set in counts, it is displayed in milliseconds above.
Weighting window type - choose the required type of the weighting window:
z Hamming,
z Hann,
z Nuttall,
z Gauss,
z None.
See section 15.2 for the description of window types and window selection tips.
Normalization. If this option is enabled, the signal in each frame will be normalized to
the specified amplitude value before FFT is performed. After normalization parts of the
signal recorded with different gain levels (differing in amplitude maxima) will contribute
evenly to the averaging of the spectral components. However, enabling only this option
can result in the accumulation of undesirable distortions, because noise regions located
in the current frame will be amplified as well. To avoid this effect, enable the option
Skip pauses along with Normalization.
Skip pauses. It allows detecting pauses in signal fragments with amplitude below the
specified threshold. If this mode is enabled and the maximum amplitude value in the
current FFT frame does not exceed the threshold value, such a fragment will be skipped.
Geometrical average. If this option is enabled, instead of the arithmetical average the
geometrical average spectrum will be accumulated (i.e. N-th root of the product of N
multipliers).
Normalization level. It sets amplitude value for frame normalization with the
Normalization option enabled. The maximum real amplitude of the signal within the
126
frame is changed according to the specified normalization level. The rest amplitudes are
displayed in the relative scale.
Magnitude for pause detection. Maximum amplitude value for pause detection. This
parameter is used only if the option Skip pauses is enabled.
Having set all the necessary parameters, press OK to return to the Average spectrum
menu.
Using the Edit►Segment shift option you can shift formants to the left or to the right
to check the coincidence or non-coincidence of formant structures in the speech of
different speakers.
Two different methods of calculation are provided in the system: LPC (based on the
model of linear prediction coefficients) and Spectral (based on smoothed Fourier
spectra with noise subtraction). The LPC method produces more accurate results on
tonal sounds, while the Spectral method is more effective for non-tonal sounds and
noisy signals. The Spectral method defines the number of formants and if this number
exceeds the number specified in the menu, it will cancel the unnecessary (high-
frequency) formants. The values of the non-existing formants are specified as the
unrealizable number (-1) and are not displayed. The LPC method always calculates as
many formants as specified in the menu.
To analyze the signal formant structure, select the Analysis►Formants analysis item.
The operation menu will appear on the screen. This is a standard menu of the source-
destination type described in section 4.6.
Pressing the OPTIONS field, you can open the options menu (Fig. 57) where the
necessary parameters are to be set. After setting the parameters you should create or
choose a destination-window and start the calculating process by pressing on the
desired data fragment type. The only reason why this operation will not be performed
may be the unsuitable type of the source widow. A progress indicator for the current
process will be displayed at the bottom of the screen in the message line. It allows you
to control the calculating process. You can interrupt the process by pressing the <ESC>
key.
127
Figure 57. The Formant analysis options menu
When you change the method version (by pressing the respective field) the menu will
look slightly different, as the methods use different sets of parameters.
z for the LPC method - an arbitrary value exceeding the doubled number of the
formants,
z for the Spectral method - any power of 2 (from 16 to 2048, as FFT is used for the
analysis).
Frequency resolution (Hz). Is used only in the LPC method, changes from 5.3 to
689.0 Hz. The better the resolution, the slower is the process of formants calculation.
Method version. Is used for changing the calculation method. To change method,
press the Method version field. The selected method is shown one line lower.
z Male voice:
Tenor (130-520 Hz).
Baritone (110-390 Hz).
Basso (80-350 Hz).
z Female voice:
Soprano (260-1050 Hz).
128
Mezzo Soprano (220-880 Hz).
Contralto (165-700 Hz).
In the Zoom mode each formant will be displayed in its own color (for example, the first
– green, the second – red, etc.).
Then open the contextual menu of the "zoom" window by pressing the right mouse
button (Figure 58).
To start editing formants in the contextual menu select the Editing formant X item,
where X is the number of the formant. After that the corresponding inscription will
appear at the bottom of the "zoom" window, and the cursor will take the form of a
pencil . Press and hold down the left mouse button, and move the mouse to edit
formant. You can change the formant number by rotating the mouse wheel.
To erase formant, select Erasing formants or Erasing formants in region (all) item.
The Erasing formants mode allows to delete formants by pressing the left mouse
button. Adjust the size of the eraser by rotating the mouse wheel.
After selecting the Erasing formants in region (all) item you will see a dashed frame
appear on the screen. You can resize and position it using the mouse. After you have
modified the frame size and position the desired way, click the right mouse button. ALL
formants appeared within the dashed frame along the X axis will be deleted.
To exit Editing formant or Erasing formants mode, select the Zoom formants item
in the contextual menu.
129
15.14 Histogram calculation
SIS allows calculation of histograms for waveforms and pitch.
In the Options menu, called by pressing the OPTIONS button, you can set the
necessary parameters.
For histogram calculation the entire interval between the lower and the upper boundary
is divided into sub-intervals at a specified step. Each sub-interval is related to a specific
histogram value. If the analyzed signal value falls within the specified interval, the
histogram value there is increased by 1. Having analyzed all source signal values in this
way, we can obtain a histogram. The histogram is then normalized so that the sum of all
its values multiplied by the interval length is equal to 1. Thus, the actual histogram
value after normalization equals the probability density of detecting the specified signal
value. If the histogram is smooth, it is almost independent from the step value. This can
be easily verified by computing a histogram for a loud speech signal waveform in the
interval between -500 and 500 at a step of 2, 5 or 10 counts.
Low edge. All values below the specified threshold fall into the first count of the
histogram.
High edge. All values above the specified threshold fall into the last count of the
histogram.
Step. Step value at which the interval between the lower and upper thresholds is
divided into sub-intervals.
All parameters are set in counts for waveforms and in Hz for pitch.
If the program runs out of memory while crating a histogram, the system will inform
you that the histogram step is too small. In this case you should either increase the step
or modify the upper and lower threshold values.
SIS allows for reading any type of source data as text. Once read, this data can be
subsequently used to compute a histogram.
130
z Same/impostor pair,
z Similar histograms.
If the first option is selected, the following menu will be displayed upon the completion
of calculations (Fig. 59):
The menu contains four columns. The first column lists the calculated parameters. In the
second and third columns the names of the two compared files and estimated values for
their histograms are represented. The fourth column (Ratio) displays parameter ratios
for the first and second file accordingly.
Median – coordinates of the point, to the right and to the left of which the square
values under the histogram are equal.
Center of gravity – the first moment of the histogram. For symmetrical histograms
coincides with the median.
Equal Error Rate (EER) – at a certain abscissa value, the probability of false rejection
equals the probability of false acceptance. This value is called Equal Error Rate.
EER coordinate – coordinate (abscissa value) of the Equal Error Rate point.
If the second option is selected, the following menu will be displayed upon the
completion of calculations (Fig. 60):
This menu also contains four columns. The first column lists the calculated parameters.
In the second and third columns the names of the two compared files and estimated
values for their histograms are represented. The fourth column (Ratio) displays
parameter ratios for the first and second file accordingly.
Root mean square – square root of the histogram dispersion value (second moment).
132
16 USING PLUG-INS
SIS allows you to use plug-ins - separately compiled program modules which are
attached dynamically to SIS for developing its functions.
To successfully use plug-ins, it may be required update some dynamic libraries included
in SIS.
1. Copy a folder with files of the module to the plugins folder included in the SIS
software folder. By default it is C:\Program Files\Speech Technology
Center\SIS.
If necessary module is not found in the Plugins registration window, make sure that
the folder with module files is in the plugins folder. Otherwise, copy it to the plugins
folder. If its contents are incomplete, copy the entire folder to the plugins.
If a folder with the module files is stored in the folder different from that mentioned
above, you can define path to the necessary module files. Press the Paths button in the
Plugins registration window (see Fig. 61). The Paths to find modules window will
133
appear (Fig. 62). Press the New button, choose the necessary folder in the standard
Browser window and press the OK button.
Here press the New button, choose the necessary folder using the standard Browser
window and press the OK button.
134
17 TESTING THE INPUT/OUTPUT CHANNEL
Analog-to-digital and digital-to-analog converters used for digital signal processing are
characterized most completely by dynamic factors, such as signal-to-noise ratio (SNR),
total harmonic distortions (THD) and effective bit resolution (Nd) of the ADC/DAC,
obtained as a result of conversion and spectral analysis of the sample sine wave signal.
Distortions of the signal, arising in the course of the transformation process, are
reflected in the transformation noise level and the level of the first and the second
harmonics in the main signal.
SNR − 1.76
Nd = .
6.02
V [2] + V [3] + K + V [k ]
THD = 10 ⋅ lg , dB
V [1]
or
V [2] + V [3] + K + V [k ]
THD = ⋅ 100% ,
V [1]
where V[1], V[2], ...V[k] stand for the power of the first and subsequent harmonics,
and k equals to the ratio of the maximum spectrum frequency to the frequency of the
first harmonic (fundamental frequency).
V [1]
SNR = .
total sum of harmonic powers − V [1] − V [2] − ... − V [k ]
1. Input a sample sine wave signal (for example, using any generator) in the
computer. The signal frequency should be not less than 500 Hz and should not
exceed the quarter of the sampling rate, duration - not less than 5 sec, amplitude
- close to the maximum value (but without overload). Typically you should input
the signal with 6-7 seconds duration and frequency about 900 Hz.
135
2. Set two temporary marks to cut off fragments of 1 sec length in the beginning and
in the end of the input signal.
6. Create a destination window of any height, but spanning the entire screen width
using the field Create destination window. The window width doesn't influence
the result, but it influences the representation.
7. Press the All data field. Then the Fourier signal spectrum will be calculated and
the result will be written into the destination window.
8. Enter the menu Data►Input/output test. First the system will inquire: "Have
you inputted in the computer an ordinary sine wave signal and calculated its
Fourier spectrum by 2048 points with the Gauss weighting window?". If all steps,
described above, were performed, you should answer OK by pressing the
corresponding field. Otherwise, you should press Cancel and perform all the
necessary actions exactly as described above.
9. If you have answered OK to the previous question, the vertical multi-cursor will
appear in the active window, which constitutes a system of connected vertical
cursors. Each of them corresponds to the frequency received by the multiplication
of some basis frequency F0 by an integer multiplier, equal to or greater than 1.
The system processes the signal (active segment) values at these frequencies as
the square root of V[0], V[1],..., V[k] (see the formula for THD given in the
beginning of this section). The system automatically takes into account the signal
harmonic finite width and doesn't mix it with the noise.
10. If you start shifting the mouse to the left/right or pressing the arrow keys on the
keyboard, the cursor system will come into motion and in the Output window at
the bottom of the screen the following text will appear:
where SNR stands for the signal-to-noise ratio, THD - total harmonic distortions,
Nd - effective resolution of the input/output cards (bit), f – frequency
corresponding to the first cursor (F0).
136
All these values actually don't make sense until you match the first cursor with the
main peak of the sample sine wave. You can try to do it yourself, but it is rather
difficult since even one screen pixel corresponds to a large frequency shift.
Therefore you are recommended to take advantage of the option described below.
11. Press the <F3> key. Then an informational menu will appear on the screen
(Fig. 63). It contains the Signal/noise ratio and Total Harmonic Distortions
fields. The harmonic distortion value is given both in dB and in percents.
If you are satisfied with the measured values, the test may be considered
complete. If you are not satisfied with the results, try to repeat the test. But
before that:
Replace a standard signal generator.
Make sure that the external input/output unit does not lie on the monitor,
computer or any other electronic device.
2. Set the sine amplitude close to the maximum value (e.g. 30000, since 32767 is
the maximum independent of the I/O board bit capacity). Set the sine frequency
500 Hz or more, but not greater than a quarter of the sampling rate. Set
amplitudes of other test signal types to zero.
3. Set the desired signal duration (about 30 sec) in the field Signal length.
4. Set signal bit capacity in the 24-bit signal field. If this option is enabled, the
result will be a 24-bit signal, otherwise a 16-bit signal will be generated.
6. For the obtained test sine wave repeat all operations described in the previous
section 17.1 – items 2-10. The obtained values of the signal-to-noise ratio (105-
140 dB), total harmonic distortions (0.0013 % - 0% for the signal with the
amplitude equal to 30000), are the maximum accessible at the given signal size
137
(determined by the card capacity), and the given analysis method (Gauss
window).
7. Perform sound output through the linear board output, using the menu
Playback►All data and analyze its parameters with a high-quality spectrum
analyzer. The values obtained with this type of analysis will be not better than
those described in item 6 above, and than the values specified in the spectrum
analyzer manual. If you are satisfied with the measured values, the test can be
considered complete. If you are not satisfied with the results, repeat the test. But
before that:
Replace a standard signal generator.
Make sure that the external input/output unit doesn’t lie on the monitor,
computer or any other electronic device.
You should take into account that devices' technical data values are obtained in SIS
software without using A-weighting (window function imitating human ear by frequency
sensitivity). Therefore technical data obtained with SIS software differs from A-weighted
data described in device's user manual. The difference between these values must not
be more than 10 dB.
Generate 10 seconds of a white noise signal (mono or stereo) via the Data►Generate
test waveform with the amplitude of 5000 and sampling rate between 10000 and
30000 Hz. Run the In/Out Loop Test command and calculate the spectrum of the
result (items 2-6 from section 17.1). The achieved signal will be mono or stereo –
depending on the initial one. The spectrum will be proportional to the frequency
response function. There might be 1.5 dB deviations on the spectrum due to the non-
ideal white noise generator within the system.
To measure signal-to-noise ratio and total harmonic distortion, you will need to generate
a sine wave signal with the 1000 Hz frequency at sampling rate between 10000 and
192000 Hz and with the amplitude equal to 30000.
Then divide stereo signals into two mono signals (Edit►Mono/stereo operations
►Separate stereo-signal to 2 mono). After that select the Data►In/Out Loop Test
and follow steps 2-10 from section 17.1 for the obtained result.
138
Note that before testing any external equipment you should test the
input/output channel. You should avoid signal overload, but remember that
L low-amplitude signals reduce electronic device parameters as a result of
round-off errors. To adjust signal level use software mixers.
139
GLOSSARY
A
Acoustics
(gr. akustikos – of or for hearing, ready to hear)
1. The science of the sound, studying its elastic vibrations and waves.
2. The sound characteristics of an enclosed space or an object (audio device).
3. Acoustic level of (speech) signal – a description of concerned signal (especially,
speech signal) characteristics as a whole and its elements’ characteristics as a sound
physical process without taking into consideration the information transferred by the
signal. Usually, the spectral description of a signal is used at acoustic level.
4. Speech acoustics – a part of general acoustics, studying speech signal structure,
processes of speech production and speech perception. It is concerned with developing
methods and means of analysis, as well as with speech modeling, identification,
synthesis and compression.
Acoustic and phonetic attributes of oral speech
The attributes reflecting acoustic qualities of the vocal tract and articulation skills of the
person. These attributes are perceived and revealed with the help of technical means
and form the basis of instrumental analysis of speech signals; the attributes can be
evaluated quantitatively.
Acoustic depth of sound record
The distance between microphone and sound source estimated by sounding. Such
estimation is possible basically due to gradually changing of sound timbre along with
distance to source of loudness and ratio between sound level of given source and
surrounding acoustic noise as well.
Acoustic event
A single, relatively independent, short- or long-term event being heard in real time or
on record. The term is commonly used to indicate sound aspect of events happening
simultaneously with main speech signal sounding (e.g. knock, music, sound of passing
car or TV set, etc.).
Adaptive noise reduction (adaptive filter, adaptive signal extraction)
An algorithm of a signal extraction/noise reduction. Adaptive filter is a filter that self-
adjusts according to noise characteristics in the context of a noise reduction method
being used. Adaptive filter is usually characterized by an adaptation time constant –
time of filter response to changing input signal. Adaptive signal extraction is a procedure
which allows extracting useful signal from a processed input signal. At that, background
noise and signals suppression is performed due to permanent self-adjusting of
procedure parameters to characteristics of exactly this useful signal.
140
AGC (Automatic gain control)
A device or program tool used for automatic smoothing the input speech signal level.
The AGC effectively reduces the volume if the signal is strong and raises it when it is
weaker. The AGC is typically characterized by the following variables: AGC range, attack
time for (weak signals), decay time (for strong signals). Conformably to communication
equipment, the standard values are: 12..20 dB, 20 ms, 500 ms.
Amplitude (magnitude)
(lat. amplitudo – size)
The maximum deviation value (from the equilibrium position) of an oscillating quantity,
for example, the deviation from zero of an in-circuit electric current voltage, sound
pressure intensity, etc. It represents the size of vibration (deviation value). In strictly
periodic vibrations, the amplitude is a constant.
In the research of harmonic sound vibrations, the amplitude means sound pressure in a
signal expressed by the amplitude of a current, voltage or other electrical quantity on
the output of sound converting equipment (microphone). In the signal waveform figure,
the amplitude represents the deviation size of an image up or down from zero position.
Analog-to-digital converter (ADC)
A device which converts continuous analog signals to digital numbers. The signal is
digitized (that is, the signal values at equal time intervals are measured and stored)
with the fixed sampling rate (Fs), while its amplitude is converted to a sequence of
digital codes (is quantized). The reverse operation is performed by a digital-to-analog
converter (DAC). Typical Fs values are: 11025 Hz – for speech, and 8000 Hz – for
telephone digital channels.
Antiformant
The suppression of the sound spectral envelope, resulting from the resonant properties
of the sound source. The effect of antiformants is typical for nasalilized sounds.
Articulation
(lat. articulo – enunciate clearly)
Articulation organs
The organs involved in the articulation process, such as lungs, bronchi, windpipe, larynx
with vocal folds, gullet, oral cavity, soft palate, nasal cavity, lips, teeth, tongue (root,
back, forepart and tip). Sinuses of scull, soft tissues of face and neck, as well as the
whole thorax have an indirect influence on the speech signal.
141
Auditory estimation of a speaker
A subjective expert evaluation of speaker’s identification (stable) characteristics, based
on (human) listener perception. Integral auditory estimation is obtained on basis of
speaker’s characteristic estimation by narrow scales which must be maximally simple,
“one-dimensional”, in order to improve accuracy.
Authenticity
(gr. authentikos – reliable)
The reliability and total conformity of all considerable aspects; an absence of casual or
intentional distortions being important for the concerned problem.
Authentic audio record
A record with a complete conformity of its sound contents and original of sounding at
the place where the recording was performed. Authentic record doesn’t contain blank
areas, areas of removal, erasing, insertion, adding, imposing other audio records,
cutting, selective audio recording and is not obtained by means of staging recording.
Synonyms of the term: genuine, reliable, identical with the original.
B
Beating
A periodic change of oscillation amplitude that occurs when two harmonic vibrations with
close frequencies are added together. Beating is commonly heard as interferences in the
radio-electronic devices.
Bel (B)
A logarithmic unit of measurement that expresses a base-10 logarithm of ratio of two
like physical quantities. The “bel” is named in honour of the American scientist
Alexander Graham Bell. When referring to measurements of power or intensity, a ratio
can be expressed as: 1 B = lg(P2/P1) where P2 = 10P1 (P1 and P2 are the power or
intensity quantities). When referring to measurements of amplitude, voltage or current,
a ratio can be expressed as: 1 B = 2lg(F2/F1), where F2 = 10F1 (F1 and F2 are the
amplitude, voltage or current quantities). A decibel, commonly used, is one tenth of a
bel (B).
C
Cepstrum
A representation of the speech signal in the form of a set of coefficients, obtained as the
result of taking the Fourier transform of the decibel spectrum of the given signal. Such
primary representation of speech signal is applied in the automatic speech and speaker
142
recognition systems. Cepstrum is typically used for the MFCC (Mel Frequency Cepstral
Coefficients), calculating cepstral coefficients with help of a nonlinear mel scale of
frequency. Mel scale is considered to approximate the human auditory system’s
response more closely.
Composite stereo
The mode when two mono signals are reproduced simultaneously and, at that, in the
left channel one method of noise reduction is applied to the signal, while in the right
channel another method of noise reduction of the same signal is used. Composite stereo
mode allows increasing speech intelligibility.
Constant mark
A white vertical dashed line in the data box. It is used for demarcating fragments of
data. Marks can be assigned textual inscriptions (via the Marks list menu) and selected
(checked). When a constant mark is deleted, all data in the window will be redrawn.
Current (active) window
A graphic window which serves as a source of data at the current moment. It is always
located above all the rest windows. A short name of a current window is outlined in the
left top corner of a window.
D
(Data) box
In SIS, a black rectangular in the graphic window with numbered axes of coordinates. If
data is loaded in the window, it will be represented in the data box.
Decibel (dB)
In acoustics, a relative unit to quantify sound pressure level and equals one tenth of a
bel (B).
Diagnostic attributes of oral speech
The attributes which allows one to determine accent/dialect, social, psychological,
physiological and other characteristics of the speaker.
Dropout
A momentary loss or a considerable weakening of signal without changing time of its
sounding. Dropout may be caused by recording medium malfunctions or features of
recording and reproducing device.
Dynamic cepstrogram
A flat graphic representation of the speech signal in the 3-dimensional rectangular
system of coordinates: time, frequency and intensity (it shows how sound (speech)
cepstrum changes with time). Cepstrogram is a synchronous graphical representation of
143
a group of successively calculated instantaneous cepstra. Time is commonly laid off
along the horizontal axis, while the vertical axis is meant for frequency of cepstral
components (Hz). Cepstral amplitude for given frequency at the current time point is
presented in the image as a blackening with different degrees of saturation or with help
of a specially created colour spectrum. Cepstrogram is commonly used to represent in a
pictorial view the degree of voice periodicity and fundamental frequency of periodicity –
pitch and its harmonics.
Dynamic range
A system operational characteristic that is calculated as the ratio of the input quantity’s
maximum operational intensity to its minimum intensity at which this quantity can be
still discerned against the background noise of the system; is measured in dB. For linear
systems, dynamic range is practically equal to the SNR, while in real systems, the noise
level exceeds in the presence of the input signal. It is often used to describe the ratio of
the signal’s maximum undistorted amplitude to the root-mean-square amplitude of the
background noise in the presence of the weak signal (usually at 60 dB value lower than
the allowable maximum).
Discreteness
(lat. discretus – separated)
E
Echo cancellation
The means and methods of removing echo interference (repetition of a signal and
superposition of a secondary, usually a bit weakened copy (copies) of an initial signal)
from a useful sound signal. For automatic echo cancellation, adaptive filters are
commonly used. Echo arises in communication channels due to electromagnetic signal
reflections at different nodes in a distributed system, as well as a result of an acoustic
reflection of a signal and its reappearing in a microphone.
144
EER (Equal Error Rate)
A reliability index of the object (speaker) recognition systems; computed at the point
where both FAR and FRR errors are equal. When quick comparison of two systems is
required, the EER is commonly used. The lower the EER, the more accurate the system
is considered to be.
Extraction of a (useful) signal/speech
A procedure of the signal processing which allows obtaining the better characteristics of
the signal perception against the background of noise or sound surrounding. Extraction
provides the greater signal intelligibility, the higher speaker recognition, speech/non-
speech discrimination, convenience during listening, etc.
F
FAR (False Acceptance Rate)
A parameter used for assessing reliability of the object (speaker) recognition systems;
defined in % as the probability that the system incorrectly declares a successful match
between the input pattern and a non-matching pattern in the database. It measures the
percent of invalid matches.
FFT (fast Fourier transform)
The simple and efficient algorithm for computing the signal spectrum via discrete Fourier
transform. When using FFT, it commonly requires processing in a single analysis frame a
number of signal points divisible by power of 2: 16, 32, 64, 128, 256, 512, 1024, etc. In
speech technologies, the following types of spectrum are typically used. Instantaneous
signal spectrum is calculated in a single analysis frame. Average signal spectrum is
calculated via averaging the instantaneous spectra within the specified fragment of a
speech signal (e.g. within the whole signal) taken at the fixed time intervals (e.g. one
forth of the analysis frame length).
Filter
An electronic device or program-mathematical algorithm used to remove vibrations of
certain frequencies from a composite signal with wideband spectrum while allowing the
more narrowband vibrations to pass. A high-pass filter attenuates low frequencies and
lets the high ones pass through. A low-pass filter does the opposite. In a more
comprehensive sense, filter is any mean of linear modification of input signal spectrum.
Formant
The amplitude maximum, area of energy concentration in the speech sound spectrum,
determined by the resonant properties of the vocal tract. In the speech sound 3-6
formants are commonly distinguished within the frequency range from 250 to 5000
Hz. Formant is a phonetic characteristic of sound; it contains information about the
speaker’s individual speech features. Formant with the lowest frequency is denoted F1,
the second F2, and so on to the highest frequencies.
145
Formant bandwidth
An interval on the frequency axis occupied by formant; denoted as В1, В2, etc.
depending on the number of formant.
Fourier transform
The main subject of the spectral (frequency) analysis of signals. It is based on the
assumption that all signals (processes) under consideration consist of a certain number
of harmonic (sine and/or cosine) components (called harmonics) with different
frequencies; each component has its amplitude and initial phase angle (phase).
Fragment
In SIS, the part of data which is singled out in some way from the segment, but has not
lost its connection with the remaining data. It can be, for example, part of a segment
limited by temporary marks or part of a segment included in the highlighted interval
between permanent marks or part of a segment visible in the box.
FRR (False Rejection Rate)
A parameter used for assessing the reliability of the object (speaker) recognition
systems; defined in % as the probability that the system incorrectly declares failure of
match between the input pattern and the matching template in the database. It
measures the percent of valid inputs being rejected.
G
Frequency response)
Gain-frequency characteristic of a sound data transmission channel is the dependence of
a signal level at the output on the frequency of a constant-amplitude input signal. It is
typically characterized by two parameters: passband and gain flatness. Gain-frequency
characteristic for ideal acoustic system (in view of its properties) must be flat.
H
Hamming window
A weighting (window) function applied in spectral analysis of the signals and filter
design. The following window functions also exist: Hann (Hanning) window, Blackman
window, Nuttoll window, Gauss window, etc.
Harmonic components (harmonics, overtones)
The simple harmonic (sinusoidal) oscillations which form together the complex sound
oscillations. Harmonic is a component frequency of these oscillations that is an integer
multiple of the fundamental frequency. The totality of harmonic components values
defines a voice timbre and is individual for every speaker.
Harmonic distortion
146
A result of different nonlinear transformations, affecting the useful signal properties.
These can be, for example, amplitude limitation of a signal, a signal compression using
specific encoding algorithms, etc. In most cases, harmonic distortions of a signal
annihilate its useful information without recovery. Total harmonic distortion (THD) is a
measurement of the harmonic distortion and is defined as the ratio of the sum of the
powers of all harmonic frequencies above the fundamental frequency to the power of
fundamental. THD is measured in % or dB. For ADC/DAC systems, the THD value less
than 0.01% corresponds to a high quality recorder, less than 0.1 – a good quality, 1% –
an average quality recorder, 10% – a bad quality recorder.
Harmonic (sinusoidal) oscillation
An oscillation of a physical (or any other) quantity when its value changes with time
according to sinusoidal low: X=Asin(ωt+φ), where X is the displacement (oscillating
quantity value) at the current point of time (t), A is the amplitude of oscillation, ω is the
angular frequency, (ωt+φ) is the current oscillation phase, φ is the initial oscillation
phase. Any non-harmonic oscillation may be uniquely represented as the sum of
different harmonic oscillations, i.e. as the spectrum of harmonic oscillations (harmonic
expansion, Fourier expansion).
Hertz (Hz)
A unit of frequency and is defined as the number of cycles of periodic process per
second. One hertz means “one cycle per second”. It is named after the German physicist
Heinrich Hertz. Commonly used multiples of Hz are: kHz (kilohertz, 103 kHz), MHz
(megahertz, 106 Hz), etc.
I
Impulse noise
A noise occurring as short signals with sharply increasing and decreasing amplitudes.
Interference
A process embarrassing an auditory perception of useful signals during record playing,
developing in the form of different kinds of noise, background and other signals having
no useful information.
Inverse spectrum
147
A spectrum having its maximums converted into minimums of the same value and vice
versa.
Inversion
A reversal of a set (succession) of any elements. Reverse order of words.
L
Linear prediction coefficients (LPC)
One of the methods of primary speech signal representation frequently used in the
compression, recognition and synthesis systems.
LPC spectrum
The kind of speech spectrum analysis based on speech linear prediction methods (or
equivalently, autoregressive model of speech signal). LPC spectrum sometimes results
in more effective representation of the signal’s spectral characteristics than the classic
signal spectrum calculated with the help of the Fourier transform.
M
Masking of sound (speech)
A property of human hearing, a psycho-physiological phenomenon, which means that
some components of one signal (e.g. speech) are not heard or are heard weakened in
the presence of another (masking) signal in a signal mixture. For example, in the
frequency neighbourhood of the large-amplitude harmonic, the weaker harmonic signals
are not heard. Speech masking by noise is one of the main reasons of decreasing
speech intelligibility in noise. Sound masking is quantified as a number of decibels, on
which the sound hearing threshold raises in the presence of noise. Low frequency tones
take the stronger masking effect than high frequency tones. In some cases, some
speech components can be masked by other speech components of the same speech
signal and, as a result, speech intelligibility decreases. This phenomenon is called self-
masking of speech.
Mu-law (μ-law, u-law) algorithm
A method of time-domain (speech) signal processing, transforming instantaneous
amplitude according to the law similar to logarithmic one; allows compressing input
148
speech data without quality loss; primarily used in telephony (G.711 standard is
recommended).
N
Noise
1. Disorderly oscillations of a different physical nature, having continuous spectrum in a
sound frequency range.
2. Unwanted sound that complicates the useful signal determination and use. Any
oscillation in solids, liquids and gases can be the source of an audible and inaudible
noise. Radio-electronic (electromagnetic) noise is a random variation of current or
voltage in radio-electronic devices (for example, audio recording and reproducing
equipment).
Noise reduction (noise cancellation)
The process of removing unwanted noise (background noise) from a signal.
O
Octave
The interval including 6 tones and 12 semitones; the interval between two frequencies
having a ratio of 2 to 1. It is used at specifying the fall-off slope for the filter’s frequency
response and normalizing the signal in the spectral analysis. For instance, 12 dB per
octave means the 12 dB amplitude fall-out (approx 4 times less) at the frequency
doubling.
Oscillation frequency
A quantity defined as the number of oscillation periods (the number of oscillations)
occurring per time unit; is the reciprocal of the oscillation period (the duration of one
cycle in an oscillating process).
P
Pause
(lat. pausa, gr. pausis – stop, termination)
149
speed of vocal cords vibrations. In oral speech, this feature determines voice type (bass,
tenor, descant, etc.).
Pitch of voice (sound)
A property of voice measured by the vocal folds oscillation frequency in a unit of time:
the more oscillations account for a unit of time, the higher is the pitch.
Pixel
In digital imaging, the smallest item in an image.
Pseudo stereo
The mode of stereo playback of a mono signal when a signal in one playback channel is
late for a certain time and moves in phase relatively to another playback channel. Using
this mode, it is possible to reduce an operator’s fatigability during listening and enhance
speech intelligibility in noise.
R
Range
A quantity setting the utmost limits of attribute change (e.g., sounding speech
attributes); difference between minimum and maximum values of the attribute.
The process of a gradual sound attenuation in an enclosed space after the source is
removed. Each frequency component keeps sounding during some period of time
depending on absorption of the sound reflected from the walls and subjects inside the
room at a given frequency. Vibrations with different frequencies, excited by a sound
source, decay non-simultaneously and add to an original signal distorting its properties.
Reverberation strongly affects intelligibility of speech and music in an enclosed space.
The length of reverberation is characterized by the reverberation time, i.e. period of
time during which the sound intensity decreases by a factor of 1000. Reverberation time
receives special consideration for an acoustic quality of an enclosure. The greater the
room capacity (or the sound free path time), the longer is the reverberation time and
the smaller is the absorption by the bounding surface. To measure the reverberation
150
time, you need to record the decrease process of the sound pressure level after the
source stops.
S
Sampling rate (sampling frequency)
The number of samples per time unit taken from a continuous signal to make a discrete
signal. For time-domain signals, it can be measured in hertz (Hz). The inverse of the
sampling frequency is the sampling period or sampling interval, which is the time
between samples.
Segment
In SIS, the chunk of data which forms an entity and is not connected with other data.
For example, data read from a file or inputted from a sound card will form one segment.
Being recorded via the sound card, all read data form a single segment. Each new
segment in the data box is represented by a different color.
Signal distortion
Refers to any modification (losses or additions) in the sound signal affecting its
significant characteristics and worsening oral perception parameters (intelligibility,
naturalness, speaker’s cognition, etc.). Typical distortions: limitation of the level of weak
and strong sounds, clipping, limitation in the frequency range, gain flatness, frequency
bias at the expense of heterodyning.
Signal energy
A root-mean-square signal value in a frame of a set width (in milliseconds), located
symmetrically relative to the current point in the signal.
Signal/sound power
A quantity defined as the energy transported by the sound wave per concerned area per
time unit. Time-average power value related to unit area is called sound intensity.
Signal-to-noise ratio (SNR)
A measurement defined as the ratio in decibels of the average (or other normalized)
level of a useful signal to the average (or other normalized) level of background noise.
An audio component with a high signal-to-noise ratio (>30 dB) has relatively little
background noise accompanying the signal; a component with a low signal-to-noise
ratio (<10 dB) is noisy.
Sound
A mechanical oscillation travelling through elastic mediums or bodies (solids, liquids and
gases), composed of frequencies within the limits of human hearing (between about 17-
20 Hz and 20 000 Hz). The heightened sensibility of human ear is detected in the
frequency range from 1 kHz to 5 kHz. Mechanical oscillation which is lower in frequency
151
than 17 Hz is called infrasound, while ultrasound is an oscillation with a frequency
greater than the upper limit of human hearing (20 000 Hz).
Sound intensity
A quantity defined as the time-average energy transported by the sound wave per time
unit per unit area that is normal to the wave line. Sound intensity is measured by an
intensity level, expressed in logarithmic units (decibels).
Sound spectrum
An acoustic representation of complex sound providing information about the frequency
of sound source, pitch harmonics and relative intensity of all its frequency components.
Speaker identification
The process of comparing the speech of an unknown speaker against a database of the
speech samples of known speakers to determine whether it matches any of the
templates or not, i.e. to identify the submitted unknown speaker with any of known
speakers.
Speaker identification characteristics
The stable individual characteristics of a speaker that are obtained from his speech:
appearance and speech characteristics, as well as subjective auditory estimation of a
speaker.
Speaker recognition
A generalized term including identification, verification and speaker separation.
Speaker verification
A procedure of checking whether the speaker, whose speech is analyzed, is the person
he pretends to be (e.g. by entering a specific PIN code). Verification itself is one of the
pattern recognition problems, when it is necessary to accept or reject a hypothesis of
identity of the two given classes (patterns).
Spectral range
The frequency range of spectrum within that the given event or object (e.g. the signal)
is considered. Commonly is specified by the upper and lower frequency bounds in Hz.
For example, the spectral range for the standard telephone channel is 300-3400 Hz.
Spectrogram
A graphic representation for the results of sound vibrations spectral analysis.
Spectro-temporal analysis of speech recording
The instrumental method of speech signal analysis used to establish dependences
between the frequency and peak characteristics of speech spectrum and the duration of
the speech process. Spectro-temporal analysis provides the most complete
representation of speech in the form of a continuously changing spectrum of sound
vibrations produced by the resonator parameters of the vocal tract constantly varying in
the time domain.
Spectrum (frequency, harmonic, Fourier spectrum)
152
A signal parametric representation as a set of coefficients of its sinusoid decomposition
with fixed frequencies, called harmonics or Fourier functions. In most devices and
algorithms, processing or using speech signal during its recognition, synthesis,
compression and cleaning, the signal at the beginning of processing is altered from time
domain to spectral representation (spectrum is calculated) by means of the Fourier
transform or its implementation optimized for computer calculations – the fast Fourier
transform (FFT). Spectral analysis is a procedure of spectrum production; it is carried
out for a sequence of signal points – an analysis frame or a signal sample. From the
computational standpoint, it is important how many points are used by this procedure in
one analysis frame. Typical number of points is 256. In the spectrum graphic
representation, a distance between separate points of spectrum in Hz is named spectral
resolution. For each spectral component, a width of spectrum analysis band is called
spectral width. Spectral analysis of speech signal is the signal decomposition into
resonant frequencies with certain amplitudes by means of a set of filters (filter bank) or
fast Fourier transform algorithms.
Speech intelligibility
A measure of sound clarity that indicates the ease of speech understanding. It is a
composite function of an articulation clearness of a speaker, properties of room
acoustics, as well as a quality of transmission channel and recording/playback
equipment. Quantitatively, intelligibility is defined as the ratio between number of
phrases (words, syllables or sounds) accurately understood by a listener and their
general number. In this connection, they distinguish phrasal, verbal, syllabic and sound
intelligibility.
Speech perception
Refers to the processes by which humans are able to interpret and understand the
speech sound signal. Speech perception includes the initial auditory analysis, acoustic
features extraction, as well as their phonetic, prosodic and semantic representation.
Speech sound
A minimum unit of speech flow resulting from human articulation activity. Speech sound
is characterized by specific acoustic and perceptive properties.
Speech tempo
The speed of pronouncing the speech elements: sounds, syllables and words. Is
measured by either a number of sounds, syllables, etc. being pronounced in a time unit,
or by their average length.
T
Temporary mark
153
A yellow vertical dashed line in the data box used for temporary marking fragments of
data. There can be from 0 to 2 temporary marks in the box. If you try to set the third
mark, the first of the two already set marks will disappear.
Threshold of hearing
A minimum sound level that an average person can hear in a noiseless environment.
This point will vary from person to person, but is generally reported as the RMS (root-
mean-square) sound pressure of 20 micropascals or 2×10−4 dynes per square
centimeter at 1 kHz frequency.
Timbre
A subjective quality or color of a speech sound perceived as the impression of the
totality and ratio of the spectral component levels. The following timbre types are
distinguished: epiglottal timbre (also called sound timbre) – the sound quality
depending on location of different articulation organs, not participating in the voice
production, and acoustic processes pertaining to them; glottal timbre – the sound
quality determined by operation of organs involved in the voice production.
Transcription
(lat. transcriptio – rewriting)
The process of matching the human speech sound units (segmental transcription) and
intonation units (suprasegmental transcription) to special written symbols, using a set of
exact rules, so that these sounds can be reproduced later. Two types of transcription:
phonetic and phonematic. Phonetic transcription matches to script the spoken speech,
considering all its sound and intonation features. Commonly, English and Russian special
symbols (Roman alphabet and Cyrillic alphabet) and additional super-/interlinear
diacritical signs are distinguished. For linguistic purposes, the IPA (International
Phonetic Alphabet) standard transcription scheme is frequently used.
V
Visible speech (sonogram, dynamic spectrogram)
A flat graphic representation of the speech signal in the 3-dimensional rectangular
system of coordinates: time, frequency and intensity (it shows how changes with time
the spectrum of compound sounds including speech). Spectrogram is a synchronous
graphical representation of a group of successively calculated instantaneous spectra.
Time is commonly laid off along the horizontal axis, while the vertical axis is meant for
frequency (Hz). Intensity of every frequency component at the current time point is
presented in the image as a blackening with different degrees of saturation or with help
of a specially created colour spectrum. Visible speech representation describes almost
154
uniquely the key characteristics of the speech signal sounding (its formant and harmonic
components).
Vocal tract
The totality of articulation organs.
W
Waveform
Waveform of the speech signal is a graphic representation of the signal vibration
amplitude as a function of time. Waveforms can be obtained using signal processing
equipment: loop waveform viewers, signal level recorders and electronic waveform
viewers. Waveforms can be used to extract fragments of data for further research.
Window
In SIS, a rectangular displayed in the main screen along with graphical representation in
it and data contained only inside this window.
White noise
A noise which contains equally represented sound vibrations with different frequencies,
that is the signal contains equal wave power within a fixed bandwidth at any center
frequency (e.g. noise of waterfall). White noise draws its name from white light. The
amplitudes of a white noise signal are independent from each other in the successive
points of time. On ear white noise is interpreted as a uniform hissing. It is typically
developed as a tape noise, noise of amplifier background, etc. Widespread hardware and
software generators of white noise are intended for testing audio equipment and signal
processing algorithms.
Wow (and flutter)
A signal distortion, resulting from parasitic frequency modulation with frequencies taken
within the 0.2-200 Hz range which is caused by irregular tape motion during recording
or playback. “Wow” effect is typical for analog tape recorders.
155
02-090609–7.0.1.