This chapter is composed of a number of walk through tutorials which illustrate how to build simple systems using the CASA Definition Language.
In the first example a system will be built which superimposes three sinusoidal signals and displays the spectrogram of the resulting signal.
In any text editor open a new file and call it `example1.ctk'.
At the top of the file type:
This line tells the shell that the file is a script and should be interpreted by the CTKScript command in /usr/local/bin/ . (If your CTK binaries are installed somewhere other than /usr/local/bin then make sure you use the appropriate path.)
Next start the description of a new processing block. On a new line type:
This creates a new block and assigns it the name `main'. A script file can define any number of blocks. The CTKScript command takes as an optional argument the name of the block to execute. By default it will search for a block called, `main'. If the script contains only one top-level block then it makes sense to call this block `main' so that CTKScript will work using the default block name argument.
The top-level block main describes the entire system and is composed of a number of interconnected lower-level blocks. These lower-level blocks can be either intermediate-level blocks (themselves composed of a network of lower level blocks) or one of the fixed library of lowest level inbuilt blocks.
For this example we must start out by adding three inbuilt sine-wave generator blocks. To do this we use the CTK script ADD command. Type:
ADD i1=SineWave(DURATION=1, SAMPLE_RATE=1000, FREQ=2) ADD i2=SineWave(DURATION=1, SAMPLE_RATE=1000, FREQ=3) ADD i3=SineWave(DURATION=1, SAMPLE_RATE=1000, FREQ=4)
These SineWave blocks are generators as they have signal outputs (1 each) but no signal inputs. They have tunable parameters and in this case the parameters are set to produce sinusoids of 1 second duration at a sample rate of 1000 samples/second and with frequencies of 2, 3 and 4 Hz respectively.
We now need a way of superimposing these signals. This is done using the inbuilt block called Adder This takes an arbitrary number of inputs and sums them to produce one output. Type:
By default Adder expects two inputs. In this case we have 3 sinusoids to add, so the NINPUTS (number of inputs) parameter has been set to 3.
We can now request a graphical display of the superimposed waveform. The basic mechanism for graphical output is to use the inbuilt block Display which generates a Qt-based output window. An alternative for users with MATLAB is to use the MDisplay block which uses the CTK/MATLAB interface to create a graphic in a MATLAB window. So next type either:
ADD d1=Display (Qt-based output)
ADD d1=MDisplay (MATLAB-based output)
As well as specifying the system's subblocks we must also describe how they are connected. By default the toolkit will try to connect the blocks in series in the same order as they occur in the script. If - as in this case - this is not what is wanted, then connections must be specified explicitly. In this example each of the sine wave generators must connect to one of the three inputs on the adder block, and the output of the added block must connect to the display block. To make these connections add the lines:
CONNECT i1:out1 a:in1 CONNECT i2:out1 a:in2 CONNECT i3:out1 a:in3
The first line connects the output socket named out1 of the block called i1 to the input socket named in1 of the Adder block called a. The second and third lines likewise connect inputs i2 and i3 to the 2nd and 3rd inputs of the Adder block a. Note, for all blocks the input sockets will be named in1, in2, etc and the output sockets named out1, out2, etc. 4.1 If a block has only 1 input socket, or one output socket then the socket name can be omitted from the CONNECT statement without any ambiguity. So in this case we could more simply write:
CONNECT i1 a:in1 CONNECT i2 a:in2 CONNECT i3 a:in3
To connect the adder block to the display block we could add the line:
CONNECT a d1
But in this case there is no need to make this connection explicitly as it follows the default behaviour of connecting blocks in series in the order in which they are defined in the script. So this line can be omitted.
Finally add the line:
This marks the end of a block definition.
We have now defined a simple system that will add three sinusoids and display the resulting waveform. This system is illustrated in Figure 4.1. We can now `execute' the script and see the result. From the editor first save the script, and then in the Unix shell window make the script executable by using the Unix chmod command:
chmod u+x example1.ctk
Now simply execute it by typing:
A window should appear displaying a periodic waveform. If the MATLAB interface display is used (i.e. MDisplay) there will be a short pause while the MATLAB engine starts up.
The waveform is the addition of sinusoids with frequencies of 2, 3 and 4 Hz. By changing the FREQ parameters of the SineWave blocks and rerunning the script you can see the effect of adding sinusoids with different frequencies. Alternatively you can set the sinusoid frequencies from the command line. To do this you must first edit the script so the SineWave blocks are defined as:
ADD i1=SineWave(DURATION=1, SAMPLE_RATE=1000, FREQ=$1) ADD i2=SineWave(DURATION=1, SAMPLE_RATE=1000, FREQ=$2) ADD i3=SineWave(DURATION=1, SAMPLE_RATE=1000, FREQ=$3)
The $1, $2 and $3 are command-line variables. You can now execute the command by proceeding it with three argument values. For example, try typing:
example1.ctk 2 3 4or
example1.ctk 100 200 300
or with any other three frequency values after the command name.
We now wish to make a spectrogram of the signal. To do this we need to break the signal into a series of windowed frames and calculate the magnitude of the complex FFT of each frame. This is done using the two more inbuilt blocks, named Frame and FFT, connected together in series. However, a spectrogram is a commonly used representation and it is rather obscure and cumbersome to have to keep writing out its inbuilt block definition in full every time we wish to use it. e.g. to construct a spectrogram we need to add something like the following to the script:
ADD frame=Frame ADD fft=FFT
and it would be better is we could simply type:
We can in fact use this neater alternative if we first define an intermediate level block called Spectrogram in terms of the the inbuilt blocks Frame and FFT. A number of useful intermediate level block definitions (including the spectrogram block) come supplied with the toolkit and are stored in the toolkit scripts directory. So rather than defining our own Spectrogram intermediate-level block we can simply use the INCLUDE command to add this prewritten block to our script. So directly before the line `BLOCK main,' type:
Now add the Spectrogram block at the end of the main block definition (i.e. immediately before the ENDBLOCK)
We do not need a CONNECT command as by default the Spectrogram block will take input from the proceeding Display block (Display blocks are designed to both display their input and pass it on to their output). Finally we want to display the output of the Spectrogram. So we need to add another display block. For Qt-based output type:
ADD d2=Display (Qt-based output)
and for MATLAB output, first add:
and then edit the definition of block d1 to be
The BEFORE_PLOT parameter of the MDisplay allows the MATLAB plot to be customised by issuing MATLAB commands that are executed immediately before the plot is generated.4.2 In this case it is used to make the waveform and the spectrogram appear as subplots in a single window.
The script is now complete. If you now go back to the Unix command shell and type:
example1.ctk 10 100 200
a window showing both the waveform and the spectrogram something like that shown in Figure 4.2 should appear. If the script fails to run examine the error message and check your script against the script listed in Appendix A.1.
This example also serves to illustrate an important feature of the data flow underlying the CASA toolkit. Whereas the data entering the Spectrogram block is a simple signal with time as the only dimension (i.e. it is 1-dimensional) the data leaving the Spectrogram block has two dimensions, time and frequency. All data must have time as a dimension but on top of that it may have any number of other dimensions. Each inbuilt block is designed to examine the dimensionality of the data entering it and to act appropriately. So, in this example the output of the first Display block is a simple graph with time along the x-axis, while the output of the second Display block is a colour image map with time and frequency along the x and y axes. There is only one type of Display block but it behaviour alters according to the nature of its inputs.
In the previous tutorial we saw how we can use an intermediate level block to perform the function of several connected inbuilt blocks. We saw how an intermediate block called Spectrogram was used which was composed of the series combination of a Frame block and an FFT block.
Why go to the trouble of defining an intermediate blocks when we could just write it out longhand in terms of inbuilt blocks? The obvious advantage is that it simplifies scripts making them both shorter and easier to read and write. However, a far more important advantage of intermediate blocks is that they allow a degree of `design reuse'. Intermediate blocks definitions can be written and stored in a library. These intermediate blocks can then be used as parts of a more complex system through use of the INCLUDE mechanism. So a simpler unit, such as a spectrogram, can be designed once and then reused in any number of more complex systems. It should be noted that an intermediate level block may itself be designed in terms of other intermediate level blocks and there is no limit to the level of such nesting that may be used.
In the previous tutorial we used the pre-defined Spectrogram intermediate block that is provided as part of th CTK intermediate block library. All we had to do was add the appropriate INCLUDE line at the top of our script file then we could use Spectrogram exactly as if it was one of the toolkit inbuilt blocks. In this tutorial we will see how to define our own intermediate level blocks so that they can be used like inbuilt blocks in this way.
In this tutorial we will construct a lossy delay line like that shown in Figure 4.3. Each block in the delay line performs a delay operation, a scaling operation, and a teeing operation to split the data flow into two outputs. There is no one inbuilt block that can perform all these operations in one go, instead we must build an intermediate block from the inbuilt blocks Delay, Scale and Tee. Figure 4.4 shows the architecture of this intermediate block.
So first we will construct the intermediate block shown in Figure 4.4 and give it a typename `tap' and then we will construct the delay line by arranging these tap blocks in series as in Figure 4.3.
In any text editor open a new file and call it `example2.ctk'.
At the top of the file type:
Next we need to start the description of a new block to which we will give the typename `tap'. To do this we just type:
Now we add the subblocks Delay, Scale, and Tee which go together to make up the tap block. To do this we just add the lines:
ADD d1=Delay(DELAY=20) ADD s1=Scale(X=0.5) ADD t1=Tee
These blocks need to be connected in series. We do not need any explicit CONNECT statements because the default series connections will give the right result. However, once these three are connected there will still be a few loose connections, namely, the input of the Delay block and the two outputs of the Tee block. These connections are intended to be inputs and outputs of the intermediate block itself. This must be made explicit and is done so by use of the INPUT and OUTPUT commands. To handle the input add the line:
This declares that the intermediate has an input socket and this socket feeds into the input of the subblock called d1 i.e. it feeds into the Delay block. Likewise for the outputs we need to add the lines:
OUTPUT out1=t1:out1 OUTPUT out2=t1:out2
This declares that the intermediate block has two output sockets and they are respectively output socket 1 and output socket 2 of the block called t1 (i.e. the Tee block).
We can now complete the block definition with the line:
With this definition the Tap block has no parameters. The delay and scale factor will be fixed at 20 and 0.5 as stated in the definition of its Delay and Scale subblocks. If we want tap blocks with adjustable delays and scale factors we must make the parameters of its subblocks visible as parameters of the tap block itself (see on Figure 4.4 how tap has a parameter DELAY that is linked to the parameter DELAY of its subblock Delay). To do this we use the PARAMETER command. So just before the ENDBLOCK insert the lines:
PARAMETER DELAY=d1:DELAY PARAMETER SCALE=s1:X
This gives the tap block parameters DELAY and SCALE, and attaches these to the DELAY parameter of the Delay block and the X parameter of the Scale block respectively.
Now that the tap block has been defined it is ready to be used in the construction of the main block. The main block will have a SineWave source block, then a series of tap blocks. Add to following lines:
BLOCK main ADD i1=SineWave(DURATION=0.5, FREQ=5, SAMPLE_RATE=1000) ADD tap1=tap ADD tap2=tap ADD tap3=tap ADD tap4=tap ADD tap5=tap
The default connections will string these blocks together in series. Now lets display the signal at a couple of different points along the delay line. To do this we simply add a couple of display blocks and connect them to the delay line using the spare outputs of the tap blocks. For example if you are using the MATLAB interface add the following:
ADD first_output=MDisplay(BEFORE_PLOT="subplot(2,1,1)") ADD second_output=MDisplay(BEFORE_PLOT="subplot(2,1,2)")
or if you are using the Qt-based graphical output add:
ADD first_output=Display ADD second_output=Display
If you have no graphical output you can monitor the output on the terminal using:
ADD first_output=Output ADD second_output=Output
Then to connect the outputs to the delay line add the following:
CONNECT tap1:out2 first_output CONNECT tap4:out2 second_output
Here we have connected the display to tap1 and tap4, but they could equally well be connected at any two other positions.
And finally finish the main block definition by adding:
The example is now ready to be run. To do this you first need to make sure the script file is executable. In the Unix command shell type:
chmod u+x example2.ctk
Then to run the script, simply type:
If all goes well two sinewaves should appear with one reduced in magnitude and delayed with respect to the other. Using the MATLAB interface this should look something like Figure 4.5. If the script fails to run then you have probably made a typing mistake. Examine the error message and check your script against the script listed in Appendix A.2.
In this final tutorial we will look at how the CASA toolkit can be used to run speech recognition experiments.
Figure 4.6 shows the block diagram for a system that performs `missing data' speech recognition. At the bottom of the diagram is the block which performs the `missing data' Viterbi decoding. This is similar to a standard speech decoder but as well as taking a representation of the speech signal, it also has an input called a `mask' which tells the decoder which elements of the representation may be considered to be reliable. Typically some form of spectro-temporal representation is used, and the mask should indicate which spectro-temporal elements have a favourable local SNR (i.e. which elements are `clean').
Following Figure 4.6 we see that the first operation is to read the noisy speech from an AU sound file. In this example recognition will be based on a `ratemap' representation. The ratemap is a kind of auditory-inspired spectrogram and is formed by passing the signal through a gammatone filterbank, and then performing some leaky integration and downsampling. The ratemap is then duplicated using a tee block and one copy passed directly to the decoder and the other copy is passed down the right hand side of the figure through the blocks that generate the mask. The mask is generated by performing a simple noise estimation and then using a comparator to set the mask to true at the spectro-temporal points where the estimated clean signal (i.e. the noisy signal after subtraction of the estimated noise) makes up over half the energy of the noisy signal. The mask is then passed into the second input of the decoder.
The CTK script to describe this process is listed in Appendix A.2. The script shown differs slightly from the diagram in that it omits the initial ratemap generation stage and instead reads in precomputed ratemaps. Note, that the default series block connection means that very few of the block connections need to be specified explicitly.
We will now see how this script can be used both for recognising a single utterance, or recognising a small corpus of test utterances. First make a new directory called example3 and copy the example3 CTK script supplied with the distribution into this directory
mkdir example3 cd example3 cp $CTKROOT/tutorial/example3.ctk .
The script takes two arguments: first, the name of the ratemap file to be recognised and second a string representing the correct transcription against which the result will be compared. A set of ratemaps for TIDigits mixed with factory noise at various SNRs is distributed with the toolkit under the $CTKROOT/data directory. So, to test the script on the utterance 1159 at 10dB SNR type:
example3.ctk $CTKROOT/data/factory/1159a.10 1159 ""
After typing this there will be a short pause while the HMM files are read in. Once the HMMs have been read a few lines appear summarising the HMM data. After this, recognition will commence and digits should appear on the screen as they are recognised. When recognition is complete a line of statistics will appear summarising the systems performance. The final output should look something like this:
(yelp)85% example3.ctk $CTKROOT/data/factory/1159a.10 1159 num_HMMs = 12 num_states = 8 num_mixes = 10 vec_size = 64 num_dist = 960 1157 1159 Nin: 4 Nout: 4 (H=3 I=0 D=0 S=1) Cor: 75 Acc: 75 -- Ni 4 Nout: 4 (H=3 I=0 D=0 S=1) Cor: 75 Acc: 75 --
This shows that the utterance `1159' was recognised as `1157'. The statistics on the left show that 4 digits were `read in' and 4 digits were output; there were 3 hits, no insertions, no deletions and 1 substitution. Word correctness is 75% and word accuracy is 75%.
The system seems to work well with this single utterance, but to really test it we need to run it over a large test set of several hundred utterances. One approach would be to write a shell script to repeatedly execute the example3.ctk command each time with a different ratemap file and a different transcription string. However, there are two problems with this. First, reading in the HMMs represents a sizeable computational overhead. We do not want to have to reread the HMMs for every utterance to be recognised. Second, there is no convenient way of calculating the overall performance if we have invoked example3.ctk separately for each utterance to be recognised. In order to overcome these problems we can use the CTK script argument list mechanism.
A script argument list is a text file where each line supplies the arguments for a separate invokation of the block process described by the script. These argument lists are introduced to the script on the command line using CTKScript's -S option. If a script requires more than one argument these arguments can all be specified in a single argument list containing several columns, or they can be split across multiple argument lists with fewer columns. When more than one argument list is used each list is introduced on the command line with a separate -S. An example makes this clear. Make your window as wide as possible and then try the following command:
example3.ctk -S $CTKROOT/data/flists/test240.10.flist -S $CTKROOT/data/ transcripts/transcripts_240 ""
The first -S parameter introduces the argument list containing the paths of the complete test set of 240 files to be recognised. The second argument list is a list of the correct transcriptions of these 240 utterances. Note, all the argument lists should have the same length, if they do not then an error will be reported.
This command should produce output that starts like that below:
(yelp)333% example3.ctk -S $CTKROOT/data/flists/test240.10.flist -S $CTKROOT/data/transcripts/transcripts_240 num_HMMs = 12 num_states = 8 num_mixes = 10 vec_size = 64 num_dist = 960 1157 1159 Nin: 4 Nout: 4 (H=3 I=0 D=0 S=1) Cor: 75 Acc: 75 -- Nin: 4 Nout: 4 (H=3 I=0 D=0 S=1) Cor: 75 Acc: 75 -- 12773o73 1273o73 Nin: 7 Nout: 8 (H=7 I=1 D=0 S=0) Cor: 100 Acc: 85.7143 -- Nin: 11 Nout: 12 (H=10 I=1 D=0 S=1) Cor: 90.9091 Acc: 81.8182 -- 1627 127 Nin: 3 Nout: 4 (H=3 I=1 D=0 S=0) Cor: 100 Acc: 66.6667 -- Nin: 14 Nout: 16 (H=13 I=2 D=0 S=1) Cor: 92.8571 Acc: 78.5714 -- 6128o 128o Nin: 4 Nout: 5 (H=4 I=1 D=0 S=0) Cor: 100 Acc: 75 -- Nin: 18 Nout: 21 (H=17 I=3 D=0 S=1) Cor: 94.4444 Acc: 77.7778 -- 12 12 Nin: 2 Nout: 2 (H=2 I=0 D=0 S=0) Cor: 100 Acc: 100 -- Nin: 20 Nout: 23 (H=19 I=3 D=0 S=1) Cor: 95 Acc: 80 --
To recognise the full set of 240 utterances may take some time. The recognition can be halted at any time by using the Ctl-C key. The statistics on the left of the screen apply to the current utterance, and those on the right are a running total showing the performance so far. The final recognition accuracy after all 240 utterances have been processed should be about 84%.