|
Mavzu: speaker indenfication guruh: 051-20 Bajardi
|
bet | 7/7 | Sana | 08.01.2024 | Hajmi | 0,52 Mb. | | #132475 |
Bog'liq 1 TIZIM SIGNAL TOP 1 Results
Initially, our set of files was judged by two human experts „by ear” to match the voices to actual identities. It was noted that this was not an easy task. Out of the 246 files, there were 23 mismatches between the human judges and the automatic results. These 23 recordings were then re-verified by the experts and the automatic result was deemed correct in all but one recording, which turned out to be outside of enrollment set.
While we are very pleased with our results, there are a few things that can be done to improve the outcome in case it doesn’t work as intended:
improve the quality of the data – eg. clean up the audio, fix endpointing, make sure voice activity detection works properly
retrain the normalization, transformation and PLDA classifier to our data – this isn’t very difficult, but may depend on the amount of samples available
adapt the XVector models – this would be very difficult to develop and also very time consuming
How it was done
Here we will explain step-by-step how to obtain the above results using Kaldi. Installing it is not too difficult, if you’re used to opensource projects. In fact, Kaldi is designed to be as self-sustaining as possible, so it minimizes the number of components that affect the system as a whole and mostly lives within a specific chosen directory. This was done to ease the deployment in cluster computing environments. The idea is to clone the official repository: https://github.com/kaldi-asr/kaldi and follow the INSTALL instructions therein. They are targeted mostly at Linux and Mac, but a Windows setup also exists in the windows sub-directory.
The models we used are available on the project’s official model page: http://kaldi-asr.org/models.html We used the model with the code M8 which seemed to be trained on the largest amount of data (at the time of publishing). We obviously used the version 1a, which is the XVector model referred above. After unpacking the folder, inside you will find the exp/xvector_nnet_1a folder containing the XVector and PLDA models and everything else is safe to ignore and delete. To make things work you will also need to copy or symlink the following folders from the Kaldi distribution:
$KALDI_HOME/egs/sitw/v2/path.sh – you need to copy and edit this file to point to the location where you installed Kaldi
$KALDI_HOME/egs/sitw/v2/conf – contains the configurations files with parameters used to train the model
$KALDI_HOME/egs/sitw/v2/sid – contains scripts related specifically to speaker identification
$KALDI_HOME/egs/sitw/v2/steps – contains generic scripts to different procedures used in Kaldi
$KALDI_HOME/egs/sitw/v2/utils – contains small utility scripts useful for data manipulation and error checking
Next we create the data directory and inside two subdirectories enrollment and test. Each of them contains the following files:
wav.scp – contains the list of audio files and their paths, eg:
utt1 /mnt/audio/utt1.wav
utt2 /mnt/audio/utt2.wav
utt3 /mnt/audio/utt3.wav
utt4 /mnt/audio/utt4.wav
text – this would normally host transcriptions for the above files, but since we don’t need it for this task, we can leave the transcriptions empty. All you need is a list of the above files. You can easily get that using the following command:
cut -f1 -d' ' wav.scp > text
utt2spk – a mapping of files to the names of the speakers. For enrollment this could look like this:
utt1 spk1
utt2 spk1
utt3 spk2
utt4 spk3
For test the speakers aren’t known, so we can map each file to a new unknown speaker. It’s easier to simply use the name of the file as the name of the speaker, eg:
test1 test1
test2 test2
test3 test3
test4 test4
spk2utt – the inverse of the above mapping. This can be easily obtained using the following utility script:
./utils/utt2spk_to_spk2utt.pl data/enrollment/utt2spk > data/enrollment/spk2utt
Once these are created, we can check their correctness using the following script:
./utils/validate_data_dir.sh --no-feats data/enrollment
If everything is okay, you should get a „Successfully validated data-directory” message. Otherwise, a detailed information on the error should be provided. Note that the above and all the following examples will use the enrollment dir as the example, but they should be also repeated on the test dir as well.
After that we can start by extracting the features from the files. This script will automatically use the options provided in the conf directory. Note that the files should be saved as uncompressed WAV files, with one channel and using the sampling frequency of 16 kHz. Otherwise, this step will likely fail:
./steps/make_mfcc.sh --nj 10 data/enrollment
The nj option stands for „number of jobs” and allows speeding up the computation using multiple processes. It cannot be larger than the number of speakers in the directory and should be similar to the number of CPUs available on your system.
The next step is to compute the VAD:
./sid/compute_vad_decision.sh --nj 10 data/enrollment
As mentioned earlier, this is a pretty trivial, energy-based detector that is sufficient for our files (which were recorded in a booth and have very little background noise). For more challenging recordings, you should look at the neural-net based solutions, like the one present in the model M4 on the Kaldi models page.
Hopefully now, we can extract the XVectors using the following command:
./sid/nnet3/xvector/extract_xvectors.sh exp/xvector_nnet_1a data/enrollment exp/enrollment_xvectors
The last argument is the destination directory, where the XVector files are going to be stored. To visualize them, we will first need to do some preprocessing on them:
ivector-subtract-global-mean $plda/mean.vec scp:$xvector/xvector.scp ark:- | transform-vec $plda/transform.mat ark:- ark:- | ivector-normalize-length ark:- ark,t:xvectors.txt
Where the $plda points to the exp/xvector_nnet_1a/xvectors_train_combined_200k directory and $xvector points to the chosen XVectors directory, eg. exp/enrollment_xvectors. The resulting xvectors.txt file will contain a list of XVectors that can easily be parsed in Python:
def load(file):
label=[]
vec=[]
with open(file) as f:
for l in f:
tok=l.strip().split()
label.append(tok[0])
vec.append([float(v) for v in tok[2:-1]])
return label,np.array(vec)
Using tSNE is easy in scikit-learn:
from sklearn.manifold import TSNE
emb=TSNE(n_components=2).fit_transform(data)
Note that if you want to compare different sets (eg. enrollment and test), you should use the above command on all your data together, as doing it separately can generate different embedding for each set and then they wouldn’t match in the final drawing. To draw the result, we can use matplotlib’s scatterplot function:
import matplotlib.pyplot as P
P.scatter(emb[:,0],emb[:,1])
Finally, we want to compute the scores and find a speaker for each file. First we need to create a trials files describing what scores we want to compute:
cut -f1 -d' ' data/enrollment/spk2utt | while read spk ; do
cut -f1 -d' ' data/test/utt2spk | while read utt ; do
echo $spk $utt
done
done > trials
We can then compute the scores using the following command:
ivector-plda-scoring --normalize-length=true --num-utts=ark:exp/enrollment_xvectors/num_utts.ark \
$plda/plda \
"ark:ivector-mean ark:data/enrollment/spk2utt scp:exp/enrollment_xvectors/xvector.scp ark:- | \
ivector-subtract-global-mean $plda/mean.vec ark:- ark:- | \
transform-vec $plda/transform.mat ark:- ark:- | \
ivector-normalize-length ark:- ark:- |" \
"ark:ivector-subtract-global-mean $plda/mean.vec scp:exp/test_xvectors/xvector.scp ark:- | \
transform-vec $plda/transform.mat ark:- ark:- | \
ivector-normalize-length ark:- ark:- |" \
trials scores
This command takes all the files we created earlier and generates a text file called scores, which contains a score for each speaker to utterance mapping we provided in the trials file. To convert it into something more readable, we used this Python program to generate an Excel file:
import argparse
import xlsxwriter
if __name__ == '__main__':
parser=argparse.ArgumentParser()
parser.add_argument('scores')
parser.add_argument('output')
args=parser.parse_args()
spks={}
utts={}
with open(args.scores) as f:
for l in f:
tok=l.strip().split()
spk=tok[0]
utt=tok[1]
score=float(tok[2])
if spk not in spks:
spks[spk]=len(spks)
if utt not in utts:
utts[utt]={}
utts[utt][spk]=score
workbook = xlsxwriter.Workbook(args.output)
ws = workbook.add_worksheet()
ws.write(0,0,'Utterance')
ws.write(0,1,'Best speaker')
ws.write(0,2,'Best score')
for i,spk in enumerate(spks.keys()):
ws.write(0,3+i,spk)
for r,(utt,scs) in enumerate(utts.items()):
ws.write(r+1,0,utt)
best_spk=''
best_sc=-999999
for spk,sc in scs.items():
c=spks[spk]+3
ws.write(r+1,c,sc)
if best_scbest_sc=sc
best_spk=spk
ws.write(r+1,1,best_spk)
ws.write(r+1,2,best_sc)
workbook.close()
|
| |