Mavzu: speaker indenfication guruh: 051-20 Bajardi

Download 0,52 Mb.
bet	5/7
Sana	08.01.2024
Hajmi	0,52 Mb.
	#132475

1 2 3 4 5 6 7

Bog'liq
1 TIZIM SIGNAL TOP 1

Training data

Tools and methods

Speaker identification by voice audio analysis is an area with a long tradition and many applications in various disciplines including research, government or commerce. The NIST SRE events have been organized since 1996 and in 2018 attracted 48 teams from all around the world. This has obviously always been a big deal to many people and the substantial attention resulted in many methods being developed throughout the years.
For the much of the first half of this decade, the prevailing approach used what is known as the i-vector. This technique uses factor analysis to convert any recording into a low-dimensional representation of the speaker known as the identity vector. The idea of using vector representations for large-scale data is not isolated to speaker analysis – the popularity of embeddings made the method very attractive in many situations where speaker information may be useful. Apart from the aforementioned speaker identification and diarization, a vector like that is very easy to integrate into any model like that for speech recognition. As shown below, it can also be easy to visualize, which makes it more accessible to people with different backgrounds.
It didn’t take long for others to try and improve on that approach by utilizing another technique that works well with generating vector output, the deep neural network. Apparently the idea of using ANNs for speaker recognition is not new, however due to recent advancements in methods and available data made it possible to achieve better results than ever before. This created a form of competition where people came up with original names like the d-vector and x-vector to assure their prominence in the field. Regardless of their motive, the idea is always similar to that of its progenitor, at least from the perspective the user – each recording, regardless of size, is converted into a constant length, low-dimensional embedding in the vector space representing speakers from a chosen population present in the data.
Speaking of data, these novel approaches wouldn’t be possible without the presence of large annotated speech databases. Oftentimes, speech recognition corpora like Librispeech and CommonVoice could be used for this purpose, but the two most popular corpora mentioned in competitions like SRE are Speakers-In-The-Wild and VoxCeleb.
This is a good place to mention the nomenclature used for describing data subsets in these circles. Normally in machine learning, we use the terms „training data” (used for training the model) as something completely separate from „test data” (aka evaluation, used for assessing the final performance of the model) and „development data” (aka validation, used for tuning the hyper-parameters of the model). In speaker id competitions, these terms often mean something else. Training data usually stands for the enrollment data used to designate the speakers we are trying to identify, test data is the unidentified collection of recordings we need to identify and measure our results on and development data are the large corpora (mentioned above) used to train the model about the various common aspects of voices in the general population. It is not uncommon for people to use more than one set of „train/test” datasets for developing their model. One is made available during competition, but if you want to make sure you are making progress while preparing for the competition, you better make a train/test set of your own, or use one from a previous competition, if available.
An important note to mention is that the attractiveness of the vector representation does not guarantee that their utilization in problems like speaker identification is completely trivial. The whole algorithm, which was optimized for use in the competitions, has several steps:

voice activity detection – it is important to discard any audio that may belong to something else that is not the voice of the person we are trying to identify, as that would skew the results.
?-vector extraction – the process of the extraction can be tuned in many ways and produce a series of vectors for short segments that are later averaged out or a single vector for one long piece of audio.
training data – each speaker can have several samples (i.e. recordings) of their voice available in the enrollment data. We need to average the vectors to obtain one per speaker we wish to identify. It may also be useful to know the amount of vectors that were used to obtain each average, as that can be used in the final classification step.
global mean subtraction and normalization – this is a normal processing step useful for any classification task, but here it’s worth noting that the mean is usually computed on the much larger development set, rather than just the testing phase.
LDA transformation – dimensionality reduction is also a common approach used in classification. This is also optimized on a larger data set and tuned to maximize the performance of the classifier.
PLDA classification – this is the most popular classification algorithm used in speaker id challenges. It performs a one-vs-one classification on all speaker-test pairs and provides a score for each one. Using the maximum score, we can easily determine the winner, but looking at the value can also help determine if no speaker matches the provided sample.

Download 0,52 Mb.

1 2 3 4 5 6 7

Download 0,52 Mb.