|
Mavzu: speaker indenfication guruh: 051-20 Bajardi
|
bet | 6/7 | Sana | 08.01.2024 | Hajmi | 0,52 Mb. | | #132475 |
Bog'liq 1 TIZIM SIGNAL TOP 1
We collected a small sample of voices of 32 interpreters. Each interpreter had between 1-3 recordings, where each of the 57 recordings was between 21 and 192 seconds long. Altogether, the whole data set was ~83 minutes long.
After extracting the XVectors of each recording, they were subsequently averaged to get 32 speaker vectors – each with 128 dimensions. Below, we used the popular t-SNE algorithm to visualize this multidimensional space in 2D:
Each color dot represents files belonging to individual speakers (we assigned fake names to the speakers’ samples). The annotated crosses are the vectors representing the „average” for each speaker. We can observe several potential issues with the data. The samples of different speakers are not always close to their average. Furthermore, several speakers are very close to each other. This means they will be difficult to differentiate. As an aside, note how the method separated male speakers from female ones. This was not intentional and simply stems from similarity of different voices.
The next step was to take a collection of unlabeled audio files and use the method on them. For this demonstration, we took 246 files, but this time we always extract a 30 second segment from the middle of each file. For the visualization, we use the same crosses as in the plot above, but the dots represent the still unidentified files.
We can use this information to try and assign each file to a speaker, but there are a few observations we can make first. We can see that not all speakers are present in this collection of files (some crosses don’t have any dots close to them), but also there are a few files (in the center, around coordinates [7,-7]) that don’t match any of the speakers we have in our enrollment database (dots have no crosses close to them). To make the matching possible, we will assign a score between each speaker and each file (so 7872 different scores) and for each file choose the speaker with the highest score. However, if the score is too low (usually below 0), we can assume that the file doesn’t match any of the speakers in our enrollment database.
|
| |