|
Mavzu: speaker indenfication guruh: 051-20 Bajardi
|
bet | 4/7 | Sana | 08.01.2024 | Hajmi | 0,52 Mb. | | #132475 |
Bog'liq 1 TIZIM SIGNAL TOP 1 How Does It Work?
Let’s try to explain the working logic of the module in a simple way. First of all, the unstructured sound data are extracted and transformed into structural data. In the process of structuring the data, the roughness that will affect the model such as noise and sound pollution is cleaned. It is then determined whether there is a speaking situation. At this stage, it is checked whether there is more than one speaker or if the speaker has changed. If the speaker is more than one and overlapped, separation is done. The clustering algorithm, also known as the machine learning technique, is used. Conversations re-segmented as a result of these transactions are classified and tagged to their owners.
The part described so far forms one side of the module. On the other hand, with the structured sound data, who this sound belongs to or not is recognized by the deep learning algorithm. If the identified data has left any information previously, it is used to match aiming to verify whose sound it is. Moreover, all these stages take place in real-time, which expands the usage of this module in various industries.
Conclusion
The speaker identification module shows that big data is transformed into a truly meaningful form, can be used for many different fields of industry. There lies a treasure in the unstructured sound data. Built to reveal this treasure, the use of this module is definitely designed to push the limits of the imagination.
28 KWIETNIA 2020 ~ DANIJEL KORZINEK
Speaker identification is a process of determining the person who spoke a particular piece of recorded speech. In some cases, it may be a single long recording, but in other situations, we can have people exchanging roles frequently within a single recording session – in that case we first segment the speech into portions where hopefully only one person speaks at a time and apply the aforementioned procedure to each segment separately.
Speech identification is often referred to as speaker verification. Although usually solved in a very similar manner (using the same tools and models), those two words aren’t necessarily synonymous. Verification implies that we are trying to confirm the identity of a particular person given a recording, whereas identification tries to match a set of recordings to a set of identities.
Another very similar process is known as speaker diarization. This, however, is a completely different problem. The key difference between identification and diarization is that the former assumes the existence of a collection of voice samples for each person (known as the enrollment data) whereas the latter has no information about the actual identity of the people it is trying to recognize. Diarization simply assigns random or sequential anonymous labels to each of the determined voices, while also making sure that the same voice gets the same label, regardless of where it occurs in the audio. Technically, whereas speaker identification can be solved using regular classification methods, speaker diarization is phrased as a problem requiring clustering due to its unsupervised nature.
In our project, we wanted to assign speaker labels to recordings of interpretations of parliamentary speeches. These recordings usually contain the voice of a single person and if there are interjections by other speakers (usually in the front or the back of each file), we analyze only the voice of the person who is speaking for the majority of the time. Furthermore, we were able to collect a small database of samples of each person we wished to identify, i.e. interpreters, which was the main reason for attempting to use machine learning as a method for solving the problem. The identity of the actual parliament speakers (i.e. politicians) is usually known and in most cases announced at the start of each recording. The identity of the interpreters is not included in the recordings of the European Parliament sessions available online, but it is very relevant for our study of the interpreting process.
One final important note to mention is that this problem is generally regarded as completely language independent. That is good news when working on a problem concerning speech in many languages, like that in our project. It also means, that we can use models trained on large amounts of high quality data, regardless of the language we have to analyze.
|
| |