Judging the speech synthesis quality, I personally I think that Microsoft's solution didn't sound as great as Googles synthesis. Having ASR and TTS in one place reduces it to: Send our text to be transformed to speech (TTS) from client to server.Get response to client (dispatch the message on the client).Similar to Google, having ASR and TTS from one provider, definitely has the benefit of saving us one roundtrip since normally you would need to perform the following trips: Microsoft also offers TTS as part of their Azure cognitive services (ASR, Intent detection, TTS). I've taken a recipe and had it read by Google and frankly liked the output. I doubt that they will offer it for such a small user group - but who knows. It really works well - the only downside for us was that it's not really Swiss German. When you go to their website I highly encourage you to try out their demo with a German text of your choice. Have a look here if you haven't been wowed today yet. When thinking about SaaS solutions, the first thing that comes to mind these days, is obviously Google's TTS solution which they used to showcase Google's virtual assistant capabilities on this years Google IO conference. So enough with the boring theory, let's have a look at the available solutions. As a final note I'd like to add that there is also a format called speech synthetisis markup language, that allows users to manually specify the prosody for TTS systems, this can be used for example to put an emphasis on certain words, which is quite handy. If you are interested how the new systems work in detail, I highly recommend the engineering blog entry describing how Apple crated the Siri voice. Here deep learning networks are used to predict the unit selection. Now in contrast to the classical way of TTS new methods based on deep learning have emerged. Below is a great conceptual graphic from Apple's engineering blog showing this cost estimation. Using an algorithm called Viterbi the units are then concatenated in such a way that they create the lowest "cost", in cost resulting from selecting the right unit and concatenating two units together. Below is a great example from Apple's Siri engineering team showing how the slicing takes place. The raw input text is first translated into a phonetic transcription which then serves as the input to selecting the right units from the database that are then concatenated into a waveform. The problem is to find the right combination of these units that satisfy the input text and the accentuation and which can be joined together without generating glitches. The recombination of these components is not an easy task because the characteristics depend on the neighboring phonemes and the accentuation or prosody. The next trick is called "unit-selection", where recorded speech is sliced into a high number (10k - 500k) of elementary components called phones, in order to be able to recombine those into new words, that the speaker has never recorded. Depending on the task, the material can range from navigation instructions to jokes, depending on your use case. The classical way works like this: You have to record at least dozens of hours of raw speaker material in a professional studio. Einige Chips mit den Edamame auf dem Broccoli verteilen. in der Mitte des Ofens.Įssig, Öl und Dattelsirup verrühren, Schnittlauch grob schneiden, beigeben, Vinaigrette würzen.īroccoli aus dem Ofen nehmen. 1 1/2 cm dicke Scheiben schneiden, auf einem mit Backpapier belegten Blech verteilen. To showcase the performance of existing SaaS solutions I've chosen a random recipe from Betty Bossi and had it read by them: Ofen auf 220 Grad vorheizen. In the second blog post I will then describe at which insights we arrived in the UX workshop and how we then combined wit.ai with the solution from slowsoft in a quick and dirty web-app prototype built on socket.io and flask.īut first let us get an overview over existing text to speech (TTS) solutions. In the first out of two blog posts would like to give you a short overview of the available options. Generally there are quite a few text to speech solutions out there on the market. We decided that it would be great to go with some recipes from a famous swiss cookbook provider. So no more touching your phone with your dirty fingers only to check again how many eggs you need for that cake. We thought it would be a cool idea to combine it with our existing automatic speech recognition (ASR) expertise and build a cooking assistant that you can operate completely hands free. To my knowledge they are the only ones who are able to generate Swiss German speech synthesis in various Swiss accents. Slowsoft is a provider of text to speech (TTS) solutions. In one of our monthly innodays, where we try out new technologies and different approaches to old problems, we had the idea to collaborate with another company.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |