Tag Archives: Deep Learning


Verbal communication with machines has been one of the main goals of computer science since its early days. In 1968, Arthur C. Clarke caught the imagination of half the planet with his novel 2001: A Space Odyssey, which was made into an enormously successful film by Stanley Kubrick in the same year as the book was launched. Computer science was still in its infancy; the first microprocessor had not even been developed. Still, the idea of artificial intelligence such as that of HAL 9000 had already seduced a generation, even though personal computers were still a thing of the future.


From left to right: Aitor González Aguirre, Dr. LLuís Padró and Horacio Rodríguez Hontoria.

As often happens in these situations, in mid-2016 we are still far from replicating the imaginary communication capabilities of HAL 9000 in 2001. It may appear to be extremely easy to understand human language – people do it every day – but it is a very difficult task for machines. Language is full of factors that make comprehension complicated: polysemy, irony, sarcasm and double meanings. There are many ways of saying the same thing. As if this was not enough, comprehension depends on our knowledge of the world, which we use to reason and understand each other, in a process that we carry out almost without realising. The following two statements are a clear example:

  • Yesterday I saw a plane flying over New York.
  • Yesterday I saw a train flying over New York.

Although the statements are exactly the same apart from one word, they have very different meanings. It is easy for us to imagine a plane flying over New York, but when we read the second statement we immediately realise that a train cannot fly. Consequently, we know that the meaning of this sentence is different. For example, it may be that a plane is flying over New York and the person inside the plane (because people cannot fly either) sees a train from the window. It is very difficult for computers to make these kinds of inferences, but relatively easy for people.

Semantic textual similarity (STS) is a task that was introduced originally in SemEval 2012 to tackle one of the aspects of artificial intelligence that will enable machines to communicate naturally with people: evaluation of meaning. Knowledge of whether two original sentences have the same meaning or not is vital to good communication. However, meaning is not black and white: there is an entire spectrum of greys. The aim of STS is to automatically assess the similarity of two sentences on a scale from 0 to 5. Each band represents, in an easily understandable way, the differences that make two sentences equivalent or not.

There are many techniques and resources for tackling this task, such as WordNet, Wikipedia and ontologies such as SUMO, which address many of the aforementioned difficulties, including polysemy, the identification of entities (including people and companies) and reasoning. However, these resources are created manually, which is a very expensive task in terms of the time required and the money. Recently, advances made via Deep Learning have led to new useful resources that have been generated automatically to improve STS systems. One of these is Word Embedding: a system that can map the semantic characteristics of words in a vector. Words with similar meanings are mapped closer to each other, and those that are different are further away. These new systems can analyse the context in which words appear in a very wide range of texts. Assuming that similar words appear in similar contexts, the system puts each of the words into an N-dimensional space, and maintains the premise that similar words should remain close to each other. When the vectors have been generated, we can calculate the similarity of the words by calculating the cosine of the vectors.

The same vector that converts man into King, converts woman into Queen (figure 1).

The same vector that converts man into King, converts woman into Queen (figure 1).

However, the most interesting characteristic of these vectors is that the meaning is preserved, so we can operate on them using additions and subtractions. For example, the result of the operation King-Man+Woman would be the same as Queen, or put another way, if the vector that codes the meaning of Man is taken away from the vector that means King and the vector for Woman is added, the result is the vector that means Queen (see Figure 1).

These measures of equivalence of meaning are useful for many tools, for example, in the well-known Siri, the personal assistant for iOS, or to assess voice commands for automated systems for dwellings, so that the house can deduce that the phrase “I need more light” could mean “Put up the blinds” or “Turn on the lights”, depending on how much light there is outside at the time. Other potential applications are assistance for the elderly, as in general people in this age range find it harder to learn and use voice commands, and assistance for teaching, where an STS system can assess whether a student’s response means the same as the correct response assigned by the teacher, which makes the task of teaching easier.

Another very popular task is Question Answering (QA), an area in which our group is working actively. QA can be defined as a task in which the user asks a question in natural language and the system must give an answer (instead of different documents, which is what Google would do). The first QA systems tackled questions of a factual nature: Where was Obama born?, When did the Second World War end? Who discovered penicillin? Then, more complex questions were tackled, in which the response was not found in just one document, but needed to be built from partial components, for example: Who were the first three chancellors of the Federal Republic of Germany?; questions related to definitions: Who was Picasso?, What is paracetamol used for?; and questions on opinions, such as arguments for and against arms control in the USA. In parallel, restricted domain systems have been developed, as well as systems that seek a response not only in textual documents but also in structured sources, such as Linked Open Data repositories (FreeBase, DBPedia, BioPortal and others). In all of these systems, distance measures between the question and potential responses are used widely.

One type of QA that has become fairly well-known is Community QA (CQA). In CQA, an initial question is asked by a member of the community. This activates a complex structure of interventions of members of the community who reply, give opinions, reconsider and refine, among other actions. This leads to the use of different measures of similarity between questions and other questions, between questions and answers, and between different answers, etc. Obviously, the measures are different for different sub-tasks. Recently, our group participated in a CQA competition as part of SemEval 2016 in which there was the added difficulty that texts in their original language, Arabic, were compared with their translation in English.

Both STS and QA are partial approximations to Natural Language Understanding (NLU), which consists of a programme that reads a text and, on this basis, constructs a conceptual representation of its meaning. If two texts have the same meaning, or similar meanings, they will have similar conceptual representations. However, this representation could be used in other applications (such as translation and creating summaries). In the future, research will be carried out in this direction, which will combine Deep Learning and major semantic online resources for the automatic comprehension of language. Perhaps HAL is not so far away after all.

Dr Lluís Padró, Dr Horacio Rodríguez Hontoria, researchers of the Center for Language and Speech Technologies and Applications (TALP UPC), and Aitor González from IXA research group


Deep Learning revolutionises speech technologies

One of the objectives of language technology experts is speaking and interacting with machines in any language. This is nothing new; but this type of technology is increasingly common at user level.

The new generation of speech recognition and natural language processing systems has already begun to filter down to users, through improvements in personal assistants (such as Apple Siri, Google Now and Microsoft Cortana) and new products such as Skype’s voice translator or Amazon Echo’s smart speaker.

grafico 1

Fig. 1: General diagram of a neural network

Deep Learning is no more than the evolution of traditional neural networks (algorithms designed to reproduce the mechanics of the human brain in the encoding and decoding of messages, enabling self-learning). However, their widespread use seems to be reinventing research and development in various fields, such as image processing, language processing, and speech technologies. Neural networks are learning systems that use relatively simple mathematical functions, called neurons, which work in an interconnected way to create various layers (see Fig. 1). The layers enable different levels of abstraction and, to a certain extent, a form of learning that is more similar to that of humans. This contrasts with previous techniques, such as statistics, in which the system learns from the data, as in Deep Learning, but without considering abstractions.

Neural networks have existed since the 1950s, but now they really work

A frequently asked question is why artificial neural networks (a mathematical technique that goes back to the 1950s) are so popular today. In fact, in 1969, Minsky & Papert clearly defined their limitations, including the fact that single-layer neural networks cannot compute the XOR operation for two inputs (the result is true if one of the two inputs is true, but not both, e.g. Juan is tall or short); and the fact that computational capacity at the time was not enough to process multilayer neural networks. The problem of training for the XOR operation was solved with the backpropagation algorithm (Rumlelhart et al., 1986), and computational capacity has increased dramatically with the use of Graphics Processing Units (GPUs) that enable hundreds of operations to be performed at the same time. An important milestone was reached in 2006: the year in which the concept of Deep Learning was introduced with the development of an effective way of training very deep neural networks (networks of several layers) (Hinton et al., 2006). Currently, neural algorithms are advancing at such a pace that the specific architectures that are effective today could be replaced by others tomorrow. So we are talking about a field that is in constant evolution.

The use of deep learning in speech applications

In a natural way, neural networks have become a tool for automatic learning. Multiple-tier architectures with a range of network types have been used satisfactorily for classification and prediction techniques. Recently, automatic learning has crossed boundaries and successfully entered other areas. As a result, voice recognition and automatic translation can be approached as automatic learning problems that can be solved via specific neural architectures. Neural architectures have been effective in the fields of speech technologies, voice recognition, automatic translation and voice synthesis.

Voice recognition and its jump to quality

In recent years, Deep Learning algorithms have been the key to obtaining significant advances in the features of automatic speech recognition systems. Neural networks have proved to be a versatile tool that can model all the acoustic, phonetic and linguistic aspects of this task. Complex, traditional systems based on many specific components have already been replaced by general structures that are highly versatile and have better features. Every year, new architectures appear that are based on recurrent neural networks (Karparthy, 2014) and offer significant improvements in the hit rate.

The new systems may not increase knowledge of the complexity of the problem, but they have helped to resolve it. Now we can build machines that have a surprising capacity to learn models, through the use of examples, as complex as those involved in speech recognition.

Text translation in one step


Fig. 2: Autoencoder

Text translation via Deep Learning relies on an autoencoder structure (Cho, 2015) to translate from a source to a target language (see Fig. 2). The encoder learns a representation (M) of the input data (source language, S) and then decodes it to output data (target language, T). The autoencoder is trained using translated texts. Source words are mapped to a small space (see Fig. 3). This operation reduces the vocabulary and takes advantages of synergies between similar words (in morphological, syntactical and/or semantic terms). The new representation of words is encoded in a summary vector (a representation of the source language that we have to decode into the target language), using a recurrent neural network. This kind of network has the advantage of helping us to find the most accurate words, according to the context. Decoding is carried out using the reverse steps to the encoder.

Recently, this architecture has been improved by an attention mechanism, which uses the context of the word that is being translated instead of using the sentence as such.


Fig. 2: Example of mapping of words

Voice synthesis with natural intonation in broad contexts

In voice synthesis, written phrases are transformed into one of the possible correct ways of reading them. The most mature technology to date concatenated pre-recorded segments. However, in the last decade, advances have been made in statistical synthesis, which is the traditional technology. In this technique, the voice is modelled through parameters that define oral discourse, for example, pauses and intonation, learnt statistically. Given a phrase, appropriate models are selected for the phonemes and a generation algorithm produces what is finally transformed into voice. One of the first applications of Deep Learning was to broaden the definition of the contexts. In statistical synthesis, a precise phoneme is defined according to the context. In contrast, in Deep Learning, the appropriate parameters can be found without an explicit definition of contexts (Zen et al., 2013).

Recently, recurrent neural networks have been used that model time sequences: the learning system itself learns the continuity of speech. Specifically, recurrent neural networks called long short-term memory (LSTM) facilitate learning of elements such as the intonation of a question or the reading of relative clauses, which occur in broad contexts of reading.

Neither traditional statistical systems nor current systems can read in a way that shows comprehension of the text. In both systems, learning is based on three elements: algorithms, data and calculation capacity. The major advances in Big Data processing and in the capacity to carry out complex calculations and multiple operations simultaneously open up new perspectives for Deep Learning. If the observable data and calculation capacity are increased, would this paradigm be able to model the intricate mental processes we use to read?

What remains to be seen

For the sceptics who saw neural networks, the precursors to Deep Learning, fail in the 1980s, we should explain that advances in the new neural techniques can be seen in the quality of speech systems, which are revolutionary (LeCun et al., 2015). For example, in voice recognition, the error rate has been brought down by around 10% (see TIMIT). The fact that the functions that characterise a problem do not need to be designed, as they are learnt automatically, has enabled challenges to be extended in the field of speech technologies to end-to-end voice recognition (learning one function that enables the direct transfer of voice to text) (Hannun et al., 2014), multilingual translation (Firat et al., 2016) and multimodal translation (image and text) (Elliot et al., 2015).

From left to right: José A. R. Fonollosa, Marta R. Costa-jussà and Antonio Bonafonte.

At the Center for Language and Speech Technologies and Applications (TALP UPC) we are leaders in the development of research on the granularity of characters (Costa-jussà & Fonollosa, 2016). This technique enables learning of text translation by detecting subsequences of words with meaning, such as prefixes and suffixes, and reducing the size of the vocabulary. This leads to a high level of generalisation, particularly for morphologically flexible languages, which has a positive impact on the quality of the translation.

We may reach certain limits in neural networks again, but these architectures have already opened up new horizons for research and for a wide range of applications. These include the “simple” design of a machine that can win the popular Japanese game GO, and the major impact that neural networks could have in growing sectors such as artificial intelligence (AI), the internet of things (IoT), social computing, or image recognition, in addition to those mentioned above. Neural networks are beginning to be implemented in areas such as the aerospace industry, education, finances, defence, e-commerce and health (diagnostics and drug prescription). Some technological innovations are being tackled that would not have been feasible or manageable with traditional techniques such as statistics. So it seems reasonable to consider that neural systems are here to stay for a while.

Will you join in?

Marta R. Costa-jussà, José A. R. Fonollosa y Antonio Bonafonte
Investigadors del Centre de Tecnologies i Aplicacions del Llenguatge i de la Parla (TALP UPC)