Paraphrasing project

or how we can help machines improve their linguistic capabilities

The problem

Humans are able to understand easily that all of the sentences in the previous animation have the same meaning. However, for machines this is a complicated task and, in fact, one of the biggest limitations currently in human-machine communication. Paraphrasing is the linguistic mechanism we use to express the same message with different structures and words.

A lot of variability is introduced in our language through paraphrasing. The management of this is so complex that it takes a toll on our experience interacting with numerous applications in our day-to-day life. For example, it makes us rewrite over and over our queries in the text field of search engines, and also narrow down our language when we use voice assistants. It also affects many work fields, such as, automatic translation, interaction with chat bots, document classification, etc.

A great deal of effort is invested in order to control this limitation and endeavour to make systems answer to any given input, working on the resilience of systems to language variability. This is achieved through dictionaries, rules, statistic technologies, or neuronal learning. These approaches have to deal with some issues, including context specific ones.

Instead of making a system that is resistant to language flexibility, our approach focuses on learning from this variability. We search for efficient mechanisms to generate said flexibility, ensuring we can recognise and then use it in specific applications.

Acquisition and text segmentation

Paraphrase generation with diverse techniques

Classification and validation of paraphrases

Our approach

Variability in contrast with the rigidity of previous solutions
From a general approach to more specific solutions
Multilingual — English, Spanish and Portuguese
Big Data — high quality data, diverse and numerous
An heterogeneous solution in three phases: artificial intelligence, statistical techniques and language modelling

Technological innovation in three phases

2019

Nov — Dec

Management and dissemination

2020

Jan — Dec

Management and dissemination

Jan — Dec

Data collection

2021

Jan — Dec

Management and dissemination

Jan — Jun

Generation

Apr — Dec

Classification

Phase 1

Acquisition and text segmentation

Detection of phrases likely to be paraphrased
Gathering and cleaning of textual resources
Implementation of a method of text segmentation in meaningful units

Phase 2

Paraphrase generation with diverse techniques

Statistic and neuronal automatic translation (advanced encoder-decoder models, transformer, Bert)
Synonyms dictionaries and lemmatizers
Context vectors, word, phrase and sentence embeddings
Language models
Further research into other applicable technologies

Phase 3