Paraphrasing project

or how we can help machines improve their linguistic capabilities

The problem

Humans are able to understand easily that all of the sentences in the previous animation have the same meaning. However, for machines this is a complicated task and, in fact, one of the biggest limitations currently in human-machine communication. Paraphrasing is the linguistic mechanism we use to express the same message with different structures and words.

A lot of variability is introduced in our language through paraphrasing. The management of this is so complex that it takes a toll on our experience interacting with numerous applications in our day-to-day life. For example, it makes us rewrite over and over our queries in the text field of search engines, and also narrow down our language when we use voice assistants. It also affects many work fields, such as, automatic translation, interaction with chat bots, document classification, etc.

A great deal of effort is invested in order to control this limitation and endeavour to make systems answer to any given input, working on the resilience of systems to language variability. This is achieved through dictionaries, rules, statistic technologies, or neuronal learning. These approaches have to deal with some issues, including context specific ones.

Instead of making a system that is resistant to language flexibility, our approach focuses on learning from this variability. We search for efficient mechanisms to generate said flexibility, ensuring we can recognise and then use it in specific applications.

Acquisition and text segmentation
Paraphrase generation with diverse techniques
Classification and validation of paraphrases

Our approach

  • Variability in contrast with the rigidity of previous solutions
  • From a general approach to more specific solutions
  • Multilingual — English, Spanish and Portuguese
  • Big Data — high quality data, diverse and numerous
  • An heterogeneous solution in three phases: artificial intelligence, statistical techniques and language modelling

Technological innovation in three phases

2019

Nov — Dec

Management and dissemination

2020

Jan — Dec

Management and dissemination

Jan — Dec

Data collection

2021

Jan — Dec

Management and dissemination

Jan — Jun

Generation

Apr — Dec

Classification

Phase 1

Acquisition and text segmentation

  • Detection of phrases likely to be paraphrased
  • Gathering and cleaning of textual resources
  • Implementation of a method of text segmentation in meaningful units

Phase 2

Paraphrase generation with diverse techniques

  • Statistic and neuronal automatic translation (advanced encoder-decoder models, transformer, Bert)
  • Synonyms dictionaries and lemmatizers
  • Context vectors, word, phrase and sentence embeddings
  • Language models
  • Further research into other applicable technologies

Phase 3

Classification and validation of paraphrases

  • Automatic classification: neural networks, support vector machines, random forest, etc.
  • Human validation

Project results

Download corpora
Monolingual corpora from diverse high-quality sources
Spanish
10M sentences
8.6M from books
1.4M from online sources
18 variants
Portuguese
10M sentences
9M from books
1M from online sources
4 variants
English
10M sentences
5M from books
5M from online sources
15 variants