Paraphrasing project
or how we can help machines improve their linguistic capabilities
The problem
Humans are able to understand easily that all of the sentences in the previous animation have the same meaning. However, for machines this is a complicated task and, in fact, one of the biggest limitations currently in human-machine communication. Paraphrasing is the linguistic mechanism we use to express the same message with different structures and words.
A lot of variability is introduced in our language through paraphrasing. The management of this is so complex that it takes a toll on our experience interacting with numerous applications in our day-to-day life. For example, it makes us rewrite over and over our queries in the text field of search engines, and also narrow down our language when we use voice assistants. It also affects many work fields, such as, automatic translation, interaction with chat bots, document classification, etc.
A great deal of effort is invested in order to control this limitation and endeavour to make systems answer to any given input, working on the resilience of systems to language variability. This is achieved through dictionaries, rules, statistic technologies, or neuronal learning. These approaches have to deal with some issues, including context specific ones.
Instead of making a system that is resistant to language flexibility, our approach focuses on learning from this variability. We search for efficient mechanisms to generate said flexibility, ensuring we can recognise and then use it in specific applications.
Our approach
- Variability in contrast with the rigidity of previous solutions
- From a general approach to more specific solutions
- Multilingual — English, Spanish and Portuguese
- Big Data — high quality data, diverse and numerous
- An heterogeneous solution in three phases: artificial intelligence, statistical techniques and language modelling
Technological innovation in three phases
2019
Nov — Dec
Management and dissemination
2020
Jan — Dec
Management and dissemination
Jan — Dec
Data collection
2021
Jan — Dec
Management and dissemination
Jan — Jun
Generation
Apr — Dec
Classification
Acquisition and text segmentation
- Detection of phrases likely to be paraphrased
- Gathering and cleaning of textual resources
- Implementation of a method of text segmentation in meaningful units
Paraphrase generation with diverse techniques
- Statistic and neuronal automatic translation (advanced encoder-decoder models, transformer, Bert)
- Synonyms dictionaries and lemmatizers
- Context vectors, word, phrase and sentence embeddings
- Language models
- Further research into other applicable technologies
Classification and validation of paraphrases
- Automatic classification: neural networks, support vector machines, random forest, etc.
- Human validation