Babylon’s Peer-Reviewed Research

Estimating Mutual Information Between Dense Word Embeddings

Vitalii Zhelezniak, Aleksandar Savkov, April Shen, Nils Hammerla

Some of the top approaches to semantic textual similarity rely on various correlations between word embeddings, including the famous cosine similarity. We show that mutual information between dense word embeddings, despite being difficult to estimate, is another excellent candidate for semantic similarity and rivals existing state-of-the-art unsupervised methods.

Published in 2020 | ACL Journal


Hybrid Reasoning Over Large Knowledge Bases Using On-The-Fly Knowledge Extraction

Stoilos, Giorgos and Juric, Damir and Wartak, Szymon and Schulz, Claudia and Khodadadi, Mohammad

The success of logic-based methods for comparing entities heavily depends on the axioms that have been described for them in the Knowledge Base (KB). Due to the incompleteness of even large and well engineered KBs, such methods suffer from low recall when applied in real-world use cases. To address this, we designed a reasoning framework that combines logic-based subsumption with statistical methods for on-the-fly knowledge extraction.

Published in 2020 | European Semantic Web Conference


Can Embeddings Adequately Represent Medical Terminology? New Large-Scale Medical Term Similarity Datasets

Claudia Schulz, Damir Juric

We create various large-scale datasets for testing whether embeddings correctly encode the similarity between medical terms and test existing state-of-the-art embeddings on these datasets. Our results reveal that existing embeddings cannot adequately represent medical terminology. Our new datasets are thus challenging new benchmarks for testing the adequacy of new medical embeddings in the future.

Published in 2020 | AAAI 2020


An Ontology-Based Interactive System for Understanding User Queries

Stoilos, Giorgos and Wartak, Szymon and Juric, Damir and Moore, Jonathan and Khodadadi, Mohammad

In the current paper we present a framework for automatically building a small dialogue for the purposes of bridging the gap between user queries and a set of pre-defined (target) ontology concepts. We show how we can use the ontology and statistical techniques to select an initial small set of candidate concepts from the target ones and how these can then be grouped into categories using their properties in the ontology.

Published in 2019 | European Semantic Web Conference


Multiverse: Causal Reasoning using Importance Sampling in Probabilistic Programming

Yura Perov, Logan Graham, Kostis Gourgoulias, Jonathan G. Richens, Ciarán M. Lee, Adam Baker, Saurabh Johri

The paper describes a probabilistic programming engine design and its analysis for counterfactual probabilistic programming, in general and in particular using importance sampling.

Published in 2019 | AABI 2019


Masking schemes for universal marginalisers

Gourgoulias, Kostis; Lomeli, Maria; Thompson, Daniel; Gautam, Divya

In this paper we study generative models which mimic reasoning under partially observed evidence scenarios and take decisions about which disease is more likely for the patient. Specifically, we explore different generative models in terms of learning efficiency.

Published in 2019 | AABI 2019


Integrating overlapping datasets using bivariate causal discovery

Anish Dhir and Ciarán M. Lee

Knowing that a disease is highly correlated with symptoms, or a drug highly correlated with recovery, is not enough, and basing medical decisions on such information can be dangerous. To truly begin to revolutionise healthcare, AI must learn to distinguish cause and effect. Our work solves this by utilising new physics-inspired ideas about what it means for one variable to cause another, and showing how causal relationships in one dataset limit the possibilities in other overlapping datasets. To illustrate our algorithm, we apply it to breast cancer data, showing how to extract causal relations between two important features despite the fact that they were never measured in the same dataset.

Published in 2019 | AAAI 2020


A System for Medical Information Extraction and Verification From Unstructured Text

Damir Juric, Giorgos Stoilos, Andre Melo, Jonathan Moore and Mohammad Khodadadi

A wealth of medical knowledge has been encoded in terminologies like SNOMED CT, NCI, FMA, and more. However, these resources are usually lacking information like relations between diseases, symptoms, and risk factors preventing their use in diagnostic or other decision making applications. In this paper we presented a pipeline for extracting such information from unstructured text and enriching medical knowledge bases.

Published in 2019 | IAAI 2020 at AAAI 2020


Correlations between Word Vector Sets

Vitali Zheleniak, April Shen, Daniel Busbridge, Aleksandar Savkov, Nils Hammerla

We interpret word similarity as correlations between word embeddings and generalise this view to the sentence-level similarity by considering either vector pooling or multivariate correlation coefficients. Both approaches rival state-of-the-art methods on standard semantic textual similarity benchmarks.

Published in 2019 | EMNLP-IJCNLP


Copy, paste, infer: a robust analysis of twin network counterfactual inference

Logan Graham, Ciarán M. Lee, Yura Perov

Provides efficient way to conduct counterfactual simulation, benchmarked against state of the art.

Published in 2019 | NeurIPS Causal Machine Learning workshop


Multilingual Factor Analysis

Vargas et al.

In this work we approach the task of learning multilingual word representations in an offline manner by fitting a generative latent variable model to a multilingual dictionary. We model equivalent words in different languages as different views of the same word generated by a common latent variable representing their latent lexical meaning.

Published in 2019 | ACL


Don't Settle for Average, Go for the Max: Fuzzy Sets and Max-Pooled Word Vectors

Vitalii Zhelezniak, Aleksandar Savkov, April Shen, Francesco Moramarco, Jack Flann, Nils Y. Hammerla

We push the limits of word embeddings on semantic textual similarity tasks by introducing DynaMax, a novel unsupervised non-parametric similarity measure based on word vectors and fuzzy bag-of-words. This method is efficient and easy to implement, yet outperforms current baselines on STS tasks by a large margin.

Published in 2019 | ICLR


Model Comparison for Semantic Grouping

Vargas et al.

We introduce a probabilistic framework for quantifying the semantic similarity between two groups of embeddings. We formulate the task of semantic similarity as a model comparison task in which we contrast a generative model which jointly models two sentences versus one that does not. We illustrate how this framework can be used for the Semantic Textual Similarity tasks using clear assumptions about how the embeddings of words are generated.

Published in 2019 | ICML


Correlation Coefficients and Semantic Textual Similarity

Vitalii Zhelezniak, Aleksandar Savkov, April Shen, Nils Hammerla

We introduce a novel statistical view on semantic textual similarity and cast it as correlations between word embeddings. We study the statistics of popular embedding models and show that simple word embeddings together with rank correlations can easily rival the strongest deep representations on semantic textual similarity tasks.

Published in 2019 | NAACL-HLT


Decoding Decoders: Finding Optimal Representation Spaces for Unsupervised Similarity Tasks

Vitalii Zhelezniak, Dan Busbridge, April Shen, Samuel L. Smith, Nils Y. Hammerla

Intriguingly, simple models outperform complex deep networks on many unsupervised text similarity tasks. We provide an intuitive yet rigorous explanation for this behaviour by introducing the concept of an optimal representation space, in which similarity is induced by the model's objective function.

Published in 2018 | ICLR Workshop


A Novel Approach and Practical Algorithms for Ontology Integration

Giorgos Stoilos, David Geleta, Jetendr Shamdasani and Mohammad Khodadadi

In this paper we present a framework and novel approach for integrating independently developed ontologies. Starting from an initial seed ontology which may already be in use by an application, new sources are used to iteratively enrich and extend the seed one. To deal with structural incompatibilities we present a novel fine-grained approach which is based on mapping repair and alignment conservativity, formalise it and provide an exact as well as approximate but practical algorithms.

Published in 2018 | ISWC


Supporting Digital Healthcare Services Using Semantic Web Technologies

Gintaras Barisevičius, Martin Coste, David Geleta, Damir Juric, Mohammad Khodadadi, Giorgos Stoilos, Ilya Zaihrayeu

In this paper we report on our efforts and faced challenges in using Semantic Web technologies for the purposes of supporting healthcare services provided by Babylon Health.

Published in 2018 | ISWC


Reasoning with Textual Queries: A Case of Medical Text

Damir Juric, Giorgos Stoilos, Szymon Wartak, Mohammad Khodadadi

Published in 2018 | ISWC


Methods and Metrics for Knowledge Base Engineering and Integration

Stoilos et al.

In this paper we investigated the possibility of integrating different and largely heterogeneous biomedical ontologies. We reported on our Knowledge Base construction pipeline which is based on ontology integration and focused on the various metrics, techniques, and tools we have developed in order to assist in achieving this large-scale integration task.

Published in 2018 | WOP


Medical Knowledge Graph Construction by Aligning Large Biomedical Datasets

Stoilos et al.

Building large Knowledge Bases can be realised by aligning and integrating existing data sources. To support AI-based digital healthcare services within Babylon Health significant effort to build a large medical KB was recently undertaken. To realise this goal a highly configurable and modular ontology integration pipeline has been created that contains three phases: a Matching phase, an Aggregation phrase, and a final PostProcessing phase.

Published in 2018 | OM


Offline bilingual word vectors, orthogonal transformations and the inverted soft-max

Smith et al.

Pre-trained word embeddings can be aligned with a linear transformation, using dictionaries compiled from expert knowledge. In this work, we prove that the linear transformation between two embedding spaces should be orthogonal, and it can be obtained using the singular value decomposition. We also introduce a novel “inverted softmax” for identifying translation pairs to improve on precision @1 mapping from previous work.

Published in 2017 | ICLR