Skip to contentright arrow
Babylon Health

Babylon's Medical Peer-Reviewed Research of Babylon Tech

Tom Knoll, Francesco Moramarco, Alex Papadopoulos Korfiatis, Rachel Young, Claudia Ruffini, Mark Perera, Christian Perstl, Ehud Reiter, Anya Belz, Aleksandar Savkov

In this paper, we present three rounds of user studies, carried out in the context of developing a medical note generation system. We present, analyse and discuss the participating clinicians’ impressions and views of how the system ought to be adapted to be of value to them. Next, we describe a three-week test run of the system in a live telehealth clinical practice.

Published in 2022 | Presented at the 60th Annual Meeting of the Association for Computational Linguistics, Dublin

Francesco Moramarco, Alex Papadopoulos Korfiatis, Mark Perera, Damir Juric, Jack Flann, Ehud Reiter, Anya Belz, Aleksandar Savkov

In recent years, machine learning models have rapidly become better at generating clinical consultation notes; yet, there is little work on how to properly evaluate the generated consultation notes to understand the impact they may have on both the clinician using them and the patient’s clinical safety. To address this we present an extensive human evaluation study of consultation notes where 5
clinicians listen to 57 mock consultations, write their own notes, post-edit a number of automatically generated notes, and extract all the errors, both quantitative and qualitative. We then carry out a correlation study with 18 automatic quality metrics and
the human judgements. We find that a simple, character-based Levenshtein distance metric performs on par if not better than common model-based metrics like BertScore.

Published in 2022 | Presented at the 60th Annual Meeting of the Association for Computational Linguistics, Dublin

Joseph Enguehard, Dan Busbridge, Adam Bozson, Claire Woodcock, Nils Hammerla

To address their temporal nature, we treat EHRs as samples generated by a Temporal Point Process (TPP), enabling us to model what happened in an event with when it happened in a principled way. Our proposed attention-based Neural TPP performs favourably compared to existing models, and provides insight into how it models the EHR, an important step towards a component of clinical decision support systems.

Published in 2020 | ML for Health Workshop, NeurIPS 2020

Adam Baker, Yura Perov, Katherine Middleton, Janie Baxter, Daniel Mullarkey, Davinder Sangar, Mobasher Butt, Arnold DoRosario

We performed a validation study of the accuracy and safety of the babylon AI triage system and human doctors using a set of identical clinical cases. Overall, we found that the AI system is able to provide patients with triage and diagnostic information with a level of clinical accuracy and safety comparable to that of human doctors.

Published in 2020 | Frontiers in Artificial Intelligence Medicine and Public Health

Yuanzhao Zhang, Robert Walecki, Joanne Winter, Felix Bragman, Sara Lourenco, Chris Hart, Adam Baker, Yura Perov and Saurabh Johri

AI-driven digital health tools often rely on estimates of disease incidence or prevalence, but obtaining these estimates is costly and time-consuming. We demonstrate that context-aware machine learning models can be used for estimating disease incidence. These methods are quicker to implement than traditional epidemiological approaches. We therefore suggest it complements existing modelling efforts, where data is required more rapidly or at larger scale. This may particularly benefit AI-driven digital health products where the data will undergo further processing and a validated approximation of the disease incidence is adequate.

Published in 2020 | Frontiers in Artificial Intelligence Medicine and Public Health

Claudia Schulz, Josh Levy-Kramer, Camille Van Assel, Miklos Kepes and Nils Hammerla

A promising application of AI to healthcare is the retrieval of information from electronic health records (EHRs), e.g. to aid clinicians in finding relevant information for a consultation or to recruit suitable patients for a study. This requires search capabilities for beyond simple string matching, including the retrieval of medical concepts (diagnoses, symptoms, meditations, etc) related to the one in question. We open-source a novel medical concept relatedness benchmark, which is six times larger than existing datasets and consists of concept pairs that co-occurr in EHRs, ensuring their relevance for medical information retrieval from EHRs.

Published in 2020 | Coling 2020

Jonathan G. Richens, Ciarán M. Lee & Saurabh Johri

Machine learning promises to revolutionize clinical decision making and diagnosis. In medical diagnosis a doctor aims to explain a patient’s symptoms by determining the diseases causing them. However, existing machine learning approaches to diagnosis are purely associative, identifying diseases that are strongly correlated with a patients symptoms. In this paper we show that this inability to disentangle correlation from causation can result in sub-optimal or dangerous diagnoses. To overcome this, we reformulate diagnosis as a counterfactual inference task and derive counterfactual diagnostic algorithms. We compare our counterfactual algorithms to the standard associative algorithm and 44 doctors using a test set of clinical vignettes. While the associative algorithm achieves an accuracy placing in the top 48% of doctors in our cohort, our counterfactual algorithm places in the top 25% of doctors, achieving expert clinical accuracy. Our results show that causal reasoning is a vital missing ingredient for applying machine learning to medical diagnosis.

Published in 2020 | Nature Communications

Vitalii Zhelezniak, Aleksandar Savkov, April Shen, Nils Hammerla

Some of the top approaches to semantic textual similarity rely on various correlations between word embeddings, including the famous cosine similarity. We show that mutual information between dense word embeddings, despite being difficult to estimate, is another excellent candidate for semantic similarity and rivals existing state-of-the-art unsupervised methods.

Published in 2020 | ACL Journal

Stoilos, Giorgos and Juric, Damir and Wartak, Szymon and Schulz, Claudia and Khodadadi, Mohammad

The success of logic-based methods for comparing entities heavily depends on the axioms that have been described for them in the Knowledge Base (KB). Due to the incompleteness of even large and well engineered KBs, such methods suffer from low recall when applied in real-world use cases. To address this, we designed a reasoning framework that combines logic-based subsumption with statistical methods for on-the-fly knowledge extraction.

Published in 2020 | European Semantic Web Conference

Claudia Schulz, Damir Juric

We create various large-scale datasets for testing whether embeddings correctly encode the similarity between medical terms and test existing state-of-the-art embeddings on these datasets. Our results reveal that existing embeddings cannot adequately represent medical terminology. Our new datasets are thus challenging new benchmarks for testing the adequacy of new medical embeddings in the future.

Published in 2020 | AAAI 2020

Yura Perov, Logan Graham, Kostis Gourgoulias, Jonathan G. Richens, Ciarán M. Lee, Adam Baker, Saurabh Johri

The paper describes a probabilistic programming engine design and its analysis for counterfactual probabilistic programming, in general and in particular using importance sampling.

Published in 2019 | AABI 2019

Anish Dhir and Ciarán M. Lee

Knowing that a disease is highly correlated with symptoms, or a drug highly correlated with recovery, is not enough, and basing medical decisions on such information can be dangerous. To truly begin to revolutionise healthcare, AI must learn to distinguish cause and effect. Our work solves this by utilising new physics-inspired ideas about what it means for one variable to cause another, and showing how causal relationships in one dataset limit the possibilities in other overlapping datasets. To illustrate our algorithm, we apply it to breast cancer data, showing how to extract causal relations between two important features despite the fact that they were never measured in the same dataset.

Published in 2019 | AAAI 2020

Logan Graham, Ciarán M. Lee, Yura Perov

Provides efficient way to conduct counterfactual simulation, benchmarked against state of the art.

Published in 2019 | NeurIPS Causal Machine Learning workshop

Vitalii Zhelezniak, Aleksandar Savkov, April Shen, Francesco Moramarco, Jack Flann, Nils Y. Hammerla

We push the limits of word embeddings on semantic textual similarity tasks by introducing DynaMax, a novel unsupervised non-parametric similarity measure based on word vectors and fuzzy bag-of-words. This method is efficient and easy to implement, yet outperforms current baselines on STS tasks by a large margin.

Published in 2019 | ICLR

Gintaras Barisevičius, Martin Coste, David Geleta, Damir Juric, Mohammad Khodadadi, Giorgos Stoilos, Ilya Zaihrayeu

In this paper we report on our efforts and faced challenges in using Semantic Web technologies for the purposes of supporting healthcare services provided by Babylon Health.

Published in 2018 | ISWC

Douglas et al.

Published in 2017 | NIPS Workshop, NIPS 2017