Data Challenges of building an AI doctor (4/5): Understanding Unstructured Data
Written by Christina Hu
, 3 min read
When we think of data, we often imagine numbers sitting neatly in tables. Easy to transform, analyse and use.
Much medical data isn’t so conveniently structured.
Doctors’ notes, test results and referrals are but a few examples. And at Babylon, there’s also the free-text input from our members when they use our symptom checker chatbot.
Much of handling unstructured healthcare data involves making sense of free text.
Problem: How do we extract meaning from free-form, unstructured medical text?
Think about when you read a piece of text for the first time.
There are usually two things you’re doing: 1) Making sense of it within the immediate context, and 2) Storing what you’ve learnt in a way that you can apply it later when the situation calls for it.
Machine reading medical text is no different.
LESSON 10: Understanding unstructured text should take both the current use case and potential future use cases into account.
1) Making sense of text within the immediate context:
At Babylon, we’re developing various natural language processing (NLP) techniques specialised for handling different types of medical text input for different purposes.
Some key properties we’ve realised are important for our NLP techniques to have are:
- Separating salient information from background noise
- Distinguishing different categories of medical information, e.g. conditions, symptoms, drugs, body parts
- Knowing when inputs are clear vs when they are ambiguous / nonsensical - so we don’t return potentially misleading outputs
- Use of knowledge-based methods - to both provide a deeper understanding and open the doors for explainable algorithm behaviour
- Scalability - so we can process vast amounts of medical text at high accuracy and speed
Diagram 7: Example to illustrate the key information our NLP techniques can extract from a doctor’s note
2) Storing what you’ve learnt to apply later:
We treat each data point we’re given like gold dust.
That’s because we know that, by combining each point in various ways with other data points members give us over time, we can create vast amounts of insight and value that we can use to provide more effective and personalised care than ever before.
To combine data points from disparate sources and allow different Babylon services to talk to each other, we must express all the concepts we extract from our data in a single, common language. This is such a crucial point we’ll unpack this in its own section below.
LESSON 11: To extract meaning from data in a way that’s conducive to future usage, use a single medical language.
We’re making great strides towards machine understanding of complex, free-flowing human text input.
But, like with the Accuracy challenge we discussed previously, the quality of the output is limited by the quality of the input.
Just take a look at these extracts taken from example doctors’ notes:
Diagram 8: Extracts from example doctors’ notes illustrating the potential variable quality
When we put poorly written doctors’ notes through even our most sophisticated NLP algorithms, the output is often complete garbage. But notes that are clear and well-structured tend to yield extracted concepts that completely and accurately reflect the intended meaning.
That’s why we’re also developing tools and guidance for our doctors to write consultation notes that can be easily understood by fellow humans and machines alike.
The information provided is for educational purposes only and is not intended to be a substitute for professional medical advice, diagnosis, or treatment. Seek the advice of a doctor with any questions you may have regarding a medical condition. Never delay seeking or disregard professional medical advice because of something you have read here.