Along with our marquee cult.fit brand, care.fit has long been an integral part of our long-term vision. A strong focus on clinic design ensuring zero wait time and stellar customer experience lead to care.fit’s consistent high ratings in offline experiences before COVID-19 hit.
This understandably changed with the pandemic. Almost overnight, there was a need to provide super-critical medical advice to all our customers locked down in their homes. This meant that care.fit had to pivot its vertical to a digital-first teleconsultation model, where customers could consult doctors online via phone or laptop after pre-booking their calendar.
This shift to digital-first wasn’t without challenges
Doctors, who were extremely comfortable with their offline workflows, had to become digital first, overnight. Along with the usual problems of figuring out video conferencing softwares and online consultation etiquette, there was the more formidable problem of creating digital prescriptions.
Doctors are typically used to writing freehand on paper. Most of the time, the prescription can be interpreted only by a handful of trained people (like assistants and pharmacists), leading to doctors to using convenient shorthand— something that is practically inscrutable for those not in the know.
A prescription records a lot of information like prior conditions, symptoms, probable diagnosis, medicines, lab tests, and lifestyle advice (eg: avoiding spicy food). Not all of the doctors were used to typing on a computer, especially given the information-dense nature of these prescriptions and the limited time they had. This led to problems on both fronts: doctors complaining that the whole process was very inefficient and customers were dissatisfied with the patient-doctor interaction time.
And so, care.fit had a new business problem to solve:
“Decrease the prescription creation time by reducing keystrokes, scrolls and clicks required by the doctor”
At cult.fit, we always start with a business problem and then look for the right technique to solve the problem. In this case, it required a combination of UX designers, product managers, engineers and data scientists to design a comprehensive solution that could solve this problem.
In this blog, we seek to shed light on the data science part of the solution while also discussing how all the other parts of the puzzle fit into it.
An Up-to-date Prescription Model:
To aid the doctors in their digital flows, we already had a state-of-the-art <DocsApp> built by our team. This was, at that time, web first. <DocsApp> allowed doctors to roster themselves, manage their schedule, and interact with patients offline while simultaneously being supported with documents and creating a prescription.
An online prescription consists of the following key subsections:
Along with these sections there is a section that allows freeform writing to record additional information like previous medical history, travel history, blood sugar levels, and doctor comments among others.
A prescription, more formally, is also called an Electronic Health Record (EHR). Our tech team surveyed the global best practices of storing EHRs. We then decided to create a limited list of terms for each of these sections that doctors could search for. We also acknowledged that there could be cases where this list wouldn’t suffice, so we also allowed doctors to add to this list. The benefits were two fold: it ensured convenience to doctors and allowed constant updates to the system’s knowledge base with terms we may have missed.
An Efficient UI:
Our UI team finally decided that we use pills to show these suggestions. Since <DocsApp> was web-first at that time, we decided we had real estate to show 10 suggestions per prescription category. Additionally we also wanted to dynamically update the recommendations after every doctor input, which we slotted for the second rollout.
A Problem Statement to start with
With a clear picture of the business problem and method of delivery, we now had to define our problem statement.
For motivation, let’s start by defining the problem statement for medicines themselves. A first attempt would look something like this
“Given a set of diagnosis predicts the medicine”
While this statement is correct, we could do with detailing it out a little more.
- First, we are not predicting a medicine, we actually want to show the most likely recommendations. So we want our model to give probabilities.
- Secondly, there is an implicit assumption here that medicines only depend on diagnosis which is not true. We know that doctors recommend medicines to alleviate the pain of some symptoms.
Irrespective of trueness of the statement, we always want to decouple our model design choices from the problem statement itself. Correcting these problems, here is the refined problem statement:
“Given a arbitrary prescription with fields like symptoms and diagnosis, output the most likely n medicines”
We are almost there, Let’s just rework the problem statement to work with all fields.
“Given an arbitrary prescription consisting of multiple fields, output the most likely n predictions for each of these fields”
This problem statement captures all aspects of the problem. As a bonus, this very directly defines the input and output interfaces to the model. The tech team will call our model with a prescription object, we reply with an object with the same structure containing the predictions for each of the fields. Note that this formulation also supports dynamic predictions. We just keep calling the prediction system with an updated prescription after every doctor entry.
The AI Philosophy at cult.fit
Before we jump into the implementation details, we wanted to comment on the usage of AI in our systems.
Claiming whatever system we build understands the human body and most of modern medical science is patently untrue. Our team strongly believes we have miles to go before we build intelligence that can reason, interpolate and extrapolate like a human does. Particularly , in critical systems like healthcare.
However, there is absolutely no need to frame the problem as a Human vs AI as it is presented in our current zeitgeist. This is a false dichotomy. Instead, at cure.fit we coined something called Human+ strategy. The core idea is as follows. We have various experts like doctors, trainers and nutritionists in our system. Currently, a lot of their time is wasted on non-productive tasks like data entry, logistics, tracking, insight generation and more. We want them to leverage AI and tech to automate their additional tasks so they can focus more of their time on solving health issues and meeting the needs of our customers/their patients.
If our experts are currently able to serve 10 customers effectively, we wish to take that number up to 50 and eventually 100 using automation. This is a win-win situation. On the customers’ side, this solves the supply problem because high-quality healthcare knowledge is mostly accessible to the affluent in India due to scarcity. On the doctors’ end, we are able to help the experts serve more customers.
So our team built this system as a virtual aid to doctors—an assistant of sorts. This mindset promoted a collaborative approach between the doctors and tech, rather than a combative one.
The core solution is a weighted ensemble of Recency Frequency + Naive Bayes models. The models range from the most generic with the least weight to the most specific with highest weight.
For medicines in particular, we split the model into two steps.
- One step predicts chemical molecules.
- The second step predicts medicine brand names from the chemical molecules.
In this section, we elucidate on the solution presented and the design choices involved.
Recency Frequency Models
Recency Frequency (RF) models are quite useful when your problem has all of the following aspects.
- Features are categorical
- Target is categorical
- Number of categories per feature/target is quite high
- Training data is quite less
Note: We call a feature categorical when it can take values only from a limited set.
Eg : Rating (Terrible/Bad/Good/Awesome), State of residence etc.
Almost all the aspects are relevant to the problem at hand. Especially the number of categories are quite huge for symptoms, disease, medicines and lab tests.
Here is how a frequency model works:
- For a given set of features and their current values, subset all the rows where the features take those values.
- Calculate the number of times each medicine was recommended
- Divide it by the total number of rows in the subset resulting in a predicted probability distribution
Now for recency frequency model step 2 and 3 change as follows
- Instead of counting each row as 1 , we weight rows closer to current time higher and older rows lesser.
- Instead of dividing by total number of rows, we divide by aggregate weight resulting in a probability distribution
This pictographic illustrates the same calculation, hopefully, in a much more intuitive manner.
Now, if you are thinking this model is flawed, it definitely is. The main problem is one of sparsity. It is not guaranteed that for any combination of features, previous records should exist. The other is selecting these features by which we subset.
Both the problems are addressed by ensembling techniques which we will discuss in a later section.
Naive Bayes Models
Naive Bayes models work on the unstructured data (text) present in the prescription, specifically the public and private notes that doctors use to jot down freeform the details the patient is telling them.
Naive Bayes models work on text tokens present in the data. To predict a category for the document given a set of words, we flip the question and ask what is the category of the document that maximizes the probability of words present in the document.
As alluded to before, most of the individual models we build are weak learners. By combining these models we get a higher level model that is more powerful than each of the individual ones. This technique is called Ensembling, a very powerful idea that is used to build random forests from weak decision trees. Random forests, even in the deep learning era, remain very competitive and still remain state of the art in use cases where training data is tabular.
Here, we illustrate how to combine two models (probability distributions) in a weighted manner:
The most specific models are often the most accurate and strongly influence the prediction whenever possible. However, if because of sparseness, a model is able to predict, we fallback to a super generic model.
The sparsity problem is addressed by the inclusion of very general models with small weights that always have adequate data to make predictions.
The Path to Production
These Ensemble models finally gave us impressive accuracy. However, we used the following techniques to improve it even further. While most of the techniques may look obvious, they were absolutely vital in the functioning of the system. Most of these techniques fall into the data-cleaning bucket which is generally considered to be 80% of all the work involved in building intelligent systems.
Stemming symptom field
Symptoms were one category where doctors were always adding new ones instead of picking from the pre-existing list. To clean this data, we used stemming, stop word removal and sorting to reduce the complaint dimensionality
(Eg: “Right lung is paining”, “pain in right lung” both resolve to “lung pain right”)
Simple description for diseases
For diagnosis, we used ICD-10 codes. An internationally recognised standard for recording diagnoses. However, most of them have quite a descriptive name (I10 : Essential (primary) hypertension instead of just blood pressure or BP). This is quite useful to avoid ambiguity in medical records. However it is not the most user friendly. The doctors sometimes searched for this simple name and upon not finding it within the existing list, added a new entry.
This was solved by creating a new column called alternative names where our doctors added popular simple names for the most used in ICD-10. This was included as a field to be searched on in our ElasticSearch config.
Modelling similarity of categories
You might’ve noticed that in our models we treat every category independently. The model could be improved by building a notion of similarity between categories.
“E11.1 : Type 2 diabetes mellitus with ketoacidosis” and
“E11.2 : Type 2 diabetes mellitus with renal complications” are quite similar.
“E11.1 : Type 2 diabetes mellitus with ketoacidosis” is somewhat similar to
“E12 : Malnutrition-related diabetes mellitus”
As you can notice from the ICD-10 codes, the code itself encodes some notion of similarity. We used this to alleviate the sparsity problems. For example, if records for ICD code E11.2 are not available, we can use records of E11.1 with reduced weights.
This only partially solves the problem even for diagnosis. Such convenient codes don’t even exist for other fields like symptoms. A more elegant solution to this is to be inspired by word embeddings from NLP. This is discussed in the future work section.
While manually inspecting a sample of cases where the system was being completely off, a process we recommend to all data scientists trying to debug their models, we realised the most frequent case was when the prescription had no data. Some doctors were verbally communicating the diagnosis to the user and then just included medicines in the prescription. Here the central doctor team was vital in ensuring the importance of filling the full prescription was conveyed to the doctors.
Structuring the Medicine field
For predicting the medicines, we first started with predicting strings of medicine names present in our database. These medicine strings look like this.
“DOLO 250MG ORAL SUSP 60ml”
“DOLO 650MG TAB 8’s”
“CALPOL 650MG TAB 15’s”
This unstructured string “DOLO 650MG TAB 8’s” can be converted to following structured data.
Brand Name : Dolo
Strength : 650MG
Form : Tablet
Quantity : 8 nos
Molecule : Paracetamol
The molecule information is not available in the name but can be included from third-party sources
As mentioned above, we split the medicine prediction model into two parts. The first part learns that when the diagnosis is fever, it should recommend Paracetamol. The second part uses this chemical molecule paracetamol and the doctor id to understand doctor’s preferred company to predict the brand name “Dolo”.
Explicitly including doctor input
During the manual inspection phase, we also noticed we seemed to be making some structural errors. To solve this problem, we identified the top diagnoses and medicine combinations being recommended and asked doctors to tell us whether the medicine.
- Always used
- Sometimes used
- Never used
This manual input was plugged back into the system. However, this didn’t yield the gains we desired. We still believe there is a better way to do this. Refer to our discussion on knowledge graphs in the future work section.
Impact and Learnings
Here were the accuracies in our prediction models for various fields:
OnepPrediction accuracy is when we were able to predict at least one of the entries for that field.All Prediction accuracy is when we were able to predict all of the entries for that field.
|Accuracy @ 8
|Accuracy @ 8
|Medicine Prediction ( Medicine Brand)||72.83%||20.13%|
|Medicine Molecule Prediction||81.44%||29.33%|
|Lab Test prediction||90.00%||60.00%|
The particularly low accuracy for complaints is because of it being the first field to beentered. We rarely have any information to make relevant predictions.
On the business front, we were able to reduce consultation time by 1 min leading to saved doctor-hours but more importantly, there was a significant increase in doctor NPS.
As mentioned in a couple places, we required the model to have a notion about similarity of categories.
The cleanest way to encode this, is to have medical word embeddings. Embeddings typically process a huge amount of text, to automatically learn the similarity of words. The algorithm looks in other words around word of interest to infer similarity.
The King presided over his court
The Queen presided over her court.
This leads the algorithm to infer words King and Queen are similar. While these are openly available for natural text, there hasn’t been a lot of significant work on Medical word embeddings. (i.e. on terms such as diabetes mellitus).
Another idea we had was to use the state of art Graph Neural Networks. With the help of experts, we can create a Medical Knowledge Graph. This would capture relationships between concepts like “Headache” is an ailment of the “Head”. It can be solved by medicine “Crocin”. Once we have this graph, We could use the ingenious Graph Neural Networks to generate embeddings.
We found the organized terminology, created by international medical experts, called SNOMED-CT, which encoded some of this information. There was some existing literature on creating embedding using this but it was at a fairly nascent stage. Also, SNOMED-CT did not capture the relationships between diseases and medicines something that was very important to us.
Overall although we couldn’t pursue this direction now, We believe this will be a promising direction to explore in the future, especially after Medical Knowledge Graphs garner more interest in the wider ML research community.
Credits: Pratibha Yadav(Data Science), Darshak Bhagat(Product), Aditya Gupta(Engineering) and Satwik Gokina(Data Science)