How to make the most of AI in order to extract insights from a massive text corpus

Since the beginning of COVID pandemic, there have been more than 700000 scientific papers published on the subject. A human researcher cannot possibly get acquainted with such a huge text corpus — and therefore some help from AI is highly needed.

In this post, we will show how we can extract some knowledge from scientific papers, gain insights, and build a tool to help researcher navigate the paper collection in a meaningful way. During this process, we will also meet a number of cloud tools that can be useful for a data scientist.

If you want to follow along and do this experiment yourself, you may find all the source code together with step-by-step instructions here:

Automatic Paper Analysis

Automatic scientific paper analysis is fast growing area of studies, and due to recent improvements in NLP techniques is has been greatly improved in the recent years.

In this post, we will show you how to derive specific insights from COVID papers, such as changes in medical treatment over time, or joint treatment strategies using several medications.

The main approach I will describe in this post is to extract as much semi-structured information from text as possible, and then store it into some NoSQL database for further processing.

Storing information in the database would allow us to make some very specific queries to answer some of the questions, as well as to provide visual exploration tool for medical expert for structured search and insight generation. The overall architecture of the proposed system is shown below:

Image by author

We will use different Azure technologies to gain insights into the paper corpus, such as Text Analytics for HealthCosmosDB and PowerBI. Now let’s focus on individual parts of this diagram and discuss them in detail.

If you want to experiment with text analytics yourself — you will need an Azure Account. You can always get free trial if you do not have one. And you may also want to check out other AI technologies for developers.

COVID Scientific Papers and CORD Dataset

The idea to apply NLP methods to scientific literature seems quite natural. First of all, scientific texts are already well-structured, they contain things like keywords, abstract, as well as well-defined terms.

Thus, at the very beginning of the COVID pandemic, a research challenge has been launched on Kaggle to analyze scientific papers on the subject. The dataset behind this competition is called CORD (publication), and it contains constantly updated corpus of everything that is published on topics related to COVID.

This dataset consists of the following parts:

  • Metadata file Metadata.csv contains most important information for all publications in one place. Each paper in this table has unique identifier cord_uid (which in fact does not happen to be completely unique, once you actually start working with the dataset). The information includes: Title of publication, Journal, Authors, Abstract, Date of publication, doi
  • Full-text papers in document_parses directory, represented as structured text in JSON format, which greatly simplifies the analysis.
  • Pre-built Document Embeddings that maps cord_uids to float vectors that reflect some overall semantics of the paper.

In this post, we will focus on paper abstracts, because they contain the most important information from the paper. However, for full analysis of the dataset, it definitely makes sense to use the same approach on full texts as well.

What Can AI Do with Text?

In recent years, there has been a huge progress in the field of Natural Language Processing, and very powerful neural network language models have been trained. In the area of NLP, the following tasks are typically considered:

  • Text classification / intent recognition — In this task, we need to classify a piece of text into a number of categories. This is a typical classification task.
  • Sentiment Analysis — We need to return a number that shows how positive or negative the text is. This is a typical regression task.
  • Named Entity Recognition (NER) — In NER, we need to extract named entities from text, and determine their type. For example, we may be looking for names of medicines, or diagnoses. Another task similar to NER is keyword extraction.
  • Text summarization — Here we want to be able to produce a short version of the original text, or to select the most important pieces of text.
  • Question Answering — In this task, we are given a piece of text and a question, and our goal is to find the exact answer to this question from text.
  • Open-Domain Question Answering (ODQA) — The main difference from previous task is that we are given a large corpus of text, and we need to find the answer to our question somewhere in the whole corpus.

In one of my previous posts, I have described how we can use ODQA approach to automatically find answers to specific COVID questions. However, this approach is not suitable for serious research.

To make some insights from text, NER seems to be the most prominent technique to use. If we can understand specific entities that are present in text, we could then perform semantically rich search in text that answers specific questions, as well as obtain data on co-occurrence of different entities, figuring out specific scenarios that interest us.

To train NER model, as well as any other neural language model, we need a reasonably large dataset that is properly marked up. Finding those datasets is often not an easy task, and producing them for new problem domain often requires initial human effort to mark up the data.

Pre-Trained Language Models

Luckily, modern transformer language models can be trained in semi-supervised manner using transfer learning. First, the base language model (for example, BERT) is trained on a large corpus of text first, and then can be specialized to a specific task such as classification or NER on a smaller dataset.

This transfer learning process can also contain additional step — further training of generic pre-trained model on a domain-specific dataset. For example, in the area of medical science Microsoft Research has pre-trained a model called PubMedBERT (publication), using texts from PubMed repository. This model can then be further adopted to different specific tasks, provided we have some specialized datasets available.

Text Analytics Cognitive Services

However, training a model requires a lot of skills and computational power, in addition to a dataset. Microsoft (as well as some other large cloud vendors) also makes some pre-trained models available through the REST API. Those services are called Cognitive Services, and one of those services for working with text is called Text Analytics. It can do the following:

  • Keyword extraction and NER for some common entity types, such as people, organizations, dates/times, etc.
  • Sentiment analysis
  • Language Detection
  • Entity Linking, by automatically adding internet links to some most common entities. This also performs disambiguation, for example Mars can refer to both the planet or a chocolate bar, and correct link would be used depending on the context.

For example, let’s have a look at one medical paper abstract analyzed by Text Analytics:

Image by author

As you can see, some specific entities (for example, HCQ, which is short for hydroxychloroquine) are not recognized at all, while others are poorly categorized. Luckily, Microsoft provides special version of Text Analytics for Health.

Text Analytics for Health

Text Analytics for Health is a cognitive service that exposes pre-trained PubMedBERT model with some additional capabilities. Here is the result of extracting entities from the same piece of text using Text Analytics for Health:

Image by author

To perform analysis, we can use recent version Text Analytics Python SDK, which we need to pip-install first:

pip install azure.ai.textanalytics

The service can analyze a bunch of text documents, up to 10 at a time. You can pass either a list of documents, or dictionary. Provided we have a text of abstract in txt variable, we can use the following code to analyze it:

poller = client.begin_analyze_healthcare_entities([txt])
res = list(poller.result())
print(res)

Before making this call, you need to create TextAnalyticsClient object, passing your endpoint and access key. You get those values from cognitive services/text analytics Azure resource that you need to create in your Azure Subscription through the portal or via command-line.

In addition to just the list of entities, we also get the following:

  • Entity Mapping of entities to standard medical ontologies, such as UMLS.
  • Relations between entities inside the text, such as TimeOfCondition, etc.
  • Negation, which indicated that an entity was used in negative context, for example COVID-19 diagnosis did not occur.
Image by author

In addition to using Python SDK, you can also call Text Analytics using REST API directly. This is useful if you are using a programming language that does not have a corresponding SDK, or if you prefer to receive Text Analytics result in the JSON format for further storage or processing. In Python, this can be easily done using requests library:

uri = f"{endpoint}/text/analytics/v3.1/entities/
health/jobs?model-version=v3.1"
headers = { "Ocp-Apim-Subscription-Key" : key }
resp = requests.post(uri,headers=headers,data=doc)
res = resp.json()
if res['status'] == 'succeeded':
result = t['results']
else:
result = None

Resulting JSON file will look like this:

{"id": "jk62qn0z",
"entities": [
{"offset": 24, "length": 28, "text": "coronavirus disease pandemic",
"category": "Diagnosis", "confidenceScore": 0.98,
"isNegated": false},
{"offset": 54, "length": 8, "text": "COVID-19",
"category": "Diagnosis", "confidenceScore": 1.0, "isNegated": false,
"links": [
{"dataSource": "UMLS", "id": "C5203670"},
{"dataSource": "ICD10CM", "id": "U07.1"}, ... ]},
"relations": [
{"relationType": "Abbreviation", "bidirectional": true,
"source": "#/results/documents/2/entities/6",
"target": "#/results/documents/2/entities/7"}, ...],
}

Note: In production, you may want to incorporate some code that will retry the operation when an error is returned by the service. For more guidance on proper implementation of cognitive services REST clients, you can check source code of Azure Python SDK, or use Swagger to generate client code.

Processing All Papers in Parallel

Since the dataset currently contains 800K paper abstracts, processing them sequentially through Text Analytics would be quite time-consuming — it is likely to take a couple of days. To run this code in parallel, we can use technologies such as Azure Batch or Azure Machine LearningBoth of them allow you to create a cluster of identical virtual machines, and have the same code run in parallel on all of them.

Image by author

Azure Machine Learning is a service intended to satisfy all needs of a Data Scientist. It is typically used for training and deploying model and ML Pipelines; however, we can also use it to run our parallel sweep job across a compute cluster. To do that, we need to submit a sweep_job experiment.

There are a few ways you can work with Azure ML and submit experiments:

  • Interactively through Azure Portal. This is probably best for beginners, but not easy to replicate or document.
  • Using Azure ML Python SDKIt allows you to define all properties of an experiment via code, however, Python code seems to have a lot of boilerplate.
  • From command-line, using YAML files to define the parameters. This is currently the recommended way to go.
  • From Visual Studio Code Azure ML Extension — it is essentially very similar to the approach above, but VS Code helps you by providing autocomplete options to simplify authoring all configuration files, and it can also submit the commands for you.

First of all, we need to create an Azure Machine Learning workspace, and a cluster to run out experiment on. This is done via Azure CLI:

$ az ml workspace create -w AzMLWorkspace -l westus -g MyGroup
$ az ml compute create –n AzMLCompute --size Standard_NC
--max-node-count 8

We also need to upload our CORD dataset into Azure ML. We first define the dataset with the YAML file data_metacord:

name: metacord
version: 1
local_path: Metadata.csv

Then we upload the dataset to the cloud:

$ az ml data create -f data_metacord.yml

We also need to define an environment our script will run at. An environment is essentially a container, which can be defined by specifying a starting container, and applying some additional configuration on top of it. Here, we define an environment in env.yml:

name: cognitive-env
version: 1
docker:
image: mcr.microsoft.com/azureml/base:intelmpi2018.3-ubuntu16.04
conda_file:
file:./cognitive_conda.yml

We start with a standard Azure ML Ubuntu container, and specify additional Python dependencies in cognitive_conda.yml:

channels:
- conda
dependencies:
- python=3.8
- pip
- pip:
- azure-cosmos
- azure.ai.textanalytics
- requests

We create the environment by running

az ml environment create -f env.yml

To define a sweep job, we will use the following YAML file sweepexp.yml:

experiment_name: sweep_experiment
algorithm: grid
type: sweep_job
search_space:
number:
type: choice
values: [0,1,2,3,4,5,6,7]
trial:
command: python process.py
--number {search_space.number}
--nodes 8
--data {inputs.metacord}
inputs:
metacord:
data: azureml:metacord:1
mode: download
code:
local_path: .
environment: azureml:cognitive-env:1
compute:
target: azureml:AzMLCompute
max_concurrent_trials: 8
timeout_minutes: 10000

Here we define a search space with integer parameter number, which takes values from 0 to 7. We allow up to 8 concurrent runs, and each run will consist of calling process.py script, passing it the command-line parameters for the dataset, total number of concurrent runs and the invidual run --number, which will vary from 0 to 7.

Note that we also specify the environment name here, and the compute name. If you are using Visual Studio Code with Azure ML Extension to create those scripts, you can use auto-complete (press Ctrl-Space) to populate the names of the fields, and required values (such as available compute names, container names, etc.) automatically.

The processing logic will be encoded in the Python script, and will be roughly the following:

## process command-line arguments using ArgParse

df = pd.read_csv(args.data) # Get metadata.csv into Pandas DF## Connect to the database
coscli = azure.cosmos.CosmosClient(cosmos_uri, credential=cosmoskey)
cosdb = coscli.get_database_client("CORD")
cospapers = cosdb.get_container_client("Papers")## Process papers
for i,(id,x) in enumerate(df.iterrows()):
if i%args.nodes == args.number: # process only portion of record
# Process the record using REST call (see code above)
# Store the JSON result in the database
cospapers.upsert_item(json)

For simplicity, we will not show the complete script here

Using CosmosDB to Store Analytics Result

Using the code above, we have obtained a collection of papers, each having a number of entities and corresponding relations.

This structure is inherently hierarchical, and the best way to store and process it would be to use NoSQL approach for data storage. In Azure, Cosmos DB is a universal database that can store and query semi-structured data like our JSON collection, thus it would make sense to upload all JSON files to CosmosDB collection. The code shown above demonstrates how we can store JSON documents directly into CosmosDB database from our processing scripts running in parallel.

Image by author

We have assumed that you have created a Cosmos DB database called ‘CORD’, and obtained required credentials into cosmos_uri and cosmoskey variables.

After running this code, we will end up with the container Papers will all metadata. We can now work with this container in Azure Portal by going to Data Explorer:

Image by author

Now we can use Cosmos DB SQL in order to query our collection. For example, here is how we can obtain the list of all medications found in the corpus:

-- unique medication names
SELECT DISTINCT e.text
FROM papers p
JOIN e IN p.entities
WHERE e.category='MedicationName'

Using SQL, we can formulate some very specific queries. Suppose, a medical specialist wants to find out all proposed dosages of a specific medication (say, hydroxychloroquine), and see all papers that mention those dosages. This can be done using the following query:

-- dosage of specific drug with paper titles
SELECT p.title, r.source.text
FROM papers p JOIN r IN p.relations
WHERE r.relationType='DosageOfMedication'
AND r.target.text LIKE 'hydro%'

A more difficult task would be to select all entities together with their corresponding ontology ID. This would be extremely useful, because eventually we want to be able to refer to a specific entity (hydroxychloroquine) regardless or the way it was mentioned in the paper (for example, HCQ also refers to the same medication). We will use UMLS as our main ontology.

--- get entities with UMLS IDs
SELECT e.category, e.text,
ARRAY (SELECT VALUE l.id
FROM l IN e.links
WHERE l.dataSource='UMLS')[0] AS umls_id
FROM papers p JOIN e IN p.entities

Creating Interactive Dashboards

While being able to use SQL query to obtain an answer to some specific question, like medication dosages, seems like a very useful tool — it is not convenient for non-IT professionals, who do not have high level of SQL mastery. To make the collection of metadata accessible to medical professionals, we can use PowerBI tool to create an interactive dashboard for entity/relation exploration.

Image by author

In the example above, you can see a dashboard of different entities. One can select desired entity type on the left (eg. Medication Name in our case), and observe all entities of this type on the right, together with their count.

You can also see associated UMLS IDs in the table, and from the example above once can notice that several entities can refer to the same ontology ID (hydroxychloroquine and HCQ).

To make this dashboard, we need to use PowerBI Desktop. First we need to import Cosmos DB data — the tools support direct import of data from Azure.

Image by author

Then we provide SQL query to get all entities with the corresponding UMLS IDs — the one we have shown above — and one more query to display all unique categories. Then we drag those two tables to the PowerBI canvas to get the dashboard shown above.

The tool automatically understands that two tables are linked by one field named category, and supports the functionality to filter second table based on the selection in the first one.

Similarly, we can create a tool to view relations:

Image by author

From this tool, we can make queries similar to the one we have made above in SQL, to determine dosages of a specific medications. To do it, we need to select DosageOfMedication relation type in the left table, and then filter the right table by the medication we want.

It is also possible to create further drill-down tables to display specific papers that mention selected dosages of medication, making this tool a useful research instrument for medical scientist.

Getting Automatic Insights

The most interesting part of the story, however, is to draw some automatic insights from the text, such as the change in medical treatment strategy over time. To do this, we need to write some more code in Python to do proper data analysis.

The most convenient way to do that is to use Notebooks embedded into Cosmos DB:

Image by author

Those notebooks support embedded SQL queries; thus we are able to execute SQL query, and then get the results into Pandas DataFrame, which is Python-native way to explore data:

%%sql --database CORD --container Papers --output meds
SELECT e.text, e.isNegated, p.title, p.publish_time,
ARRAY (SELECT VALUE l.id FROM l
IN e.links
WHERE l.dataSource='UMLS')[0] AS umls_id
FROM papers p
JOIN e IN p.entities
WHERE e.category = 'MedicationName'

Here we end up with meds DataFrame, containing names of medicines, together with corresponding paper titles and publishing date. We can further group by ontology ID to get frequencies of mentions for different medications:

unimeds = meds.groupby('umls_id') \
.agg({'text' : lambda x : ','.join(x),
'title' : 'count',
'isNegated' : 'sum'})
unimeds['negativity'] = unimeds['isNegated'] / unimeds['title']
unimeds['name'] = unimeds['text'] \
.apply(lambda x: x if ',' not in x
else x[:x.find(',')])
unimeds.sort_values('title',ascending=False).drop('text',axis=1)

This gives us the following table:

Image by author

From this table, we can select the top-15 most frequently mentioned medications:

top = { 
x[0] : x[1]['name'] for i,x in zip(range(15),
unimeds.sort_values('title',ascending=False).iterrows())
}

To see how frequency of mentions for medications changed over time, we can average out the number of mentions for each month:

# First, get table with only top medications 
imeds = meds[meds['umls_id'].apply(lambda x: x in top.keys())].copy()
imeds['name'] = imeds['umls_id'].apply(lambda x: top[x])# Create a computable field with month
imeds['month'] = imeds['publish_time'].astype('datetime64[M]')# Group by month
medhist = imeds.groupby(['month','name']) \
.agg({'text' : 'count',
'isNegated' : [positive_count,negative_count] })

This gives us the DataFrame that contains number of positive and negative mentions of medications for each month. From there, we can plot corresponding graphs using matplotlib:

medh = medhist.reset_index()
fig,ax = plt.subplots(5,3)
for i,n in enumerate(top.keys()):
medh[medh['name']==top[n]] \
.set_index('month')['isNegated'] \
.plot(title=top[n],ax=ax[i//3,i%3])
fig.tight_layout()
Image by author

Visualizing Terms Co-Occurrence

Another interesting insight is to observe which terms occur frequently together. To visualize such dependencies, there are two types of diagrams:

  • Sankey diagram allows us to investigate relations between two types of terms, eg. diagnosis and treatment
  • Chord diagram helps to visualize co-occurrence of terms of the same type (eg. which medications are mentioned together)

To plot both diagrams, we need to compute co-occurrence matrix, which in the row i and column j contains number of co-occurrences of terms i and j in the same abstract (one can notice that this matrix is symmetric). The way we compute it is to manually select relatively small number of terms for our ontology, grouping some terms together if needed:

treatment_ontology = {
'C0042196': ('vaccination',1),
'C0199176': ('prevention',2),
'C0042210': ('vaccines',1), ... }diagnosis_ontology = {
'C5203670': ('COVID-19',0),
'C3714514': ('infection',1),
'C0011065': ('death',2),
'C0042769': ('viral infections',1),
'C1175175': ('SARS',3),
'C0009450': ('infectious disease',1), ...}

Then we define a function to compute co-occurrence matrix for two categories specified by those ontology dictionaries:

def get_matrix(cat1, cat2):
d1 = {i:j[1] for i,j in cat1.items()}
d2 = {i:j[1] for i,j in cat2.items()}
s1 = set(cat1.keys())
s2 = set(cat2.keys())
a = np.zeros((len(cat1),len(cat2)))
for i in all_papers:
ent = get_entities(i)
for j in ent & s1:
for k in ent & s2 :
a[d1[j],d2[k]] += 1
return a

Here get_entities function returns the list of UMLS IDs for all entities mentioned in the paper, and all_papers is the generator that returns the complete list of paper abstracts metadata.

To actually plot the Sankey diagram, we can use Plotly graphics library. This process is well described here, so I will not go into further details. Here are the results:

Image by author
Image by author

Plotting a chord diagram cannot be easily done with Plotly, but can be done with a different library — Chord. The main idea remains the same — we build co-occurrence matrix using the same function described above, passing the same ontology twice, and then pass this matrix to Chord:

def chord(cat):
matrix = get_matrix(cat,cat)
np.fill_diagonal(matrix,0)
names = cat.keys()
Chord(matrix.tolist(), names, font_size = "11px").to_html()

The results of chord diagrams for treatment types and medications are below:

Image by author

Diagram on the right shows which medications are mentioned together (in the same abstract). We can see those well-known combinations, such as hydroxychloroquine + azitromycin, are clearly visible.

Conclusion

In this post, we have described the architecture of a proof-of-concept system for knowledge extraction from large corpora of medical texts. We use Text Analytics for Health to perform the main task of extracting entities and relations from text, and then a number of Azure services together to build a query took for medical scientist and to extract some visual insights.

This post is quite conceptual at the moment, and the system can be further improved by providing more detailed drill-down functionality in PowerBI module, as well as doing more data exploration on extracted entity/relation collection. It would also be interesting to switch to processing full-text articles as well, in which case we need to think about slightly different criteria for co-occurrence of terms (eg. in the same paragraph vs. the same paper).

The same approach can be applied in other scientific areas, but we would need to be prepared to train a custom neural network model to perform entity extraction.

This task has been briefly outlined above (when we talked about the use of BERT), and I will try to focus on it in one of my next posts. Meanwhile, feel free to reach out to me if you are doing similar research, or have any specific questions on the code and/or methodology.

Author

Cloud Developer Advocate / Software Engineer at Microsoft, Associate Professor at MIPT, HSE and MAI