MEDINFO 2019: Health and Wellbeing e-Networks for All 1263
L. Ohno-Machado and B. Séroussi (Eds.)
© 2019 International Medical Informatics Association (IMIA) and IOS Press.
This article is published online with Open Access by IOS Press and distributed under the terms
of the Creative Commons Attribution Non-Commercial License 4.0 (CC BY-NC 4.0).
doi:10.3233/SHTI190429
SNOMEDtxt: Natural Language Generation from SNOMED Ontology
Olga Lyudovyka, Chunhua Wenga
a
Department of Biomedical Informatics, Columbia University, New York, NY, USA
Abstract
78561
SNOMED Clinical Terms (SNOMED CT) defines over 70,000
diseases, including many rare ones. Meanwhile, descriptions of
rare conditions are missing from online educational resources.
SNOMEDtxt converts ontological concept definitions and 952 1016 1500 1570 2215 2938 7600
relations contained in SNOMED CT into narrative disease
descriptions using Natural Language Generation techniques.
Generated text is evaluated using both computational methods
and clinician and lay user feedback. User evaluations indicate
that lay people prefer generated text to the original SNOMED
content, find it more informative, and understand it significantly
Figure 1– Counts of Diseases in [Link], [Link],
better. This method promises to improve access to clinical
[Link], [Link], [Link],
knowledge for patients and the medical community and to assist
[Link], [Link], SNOMED [2] (Nov.
in ontology auditing through natural language descriptions.
10, 2018)
Keywords:
Systematized Nomenclature of Medicine, Natural Language An additional use case for SNOMEDtxt is to enable clinicians
Processing, Access to Information and domain experts without specialised technical training or
experience working with structured terminologies to review and
Introduction critique clinical knowledge defined in SNOMED CT. This task
is critically important as biomedical knowledge is growing
SNOMED CT is the world’s most comprehensive clinical exponentially with numerous data types and tools emerging
terminology [1]. The March 2018 release of the US version rapidly on the daily basis. For example, Campbell et al. reported
contains 347,231 unique concepts, including 78,561 diseases, that the absence of a robust granular ontology represents a
and defines 1,088,068 unique active relationships between these barrier to capturing and analyzing data in the field of cancer
concepts [2]. In contrast, the largest professional medical research and precision medicine [6], while Fung et al. made a
reference source, Medscape ([Link]), contains 7,600 similar observation in the area of rare diseases [7]. However,
diseases, representing less than 10% of the diseases defined in ontology auditing or quality ascertainment is largely performed
SNOMED CT, and the largest consumer health resource, Mayo by knowledge engineers with specialized training in ontology
Clinic ([Link]), describes 2,215 diseases. Disease design and maintenance. The workforce of this profession is
descriptions in these resources are manually curated, thus rare, hence creating a bottleneck for enabling scalable ontology
limiting the number of diseases which can be covered. Topics expansion or for crowdsourcing ontology auditing.
may be chosen according to popularity in search results [3], thus SNOMEDtxt allows representation of concepts and related
rare diseases are often excluded from these resources. Counts of information in natural text, thus expanding the group of
disease concepts in major medical information sources are potential reviewers to include any medical professionals who
shown in Figure 1. Google Knowledge Graph for diseases is not are not necessarily familiar with structured ontologies. Wider
available, but since it is curated from the sources listed in Figure review of SNOMED CT by clinicians can be expected to
1, it is likely on the same order of magnitude. improve accuracy, reduce missing information, and enable
faster SNOMED CT evolution as the body of clinical
While extensive, SNOMED CT is not easily accessible to the knowledge expands.
public and is known to be difficult to use even for clinicians
without training in ontologies [4,5]. Like other structured Natural Language Generation (NLG) is a technology utilizing
ontologies, SNOMED CT is not designed to be used directly by advanced computational methods to generate natural language
lay people. The US version of SNOMED CT contains only descriptions from structured knowledge or data representation.
4,372 text definitions easily interpretable by untrained Attempts to apply NLG to generate text from SNOMED CT
personnel, covering 2,608 diseases, corresponding to 1.3% of have been reported by Liang et al. [8] and Kanhov et al. [9].
all SNOMED CT concepts and 3.3% of disease concepts. Liang and colleagues developed OntoVerbal, a generic tool for
ontology verbalization that was then applied to SNOMED CT.
We propose a method called SNOMEDtxt to automatically While Kanhov and colleagues utilized an off-the-shelf natural
generate disease descriptions from SNOMED CT in order to language generator, they developed a methodology for user
make available to both patients and the medical community the evaluation of the fluidity and readability of NLG texts in the
valuable clinical knowledge contained in SNOMED CT. Biomedical domain. OntoVerbal was developed as a Protégé
4.2 plugin and is not available in more recent Protégé versions
1264 O. Lyudovyk and C. Weng / SNOMEDtxt: Natural Language Generation from SNOMED Ontology
or as a standalone application. The NLG system developed by Concepts are the key component of SNOMED CT. They are
Kanhov et al. was not made available for download or use. organized in a polyhierarchical structure with “Is-A” (parent-
child) relationship and can be additionally defined or described
OntoVerbal implements a generic verbalization approach for
through other relationships. Each relationship has a type, a
ontologies, with an emphasis on the ability to handle any OWL
source concept, and a destination concept. Once a disease
ontology and generate natural language descriptions for any
Concept ID is found, relevant relationships are retrieved from
entity type in that ontology [8]. This approach restricts handling
SNOMED CT database:
of relationships, or ontology axioms, to generic lexical choices
and results in some redundant and inelegant phrases, such as • Relationships where the searched disease Concept ID is
“chronic disease of the genitourinary system … has a finding the source
site in a structure of the genitourinary system.” In contrast, our
method trades off generalizability for improved readability and • Is-A relationships where the searched disease Concept
comprehensibility through more specific verbalizations of ID is the destination: the source concepts represent
SNOMED CT axioms and simplifying structures tailored to subtypes or examples of the disease and are included in
SNOMED CT concepts, so that the same construct is simplified the definition
by SNOMEDtxt as “… affects the genitourinary system.” Concept names are then retrieved for the corresponding target
Moreover, OntoVerbal takes the generic approach to ordering concepts. The generated text is the product of concept names
information from simpler sentences to more complex ones, arranged in lexical patterns corresponding to types of
whereas SNOMEDtxt follows the common flow of information relationship between these concepts. Concept names undergo
found in disease descriptions in reference medical texts: minimal string cleaning to remove non-informative structures
definition is followed by possible causes, presentation, such as “(Disorder)” and “(Body Structure)”.
diagnosis, clinical course, and finally additional information.
Structure And Aggregation
SNOMEDtxt is a novel NLG engine and interface, intended to
evolve and improve over time with user feedback. The current In order to produce fluid and coherent text and avoid
version focuses specifically on disease concepts and can be redundancy wherever possible, SNOMEDtxt aggregates and
easily extended to summarize procedures, treatments, and other structures information in three steps: Firstly, it groups all target
information contained within SNOMED CT and relevant to the nodes for the same relationship; secondly, it organizes
wider audience. relationships in broad logical groups; thirdly, it orders
relationships within each group and the groups themselves
Methods following a typical flow of information in a disease description
in medical reference texts. This stepwise grouping of
SNOMEDtxt follows a 4-step framework outlined in Figure 2 to relationships is a simplified application of the Rhetorical
generate a disease description for a given disease. Structure Theory [10] that describes a recursive approach to
organizing relationships in a text.
Table 1 – Organizing Relationships
Group Relationship Lexical Pattern
Definition IS-A “is a kind of”
Finding site “that affects the”
Has definitional “It manifests itself in”
manifestation
Associated “The associated
morphology morphology is”
Pathological process “Pathological process
associated with … is”
Children: IS-A, “An example of … is” /
searched “Examples of … are”
term=destination
Causality Causative agent “is caused by”
Due to “occurs due to”
Figure 2 – Framework for Disease Description Generation Associated with “is associated with”
Tempora- Occurrence “presents in” (period)
lity During/Following/ “can occur during /
Concept Search And Information Retrieval After following / after”
Temporally related “can be temporally related
The current implementation of SNOMEDtxt is based on the
to”
03/01/2018 release of SNOMED CT terminology, US edition,
Diagnosis Finding method “is discovered by”
Snapshot version, available for download from the SNOMED
Finding informer “<is discovered> through”
CT website [2]. SNOMEDtxt uses a local copy of this database. Clinical Clinical course “Clinical course is”
The system has the capability to randomly sample diseases from Course Severity “The severity of … is”
SNOMED CT and to search for disease names entered by a Episodicity “The episodicity of … is”
user. The search is undertaken in two steps: first, a simple Other Interprets “interprets or evaluates”
match on concept names and synonyms in SNOMED CT Has interpretation “… as”
database is attempted. If the search term is not found, the Other “Other related concepts
system then uses SNOMED CT Analyzer API (snomedct.t3as. include…”
org) to search for the term, provided that the API site is online.
O. Lyudovyk and C. Weng / SNOMEDtxt: Natural Language Generation from SNOMED Ontology 1265
Text Realization Relationships: Disorder of connective tissue (disorder) = Is a
(attribute). Connective tissue structure (body structure) =
The first task SNOMEDtxt undertakes in the Text Realization Finding site (attribute). Autoimmune disease (disorder) = Is a
phase is constructing an informative disease name. If the search (attribute). Autoimmune (qualifier value) = Pathological
term is significantly different from the preferred term for the process (attribute). Related concepts: Systemic lupus
disease concept, as measured by Jaro-Winkler string distance erythematosus (disorder) - Is a (attribute). Drug-induced lupus
[11], the disease description will combine both in the form of erythematosus (disorder) - Is a (attribute). Neonatal lupus
“<Preferred disease concept name> (also known as <searched erythematosus (disorder) - Is a (attribute). Discoid lupus
term>)”, e.g. “Influenza (also known as flu)”. erythematosus (disorder) - Due to (attribute).
Additionally, SNOMEDtxt concatenates all target nodes for the We evaluated disease descriptions generated by SNOMEDtxt
same relationship type which were aggregated in the previous against the concatenated SNOMED CT content using computed
step by following the “A and B”, “A, B, and C” format. When metrics and user evaluations. Both sets of evaluations indicate
concatenating examples of a given disease, SNOMEDtxt selects that SNOMEDtxt succeeds in making SNOMED CT content
a maximum of three examples, based on the largest string more readable and comprehensible.
dissimilarity with the given disease name, as a tradeoff between
completeness and relevancy. Computed Metrics
Finally, relationship types are converted into corresponding We computed readability and redundancy metrics for disease
lexical patterns (see Table 1) and sentences are generated. For definitions of the top 20 most searched diseases in 2017 [13]
the sake of conciseness, relationships in the same group are and of 20 diseases randomly retrieved from SNOMED CT:
combined into one sentence wherever this approach produces
fluid text. For example, the Is-A and the Finding site 1. Readability: Flesch-Kincaid grade level (FK) and
relationships are combined into one sentence that forms the Automated Readability Index (ARI) estimate the
concise definition of the disease: “Asthma is a kind of number of years of education needed to understand a
Respiratory disorder that affects the Airway”. Sentences are text. We calculated both with sylcount R package [12].
then ordered according to the order of relationships in Table 1. 2. Redundancy: calculated as the ratio of unique word
count to total word count after removing stop words.
Results Full summaries of health concepts retrieved from Medline Plus
web service ([Link]) were used as reference for the
User Interface of SNOMEDtxt first set of disease concepts. Since only 4 out of 20 randomly
sampled disease concepts had a reference health topic in
A simple user interface is implemented in RShiny and is Medline Plus, comparison with reference is not provided for the
available online at [Link] second set.
Table 2 – Evaluation with Computed Metrics
Readability Redundancy
FK ARI Words Unique/All
Top 20 most searched diseases
SNOMEDtxt 14.3 12.0 49.3 0.74
SNOMED CT 17.9 15.0 64.1 0.55
Reference 6.6 6.1 263 0.77
Random 20 SNOMED CT disease concepts
SNOMEDtxt 11.7 9.7 47.3 0.69
SNOMED CT 15.7 13.8 69.7 0.56
For both measures of readability, a lower score indicates a
Figure 3 – Screenshot of SNOMEDtxt Interface lower grade of education needed to understand the text and
therefore better readability. These metrics indicate that
An example disease description generated by SNOMEDtxt and SNOMEDtxt texts are more readable than the original
the corresponding concatenated SNOMED CT content are SNOMED CT content. For the 20 most searched diseases, the
illustrated below. average FK score for SNOMEDtxt texts (14.3) is equivalent to
the second year of undergraduate degree, and FK for SNOMED
SNOMEDtxt Disease Description
CT content (17.9) corresponds to the graduate school level. ARI
Lupus erythematosus (also known as Lupus) is a kind of score of 12.0 for SNOMEDtxt is equivalent to twelfth grade,
Autoimmune disease and Connective tissue disease that affects while ARI of 15.0 for SNOMED CT content indicates that the
Connective tissue. Some examples of Lupus erythematosus are text is appropriate for readers at the Professor level. Readability
Systemic lupus erythematosus, Drug-induced lupus scores for the MedlinePlus reference texts are significantly
erythematosus, and Neonatal lupus erythematosus. lower, indicating that they can be read by a much wider
Pathological process associated with Lupus erythematosus is AI audience than either SNOMEDtxt or the original SNOMED CT
- autoimmune. Other related concepts are Cutaneous lupus content.
erythematosus, Lupus erythematosus profundus, and Discoid
SNOMEDtxt texts also improve on the redundancy metric
lupus erythematosus of eyelid.
compared to SNOMED CT content for the top 20 searched
SNOMED CT Content diseases (0.74 vs. 0.55) and for the 20 randomly sampled
diseases (0.69 vs. 0.56).
ConceptID: 200936003. Terms: Lupus erythematosus, LE -
Lupus erythematosus, Lupus, Lupus erythematosus (disorder).
1266 O. Lyudovyk and C. Weng / SNOMEDtxt: Natural Language Generation from SNOMED Ontology
User Evaluation rate (72%) and as complete (78%) as the original content, while
they found 28% of descriptions to be somewhat less accurate,
A survey evaluating results of SNOMEDtxt was conducted
6% somewhat less complete, and 17% significantly less com-
among 51 lay people recruited using Amazon Mechanical Turk
plete.
(MTurk) and 6 clinicians from Columbia University Medical
Center. MTurk is a crowdsourcing marketplace that enables
outsourcing tasks like surveys to a distributed workforce for a Table 3 – User Evaluation: SNOMEDtxt vs. SNOMED
small reward. Evaluations of all MTurk taskers that applied and
did not self-identify as clinicians were included in the results. Readability and Preference
Evaluations of all 6 clinicians who responded to the survey
SNOMEDtxt SNOMED CT No Difference
were included in the results. All evaluators were provided with
Lay Audience (n=51)
a basic description of the project, but were not aware of the
Easier to read 76.5% 14.4% 9.2%
study design or the research question.
Preferred 68.6% 21.6% 9.8%
We randomly selected a set of 20 disease concepts from Clinicians (n=6)
SNOMED CT for evaluating readability, preference, accuracy, Easier to read 83% 11% 6%
and completeness (set 1). Helpfulness was evaluated on a set of Preferred 44% 28% 28%
20 disease concepts for which a medical reference text was
available (set 2). Questions probing the degree of understanding Helpfulness and Understanding
were constructed for 10 diseases with sufficient information SNOMEDtxt SNOMED CT
selected from 40 randomly sampled disease concepts (set 3). Lay Audience (n=51)
Comparison with OntoVerbal was restricted to 4 diseases for Helpful (1-10) 5.7 4.8
which OntoVerbal description was available in [8] (set 4). For Correctly understood 72.1% 51%
all 4 sets, we generated a SNOMEDtxt disease description and a Clinicians (n=6)
concatenation of SNOMED CT content. The survey was con- Helpful (1-10) 3.50 3.58
ducted using Qualtrics survey platform ([Link]) Correctly understood 100% 100%
and included randomization: each evaluator was presented with
3 randomly selected diseases from set 1, 2 from set 2, 2 from set Accuracy and Completeness
3, and 1 from set 4.
SNOMEDtxt vs. Significantly Somewhat Same
In order to assess readability and general preference, we pre- SNOMED CT Worse Worse
sented evaluators with SNOMEDtxt disease descriptions and Clinicians (n=6)
the SNOMED CT content for 3 diseases from set 1 and asked Accuracy 0% 28% 72%
whether one or the other was more readable or generally pre- Completeness 17% 6% 78%
ferred, or there was no difference. Evaluators were not informed
which text represented SNOMEDtxt output. Lay people found A conclusive comparison between OntoVerbal and
76.5% of SNOMEDtxt disease descriptions easier to read than SNOMEDtxt was not feasible since only 4 disease descriptions
the SNOMED CT content, and preferred 69% of SNOMEDtxt were available for OntoVerbal. We conducted a limited com-
descriptions to SNOMED CT content. Clinicians found 83% of parison by presenting all evaluators with the SNOMED CT
SNOMEDtxt descriptions easier to read and preferred 44% of content and with a disease description from either OntoVerbal
them to the SNOMED CT content. or SNOMEDtxt for the same disease (evaluators were unaware
of the source of each text). All evaluators were asked which text
We tested understanding by presenting the evaluators with ei-
they found easier to read and generally preferred; clinicians
ther the SNOMEDtxt description or the SNOMED CT content
were additionally asked whether the text description was less
for a concept, followed by a multiple choice question designed
accurate / complete than (denoted in Table 4 as “Worse”) or as
to test whether the evaluator understood the text; we then com-
accurate / complete as (denoted as “Same”) the SNOMED CT
pared the number of correct answers given when presented with
content. This limited comparative evaluation points to a prefer-
SNOMEDtxt description or with the SNOMED content.
ence for SNOMEDtxt disease descriptions with the same or
SNOMEDtxt format appeared to be significantly easier to un-
better performance on readability, accuracy, and completeness.
derstand for lay users: they gave the correct answer 72% of the
time when presented with SNOMEDtxt description and only
Table 4 – User Evaluation: Comparison with OntoVerbal
51% when presented with SNOMED CT original content.
There was no difference for clinicians: they gave the correct SNOMEDt Onto SNOMED No Dif-
answer 100% of the time regardless of what text they were pre- xt Verbal CT ference
sented with. Lay Audience (n=51)
To evaluate helpfulness, we presented evaluators with the Easier to read 49% 43% 3.9% 3.9%
SNOMEDtxt description, the SNOMED CT content, and a de- Preferred 52% 31% 11.8% 3.9%
scription of the same concept from either Medline Plus or Clinicians (n=6)
Google Knowledge Graph as a reference and asked “How help- Easier to read 50% 50% 0% 0%
ful was the terminology content compared to” the reference, on Preferred 50% 17% 17% 17%
a scale from 1 to 10. Lay people found SNOMEDtxt descrip- SNOMEDtxt OntoVerbal
tions more helpful: the average helpfulness score for vs. SNOMED CT vs. SNOMED CT
SNOMEDtxt texts was 5.7, compared to 4.8 for SNOMED CT Clinicians (n=6) Worse Same Worse Same
content. On the other hand, clinicians found SNOMEDtxt de- Accuracy 0% 45% 27% 27%
scriptions on average minimally less helpful than SNOMED CT Completeness 18% 37% 18% 27%
content (3.50 versus 3.58).
User evaluation demonstrates potential utility of SNOMEDtxt
Clinician evaluators were also asked to assess accuracy and
for lay users: they find the generated disease descriptions more
completeness for disease concepts from set 1. In most cases
readable and easier to understand than the structured SNOMED
clinicians thought the SNOMEDtxt descriptions were as accu-
O. Lyudovyk and C. Weng / SNOMEDtxt: Natural Language Generation from SNOMED Ontology 1267
CT content. The accuracy and completeness of SNOMEDtxt’s More broadly, natural langauge processing is growing in
natural language descriptions is close to the original SNOMED importance with many potential applications in Healthcare
CT content. The use case of assisting in SNOMED CT content systems. NLG involves several important tradeoffs, which
review would require some adjustments to SNOMEDtxt design should be made with a specific application in mind. Two such
in order to produce more faithful representations of the tradeoffs are balancing completeness and accuracy on one hand
SNOMED CT content. with fluidity and comprehensibility on the other; and
generalizability versus linguistic polish and expressiveness.
Discussion
References
We introduce a method to generate disease descriptions directly
from the SNOMED CT ontology for two main applications: [1] J. Millar, The Need for a Global Language - SNOMED CT
providing access to definitions of rare diseases or disease Introduction, Stud Health Technol Inform 225 (2016) 683-5.
variants not described in clinical reference resources and [2] US National Library of Medicine, SNOMED CT United
enabling easier comprehension of SNOMED CT content for States Edition,
those reviewing, verifying, and extending the ontology. [Link]/healthit/snomedct/us_edition .html
In the design of SNOMEDtxt, we have made several choices (Accessed April 28, 2018).
that favor fluidity and ease of comprehension over faithful and [3] N. Miller, E. M. Lacroix, J. E. Backus, MEDLINEplus:
complete representation of information, at the risk of possible building and maintaining the National Library of Medicine's
loss of information. The human evaluation of results confirms consumer health Web service, BMLA 88(1) (2000), 11-7.
that we achieved the goal. However, these choices may not be [4] S.Y. Kim et al., Comparison of Knowledge Levels Required
appropriate when SNOMEDtxt output is used to verify content for SNOMED CT Coding of Diagnosis and Operation
of SNOMED CT. It may be desirable to provide users with Names in Clinical Records, Healthc Inform Res 18(3)
configurations such as “more precise” and “easier to (2012), 186-90.
understand” when generating the natural language texts. [5] J. E. Andrews, R. L. Richesson, J. Krischer, Variation of
Another tradeoff made in the design of SNOMEDtxt was SNOMED CT Coding of Clinical Research Concepts
readability at the expense of generalizability. In order to extend among Coding Experts. JAMIA 14(4) (2007), 497–506.
SNOMEDtxt to other types of concepts or to other [6] W.S. Campbell et al. A computable pathology report for
terminologies, verbalizations of relationships and handling of precision medicine: extending an observables ontology
aggregated sentence structures would need to be adjusted. unifying SNOMED CT and LOINC, JAMIA 25(3) (2018),
259–266
A significant limitation to the use of SNOMEDtxt for the wider [7] K.W. Fung, Coverage of rare disease names in standard
audience is the amount of content available for each disease terminologies and implications for patients, providers, and
concept in SNOMED CT. Expanding, i.e. explaining, some research, AMIA Annu Symp Proc (2014 Nov 14) 564-72.
related nodes, for example parent disease node or finding site, [8] S.F. Liang et al., OntoVerbal: A Generic Tool and Practical
may add meaningful and relevant information to the generated Application to SNOMED CT, IJASCA 4(6) (2013) 227-239.
disease descriptions. A navigable user interface where a user [9] M. Kanhov, X. Feng, H. Dalianis, H. Natural Language
could click on confusing terms and see them explained would Generation from SNOMED Specifications, CLEFeHealth
be an alternative approach to this challenge. Developing APIs to 2012 Workshop (2012).
access SNOMEDtxt would enable integration of textual disease [10] M. Taboada, W.C. Mann, Applications of Rhetorical
descriptions into other electronic resources and reference Structure Theory, Discourse Studies, 8(4) (2006), 567–588.
materials, such as EHR help function or patient portals. The [11] W.E. Winkler, String Comparator Metrics and Enhanced
search functionality in the current implementation is limited to Decision Rules in the Fellegi-Sunter Model of Record
exact string match with either the SNOMED term name or any Linkage, American Statistical Association, (1990), 354–359.
of the term’s synonyms and can be further improved with string [12] D. Schmidt, ‘sylcount’ R package (2017).
search algorithms. [13] K. Sheridan, The 20 most Googled diseases, StatNews
Results of the evaluation by lay people and clinicians presented (June 6, 2017).
in this paper are encouraging for the potential use of
SNOMEDtxt in making SNOMED CT content more accessible Address for correspondence
and easier to review; however, a more rigorous evaluation with
a larger audience and a greater number of tested concepts is Chunhua Weng, Department of Biomedical Informatics, Columbia
recommended. University, email: chunhua@[Link]
Finally, to allow the system to continuously learn and evolve,
evaluation and feedback elicitation can be bulit into the user
interface. Presenting users with different verbalization options
at random and gathering user feedback would enable the system
to learn verbalization patterns favored by users and evolve the
NLG engine accordingly.
Conclusion
This work presents an ontology verbalizer for SNOMED CT
disease concepts: a tool that generates natural language concept
descriptions balancing completeness and accuracy with the ease
of human comprehension. User evaluation shows that lay
people prefer to read natural text instead of structured
ontologies and understand textual descriptions better.