Semantic Data Platform for Healthcare

Semantic Data Platform for Healthcare
ICT-611388
D3.1 Sketch of system
architecture specification
WP3 – Architecture and
Requirements
V1.0 Final
Lead beneficiary: MUG
Date: 31/03/2014
Nature: Report
Dissemination level: PU
(Public)
D3.1 – Sketch of system architecture specification
WP3: Architecture and Requirements
Dissemination level: Public
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis),
Stefan Schulz (MUG)
Version: 1-0 Final
TABLE OF CONTENTS
DOCUMENT INFORMATION ................................................................................................................ 4
DOCUMENT HISTORY ......................................................................................................................... 4
DEFINITIONS ........................................................................................................................................ 5
EXECUTIVE SUMMARY ....................................................................................................................... 6
KEY WORDS (WORDLE STYLE) ......................................................................................................... 7
1.
INTRODUCTION ........................................................................................................................... 8
1.1.
ABOUT SEMCARE .................................................................................................................. 8
1.1.1.
MOTIVATION AND BACKGROUND ................................................................................... 8
1.1.2.
PROJECT DESCRIPTION ................................................................................................... 8
1.2.
ABOUT THIS DOCUMENT ...................................................................................................... 9
1.2.1.
AIM OF THIS DOCUMENT .................................................................................................. 9
1.2.2.
DOCUMENT STRUCTURE .................................................................................................. 9
2.
APPLICATION SCENARIO / REQUIREMENTS ........................................................................ 10
2.1.
USE CASE ............................................................................................................................. 10
2.1.1.
BACKGROUND AND MOTIVATION ................................................................................. 10
2.1.2.
APPROACH ....................................................................................................................... 11
2.1.3.
TOPICS OF INTEREST AND THEIR (TEXTUAL) REPRESENTATION IN EHRS ........... 11
2.2.
REQUIREMENTS ................................................................................................................... 14
2.2.1.
FUNCTIONAL REQUIREMENTS ...................................................................................... 14
2.2.2.
NON-FUNCTIONAL REQUIREMENTS ............................................................................. 15
3.
ARCHITECTURE ........................................................................................................................ 16
3.1.
OVERVIEW ............................................................................................................................ 16
3.2.
INTERFACES ......................................................................................................................... 19
3.3.
DATA MODELS...................................................................................................................... 20
3.3.1.
INPUT DATA ...................................................................................................................... 20
3.3.2.
TERMINOLOGIES ............................................................................................................. 21
© Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s
Seventh Programme for research, technological development and demonstration under grant agreement No 611388.
1
D3.1 – Sketch of system architecture specification
WP3: Architecture and Requirements
Dissemination level: Public
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis),
Stefan Schulz (MUG)
Version: 1-0 Final
3.3.3.
I2B2 STAR SCHEMA ......................................................................................................... 22
3.3.4.
SEMCARE PATIENT RECORD SOLR DOCUMENT ........................................................ 24
3.3.5.
SEMCARE DATA LOADING FLOW.................................................................................. 26
3.4.
MODULES & FUNCTIONAL VIEW ........................................................................................ 27
3.4.1.
SEMCARE DATA IMPORTER ........................................................................................... 28
3.4.2.
SOLR.................................................................................................................................. 28
3.4.3.
AVERBIS TEXT ANALYTICS (AEP) ................................................................................. 28
3.4.4.
SEMCARE PORTAL WEB APPLICATION ....................................................................... 28
3.4.5.
I2B2 APPLICATIONS ........................................................................................................ 29
3.4.6.
THIRD PARTY TOOLS AND APPLICATIONS .................................................................. 29
3.4.7.
SCALABILITY .................................................................................................................... 29
3.5.
USERS & ROLES................................................................................................................... 29
3.6.
OPEN POINTS ....................................................................................................................... 30
4.
DATA PRIVACY / TECHNICAL AND ORGANIZATIONAL SECURITY PROCEDURES .......... 31
4.1.
DATA PROCESSING ............................................................................................................. 31
4.2.
DATA TRANSFER AND DATA LOCATION .......................................................................... 31
4.3.
ROLE CONCEPT ................................................................................................................... 32
4.4.
AVAILABILITY CONTROL .................................................................................................... 32
4.5.
DATA SEPARATION CONTROL ........................................................................................... 32
© Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s
Seventh Programme for research, technological development and demonstration under grant agreement No 611388.
2
D3.1 – Sketch of system architecture specification
WP3: Architecture and Requirements
Dissemination level: Public
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis),
Stefan Schulz (MUG)
Version: 1-0 Final
TABLE OF FIGURES
FIGURE 1: SYSTEMS INVOLVED IN THE SEMCARE ARCHITECTURE ....................................................................................... 16
FIGURE 2: ARCHITECTURE SKETCH ................................................................................................................................. 18
FIGURE 3: ARCHITECTURE LAYERING .............................................................................................................................. 19
FIGURE 4: AVERBIS SEARCH REST API........................................................................................................................... 20
FIGURE 5: REFINEMENT PROCESS FOR CRITERIA ............................................................................................................... 21
FIGURE 6: I2B2 STAR SCHEMA ...................................................................................................................................... 23
FIGURE 7: I2B2 CUSTOM_META TABLE .......................................................................................................................... 23
FIGURE 8: CUSTOM METADATA IN I2B2 TERM NAVIGATOR ................................................................................................ 24
FIGURE 9: SOLR TO I2B2 MAPPING ............................................................................................................................... 24
FIGURE 10: MAPPING OF SOLR DOCUMENTS TO I2B2 DATABASE ........................................................................................ 25
FIGURE 11: SEMCARE DATA LOADING FLOW ................................................................................................................. 26
FIGURE 12: TALEND OPEN STUDIO DATA IMPORTER ......................................................................................................... 27
FIGURE 13: SEMCARE COMPONENTS .......................................................................................................................... 27
© Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s
Seventh Programme for research, technological development and demonstration under grant agreement No 611388.
3
D3.1 – Sketch of system architecture specification
WP3: Architecture and Requirements
Dissemination level: Public
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis),
Stefan Schulz (MUG)
Version: 1-0 Final
DOCUMENT INFORMATION
Grant Agreement
Number
Full title
ICT-611388
Acronym
SEMCARE
Semantic Data Platform for Healthcare
Project URL
www.semcare.eu
EU Project officer
Saila Rinne ([email protected])
Deliverable
Number
3.1
Title
Sketch of system architecture specification
Work package
Number
3
Title
Architecture and Requirements
Delivery date
Contractual 31.03.2014
Actual
Status
Version V1.0 Final
Draft 
Nature
Report  Prototype  Other 
Final 
Dissemination Level Public  Confidential 
Authors (Partner)
Responsible Author
Philipp Daumke, Carla Haid, Luke Mertens (Averbis),
Stefan Schulz (MUG)
Stefan Schulz
Email
[email protected]
Partner
Phone
+43 699 150 96 270
MUG
DOCUMENT HISTORY
NAME
DATE
VERSION
DESCRIPTION
Philipp Daumke
03.03.2014
0.1
Initial Creation
Carla Haid
07.03.2014
0.2
Additions
Luke Mertens
14.03.2014
0.3
Additions
Carla Haid
17.03.2014
0.4
Additions
Stefan Schulz
19.03.2014
0.5
Corrections, comments, additions
Carla Haid
21.03.2014
0.6
Corrections, additions
Luke Mertens
24.03.2014
0.7
Corrections, additions
Jan Kors
24.03.2014
0.8
Corrections, additions
A. Honrado, E. Chavarría
30.03.2014
0.9
Internal formal review
Stefan Schulz
31.03.2014
1.0
Final version
© Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s
Seventh Programme for research, technological development and demonstration under grant agreement No 611388.
4
D3.1 – Sketch of system architecture specification
WP3: Architecture and Requirements
Dissemination level: Public
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis),
Stefan Schulz (MUG)
Version: 1-0 Final
DEFINITIONS
•
Partners of the SEMCARE Consortium are referred to herein according to the following codes:
AVERBIS - Averbis GmbH (Germany) Coordinator
EMC - Erasmus Universitair Medisch Centrum Rotterdam (Netherlands) – Beneficiary
MUG - Medical University of Graz (Austria) – Beneficiary
SGUL - Saint George's University of London (UK) – Beneficiary
SYNAPSE - Synapse Research Management Partners S.L. (Spain) – Beneficiary
•
Project: The sum of all activities carried out in the framework of the Grant Agreement.
•
AEP: Averbis Extraction Platform; text analysis tool to extract information units such as facts and
relations from unstructured text
•
CUI: Concept unique identifier in the Unified Medical Language System (UMLS)
•
EHR: Electronic health record; clinical data record of a patient
•
ETL: extract – transfer – load; Process in data warehousing that is often used to integrate data
from multiple sources. A common ETL tool is Talend Open Studio.
•
GUI: Graphical user interface of the application
•
Graph DB: Database using graph structures with nodes, edges, and properties to represent and
store data. Compared to a relational database it is faster and better scalable for large data sets.
•
HL7v2 format: Health Level Seven; universal standard for the exchange of electronic health
information
•
i2b2: Informatics for Integrating Biology and the Bedside; scalable informatics framework for
clinical data
•
REST: Representational State Transfer; communication service between two components using
JSON (JavaScript Object Notation) messages
•
Solr: Open source search platform from Apache Lucene, with Java client Solrj
•
Terminology: General term for information artefacts that provide controlled terms for a domain,
identifiers of meaning and semantic relations. e.g. SNOMED, ICD-10, MeSH
•
Term Browser: Tool provided by Averbis to load, view, modify and export terminologies. It can
also be used to create new terminologies.
•
UIMA: Unstructured Information Management Architecture; framework by Apache enabling the
generation of analysis pipelines for arbitrary content such as text, image or video data
© Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s
Seventh Programme for research, technological development and demonstration under grant agreement No 611388.
5
D3.1 – Sketch of system architecture specification
WP3: Architecture and Requirements
Dissemination level: Public
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis),
Stefan Schulz (MUG)
Version: 1-0 Final
EXECUTIVE SUMMARY
The initial task in work package 3 is the agreement on a generic architecture for the semantic data
platform SEMCARE. This document gives a first overview of the planned system architecture for the
project. The considerations about the architecture are a fundamental step in the development of such
a data platform. They therefore constitute an essential task right from the beginning of the project.
The architectural design decisions are driven by several dimensions. First, the use cases covered by
the project must be defined in order to evaluate the resulting requirements. As the SEMCARE
software will be installed within the different partner hospitals, it must also be considered that the
integration into the clinical IT landscape should be simple. Furthermore, aspects about data
governance, privacy and security should be kept in mind when developing the system architecture.
Another important requirement is the scalability of the system to allow processing of large data sets.
Finally, the SEMCARE architecture should be constructed in a way that enables a seamless
integration of other platforms and applications, which is also called an ‘Open Architecture’. This can
be allowed by using standard components.
The main goal of the architecture is to provide a framework to extract meaningful information out of a
broad range of structured and unstructured information from the Electronic Health Record. To this
end, several systems and resources have to be integrated within a common framework. Some of
these components are brought in and adapted by the partners, such as an extraction platform and a
terminology browser. Others are available as free software such as indexing tools and semantic
repositories. Domain terminology resources constitute another cornerstone of this framework.
Whereas the coverage of existing terminologies is already very good for English, the two other
languages addressed by SEMCARE, viz. Dutch and German, are less well served, which will require
efforts in filling the terminology gaps by the combination of automated and manual term acquisition
approaches.
© Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s
Seventh Programme for research, technological development and demonstration under grant agreement No 611388.
6
D3.1 – Sketch of system architecture specification
WP3: Architecture and Requirements
Dissemination level: Public
7
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis),
Stefan Schulz (MUG)
Version: 1-0 Final
KEY WORDS (Wordle style)1
1
http://www.wordle.net/
© Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s Seventh Programme for research, technological
development and demonstration under grant agreement No 611388.
D3.1 – Sketch of system architecture specification
WP3: Architecture and Requirements
Dissemination level: Public
8
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis),
Stefan Schulz (MUG)
Version: 1.0 Final
1. Introduction
1.1. About SEMCARE
1.1.1.
Motivation and Background
The exploitation of medical data from clinical trials and thus the monitoring and improving of
healthcare delivery is of increasing interest. However, up to 80% of the clinical trials fail to meet their
patient enrolment quotas on time. This recruitment delay currently causes up to $8 million per day in
loss of revenue for the pharmaceutical industry. The SEMCARE project will provide a more efficient
way of patient recruitment, which will be helpful to prevent recruitment delays.
Furthermore, SEMCARE also addresses another challenge in the field of health care, which is the
identification of rare diseases. For the doctors it is often hard to diagnose such diseases as they are
hardly known, and hence this results in a number of undiagnosed or even wrongly diagnosed
patients. SEMCARE will use available clinical patient data to combine signs and symptoms, thereby
detecting undiagnosed patients suffering from rare diseases. This will contribute to speed up the
research on this group of diseases. For the pharmaceutical companies, every newly diagnosed
patient is of huge interest as it generates up to $300,000 drug revenue per year.
1.1.2.
Project description
The two-year research project SEMCARE ‘Semantic Data Platform for Healthcare’ is funded by the
European Commission’s Seventh Framework Programme. The aim of the project is the development
of a software platform that facilitates the diagnosis of rare diseases in various health care contexts,
and supports the selection of appropriate patients for clinical studies, the basis being the automated,
contextual evaluation of existing patient data. SEMCARE will combine current text-mining
technologies with multi-lingual terminologies in order to develop solutions for typical problems that
arise when interpreting medical narratives, e.g. ambiguities, abbreviations, spelling variations or
typos. Testing and optimization of the analysis software for the analysis of routine medical data in
clinics will be performed in leading European health centres in Great Britain, the Netherlands and
Austria.
© Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s
Seventh Programme for research, technological development and demonstration under grant agreement No 611388.
D3.1 – Sketch of system architecture specification
WP3: Architecture and Requirements
Dissemination level: Public
9
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis),
Stefan Schulz (MUG)
Version: 1.0 Final
1.2. About this document
1.2.1.
Aim of this document
A fundamental task in work package 3 is the agreement on a generic architecture for the semantic
data platform SEMCARE. This document gives a first overview of the planned system architecture for
the project. The considerations about the architecture are a crucial step in the development of such a
data platform, which makes them essential right from the beginning of the project. In order to be able
to design the architecture it must be defined which use cases are covered in the project and which are
the resulting requirements. This deliverable contains only the basic requirements. More specific, userdefined requirements related to the prototype will be provided in D3.2.
1.2.2.
Document Structure
This document has been structured into four main parts. Following an introduction into the SEMCARE
project and the document, the use case that will be focused on during the project, is described. The
definition of the use case is necessary for the identification of the requirements and the demands that
are made on the platform and the underlying architecture. As a third step, we show the generic design
of the SEMCARE architecture and describe the different modules and how they interact. Last but not
least the document includes information about the technical and organizational procedures performed
at the hospitals with regards to data privacy and security in the context of the SEMCARE systems and
architecture.
© Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s
Seventh Programme for research, technological development and demonstration under grant agreement No 611388.
D3.1 – Sketch of system architecture specification
WP3: Architecture and Requirements
Dissemination level: Public
10
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis),
Stefan Schulz (MUG)
Version: 1.0 Final
2. Application Scenario / Requirements
2.1. Use Case
The three participating European health centres have agreed on one first general use case on which
they will focus during the project. The use case is called ’Risk Stratification and Differential
Diagnosis of Patients suffering from transient loss of consciousness’. This use case is
described in detail in the following subsections.
2.1.1.
Background and motivation
Cardiovascular disease is the cause of 47% of all deaths in Europe, the majority of which are related
2
to underlying coronary artery disease . Sudden cardiac death accounts for approximately half of
3
coronary artery disease related deaths and also occurs in those with non-coronary artery disease
related cardiovascular diseases such as cardiomyopathies and inherited channelopathies. Sudden
death is also more prevalent in patients with epilepsy and is often unexplained when it is known as
SUDEP (Sudden unexplained Death in Epilepsy). We are currently unable to determine those who
are at most risk from SUDEP.
The symptom of transient loss of consciousness (T-LOC) occurs in up to 50% of the general
population and leads to 1% of all hospital admissions
4,5,6
. A wide range of conditions can lead to T-
LOC. Causes of T-LOC can be broadly categorized as cardiac (such as arrhythmia when it is known
as syncope) or non-cardiac (such as epilepsy). Cardiac syncope carries a much more sinister
prognosis as it is associated with sudden cardiac death. Fortunately effective treatments, such as
anti-ischemic and heart failure medication and implantation of implantable cardioverter-defibrillator
(ICD), can dramatically improve outcomes. Unfortunately, the clinically assigned aetiology and
4
prognosis of T-LOC is frequently incorrect , predominantly due to an inability to differentiate between
cardiac syncope and epilepsy and a lack of appreciation of high-risk markers such as exertional TLOC, T-LOC with palpitation and function and/or pre-existent coronary and/or structural heart disease.
2
3
European Cardiovascular Disease Statistics, 2012 edition. European Heart Network and European Society of Cardiology
Myerburg RJ1, Junttila MJ. Sudden cardiac death caused by coronary heart disease. Circulation. 2012 Feb 28;125(8):1043-
52.
4
5
Fitzpatrick AP1, Cooper P. Diagnosis and management of patients with blackouts. Heart. 2006 Apr;92(4):559-68.
Petkar S, Jackson M, Fitzpatrick A. Management of blackouts and misdiagnosis of epilepsy and falls. Clin Med. 2005 Sep-
Oct;5(5):514-20.
6
Brignole M et al. A new management of syncope: prospective systematic guideline-based evaluation of patients referred
urgently to general hospitals. Eur Heart J. 2006 Jan;27(1):76-82. Epub 2005 Nov 4.
© Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s
Seventh Programme for research, technological development and demonstration under grant agreement No 611388.
D3.1 – Sketch of system architecture specification
WP3: Architecture and Requirements
Dissemination level: Public
11
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis),
Stefan Schulz (MUG)
Version: 1.0 Final
2.1.2.
Approach
A number of phenotypic features can help risk stratify patients, most of which are available from
routine assessment and investigations. Using a semantic data platform, we seek to identify high-risk
patient cohorts based on patient-level criteria scattered in heterogeneous clinical data contained in
electronic healthcare records (EHRs).
Subjects belonging to a universal set of interest will have their electronic medical records processed
for natural language expressions that denote often detailed descriptions about patients’ clinical
history, procedures or investigations planned or carried out. In our specific use case, the cases of
interest are patients with prior myocardial infarction (MI), syncope of presumed cardiac origin or
seizure disorder. The universal set of interest is defined as patients with Transient Loss of
Consciousness and/or Sudden Cardiac Arrest and/or sustained Ventricular Arrhythmia and/or
Cardiomyopathy and/or Ischemic Heart Disease and/or Seizure Disorder.
However, the information extraction methodology we develop and describe is generic and could be
adapted to whatsoever patient cohorts and medical inquiries.
2.1.3.
Topics of interest and their (textual) representation in EHRs
In the following, medical topics of interest like procedures and investigations, but also information
about the patients’ medication and history are listed that will be used in order to identify subjects
belonging to the universal set of interest described in the use case above. In the table below, only the
most frequent topics of interest are listed. Contents of electronic medical records will be processed for
typical phrases for topics and attributes. The values of the attributes are assumed to be numeric or
Boolean and are therefore not of terminological interest. This means, that, e.g. “normal ECG” would
be represented by the attribute “ECG normal” and the value “true”, or “QRS interval 0.12 s” would be
represented by the attribute “QRS interval in seconds” and the numeric value “0.12”.
For each topic of interest, some examples of indicative phrases and related attributes are listed in the
table below.
© Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s
Seventh Programme for research, technological development and demonstration under grant agreement No 611388.
D3.1 – Sketch of system architecture specification
WP3: Architecture and Requirements
Dissemination level: Public
12
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis),
Stefan Schulz (MUG)
Version: 1.0 Final
Topic of Interest
Indicative phrases for topics
Indicative phrases for related attributes
(selected examples in English, Dutch
and German)
(selected examples in English, Dutch and
German)
“electrocardiogram”,
“elektrocardiogram”,
“Elektrokardiogram”,
“ECG”,
“EKG”
Electrocardiogram
“exercise tolerance test”, “ETT”,
“exercise test”
Exercise Tolerance
Test
“holter monitoring”, “holter”,
“24 hour tape”, “48 hour tape”, “event
recorder”
Holter
Monitoring
“cardiac catherisation”,
“catherization”, “cath”, “angiogram”,
“coronary angiogram”
Coronary
Angiogram
“echocardiogram”, “echocardiografie”
“echo”, “TTE”, “heart scan”,
“Echokardiogramm”
Echocardiogram
MRI - Cardiac
Blood Tests
Age
“CMR”, “CMRI”, “Cardiac MRI”, “MRI Cardiac”, “Kardiales MRI”, “Herz-MRI”
“Blood Tests”, “Bloods”,
“Biochemistry”, “Full Blood Count”,
“FBC”, “Troponin”, “Toxicology”,
“Blutbild”
“age”, “DOB”, “date of birth”,
“Geburtsdatum”, “Alter”,
“normal”, “normaal”
“abnormal ECG”, “abnormal
electrocardiogram”
“PR Interval Duration”
“atrioventricular”, “atrioventriculaire”, “AV”,
“QRS Interval Duration”
“T wave inversion”, “T wave abnormality”
“ST segment depression”, “ST segment
elevation”
“Bundle Branch Block”, “RBBB”, “LBBB”
“pathological Q Waves”
“Atrial fibrillation”
“normal”, “ischemic”, “ST segment depression”
“T wave inversion”, “blood pressure response”
“ventricular tachycardia”, “VT”
“ventricular ectopics”, “VEs”, “ectopics
present”, “ectopics absent”, “couplets”,
“triplets”, “salvos”, “PVCs”, “premature
ventricular contractions”
“normal”
“non sustained VT”, “non sustained ventricular
tachycardia”, “nsVT”, “ventricular tachycardia”,
“ventricular ectopics”, “VEs”, “ectopics
present”, “couplets”, “triplets”, “salvos”,
“PVCs”, “premature ventricular contractions”,
“normal”, “unobstructed”, “normal coronaries”,
“normal coronary arteries”, “normal coronary
angiography”, “smooth coronary arteries”,
“stenosis”, “stenoses”, “obstruction”
“normal heart”, “no cardiomyopathy”, “normal
echo”,
“ejection fraction”, “ventricular function”
“ventricular dysfunction”, “poor ventricular
function”, “impaired LV”, “impaired left
ventricular”
“aortic stenosis”, “mitral stenosis”
“pulmonary hypertension”
“normal”
“ejection fraction”, “ventricular function”
“late gadolinium enhancement”, “Scar”
“regional wall motion abnormality”
“normal”, “abnormal”, “elevated”, “raised”, “low”
“hypoglycemia”
“years old”
“alte”
© Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s
Seventh Programme for research, technological development and demonstration under grant agreement No 611388.
D3.1 – Sketch of system architecture specification
WP3: Architecture and Requirements
Dissemination level: Public
13
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis),
Stefan Schulz (MUG)
Version: 1.0 Final
Topic of Interest
Medications
Family History
of Sudden
Cardiac Arrest
Sudden Cardiac
Arrest or
sustained
Ventricular
Arrhythmia
Indicative phrases for related attributes
(selected examples in English, Dutch and
German)
“alter”
“drug name”, “substance name”, “dose”,
“Furosemide”, “Frusemide”, “Metolazone”,
“Eplerenone”, “Spironolactone”, “Dosis”
“age of death”,
“degree of relative”, “first”, “second”, “mother”,
“father”, “brother”, “sister”, “aunt”, “uncle”,
“son”, “daughter”, “Vater”, “Mutter”, “Onkel”,
“vader", "broer", "zuster", "tante", "oom",
"zoon", "dochter",
in context of ventricular fibrillation:
“idiopathic”, “no cause”, “no aetiology”,
“idiopathisch”, “ohne erkennbare Ursache”
“family history of”, “sudden cardiac
arrest”, “unexplained death”, “brother
died suddenly”, “cousin died
suddenly”
“VT”, “VF”, “ventricular tachycardia”,
“polymorphic VT”, “ventricular
fibrillation”, “torsades”, “resuscitated
sudden death”, “resuscitated SCD”
“Arrest”, “Cardiac arrest”, “VF arrest”,
“Plötzlicher Herztod”, “Sekundentod”
“syncope”, “near syncope”, “presyncope”, “presyncope”
“blackout”, “black-out”, ”collapse”,
“faint”, “loss of consciousness”,
“LOC”, “TLOC”, “T-LOC”, “pass out”,
“passing out”, “passed out”,
“Ohnmacht”
Syncope
Heart Failure
Ischemic Heart
Disease
Seizure
Disorder
Indicative phrases for topics
(selected examples in English, Dutch
and German)
“geboortedatum”
“medications”, “meds”, “drugs
History”, “is on”, “Medikamente”
“heart failure”, “HF”, “CCF”,
“cardiomyopathy”, “breathlessness”,
“NYHA II”, “NYHA III”, “NYHA IV”,
“Herzversagen”, “Herzinsuffizienz”
“Myocardial infarction”, “STEMI”,
“nonSTEMI”, “non-STEMI”,
“NSTEMI”, “acute coronary
syndrome”, “ACS”, “ischaemic heart
disease”, “IHD”, “CAD”, “Angina”,
“Previous stents”
“seizure disorder”, “epilepsy”,
“seizure”,
“fitting”, “fits”
“convulsions”,
“limb jerking”
“status epilepticus”, “Krämpfe”,
“Epilepsie”, “Anfall”
“on exertion”, “exertional”, “on exercise”,
“exercise related”, “exercise induced”, “stress
related”, “catecholamine related”, “emotion
induced”, “while running”, “whilst running”,
“mid-stride”, “in Verbindung mit Stress”,
“prolonged standing”,
“prodromal symptoms”,
“coughing”, “micturition”, “passing water”,
“urinating”, “swallowing”
“Severe”, “Gross”, “Moderate”, “Mild”,
“NYHA Class I”, “NYHA Class II”, “NYHA Class
III”, “NYHA Class IV”, “NYHA I”
Time of event
“STEMI”
“NSTEMI”
"Unstable Angina”
“Stable Angina”
“Troponin rise”
“Previous stents”, “PCI”, “angioplasty”, “CABG”
“Type”, “Petit Mal”,
“Grand Mal”,
“Status Epilepticus”,
“Frequency”
© Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s
Seventh Programme for research, technological development and demonstration under grant agreement No 611388.
D3.1 – Sketch of system architecture specification
WP3: Architecture and Requirements
Dissemination level: Public
14
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis),
Stefan Schulz (MUG)
Version: 1.0 Final
2.2. Requirements
In this section the basic requirements arising from the described use case are described. A more
detailed description of the user specific requirements will be provided after developing the first
prototype. This description will be part of D3.2.
2.2.1.
Functional Requirements
In order to identify candidates matching the aforementioned criteria, arbitrary types of free-text
documents in patient records have to be gathered, pre-processed and analysed. Hence, in a first
stage, interfaces to existing clinical IT systems have to be established to consolidate the data from
each relevant resource. This stage in general also includes a data transformation process mapping,
for instance, HL7 encoded data to a target schema of a central knowledge store. These kinds of tasks
are perfectly solved by the aid of ETL (extraction, transformation, loading) tools such as Talend Open
7
8
Studio or Pipeline Pilot .
Furthermore, the identification of use case specific criteria (e.g. ‘loss of consciousness whilst running’)
within clinical narratives require that an information extraction system needs to be prepared to a
variety of isosemantic lexical and syntactic variants found in the texts. Consequently, for each criterion
and attribute of interest numerous synonymous expressions have to be considered in order to
guarantee a high recall of relevant candidates. To handle this huge complexity we will use a Solr
search engine combined with several domain terminologies like SNOMED CT, ICD-10 or MeSH. One
main focus of the SEMCARE platform is the end-user support in the criteria refinement process. This
is not trivial as it will require a dialogue with the users in order to acquire custom expressions that
would enhance the terminological coverage. Details on this refinement process are described within
section 3 below.
Another key aspect is the language of the document. Text processing tools have to consider the
particular syntax and grammar, but also the terminology to be dealt with has to be specific for a
language. Furthermore, regional particularities such as punctuation have to be accounted for.
Examples are the decimal point in English, opposed to the decimal comma in German and Dutch, or
different units of measurement used for the same laboratory observations.
7
8
http://talend.com/products/talend-open-studio
http://accelrys.com/products/pipeline-pilot
© Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s
Seventh Programme for research, technological development and demonstration under grant agreement No 611388.
D3.1 – Sketch of system architecture specification
WP3: Architecture and Requirements
Dissemination level: Public
15
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis),
Stefan Schulz (MUG)
Version: 1.0 Final
2.2.2.
Non-Functional Requirements
The non-functional requirements elaborate the performance characteristics of the SEMCARE system.
Requirements
Implementations
Intuitive user
The handling of the SEMCARE graphical user interphase (GUI) should be
interface
easy and intuitive.
The ranking of the results after submission of a user query should be
Transparent ranking
transparent and traceable. Users should be able to understand how they can
refine their query in order to get better results.
Compatibility of GUI
The web-based GUI of the system must be compatible with the browser
for browsers in use
versions used in the hospitals.
The response times while using the SEMCARE platform should be short in
order to provide a user-friendly service.
The performance of the system depends on several parameters such as:
Low response time
a)
update phase
b)
size of the index and main storage
c)
number of parallel requests
d)
strategy of authorization
Platform
Each component of the SEMCARE architecture is platform independent as
independent
Java will be used for the implementation.
Security / privacy
It must be guaranteed that only authorized people can access the clinical
data.
© Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s
Seventh Programme for research, technological development and demonstration under grant agreement No 611388.
D3.1 – Sketch of system architecture specification
WP3: Architecture and Requirements
Dissemination level: Public
16
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis),
Stefan Schulz (MUG)
Version: 1.0 Final
3. Architecture
3.1. Overview
3.1.1.
Involved Systems
Figure 1 shows an overview of the different systems involved in the SEMCARE architecture and how
data is transferred from one system to another.
Figure 1: Systems involved in the SEMCARE architecture
Each of the systems is briefly described below.
Production data system
The production data system contains the hospital production data that may be structured or
unstructured and is spread over different sources.
Possible components of the system are:
•
Databases
•
File storage
•
HL7v2 messages
•
Multiple components that constitute a HIS (hospital information system)
Staging data system
The staging data system is a copy of the hospital production data used for feeding the SEMCARE
staging system. The reason for copying the hospital data is that it is usually not allowed to directly
operate on the live data. By operating on a copy of the data, potential damages on the live-system are
avoided.
The staging data system has the same components as the production system:
•
Databases
•
File storage
•
HL7v2 messages
•
HIS (hospital information system)
© Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s
Seventh Programme for research, technological development and demonstration under grant agreement No 611388.
D3.1 – Sketch of system architecture specification
WP3: Architecture and Requirements
Dissemination level: Public
17
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis),
Stefan Schulz (MUG)
Version: 1.0 Final
SEMCARE staging system
The SEMCARE staging system reads the data from the hospital staging system. This is done via an
ETL process that aggregates data from different data sources into one data store. A common tool for
such an ETL process is Talend Open Studio. Once the data is loaded, patient data of interest is
analysed, and the resulting data populates the SEMCARE staging databases as well as the Solr
index
The different components are:
•
Relational database: SEMCARE data store where the aggregated clinical data is stored.
•
Database importer process: An ETL process that loads data from the staging data system
into the SEMCARE staging system.
•
Solr server and index: Indexes documents and searches indexes.
•
Graph database: Stores concept hierarchies and relations between documents and
concepts. For now, this is an experimental extension to the system. It will be further evaluated
if it can add additional value to the SEMCARE platform.
•
Averbis text analysis pipeline (AEP): Analyses text in order to extract structured data.
•
SEMCARE portal for testing: Provides capability for configuring and testing the staging
system.
SEMCARE production system
The SEMCARE production system contains the structured data exported from the SEMCARE staging
system. It is the system that is used by the end users to perform search queries and view reports.
The system contains the following components:
•
Relational database: SEMCARE data store where the aggregated clinical data is stored.
•
Solr server and index: Indexes documents and searches indexes.
•
Graph database: Stores concept hierarchies and relations between documents and
concepts. For now, this is an experimental extension to the system. It will be further evaluated
if it can add additional value to the SEMCARE platform.
•
Averbis text analysis pipeline (AEP): Analyses text in order to extract structured data.
•
SEMCARE portal for end users: The portal for building queries and searching the system.
3.1.2.
Architecture sketch
Figure 2 shows an overview of the complete architecture planned for the semantic analysis platform
SEMCARE. The individual components have been described in section 3.1.1 above.
Furthermore, the figure shows that it will be possible to apply third party tools on the data store of the
SEMCARE production system in order to perform further analytics like visualisation or statistics. This
will be enabled by providing a common data model (the i2b2 star schema) that can easily be used by
© Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s
Seventh Programme for research, technological development and demonstration under grant agreement No 611388.
D3.1 – Sketch of system architecture specification
WP3: Architecture and Requirements
Dissemination level: Public
18
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis),
Stefan Schulz (MUG)
Version: 1.0 Final
third party applications (e.g. tranSMART, QlikView, Rapidminer). As a consequence, hospitals can
install third party tools if they want to use them on the SEMCARE data.
Figure 2: Architecture sketch
3.1.3.
Architecture Layering
The SEMCARE system can be divided into three layers, which are described in the following
paragraphs from the bottom to the top and graphically showed in Figure 3 below.
The bottom layer contains the data sources, which consist of different types of patient data arising in
a hospital, for example unstructured data like discharge summaries or findings reports, and structured
data like lab results or other routine data acquired and structured for health care, research and quality
assessment. Also coded data could be available, which is mainly used for reimbursement. The data is
scattered over different databases or stored in files, which can be of different format (e.g. Word, XML,
Text, and PDF). Data may also be available as messages, generally in HL7v2 format as a universal
health care messaging standard.
The second layer is the semantic middleware. First, it contains tools for information extraction, ETL
and text mining as an interface to the data sources. The loaded and analysed data is then stored in a
unifying semantic database. This layer also includes terminologies and texts stored in a graph
© Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s
Seventh Programme for research, technological development and demonstration under grant agreement No 611388.
D3.1 – Sketch of system architecture specification
WP3: Architecture and Requirements
Dissemination level: Public
19
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis),
Stefan Schulz (MUG)
Version: 1.0 Final
database and Solr index. The third part of the middleware is the communication between the
SEMCARE data store and the topmost layer, which is the presentation layer.
The presentation layer is the highest level and represents the interface to the user who could be a
researcher, clinician or administrator. Possible components of the presentation layer are:
•
the terminology editor
•
a search interface including a query generator
•
dashboards and analytics
•
study management tools
Re s e a rc h e rs
Pre s e n ta tion
Layer
Te rm in olog y
E d itor
Ad m in is tra tio
n
C lin ic a n s
S e a rc h
In te rfa c e
D a s h b oa rd s
and
A n a ly tic s
S tu d y
Ma n a g e m e n t
…
C om m u n ic a tion L a y e r
Un ify in g S e m a n tic D a ta b a s e
(c on ta in s D a ta , Te rm in olog ie s , a n d
Tex ts )
S e m a n tic
Mid d le w a re
Re l D B
S e a rc h
In d ex
Trip le S tore /
N oS Q L D B
In form a tion E x tra c tion , Tex t Min in g , E T L
D a ta
S ou rc e s
R e im b u rs e m e n t d a ta
< x m l>
--------< /x m l>
Lab
Re s u lts
D is c h a rg e
D is c h a rg e
S ucm
D
hmamargarie
eries s
S uism
S u m m a rie s
H os p italD ata
Figure 3: Architecture layering
3.2. Interfaces
The following interfaces between components of the SEMCARE system have been identified:
•
Staging data to data importer: Imports data from the hospital information system as
documents or messages. Formats to be expected are xml, HL7, plain text, possibly also jpeg
or other formats for scanned documents, DICOM.
© Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s
Seventh Programme for research, technological development and demonstration under grant agreement No 611388.
D3.1 – Sketch of system architecture specification
WP3: Architecture and Requirements
Dissemination level: Public
20
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis),
Stefan Schulz (MUG)
Version: 1.0 Final
•
Data importer to staging Solr: The Solrj (Solr Java client) API is used for sending patient
record information to Solr to be analysed.
•
SEMCARE staging portal to staging Solr: The Averbis search REST API will be used for
the communication between the two components. This API uses JSON messages to
communicate with Solr. Example message definitions are shown in Figure 4: Averbis search
REST API.
•
SEMCARE production portal to production Solr: The Averbis search REST API will be
used for the communication between the two components. Example message definitions are
shown in Figure 4: Averbis search REST API.
public class Result {
public class Request {
private
private
private
private
private
private
private
private
private
private
private
private
private
String query;
Integer rows;
Integer start;
String highlightQuery;
List<SortField> sortFields;
Boolean facetHighlighting;
Integer facetLimit;
String facetPrefix;
String facetSort;
List<Facet> facets;
List<Field> fields;
List<Param> params;
User user;
private
private
private
private
private
private
private
String query;
Integer start;
String highlightQuery;
Integer numFound;
List<Facet> facets;
List<Document> documents;
String didYouMean;
}
}
Figure 4: Averbis search REST API
3.3. Data Models
SEMCARE employs a number of different data formats and systems. These include unstructured
input data, relational databases, terminologies, and Solr indexes.
3.3.1.
Input Data
The input data for the SEMCARE project may vary with regards to the data source and the data
format.
For each treatment episode, several sources are of interest:
•
documents, either original ones (e.g. findings reports) or aggregated ones (discharge letters)
•
messages, e.g. HL7v2 messages
•
raw data, e.g. images, measurement data (e.g. ECG)
•
database entries
© Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s
Seventh Programme for research, technological development and demonstration under grant agreement No 611388.
D3.1 – Sketch of system architecture specification
WP3: Architecture and Requirements
Dissemination level: Public
21
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis),
Stefan Schulz (MUG)
Version: 1.0 Final
The input data may exhibit different degrees of structure, such as
•
unstructured, e.g. free text, images
•
semi-structured, e.g. free text with standardized organizing patterns (e.g. headings)
•
structured, e.g. tables of lab values
•
coded, e.g. LOINC-coded lab values, ICD-10 coded diseases
The SEMCARE system will import these different formats from the various data sources with an ETL
process.
3.3.2.
Terminologies
Medical terminologies provide meaning identifiers (codes) for terms or groups of synonymous terms,
the latter generally referred to as concepts. In SEMCARE, terminologies will enrich the search
process by knowledge about the meaning of domain terms, their groupings into concepts, and certain
relations between concepts such as broader / narrower. In addition, SEMCARE will enable users to
add new concepts and terms to the existing terminology, where needed, e.g. when they miss an
important synonym.
As some terminologies support several languages they will also allow for multilingual text analysis by
grouping terms from different languages into the same concept. The continual process for refining
terminologies is described in this section.
Figure 5: Refinement process for criteria
© Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s
Seventh Programme for research, technological development and demonstration under grant agreement No 611388.
D3.1 – Sketch of system architecture specification
WP3: Architecture and Requirements
Dissemination level: Public
22
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis),
Stefan Schulz (MUG)
Version: 1.0 Final
As shown in Figure 5 above, the SEMCARE platform will provide a term browser and a dictionary
creator for users to view and edit their terminology. The term browser will be able to import standard
terminologies such as SNOMED CT, ICD-10 or MeSH and store them in a relational database (RDB).
The users can then build their own terminology by enhancing, merging, or modifying existing
terminologies. The most important medical terminologies are contained in the UMLS metathesaurus,
which is a rich source of synonyms in different languages that also groups concepts into top-level
categories via the UMLS Semantic Network. We will make use of all of this by enhancing the user
interface of the term browser, so that also non-English terms can be used to search for concepts.
In all stages of the terminology creation process, the terminology can be exported to the AEP analysis
pipeline. The terminology can then be used to index and search documents via the SEMCARE search
interface.
When the users build their search query, they may find that their terminology needs to be modified in
order to produce better search results. They can then go back to the term browser to make changes
to the terminology. This refinement process is crucial for optimizing the SEMCARE platform. Users
should be able to quickly see how terminology changes affect search facets and results.
Whereas there is a certain preference for SNOMED CT, ICD-10, and MeSH, a final decision of which
terminologies to use for annotation will have to be made at the start of the work in WP2. Another
decision to be made is how the known vocabulary gap for Dutch and German will be filled. One
possible strategy is the use of machine translation, together with human review of the terms
generated by this method. Manual additions to the terminologies, mainly driven by the use case, will
be the option of choice wherever queries have to be fine-tuned.
3.3.3.
I2B2 Star Schema
In order to use a standard schema for the data storage and to ensure that we provide a common data
model that is also widely used by third party providers (e.g. tranSMART), the i2b2 star schema will be
used in SEMCARE to store the data.
In the i2b2 star schema, observations or, more precise, factoid (fact-like) statements, are stored in the
observation_fact table and linked to four so-called “dimension” tables for patient, visit, provider and
concept details. These dimension tables contain descriptive information about factoid statements.
Figure 6 below shows an overview of the i2b2 star schema.
© Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s
Seventh Programme for research, technological development and demonstration under grant agreement No 611388.
D3.1 – Sketch of system architecture specification
WP3: Architecture and Requirements
Dissemination level: Public
23
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis),
Stefan Schulz (MUG)
Version: 1.0 Final
Figure 6: i2b2 star schema
I2b2 also uses metadata tables to define terminologies. SEMCARE terminologies can be stored in the
i2b2 custom_meta table (Figure 7). This table stores hierarchical terminologies that are used to build
queries in the i2b2 query and analysis tool. The c_fullname column is used to store the full path of
each term with the '\' character delimiting the hierarchical levels. After the custom_meta table is filled
with SEMCARE terminologies via an import process, concept_dimensions can be created that link to
the custom_meta terms.
custom_meta
c_hlevel
c_fullname
c_name
c_synonym_cd
c_visualattributes
c_totalnum
c_basecode
c_metadataxml
c_facttablecolumn
c_tablename
c_columnname
c_columndatatype
c_operator
c_dimcode
c_comment
c_tooltip
m_applied_path
update_date
integer
character varying(700)
character varying(2000)
character(1)
character(3)
integer
character varying(50)
text
character varying(50)
character varying(50)
character varying(50)
character varying(50)
character varying(10)
character varying(700)
text
character varying(900)
character varying(700)
timestamp without time zone
Figure 7: i2b2 custom_meta table
© Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s
Seventh Programme for research, technological development and demonstration under grant agreement No 611388.
D3.1 – Sketch of system architecture specification
WP3: Architecture and Requirements
Dissemination level: Public
24
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis),
Stefan Schulz (MUG)
Version: 1.0 Final
An example of how a custom terminology may look in the i2b2 term navigator is shown in Figure 8.
Figure 8: Custom metadata in i2b2 term navigator
In addition to the standard i2b2 tables, a new table will be created to map i2b2 records to Solr
documents. This table will contain the encounter_num key, the original unstructured record, the Solr
document and ID, and a copy of the CAS (Common Analysis System) object from the text analysis.
Figure 9 shows this additional SEMCARE record table and its relation to the existing i2b2 tables.
Figure 9: Solr to i2b2 mapping
3.3.4.
SEMCARE Patient Record Solr Document
Solr documents will be used to store patient record information for text search. Each Solr document
will contain IDs that map the Solr document to corresponding records in the i2b2 database (see also
Figure 10 below). With this linkage, only data required for search indexing will be stored in the Solr
© Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s
Seventh Programme for research, technological development and demonstration under grant agreement No 611388.
D3.1 – Sketch of system architecture specification
WP3: Architecture and Requirements
Dissemination level: Public
25
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis),
Stefan Schulz (MUG)
Version: 1.0 Final
document, and additional information can be pulled from the i2b2 database if needed. Dynamic fields
can be used in the Solr document to store multiple concepts.
References to terminology codes or concepts are stored in Solr as CUIs (concept unique identifier) to
enable multilingual searches. Preferred terms and synonyms will not be stored in Solr because all
documents and queries will be processed by the AEP to replace synonyms and preferred terms with
CUIs before sending the query to Solr.
The Solr system will provide a faceted search, which means that the search results are organized
according to a faceted classification system, thus allowing the user to explore a collection of
information by applying multiple filters. Facets correspond to properties of the search result.
Solr will store multiple dynamic fields for each concept:
•
a list of all the types of concepts used for faceting
(Note that this field is the set of all concept types in the document and it has no linkage to the
relational database. Only individual concepts are linked to the database.)
•
a value for searching
•
an ID to link to the relational database
•
a path for hierarchical faceting
For example, for medication with the CUI a1234 Solr would store the following fields:
•
concept_medications=“a1234,b5678,c2313”
•
concept_med_val_a1234=50
•
concept_med_id_a1234=123456
•
concept_med_path_a1234=/c1000/b1023/a1234
Figure 10: Mapping of Solr documents to i2b2 database
© Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s
Seventh Programme for research, technological development and demonstration under grant agreement No 611388.
D3.1 – Sketch of system architecture specification
26
WP3: Architecture and Requirements
Dissemination level: Public
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis),
Stefan Schulz (MUG)
Version: 1.0 Final
3.3.5.
SEMCARE Data Loading Flow
The data loading flow begins when the data importer ETL process loads unstructured data. The
unstructured data is stored in the relational database and then sent to Solr to be analysed and
9
indexed. The Solr process and text analysis pipeline stores data in a graph database, e.g. Neo4j , and
builds the Solr index. Finally, the structured data from the analysis is added to the relational database
to enhance the unstructured data. A diagram of the data import flow is show in Figure 11.
Figure 11: SEMCARE data loading flow
The data importer process could be created with an ETL tool such as Talend Open Studio. Figure 12
below shows an example Talend job that reads a directory of plain text files and commits them to Solr
and a PostgreSQL database.
9
http://www.neo4j.org/
© Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s
Seventh Programme for research, technological development and demonstration under grant agreement No 611388.
D3.1 – Sketch of system architecture specification
WP3: Architecture and Requirements
Dissemination level: Public
27
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis),
Stefan Schulz (MUG)
Version: 1.0 Final
Figure 12: Talend Open Studio data importer
3.4. Modules & Functional View
The SEMCARE system contains the following modules and components as shown in Figure 13 below
and described in this section.
Figure 13: SEMCARE components
© Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s
Seventh Programme for research, technological development and demonstration under grant agreement No 611388.
D3.1 – Sketch of system architecture specification
WP3: Architecture and Requirements
Dissemination level: Public
28
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis),
Stefan Schulz (MUG)
Version: 1.0 Final
3.4.1.
SEMCARE Data Importer
The SEMCARE data importer is the entry point for health care data in the SEMCARE system. It could
be an ETL process defined by a tool such as Talend Open Studio, or a custom coded software
process. When it receives data, the data importer will write the unstructured data to the database and
then send the unstructured data to Solr for analysis.
3.4.2.
Solr
Solr is an open source search platform from Apache Lucene. In the SEMCARE project it is used to
index and search patient record data. Solr will use the Averbis text analytic tools to create structured
data from unstructured text. After the text is analysed, Solr will write the structured data to the
database.
3.4.3.
Averbis Text Analytics (AEP)
The Averbis Extraction Platform (AEP) describes a text analysis tool that can be simply applied to
arbitrary information extraction scenarios. It provides solutions to extract individual information units
such as facts and relations from unstructured text having the highest relevance for a user. The AEP
consists of a number of modular text analysis components, so called Analysis Engines (AEs), stick
together in the Apache UIMA
10
framework building an overall solution for different use cases.
Depending on the requirements, rule-based, statistical methods or a combination of both are used to
reveal the semantic from the content.
Annotations between AEs are exchanged using an object named Common Analysis System (CAS).
The CAS is UIMA’s object-based data structure that allows memory based storage and exchange of
annotations with respect to pre-defined type systems of hierarchically organized annotations. With the
aid of this data structure it is possible to generate a common base to analyse unstructured text.
3.4.4.
SEMCARE Portal Web Application
The SEMCARE portal provides a graphical user interface, which allows users to build queries on the
clinical data and to manage the system. The users will get immediate feedback from a search, which
helps them to decide how to refine their query in order to get better results. The portal will also provide
users with an interface for defining and refining terminologies. More specific requirements and details
about the user interface will be provided in D3.2.
10
http://opennlp.apache.org
© Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s
Seventh Programme for research, technological development and demonstration under grant agreement No 611388.
D3.1 – Sketch of system architecture specification
WP3: Architecture and Requirements
Dissemination level: Public
29
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis),
Stefan Schulz (MUG)
Version: 1.0 Final
3.4.5.
I2b2 Applications
I2b2 tools and components such as the i2b2 query and analysis tool can be installed in the system if
needed. I2b2 runs on the JBoss application server.
3.4.6.
Third Party Tools and Applications
Third party tools can also be installed in the system as required. These tools could possibly interface
with the i2b2 database or the Solr server, but because of the varying requirements and functionality of
third party applications, they are not shown in Figure 13 or described in detail here.
3.4.7.
Scalability
All of the components in the SEMCARE system can be deployed across multiple machines to support
the processing of large data sets if needed. Multiple data importer processors can be launched to
read input data. Solr Cloud can be used to distribute Solr indexes and search processing across
multiple machines. The Averbis text analysis pipeline can also be deployed as a distributed system.
By adding more machines and distributing SEMCARE components the SEMCARE system can scale
to meet the processing requirements of large data sets.
3.5. Users & Roles
In the context of the SEMCARE project different types of users can be distinguished. Their roles are
briefly described below:
Production Database Administrator
The production database administrator manages the copying of production patient data into the
staging data system. He/she also manages the following export to the SEMCARE staging system via
ETL process. The Production Database Administrator is located at the hospital site.
SEMCARE Administrator
The SEMCARE administrator is responsible for managing the SEMCARE databases, the Solr
configuration and the SEMCARE portal. He/she configures terminology and text analysis
configuration. The administrator manages copying of data from SEMCARE staging to the SEMCARE
production environment and creates custom dashboards, scripts and third party integrations.
SEMCARE User
Typically, SEMCARE users will be researchers and clinicians who use the SEMCARE portal for
search and analytics.
© Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s
Seventh Programme for research, technological development and demonstration under grant agreement No 611388.
D3.1 – Sketch of system architecture specification
WP3: Architecture and Requirements
Dissemination level: Public
30
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis),
Stefan Schulz (MUG)
Version: 1.0 Final
The role concept will be further verified during the project and refined if needed. Furthermore, it must
be guaranteed that all roles have the access rights to the data to be analysed at the level of the
hospital information system.
3.6. Open points
A few points that are still open and need further clarification within the course of the project are listed
below. More specific details about these points will be given in deliverable D3.2.
•
One challenge for the SEMCARE platform is the search for constellation of symptoms that are
spread over several documents. A strategy will be developed in order to cover this
requirement.
•
As the SEMCARE system will be installed within the hospital, a further analysis of the IT
landscape within the hospitals will have to be performed. The interfaces need to be defined
and the interchange formats to be specified.
•
Another point to think about is a possible weighting of criteria for a specific use case. For
example, it should be possible to define mandatory and optional criteria when creating a
search query.
© Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s
Seventh Programme for research, technological development and demonstration under grant agreement No 611388.
D3.1 – Sketch of system architecture specification
WP3: Architecture and Requirements
Dissemination level: Public
31
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis),
Stefan Schulz (MUG)
Version: 1.0 Final
4. Data Privacy / Technical and organizational
security procedures
4.1. Data processing
The data processing within the scope of the SEMCARE project takes place entirely within each
participating hospital. The project integrates into the existing IT landscape of the hospital with regards
to admission (physical access), computer access, and data access control to the used IT components
(servers and network components). This also affects the security of particularly sensitive health care
data arising in a hospital.
The architectural design of the SEMCARE platform permits data processing and storage on separate
hard drives if needed because of the involvement of different departments and appropriate user
rights.
4.2. Data transfer and data location
In the scope of the SEMCARE project patient data will not leave the hospitals at any time. Patient
data may, however, be shared between different departments of each hospital. In these cases,
already installed (pseudo-) anonymisation processes will be applied. The de-identification procedures
for each of the three participating hospitals are explained in detail in deliverable D1.1.
Regarding test data, SGUL will prepare anonymised data to be used by Averbis GmbH for the
development of algorithms, interfaces and the final product. The legal basis for the transfer of such
test data is section 251 of the NHS Act 2006. Transferred test data will be encrypted either at rest or
in transition. The hospitals EMC and MUG will not provide any data to Averbis or to any other clinical
partner.
Both, data processing and the operation of the data platform will be performed within a dedicated
server infrastructure in the hospital. It will be ensured that no project-related data is stored on
locations where unauthorised persons have access to.
Furthermore, an additional encryption of the data that is e.g. stored in the Solr index is possible by
11
using TrueCrypt .
11
http://www.truecrypt.org/
© Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s
Seventh Programme for research, technological development and demonstration under grant agreement No 611388.
D3.1 – Sketch of system architecture specification
WP3: Architecture and Requirements
Dissemination level: Public
32
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis),
Stefan Schulz (MUG)
Version: 1.0 Final
4.3. Role concept
A role concept will be applied that assures that only authorised users can access the data related to
the SEMCARE project.
What
Who
Data Upload, Query generation
SEMCARE User
Data Deletion
SEMCARE Administrator
Create, Edit, Delete Users
SEMCARE Administrator
System maintenance
Local system administrator
A connection to the local LDAP (Lightweight Directory Access Protocol) can be implemented in order
to take over existing access rights.
A logging of the activities will be performed in order to be able to examine if personal data has been
entered, changed or deleted, and by whom. Only allocated and defined personnel will have access to
the system components and applications of the SEMCARE applications
4.4. Availability control
Actions will be considered in order to protect personal data against accidental destruction or loss. For
example, the SEMCARE systems will not directly work on the hospital live data but on a copy (staging
system) to ensure that no real patient data is affected in any way.
High availability of the SEMCARE platform is no priority as the application is not crucial for patient
care.
4.5. Data separation control
It must be assured that data from different scenarios or different departments are separated from
each other. The SEMCARE architecture allows this separation if needed, e.g. different Solr indexes
can be used.
The SEMCARE systems will only be run locally and queries will only be performed on relevant patient
data. Other information that is not relevant for the defined use case will not be extracted from the
hospital systems. A development system and a production system will be provided separately.
© Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s
Seventh Programme for research, technological development and demonstration under grant agreement No 611388.