Semantic Data Platform for Healthcare ICT-611388 D3.1 Sketch of system architecture specification WP3 – Architecture and Requirements V1.0 Final Lead beneficiary: MUG Date: 31/03/2014 Nature: Report Dissemination level: PU (Public) D3.1 – Sketch of system architecture specification WP3: Architecture and Requirements Dissemination level: Public Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), Stefan Schulz (MUG) Version: 1-0 Final TABLE OF CONTENTS DOCUMENT INFORMATION ................................................................................................................ 4 DOCUMENT HISTORY ......................................................................................................................... 4 DEFINITIONS ........................................................................................................................................ 5 EXECUTIVE SUMMARY ....................................................................................................................... 6 KEY WORDS (WORDLE STYLE) ......................................................................................................... 7 1. INTRODUCTION ........................................................................................................................... 8 1.1. ABOUT SEMCARE .................................................................................................................. 8 1.1.1. MOTIVATION AND BACKGROUND ................................................................................... 8 1.1.2. PROJECT DESCRIPTION ................................................................................................... 8 1.2. ABOUT THIS DOCUMENT ...................................................................................................... 9 1.2.1. AIM OF THIS DOCUMENT .................................................................................................. 9 1.2.2. DOCUMENT STRUCTURE .................................................................................................. 9 2. APPLICATION SCENARIO / REQUIREMENTS ........................................................................ 10 2.1. USE CASE ............................................................................................................................. 10 2.1.1. BACKGROUND AND MOTIVATION ................................................................................. 10 2.1.2. APPROACH ....................................................................................................................... 11 2.1.3. TOPICS OF INTEREST AND THEIR (TEXTUAL) REPRESENTATION IN EHRS ........... 11 2.2. REQUIREMENTS ................................................................................................................... 14 2.2.1. FUNCTIONAL REQUIREMENTS ...................................................................................... 14 2.2.2. NON-FUNCTIONAL REQUIREMENTS ............................................................................. 15 3. ARCHITECTURE ........................................................................................................................ 16 3.1. OVERVIEW ............................................................................................................................ 16 3.2. INTERFACES ......................................................................................................................... 19 3.3. DATA MODELS...................................................................................................................... 20 3.3.1. INPUT DATA ...................................................................................................................... 20 3.3.2. TERMINOLOGIES ............................................................................................................. 21 © Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s Seventh Programme for research, technological development and demonstration under grant agreement No 611388. 1 D3.1 – Sketch of system architecture specification WP3: Architecture and Requirements Dissemination level: Public Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), Stefan Schulz (MUG) Version: 1-0 Final 3.3.3. I2B2 STAR SCHEMA ......................................................................................................... 22 3.3.4. SEMCARE PATIENT RECORD SOLR DOCUMENT ........................................................ 24 3.3.5. SEMCARE DATA LOADING FLOW.................................................................................. 26 3.4. MODULES & FUNCTIONAL VIEW ........................................................................................ 27 3.4.1. SEMCARE DATA IMPORTER ........................................................................................... 28 3.4.2. SOLR.................................................................................................................................. 28 3.4.3. AVERBIS TEXT ANALYTICS (AEP) ................................................................................. 28 3.4.4. SEMCARE PORTAL WEB APPLICATION ....................................................................... 28 3.4.5. I2B2 APPLICATIONS ........................................................................................................ 29 3.4.6. THIRD PARTY TOOLS AND APPLICATIONS .................................................................. 29 3.4.7. SCALABILITY .................................................................................................................... 29 3.5. USERS & ROLES................................................................................................................... 29 3.6. OPEN POINTS ....................................................................................................................... 30 4. DATA PRIVACY / TECHNICAL AND ORGANIZATIONAL SECURITY PROCEDURES .......... 31 4.1. DATA PROCESSING ............................................................................................................. 31 4.2. DATA TRANSFER AND DATA LOCATION .......................................................................... 31 4.3. ROLE CONCEPT ................................................................................................................... 32 4.4. AVAILABILITY CONTROL .................................................................................................... 32 4.5. DATA SEPARATION CONTROL ........................................................................................... 32 © Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s Seventh Programme for research, technological development and demonstration under grant agreement No 611388. 2 D3.1 – Sketch of system architecture specification WP3: Architecture and Requirements Dissemination level: Public Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), Stefan Schulz (MUG) Version: 1-0 Final TABLE OF FIGURES FIGURE 1: SYSTEMS INVOLVED IN THE SEMCARE ARCHITECTURE ....................................................................................... 16 FIGURE 2: ARCHITECTURE SKETCH ................................................................................................................................. 18 FIGURE 3: ARCHITECTURE LAYERING .............................................................................................................................. 19 FIGURE 4: AVERBIS SEARCH REST API........................................................................................................................... 20 FIGURE 5: REFINEMENT PROCESS FOR CRITERIA ............................................................................................................... 21 FIGURE 6: I2B2 STAR SCHEMA ...................................................................................................................................... 23 FIGURE 7: I2B2 CUSTOM_META TABLE .......................................................................................................................... 23 FIGURE 8: CUSTOM METADATA IN I2B2 TERM NAVIGATOR ................................................................................................ 24 FIGURE 9: SOLR TO I2B2 MAPPING ............................................................................................................................... 24 FIGURE 10: MAPPING OF SOLR DOCUMENTS TO I2B2 DATABASE ........................................................................................ 25 FIGURE 11: SEMCARE DATA LOADING FLOW ................................................................................................................. 26 FIGURE 12: TALEND OPEN STUDIO DATA IMPORTER ......................................................................................................... 27 FIGURE 13: SEMCARE COMPONENTS .......................................................................................................................... 27 © Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s Seventh Programme for research, technological development and demonstration under grant agreement No 611388. 3 D3.1 – Sketch of system architecture specification WP3: Architecture and Requirements Dissemination level: Public Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), Stefan Schulz (MUG) Version: 1-0 Final DOCUMENT INFORMATION Grant Agreement Number Full title ICT-611388 Acronym SEMCARE Semantic Data Platform for Healthcare Project URL www.semcare.eu EU Project officer Saila Rinne ([email protected]) Deliverable Number 3.1 Title Sketch of system architecture specification Work package Number 3 Title Architecture and Requirements Delivery date Contractual 31.03.2014 Actual Status Version V1.0 Final Draft Nature Report Prototype Other Final Dissemination Level Public Confidential Authors (Partner) Responsible Author Philipp Daumke, Carla Haid, Luke Mertens (Averbis), Stefan Schulz (MUG) Stefan Schulz Email [email protected] Partner Phone +43 699 150 96 270 MUG DOCUMENT HISTORY NAME DATE VERSION DESCRIPTION Philipp Daumke 03.03.2014 0.1 Initial Creation Carla Haid 07.03.2014 0.2 Additions Luke Mertens 14.03.2014 0.3 Additions Carla Haid 17.03.2014 0.4 Additions Stefan Schulz 19.03.2014 0.5 Corrections, comments, additions Carla Haid 21.03.2014 0.6 Corrections, additions Luke Mertens 24.03.2014 0.7 Corrections, additions Jan Kors 24.03.2014 0.8 Corrections, additions A. Honrado, E. Chavarría 30.03.2014 0.9 Internal formal review Stefan Schulz 31.03.2014 1.0 Final version © Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s Seventh Programme for research, technological development and demonstration under grant agreement No 611388. 4 D3.1 – Sketch of system architecture specification WP3: Architecture and Requirements Dissemination level: Public Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), Stefan Schulz (MUG) Version: 1-0 Final DEFINITIONS • Partners of the SEMCARE Consortium are referred to herein according to the following codes: AVERBIS - Averbis GmbH (Germany) Coordinator EMC - Erasmus Universitair Medisch Centrum Rotterdam (Netherlands) – Beneficiary MUG - Medical University of Graz (Austria) – Beneficiary SGUL - Saint George's University of London (UK) – Beneficiary SYNAPSE - Synapse Research Management Partners S.L. (Spain) – Beneficiary • Project: The sum of all activities carried out in the framework of the Grant Agreement. • AEP: Averbis Extraction Platform; text analysis tool to extract information units such as facts and relations from unstructured text • CUI: Concept unique identifier in the Unified Medical Language System (UMLS) • EHR: Electronic health record; clinical data record of a patient • ETL: extract – transfer – load; Process in data warehousing that is often used to integrate data from multiple sources. A common ETL tool is Talend Open Studio. • GUI: Graphical user interface of the application • Graph DB: Database using graph structures with nodes, edges, and properties to represent and store data. Compared to a relational database it is faster and better scalable for large data sets. • HL7v2 format: Health Level Seven; universal standard for the exchange of electronic health information • i2b2: Informatics for Integrating Biology and the Bedside; scalable informatics framework for clinical data • REST: Representational State Transfer; communication service between two components using JSON (JavaScript Object Notation) messages • Solr: Open source search platform from Apache Lucene, with Java client Solrj • Terminology: General term for information artefacts that provide controlled terms for a domain, identifiers of meaning and semantic relations. e.g. SNOMED, ICD-10, MeSH • Term Browser: Tool provided by Averbis to load, view, modify and export terminologies. It can also be used to create new terminologies. • UIMA: Unstructured Information Management Architecture; framework by Apache enabling the generation of analysis pipelines for arbitrary content such as text, image or video data © Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s Seventh Programme for research, technological development and demonstration under grant agreement No 611388. 5 D3.1 – Sketch of system architecture specification WP3: Architecture and Requirements Dissemination level: Public Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), Stefan Schulz (MUG) Version: 1-0 Final EXECUTIVE SUMMARY The initial task in work package 3 is the agreement on a generic architecture for the semantic data platform SEMCARE. This document gives a first overview of the planned system architecture for the project. The considerations about the architecture are a fundamental step in the development of such a data platform. They therefore constitute an essential task right from the beginning of the project. The architectural design decisions are driven by several dimensions. First, the use cases covered by the project must be defined in order to evaluate the resulting requirements. As the SEMCARE software will be installed within the different partner hospitals, it must also be considered that the integration into the clinical IT landscape should be simple. Furthermore, aspects about data governance, privacy and security should be kept in mind when developing the system architecture. Another important requirement is the scalability of the system to allow processing of large data sets. Finally, the SEMCARE architecture should be constructed in a way that enables a seamless integration of other platforms and applications, which is also called an ‘Open Architecture’. This can be allowed by using standard components. The main goal of the architecture is to provide a framework to extract meaningful information out of a broad range of structured and unstructured information from the Electronic Health Record. To this end, several systems and resources have to be integrated within a common framework. Some of these components are brought in and adapted by the partners, such as an extraction platform and a terminology browser. Others are available as free software such as indexing tools and semantic repositories. Domain terminology resources constitute another cornerstone of this framework. Whereas the coverage of existing terminologies is already very good for English, the two other languages addressed by SEMCARE, viz. Dutch and German, are less well served, which will require efforts in filling the terminology gaps by the combination of automated and manual term acquisition approaches. © Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s Seventh Programme for research, technological development and demonstration under grant agreement No 611388. 6 D3.1 – Sketch of system architecture specification WP3: Architecture and Requirements Dissemination level: Public 7 Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), Stefan Schulz (MUG) Version: 1-0 Final KEY WORDS (Wordle style)1 1 http://www.wordle.net/ © Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s Seventh Programme for research, technological development and demonstration under grant agreement No 611388. D3.1 – Sketch of system architecture specification WP3: Architecture and Requirements Dissemination level: Public 8 Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), Stefan Schulz (MUG) Version: 1.0 Final 1. Introduction 1.1. About SEMCARE 1.1.1. Motivation and Background The exploitation of medical data from clinical trials and thus the monitoring and improving of healthcare delivery is of increasing interest. However, up to 80% of the clinical trials fail to meet their patient enrolment quotas on time. This recruitment delay currently causes up to $8 million per day in loss of revenue for the pharmaceutical industry. The SEMCARE project will provide a more efficient way of patient recruitment, which will be helpful to prevent recruitment delays. Furthermore, SEMCARE also addresses another challenge in the field of health care, which is the identification of rare diseases. For the doctors it is often hard to diagnose such diseases as they are hardly known, and hence this results in a number of undiagnosed or even wrongly diagnosed patients. SEMCARE will use available clinical patient data to combine signs and symptoms, thereby detecting undiagnosed patients suffering from rare diseases. This will contribute to speed up the research on this group of diseases. For the pharmaceutical companies, every newly diagnosed patient is of huge interest as it generates up to $300,000 drug revenue per year. 1.1.2. Project description The two-year research project SEMCARE ‘Semantic Data Platform for Healthcare’ is funded by the European Commission’s Seventh Framework Programme. The aim of the project is the development of a software platform that facilitates the diagnosis of rare diseases in various health care contexts, and supports the selection of appropriate patients for clinical studies, the basis being the automated, contextual evaluation of existing patient data. SEMCARE will combine current text-mining technologies with multi-lingual terminologies in order to develop solutions for typical problems that arise when interpreting medical narratives, e.g. ambiguities, abbreviations, spelling variations or typos. Testing and optimization of the analysis software for the analysis of routine medical data in clinics will be performed in leading European health centres in Great Britain, the Netherlands and Austria. © Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s Seventh Programme for research, technological development and demonstration under grant agreement No 611388. D3.1 – Sketch of system architecture specification WP3: Architecture and Requirements Dissemination level: Public 9 Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), Stefan Schulz (MUG) Version: 1.0 Final 1.2. About this document 1.2.1. Aim of this document A fundamental task in work package 3 is the agreement on a generic architecture for the semantic data platform SEMCARE. This document gives a first overview of the planned system architecture for the project. The considerations about the architecture are a crucial step in the development of such a data platform, which makes them essential right from the beginning of the project. In order to be able to design the architecture it must be defined which use cases are covered in the project and which are the resulting requirements. This deliverable contains only the basic requirements. More specific, userdefined requirements related to the prototype will be provided in D3.2. 1.2.2. Document Structure This document has been structured into four main parts. Following an introduction into the SEMCARE project and the document, the use case that will be focused on during the project, is described. The definition of the use case is necessary for the identification of the requirements and the demands that are made on the platform and the underlying architecture. As a third step, we show the generic design of the SEMCARE architecture and describe the different modules and how they interact. Last but not least the document includes information about the technical and organizational procedures performed at the hospitals with regards to data privacy and security in the context of the SEMCARE systems and architecture. © Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s Seventh Programme for research, technological development and demonstration under grant agreement No 611388. D3.1 – Sketch of system architecture specification WP3: Architecture and Requirements Dissemination level: Public 10 Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), Stefan Schulz (MUG) Version: 1.0 Final 2. Application Scenario / Requirements 2.1. Use Case The three participating European health centres have agreed on one first general use case on which they will focus during the project. The use case is called ’Risk Stratification and Differential Diagnosis of Patients suffering from transient loss of consciousness’. This use case is described in detail in the following subsections. 2.1.1. Background and motivation Cardiovascular disease is the cause of 47% of all deaths in Europe, the majority of which are related 2 to underlying coronary artery disease . Sudden cardiac death accounts for approximately half of 3 coronary artery disease related deaths and also occurs in those with non-coronary artery disease related cardiovascular diseases such as cardiomyopathies and inherited channelopathies. Sudden death is also more prevalent in patients with epilepsy and is often unexplained when it is known as SUDEP (Sudden unexplained Death in Epilepsy). We are currently unable to determine those who are at most risk from SUDEP. The symptom of transient loss of consciousness (T-LOC) occurs in up to 50% of the general population and leads to 1% of all hospital admissions 4,5,6 . A wide range of conditions can lead to T- LOC. Causes of T-LOC can be broadly categorized as cardiac (such as arrhythmia when it is known as syncope) or non-cardiac (such as epilepsy). Cardiac syncope carries a much more sinister prognosis as it is associated with sudden cardiac death. Fortunately effective treatments, such as anti-ischemic and heart failure medication and implantation of implantable cardioverter-defibrillator (ICD), can dramatically improve outcomes. Unfortunately, the clinically assigned aetiology and 4 prognosis of T-LOC is frequently incorrect , predominantly due to an inability to differentiate between cardiac syncope and epilepsy and a lack of appreciation of high-risk markers such as exertional TLOC, T-LOC with palpitation and function and/or pre-existent coronary and/or structural heart disease. 2 3 European Cardiovascular Disease Statistics, 2012 edition. European Heart Network and European Society of Cardiology Myerburg RJ1, Junttila MJ. Sudden cardiac death caused by coronary heart disease. Circulation. 2012 Feb 28;125(8):1043- 52. 4 5 Fitzpatrick AP1, Cooper P. Diagnosis and management of patients with blackouts. Heart. 2006 Apr;92(4):559-68. Petkar S, Jackson M, Fitzpatrick A. Management of blackouts and misdiagnosis of epilepsy and falls. Clin Med. 2005 Sep- Oct;5(5):514-20. 6 Brignole M et al. A new management of syncope: prospective systematic guideline-based evaluation of patients referred urgently to general hospitals. Eur Heart J. 2006 Jan;27(1):76-82. Epub 2005 Nov 4. © Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s Seventh Programme for research, technological development and demonstration under grant agreement No 611388. D3.1 – Sketch of system architecture specification WP3: Architecture and Requirements Dissemination level: Public 11 Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), Stefan Schulz (MUG) Version: 1.0 Final 2.1.2. Approach A number of phenotypic features can help risk stratify patients, most of which are available from routine assessment and investigations. Using a semantic data platform, we seek to identify high-risk patient cohorts based on patient-level criteria scattered in heterogeneous clinical data contained in electronic healthcare records (EHRs). Subjects belonging to a universal set of interest will have their electronic medical records processed for natural language expressions that denote often detailed descriptions about patients’ clinical history, procedures or investigations planned or carried out. In our specific use case, the cases of interest are patients with prior myocardial infarction (MI), syncope of presumed cardiac origin or seizure disorder. The universal set of interest is defined as patients with Transient Loss of Consciousness and/or Sudden Cardiac Arrest and/or sustained Ventricular Arrhythmia and/or Cardiomyopathy and/or Ischemic Heart Disease and/or Seizure Disorder. However, the information extraction methodology we develop and describe is generic and could be adapted to whatsoever patient cohorts and medical inquiries. 2.1.3. Topics of interest and their (textual) representation in EHRs In the following, medical topics of interest like procedures and investigations, but also information about the patients’ medication and history are listed that will be used in order to identify subjects belonging to the universal set of interest described in the use case above. In the table below, only the most frequent topics of interest are listed. Contents of electronic medical records will be processed for typical phrases for topics and attributes. The values of the attributes are assumed to be numeric or Boolean and are therefore not of terminological interest. This means, that, e.g. “normal ECG” would be represented by the attribute “ECG normal” and the value “true”, or “QRS interval 0.12 s” would be represented by the attribute “QRS interval in seconds” and the numeric value “0.12”. For each topic of interest, some examples of indicative phrases and related attributes are listed in the table below. © Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s Seventh Programme for research, technological development and demonstration under grant agreement No 611388. D3.1 – Sketch of system architecture specification WP3: Architecture and Requirements Dissemination level: Public 12 Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), Stefan Schulz (MUG) Version: 1.0 Final Topic of Interest Indicative phrases for topics Indicative phrases for related attributes (selected examples in English, Dutch and German) (selected examples in English, Dutch and German) “electrocardiogram”, “elektrocardiogram”, “Elektrokardiogram”, “ECG”, “EKG” Electrocardiogram “exercise tolerance test”, “ETT”, “exercise test” Exercise Tolerance Test “holter monitoring”, “holter”, “24 hour tape”, “48 hour tape”, “event recorder” Holter Monitoring “cardiac catherisation”, “catherization”, “cath”, “angiogram”, “coronary angiogram” Coronary Angiogram “echocardiogram”, “echocardiografie” “echo”, “TTE”, “heart scan”, “Echokardiogramm” Echocardiogram MRI - Cardiac Blood Tests Age “CMR”, “CMRI”, “Cardiac MRI”, “MRI Cardiac”, “Kardiales MRI”, “Herz-MRI” “Blood Tests”, “Bloods”, “Biochemistry”, “Full Blood Count”, “FBC”, “Troponin”, “Toxicology”, “Blutbild” “age”, “DOB”, “date of birth”, “Geburtsdatum”, “Alter”, “normal”, “normaal” “abnormal ECG”, “abnormal electrocardiogram” “PR Interval Duration” “atrioventricular”, “atrioventriculaire”, “AV”, “QRS Interval Duration” “T wave inversion”, “T wave abnormality” “ST segment depression”, “ST segment elevation” “Bundle Branch Block”, “RBBB”, “LBBB” “pathological Q Waves” “Atrial fibrillation” “normal”, “ischemic”, “ST segment depression” “T wave inversion”, “blood pressure response” “ventricular tachycardia”, “VT” “ventricular ectopics”, “VEs”, “ectopics present”, “ectopics absent”, “couplets”, “triplets”, “salvos”, “PVCs”, “premature ventricular contractions” “normal” “non sustained VT”, “non sustained ventricular tachycardia”, “nsVT”, “ventricular tachycardia”, “ventricular ectopics”, “VEs”, “ectopics present”, “couplets”, “triplets”, “salvos”, “PVCs”, “premature ventricular contractions”, “normal”, “unobstructed”, “normal coronaries”, “normal coronary arteries”, “normal coronary angiography”, “smooth coronary arteries”, “stenosis”, “stenoses”, “obstruction” “normal heart”, “no cardiomyopathy”, “normal echo”, “ejection fraction”, “ventricular function” “ventricular dysfunction”, “poor ventricular function”, “impaired LV”, “impaired left ventricular” “aortic stenosis”, “mitral stenosis” “pulmonary hypertension” “normal” “ejection fraction”, “ventricular function” “late gadolinium enhancement”, “Scar” “regional wall motion abnormality” “normal”, “abnormal”, “elevated”, “raised”, “low” “hypoglycemia” “years old” “alte” © Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s Seventh Programme for research, technological development and demonstration under grant agreement No 611388. D3.1 – Sketch of system architecture specification WP3: Architecture and Requirements Dissemination level: Public 13 Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), Stefan Schulz (MUG) Version: 1.0 Final Topic of Interest Medications Family History of Sudden Cardiac Arrest Sudden Cardiac Arrest or sustained Ventricular Arrhythmia Indicative phrases for related attributes (selected examples in English, Dutch and German) “alter” “drug name”, “substance name”, “dose”, “Furosemide”, “Frusemide”, “Metolazone”, “Eplerenone”, “Spironolactone”, “Dosis” “age of death”, “degree of relative”, “first”, “second”, “mother”, “father”, “brother”, “sister”, “aunt”, “uncle”, “son”, “daughter”, “Vater”, “Mutter”, “Onkel”, “vader", "broer", "zuster", "tante", "oom", "zoon", "dochter", in context of ventricular fibrillation: “idiopathic”, “no cause”, “no aetiology”, “idiopathisch”, “ohne erkennbare Ursache” “family history of”, “sudden cardiac arrest”, “unexplained death”, “brother died suddenly”, “cousin died suddenly” “VT”, “VF”, “ventricular tachycardia”, “polymorphic VT”, “ventricular fibrillation”, “torsades”, “resuscitated sudden death”, “resuscitated SCD” “Arrest”, “Cardiac arrest”, “VF arrest”, “Plötzlicher Herztod”, “Sekundentod” “syncope”, “near syncope”, “presyncope”, “presyncope” “blackout”, “black-out”, ”collapse”, “faint”, “loss of consciousness”, “LOC”, “TLOC”, “T-LOC”, “pass out”, “passing out”, “passed out”, “Ohnmacht” Syncope Heart Failure Ischemic Heart Disease Seizure Disorder Indicative phrases for topics (selected examples in English, Dutch and German) “geboortedatum” “medications”, “meds”, “drugs History”, “is on”, “Medikamente” “heart failure”, “HF”, “CCF”, “cardiomyopathy”, “breathlessness”, “NYHA II”, “NYHA III”, “NYHA IV”, “Herzversagen”, “Herzinsuffizienz” “Myocardial infarction”, “STEMI”, “nonSTEMI”, “non-STEMI”, “NSTEMI”, “acute coronary syndrome”, “ACS”, “ischaemic heart disease”, “IHD”, “CAD”, “Angina”, “Previous stents” “seizure disorder”, “epilepsy”, “seizure”, “fitting”, “fits” “convulsions”, “limb jerking” “status epilepticus”, “Krämpfe”, “Epilepsie”, “Anfall” “on exertion”, “exertional”, “on exercise”, “exercise related”, “exercise induced”, “stress related”, “catecholamine related”, “emotion induced”, “while running”, “whilst running”, “mid-stride”, “in Verbindung mit Stress”, “prolonged standing”, “prodromal symptoms”, “coughing”, “micturition”, “passing water”, “urinating”, “swallowing” “Severe”, “Gross”, “Moderate”, “Mild”, “NYHA Class I”, “NYHA Class II”, “NYHA Class III”, “NYHA Class IV”, “NYHA I” Time of event “STEMI” “NSTEMI” "Unstable Angina” “Stable Angina” “Troponin rise” “Previous stents”, “PCI”, “angioplasty”, “CABG” “Type”, “Petit Mal”, “Grand Mal”, “Status Epilepticus”, “Frequency” © Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s Seventh Programme for research, technological development and demonstration under grant agreement No 611388. D3.1 – Sketch of system architecture specification WP3: Architecture and Requirements Dissemination level: Public 14 Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), Stefan Schulz (MUG) Version: 1.0 Final 2.2. Requirements In this section the basic requirements arising from the described use case are described. A more detailed description of the user specific requirements will be provided after developing the first prototype. This description will be part of D3.2. 2.2.1. Functional Requirements In order to identify candidates matching the aforementioned criteria, arbitrary types of free-text documents in patient records have to be gathered, pre-processed and analysed. Hence, in a first stage, interfaces to existing clinical IT systems have to be established to consolidate the data from each relevant resource. This stage in general also includes a data transformation process mapping, for instance, HL7 encoded data to a target schema of a central knowledge store. These kinds of tasks are perfectly solved by the aid of ETL (extraction, transformation, loading) tools such as Talend Open 7 8 Studio or Pipeline Pilot . Furthermore, the identification of use case specific criteria (e.g. ‘loss of consciousness whilst running’) within clinical narratives require that an information extraction system needs to be prepared to a variety of isosemantic lexical and syntactic variants found in the texts. Consequently, for each criterion and attribute of interest numerous synonymous expressions have to be considered in order to guarantee a high recall of relevant candidates. To handle this huge complexity we will use a Solr search engine combined with several domain terminologies like SNOMED CT, ICD-10 or MeSH. One main focus of the SEMCARE platform is the end-user support in the criteria refinement process. This is not trivial as it will require a dialogue with the users in order to acquire custom expressions that would enhance the terminological coverage. Details on this refinement process are described within section 3 below. Another key aspect is the language of the document. Text processing tools have to consider the particular syntax and grammar, but also the terminology to be dealt with has to be specific for a language. Furthermore, regional particularities such as punctuation have to be accounted for. Examples are the decimal point in English, opposed to the decimal comma in German and Dutch, or different units of measurement used for the same laboratory observations. 7 8 http://talend.com/products/talend-open-studio http://accelrys.com/products/pipeline-pilot © Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s Seventh Programme for research, technological development and demonstration under grant agreement No 611388. D3.1 – Sketch of system architecture specification WP3: Architecture and Requirements Dissemination level: Public 15 Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), Stefan Schulz (MUG) Version: 1.0 Final 2.2.2. Non-Functional Requirements The non-functional requirements elaborate the performance characteristics of the SEMCARE system. Requirements Implementations Intuitive user The handling of the SEMCARE graphical user interphase (GUI) should be interface easy and intuitive. The ranking of the results after submission of a user query should be Transparent ranking transparent and traceable. Users should be able to understand how they can refine their query in order to get better results. Compatibility of GUI The web-based GUI of the system must be compatible with the browser for browsers in use versions used in the hospitals. The response times while using the SEMCARE platform should be short in order to provide a user-friendly service. The performance of the system depends on several parameters such as: Low response time a) update phase b) size of the index and main storage c) number of parallel requests d) strategy of authorization Platform Each component of the SEMCARE architecture is platform independent as independent Java will be used for the implementation. Security / privacy It must be guaranteed that only authorized people can access the clinical data. © Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s Seventh Programme for research, technological development and demonstration under grant agreement No 611388. D3.1 – Sketch of system architecture specification WP3: Architecture and Requirements Dissemination level: Public 16 Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), Stefan Schulz (MUG) Version: 1.0 Final 3. Architecture 3.1. Overview 3.1.1. Involved Systems Figure 1 shows an overview of the different systems involved in the SEMCARE architecture and how data is transferred from one system to another. Figure 1: Systems involved in the SEMCARE architecture Each of the systems is briefly described below. Production data system The production data system contains the hospital production data that may be structured or unstructured and is spread over different sources. Possible components of the system are: • Databases • File storage • HL7v2 messages • Multiple components that constitute a HIS (hospital information system) Staging data system The staging data system is a copy of the hospital production data used for feeding the SEMCARE staging system. The reason for copying the hospital data is that it is usually not allowed to directly operate on the live data. By operating on a copy of the data, potential damages on the live-system are avoided. The staging data system has the same components as the production system: • Databases • File storage • HL7v2 messages • HIS (hospital information system) © Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s Seventh Programme for research, technological development and demonstration under grant agreement No 611388. D3.1 – Sketch of system architecture specification WP3: Architecture and Requirements Dissemination level: Public 17 Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), Stefan Schulz (MUG) Version: 1.0 Final SEMCARE staging system The SEMCARE staging system reads the data from the hospital staging system. This is done via an ETL process that aggregates data from different data sources into one data store. A common tool for such an ETL process is Talend Open Studio. Once the data is loaded, patient data of interest is analysed, and the resulting data populates the SEMCARE staging databases as well as the Solr index The different components are: • Relational database: SEMCARE data store where the aggregated clinical data is stored. • Database importer process: An ETL process that loads data from the staging data system into the SEMCARE staging system. • Solr server and index: Indexes documents and searches indexes. • Graph database: Stores concept hierarchies and relations between documents and concepts. For now, this is an experimental extension to the system. It will be further evaluated if it can add additional value to the SEMCARE platform. • Averbis text analysis pipeline (AEP): Analyses text in order to extract structured data. • SEMCARE portal for testing: Provides capability for configuring and testing the staging system. SEMCARE production system The SEMCARE production system contains the structured data exported from the SEMCARE staging system. It is the system that is used by the end users to perform search queries and view reports. The system contains the following components: • Relational database: SEMCARE data store where the aggregated clinical data is stored. • Solr server and index: Indexes documents and searches indexes. • Graph database: Stores concept hierarchies and relations between documents and concepts. For now, this is an experimental extension to the system. It will be further evaluated if it can add additional value to the SEMCARE platform. • Averbis text analysis pipeline (AEP): Analyses text in order to extract structured data. • SEMCARE portal for end users: The portal for building queries and searching the system. 3.1.2. Architecture sketch Figure 2 shows an overview of the complete architecture planned for the semantic analysis platform SEMCARE. The individual components have been described in section 3.1.1 above. Furthermore, the figure shows that it will be possible to apply third party tools on the data store of the SEMCARE production system in order to perform further analytics like visualisation or statistics. This will be enabled by providing a common data model (the i2b2 star schema) that can easily be used by © Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s Seventh Programme for research, technological development and demonstration under grant agreement No 611388. D3.1 – Sketch of system architecture specification WP3: Architecture and Requirements Dissemination level: Public 18 Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), Stefan Schulz (MUG) Version: 1.0 Final third party applications (e.g. tranSMART, QlikView, Rapidminer). As a consequence, hospitals can install third party tools if they want to use them on the SEMCARE data. Figure 2: Architecture sketch 3.1.3. Architecture Layering The SEMCARE system can be divided into three layers, which are described in the following paragraphs from the bottom to the top and graphically showed in Figure 3 below. The bottom layer contains the data sources, which consist of different types of patient data arising in a hospital, for example unstructured data like discharge summaries or findings reports, and structured data like lab results or other routine data acquired and structured for health care, research and quality assessment. Also coded data could be available, which is mainly used for reimbursement. The data is scattered over different databases or stored in files, which can be of different format (e.g. Word, XML, Text, and PDF). Data may also be available as messages, generally in HL7v2 format as a universal health care messaging standard. The second layer is the semantic middleware. First, it contains tools for information extraction, ETL and text mining as an interface to the data sources. The loaded and analysed data is then stored in a unifying semantic database. This layer also includes terminologies and texts stored in a graph © Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s Seventh Programme for research, technological development and demonstration under grant agreement No 611388. D3.1 – Sketch of system architecture specification WP3: Architecture and Requirements Dissemination level: Public 19 Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), Stefan Schulz (MUG) Version: 1.0 Final database and Solr index. The third part of the middleware is the communication between the SEMCARE data store and the topmost layer, which is the presentation layer. The presentation layer is the highest level and represents the interface to the user who could be a researcher, clinician or administrator. Possible components of the presentation layer are: • the terminology editor • a search interface including a query generator • dashboards and analytics • study management tools Re s e a rc h e rs Pre s e n ta tion Layer Te rm in olog y E d itor Ad m in is tra tio n C lin ic a n s S e a rc h In te rfa c e D a s h b oa rd s and A n a ly tic s S tu d y Ma n a g e m e n t … C om m u n ic a tion L a y e r Un ify in g S e m a n tic D a ta b a s e (c on ta in s D a ta , Te rm in olog ie s , a n d Tex ts ) S e m a n tic Mid d le w a re Re l D B S e a rc h In d ex Trip le S tore / N oS Q L D B In form a tion E x tra c tion , Tex t Min in g , E T L D a ta S ou rc e s R e im b u rs e m e n t d a ta < x m l> --------< /x m l> Lab Re s u lts D is c h a rg e D is c h a rg e S ucm D hmamargarie eries s S uism S u m m a rie s H os p italD ata Figure 3: Architecture layering 3.2. Interfaces The following interfaces between components of the SEMCARE system have been identified: • Staging data to data importer: Imports data from the hospital information system as documents or messages. Formats to be expected are xml, HL7, plain text, possibly also jpeg or other formats for scanned documents, DICOM. © Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s Seventh Programme for research, technological development and demonstration under grant agreement No 611388. D3.1 – Sketch of system architecture specification WP3: Architecture and Requirements Dissemination level: Public 20 Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), Stefan Schulz (MUG) Version: 1.0 Final • Data importer to staging Solr: The Solrj (Solr Java client) API is used for sending patient record information to Solr to be analysed. • SEMCARE staging portal to staging Solr: The Averbis search REST API will be used for the communication between the two components. This API uses JSON messages to communicate with Solr. Example message definitions are shown in Figure 4: Averbis search REST API. • SEMCARE production portal to production Solr: The Averbis search REST API will be used for the communication between the two components. Example message definitions are shown in Figure 4: Averbis search REST API. public class Result { public class Request { private private private private private private private private private private private private private String query; Integer rows; Integer start; String highlightQuery; List<SortField> sortFields; Boolean facetHighlighting; Integer facetLimit; String facetPrefix; String facetSort; List<Facet> facets; List<Field> fields; List<Param> params; User user; private private private private private private private String query; Integer start; String highlightQuery; Integer numFound; List<Facet> facets; List<Document> documents; String didYouMean; } } Figure 4: Averbis search REST API 3.3. Data Models SEMCARE employs a number of different data formats and systems. These include unstructured input data, relational databases, terminologies, and Solr indexes. 3.3.1. Input Data The input data for the SEMCARE project may vary with regards to the data source and the data format. For each treatment episode, several sources are of interest: • documents, either original ones (e.g. findings reports) or aggregated ones (discharge letters) • messages, e.g. HL7v2 messages • raw data, e.g. images, measurement data (e.g. ECG) • database entries © Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s Seventh Programme for research, technological development and demonstration under grant agreement No 611388. D3.1 – Sketch of system architecture specification WP3: Architecture and Requirements Dissemination level: Public 21 Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), Stefan Schulz (MUG) Version: 1.0 Final The input data may exhibit different degrees of structure, such as • unstructured, e.g. free text, images • semi-structured, e.g. free text with standardized organizing patterns (e.g. headings) • structured, e.g. tables of lab values • coded, e.g. LOINC-coded lab values, ICD-10 coded diseases The SEMCARE system will import these different formats from the various data sources with an ETL process. 3.3.2. Terminologies Medical terminologies provide meaning identifiers (codes) for terms or groups of synonymous terms, the latter generally referred to as concepts. In SEMCARE, terminologies will enrich the search process by knowledge about the meaning of domain terms, their groupings into concepts, and certain relations between concepts such as broader / narrower. In addition, SEMCARE will enable users to add new concepts and terms to the existing terminology, where needed, e.g. when they miss an important synonym. As some terminologies support several languages they will also allow for multilingual text analysis by grouping terms from different languages into the same concept. The continual process for refining terminologies is described in this section. Figure 5: Refinement process for criteria © Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s Seventh Programme for research, technological development and demonstration under grant agreement No 611388. D3.1 – Sketch of system architecture specification WP3: Architecture and Requirements Dissemination level: Public 22 Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), Stefan Schulz (MUG) Version: 1.0 Final As shown in Figure 5 above, the SEMCARE platform will provide a term browser and a dictionary creator for users to view and edit their terminology. The term browser will be able to import standard terminologies such as SNOMED CT, ICD-10 or MeSH and store them in a relational database (RDB). The users can then build their own terminology by enhancing, merging, or modifying existing terminologies. The most important medical terminologies are contained in the UMLS metathesaurus, which is a rich source of synonyms in different languages that also groups concepts into top-level categories via the UMLS Semantic Network. We will make use of all of this by enhancing the user interface of the term browser, so that also non-English terms can be used to search for concepts. In all stages of the terminology creation process, the terminology can be exported to the AEP analysis pipeline. The terminology can then be used to index and search documents via the SEMCARE search interface. When the users build their search query, they may find that their terminology needs to be modified in order to produce better search results. They can then go back to the term browser to make changes to the terminology. This refinement process is crucial for optimizing the SEMCARE platform. Users should be able to quickly see how terminology changes affect search facets and results. Whereas there is a certain preference for SNOMED CT, ICD-10, and MeSH, a final decision of which terminologies to use for annotation will have to be made at the start of the work in WP2. Another decision to be made is how the known vocabulary gap for Dutch and German will be filled. One possible strategy is the use of machine translation, together with human review of the terms generated by this method. Manual additions to the terminologies, mainly driven by the use case, will be the option of choice wherever queries have to be fine-tuned. 3.3.3. I2B2 Star Schema In order to use a standard schema for the data storage and to ensure that we provide a common data model that is also widely used by third party providers (e.g. tranSMART), the i2b2 star schema will be used in SEMCARE to store the data. In the i2b2 star schema, observations or, more precise, factoid (fact-like) statements, are stored in the observation_fact table and linked to four so-called “dimension” tables for patient, visit, provider and concept details. These dimension tables contain descriptive information about factoid statements. Figure 6 below shows an overview of the i2b2 star schema. © Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s Seventh Programme for research, technological development and demonstration under grant agreement No 611388. D3.1 – Sketch of system architecture specification WP3: Architecture and Requirements Dissemination level: Public 23 Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), Stefan Schulz (MUG) Version: 1.0 Final Figure 6: i2b2 star schema I2b2 also uses metadata tables to define terminologies. SEMCARE terminologies can be stored in the i2b2 custom_meta table (Figure 7). This table stores hierarchical terminologies that are used to build queries in the i2b2 query and analysis tool. The c_fullname column is used to store the full path of each term with the '\' character delimiting the hierarchical levels. After the custom_meta table is filled with SEMCARE terminologies via an import process, concept_dimensions can be created that link to the custom_meta terms. custom_meta c_hlevel c_fullname c_name c_synonym_cd c_visualattributes c_totalnum c_basecode c_metadataxml c_facttablecolumn c_tablename c_columnname c_columndatatype c_operator c_dimcode c_comment c_tooltip m_applied_path update_date integer character varying(700) character varying(2000) character(1) character(3) integer character varying(50) text character varying(50) character varying(50) character varying(50) character varying(50) character varying(10) character varying(700) text character varying(900) character varying(700) timestamp without time zone Figure 7: i2b2 custom_meta table © Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s Seventh Programme for research, technological development and demonstration under grant agreement No 611388. D3.1 – Sketch of system architecture specification WP3: Architecture and Requirements Dissemination level: Public 24 Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), Stefan Schulz (MUG) Version: 1.0 Final An example of how a custom terminology may look in the i2b2 term navigator is shown in Figure 8. Figure 8: Custom metadata in i2b2 term navigator In addition to the standard i2b2 tables, a new table will be created to map i2b2 records to Solr documents. This table will contain the encounter_num key, the original unstructured record, the Solr document and ID, and a copy of the CAS (Common Analysis System) object from the text analysis. Figure 9 shows this additional SEMCARE record table and its relation to the existing i2b2 tables. Figure 9: Solr to i2b2 mapping 3.3.4. SEMCARE Patient Record Solr Document Solr documents will be used to store patient record information for text search. Each Solr document will contain IDs that map the Solr document to corresponding records in the i2b2 database (see also Figure 10 below). With this linkage, only data required for search indexing will be stored in the Solr © Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s Seventh Programme for research, technological development and demonstration under grant agreement No 611388. D3.1 – Sketch of system architecture specification WP3: Architecture and Requirements Dissemination level: Public 25 Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), Stefan Schulz (MUG) Version: 1.0 Final document, and additional information can be pulled from the i2b2 database if needed. Dynamic fields can be used in the Solr document to store multiple concepts. References to terminology codes or concepts are stored in Solr as CUIs (concept unique identifier) to enable multilingual searches. Preferred terms and synonyms will not be stored in Solr because all documents and queries will be processed by the AEP to replace synonyms and preferred terms with CUIs before sending the query to Solr. The Solr system will provide a faceted search, which means that the search results are organized according to a faceted classification system, thus allowing the user to explore a collection of information by applying multiple filters. Facets correspond to properties of the search result. Solr will store multiple dynamic fields for each concept: • a list of all the types of concepts used for faceting (Note that this field is the set of all concept types in the document and it has no linkage to the relational database. Only individual concepts are linked to the database.) • a value for searching • an ID to link to the relational database • a path for hierarchical faceting For example, for medication with the CUI a1234 Solr would store the following fields: • concept_medications=“a1234,b5678,c2313” • concept_med_val_a1234=50 • concept_med_id_a1234=123456 • concept_med_path_a1234=/c1000/b1023/a1234 Figure 10: Mapping of Solr documents to i2b2 database © Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s Seventh Programme for research, technological development and demonstration under grant agreement No 611388. D3.1 – Sketch of system architecture specification 26 WP3: Architecture and Requirements Dissemination level: Public Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), Stefan Schulz (MUG) Version: 1.0 Final 3.3.5. SEMCARE Data Loading Flow The data loading flow begins when the data importer ETL process loads unstructured data. The unstructured data is stored in the relational database and then sent to Solr to be analysed and 9 indexed. The Solr process and text analysis pipeline stores data in a graph database, e.g. Neo4j , and builds the Solr index. Finally, the structured data from the analysis is added to the relational database to enhance the unstructured data. A diagram of the data import flow is show in Figure 11. Figure 11: SEMCARE data loading flow The data importer process could be created with an ETL tool such as Talend Open Studio. Figure 12 below shows an example Talend job that reads a directory of plain text files and commits them to Solr and a PostgreSQL database. 9 http://www.neo4j.org/ © Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s Seventh Programme for research, technological development and demonstration under grant agreement No 611388. D3.1 – Sketch of system architecture specification WP3: Architecture and Requirements Dissemination level: Public 27 Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), Stefan Schulz (MUG) Version: 1.0 Final Figure 12: Talend Open Studio data importer 3.4. Modules & Functional View The SEMCARE system contains the following modules and components as shown in Figure 13 below and described in this section. Figure 13: SEMCARE components © Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s Seventh Programme for research, technological development and demonstration under grant agreement No 611388. D3.1 – Sketch of system architecture specification WP3: Architecture and Requirements Dissemination level: Public 28 Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), Stefan Schulz (MUG) Version: 1.0 Final 3.4.1. SEMCARE Data Importer The SEMCARE data importer is the entry point for health care data in the SEMCARE system. It could be an ETL process defined by a tool such as Talend Open Studio, or a custom coded software process. When it receives data, the data importer will write the unstructured data to the database and then send the unstructured data to Solr for analysis. 3.4.2. Solr Solr is an open source search platform from Apache Lucene. In the SEMCARE project it is used to index and search patient record data. Solr will use the Averbis text analytic tools to create structured data from unstructured text. After the text is analysed, Solr will write the structured data to the database. 3.4.3. Averbis Text Analytics (AEP) The Averbis Extraction Platform (AEP) describes a text analysis tool that can be simply applied to arbitrary information extraction scenarios. It provides solutions to extract individual information units such as facts and relations from unstructured text having the highest relevance for a user. The AEP consists of a number of modular text analysis components, so called Analysis Engines (AEs), stick together in the Apache UIMA 10 framework building an overall solution for different use cases. Depending on the requirements, rule-based, statistical methods or a combination of both are used to reveal the semantic from the content. Annotations between AEs are exchanged using an object named Common Analysis System (CAS). The CAS is UIMA’s object-based data structure that allows memory based storage and exchange of annotations with respect to pre-defined type systems of hierarchically organized annotations. With the aid of this data structure it is possible to generate a common base to analyse unstructured text. 3.4.4. SEMCARE Portal Web Application The SEMCARE portal provides a graphical user interface, which allows users to build queries on the clinical data and to manage the system. The users will get immediate feedback from a search, which helps them to decide how to refine their query in order to get better results. The portal will also provide users with an interface for defining and refining terminologies. More specific requirements and details about the user interface will be provided in D3.2. 10 http://opennlp.apache.org © Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s Seventh Programme for research, technological development and demonstration under grant agreement No 611388. D3.1 – Sketch of system architecture specification WP3: Architecture and Requirements Dissemination level: Public 29 Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), Stefan Schulz (MUG) Version: 1.0 Final 3.4.5. I2b2 Applications I2b2 tools and components such as the i2b2 query and analysis tool can be installed in the system if needed. I2b2 runs on the JBoss application server. 3.4.6. Third Party Tools and Applications Third party tools can also be installed in the system as required. These tools could possibly interface with the i2b2 database or the Solr server, but because of the varying requirements and functionality of third party applications, they are not shown in Figure 13 or described in detail here. 3.4.7. Scalability All of the components in the SEMCARE system can be deployed across multiple machines to support the processing of large data sets if needed. Multiple data importer processors can be launched to read input data. Solr Cloud can be used to distribute Solr indexes and search processing across multiple machines. The Averbis text analysis pipeline can also be deployed as a distributed system. By adding more machines and distributing SEMCARE components the SEMCARE system can scale to meet the processing requirements of large data sets. 3.5. Users & Roles In the context of the SEMCARE project different types of users can be distinguished. Their roles are briefly described below: Production Database Administrator The production database administrator manages the copying of production patient data into the staging data system. He/she also manages the following export to the SEMCARE staging system via ETL process. The Production Database Administrator is located at the hospital site. SEMCARE Administrator The SEMCARE administrator is responsible for managing the SEMCARE databases, the Solr configuration and the SEMCARE portal. He/she configures terminology and text analysis configuration. The administrator manages copying of data from SEMCARE staging to the SEMCARE production environment and creates custom dashboards, scripts and third party integrations. SEMCARE User Typically, SEMCARE users will be researchers and clinicians who use the SEMCARE portal for search and analytics. © Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s Seventh Programme for research, technological development and demonstration under grant agreement No 611388. D3.1 – Sketch of system architecture specification WP3: Architecture and Requirements Dissemination level: Public 30 Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), Stefan Schulz (MUG) Version: 1.0 Final The role concept will be further verified during the project and refined if needed. Furthermore, it must be guaranteed that all roles have the access rights to the data to be analysed at the level of the hospital information system. 3.6. Open points A few points that are still open and need further clarification within the course of the project are listed below. More specific details about these points will be given in deliverable D3.2. • One challenge for the SEMCARE platform is the search for constellation of symptoms that are spread over several documents. A strategy will be developed in order to cover this requirement. • As the SEMCARE system will be installed within the hospital, a further analysis of the IT landscape within the hospitals will have to be performed. The interfaces need to be defined and the interchange formats to be specified. • Another point to think about is a possible weighting of criteria for a specific use case. For example, it should be possible to define mandatory and optional criteria when creating a search query. © Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s Seventh Programme for research, technological development and demonstration under grant agreement No 611388. D3.1 – Sketch of system architecture specification WP3: Architecture and Requirements Dissemination level: Public 31 Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), Stefan Schulz (MUG) Version: 1.0 Final 4. Data Privacy / Technical and organizational security procedures 4.1. Data processing The data processing within the scope of the SEMCARE project takes place entirely within each participating hospital. The project integrates into the existing IT landscape of the hospital with regards to admission (physical access), computer access, and data access control to the used IT components (servers and network components). This also affects the security of particularly sensitive health care data arising in a hospital. The architectural design of the SEMCARE platform permits data processing and storage on separate hard drives if needed because of the involvement of different departments and appropriate user rights. 4.2. Data transfer and data location In the scope of the SEMCARE project patient data will not leave the hospitals at any time. Patient data may, however, be shared between different departments of each hospital. In these cases, already installed (pseudo-) anonymisation processes will be applied. The de-identification procedures for each of the three participating hospitals are explained in detail in deliverable D1.1. Regarding test data, SGUL will prepare anonymised data to be used by Averbis GmbH for the development of algorithms, interfaces and the final product. The legal basis for the transfer of such test data is section 251 of the NHS Act 2006. Transferred test data will be encrypted either at rest or in transition. The hospitals EMC and MUG will not provide any data to Averbis or to any other clinical partner. Both, data processing and the operation of the data platform will be performed within a dedicated server infrastructure in the hospital. It will be ensured that no project-related data is stored on locations where unauthorised persons have access to. Furthermore, an additional encryption of the data that is e.g. stored in the Solr index is possible by 11 using TrueCrypt . 11 http://www.truecrypt.org/ © Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s Seventh Programme for research, technological development and demonstration under grant agreement No 611388. D3.1 – Sketch of system architecture specification WP3: Architecture and Requirements Dissemination level: Public 32 Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), Stefan Schulz (MUG) Version: 1.0 Final 4.3. Role concept A role concept will be applied that assures that only authorised users can access the data related to the SEMCARE project. What Who Data Upload, Query generation SEMCARE User Data Deletion SEMCARE Administrator Create, Edit, Delete Users SEMCARE Administrator System maintenance Local system administrator A connection to the local LDAP (Lightweight Directory Access Protocol) can be implemented in order to take over existing access rights. A logging of the activities will be performed in order to be able to examine if personal data has been entered, changed or deleted, and by whom. Only allocated and defined personnel will have access to the system components and applications of the SEMCARE applications 4.4. Availability control Actions will be considered in order to protect personal data against accidental destruction or loss. For example, the SEMCARE systems will not directly work on the hospital live data but on a copy (staging system) to ensure that no real patient data is affected in any way. High availability of the SEMCARE platform is no priority as the application is not crucial for patient care. 4.5. Data separation control It must be assured that data from different scenarios or different departments are separated from each other. The SEMCARE architecture allows this separation if needed, e.g. different Solr indexes can be used. The SEMCARE systems will only be run locally and queries will only be performed on relevant patient data. Other information that is not relevant for the defined use case will not be extracted from the hospital systems. A development system and a production system will be provided separately. © Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s Seventh Programme for research, technological development and demonstration under grant agreement No 611388.
© Copyright 2025