Semantic Data Platform for Healthcare

Semantic Data Platform for Healthcare
ICT-611388
Lead beneficiary: Averbis
D1.1 Ethics Guidelines and Procedures
Date: 04/06/2014
WP1 – Scientific Coordination
Nature: Report
V3.0 Final
Dissemination level: PU
(Public)
© Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s
Seventh Programme for research, technological development and demonstration under grant agreement No 611388.
D1.1 – Ethics guidelines and procedures
WP1: Scientific Coordination
Dissemination level: Public
Author: SEMCARE Consortium
Version: 3.0
TABLE OF CONTENTS
TABLE OF CONTENTS ......................................................................................................................... 2
DOCUMENT INFORMATION ................................................................................................................. 4
DOCUMENT HISTORY .......................................................................................................................... 4
DEFINITIONS ......................................................................................................................................... 5
EXECUTIVE SUMMARY ........................................................................................................................ 6
KEY WORDS (WORDLE STYLE) .......................................................................................................... 7
1.
INTRODUCTION ............................................................................................................................ 8
2.
PRIVACY RULES REGARDING PROCESSING OF HEALTHCARE DATA ............................... 9
2.1. SCOPE AND DEFINITIONS OF THE DIRECTIVE 95/46/EC (‘DIRECTIVE’) .............................................. 10
2.2. PRINCIPLES OF THE DIRECTIVE ...................................................................................................... 11
2.3. SUPERVISORY AUTHORITY ............................................................................................................ 13
2.4. SUMMARIZING THE IMPORTANT ASPECTS OF THE DIRECTIVES FOR THE SEMCARE PROJECT ........... 14
3.
REPORTS OF THE RESPECTIVE SEMCARE-PROJECT HEALTHCARE DATABASE
PARTICIPANTS .................................................................................................................................... 15
3.1. EMC DATA BASE .......................................................................................................................... 15
3.2. MUG DATA BASE .......................................................................................................................... 16
3.3. SGUL DATA BASE ........................................................................................................................ 16
4.
GENERAL PROCEDURES IN THE SEMCARE PROJECT FOR PROCESSING HEALTHCARE
DATA .................................................................................................................................................... 17
4.1. OVERALL STRATEGY ..................................................................................................................... 17
4.1.1. Overall strategy .................................................................................................................. 17
4.1.2. Data mining / data extraction .............................................................................................. 20
4.1.3. Definition of the project architecture ................................................................................... 22
4.1.4. Data anonymisation ............................................................................................................ 24
2
D1.1 – Ethics guidelines and procedures
WP1: Scientific Coordination
Dissemination level: Public
Author: SEMCARE Consortium
Version: 3.0
4.1.5. Data leaving the hospitals to test the software ................................................................... 25
4.1.6. SEMCARE ‘global’ use case .............................................................................................. 26
4.2. EMC ........................................................................................................................................... 28
4.2.1. Protocol for de-identification ............................................................................................... 28
4.2.2. Data management for research.......................................................................................... 29
4.2.3. Adaptation of the SEMCARE software ............................................................................... 29
4.2.4. Installation of a foreign code in the hospitals ..................................................................... 29
4.2.5. Clinical connectors ............................................................................................................. 29
4.3. MUG ........................................................................................................................................... 29
4.3.1. Protocol for de-identification ............................................................................................... 29
4.3.2. Data management for research (can include the generation of dummy data) ................... 29
4.3.3. Adaptation of the SEMCARE software ............................................................................... 30
4.3.4. Installation of a foreign code in the hospitals ..................................................................... 30
4.3.5. Clinical connectors ............................................................................................................. 30
4.4. SGUL ......................................................................................................................................... 30
4.4.1. Protocol for De-identification .............................................................................................. 30
4.4.2. Data Management for Research ........................................................................................ 34
4.4.3. Adaptation of the SEMCARE Software .............................................................................. 34
4.4.4. Installation of a Foreign Code in the Hospital .................................................................... 34
4.4.5. Clinical Connectors ............................................................................................................ 35
4.4.6. Risks during Data Processing and Possible Solutions ...................................................... 35
5.
OVERVIEW/CONCLUSION ......................................................................................................... 35
ANNEX I. MUG'S PROCESSING OF DOCUMENTS USING THE AVERBIS SOFTWARE ............... 36
3
D1.1 – Ethics guidelines and procedures
4
WP1: Scientific Coordination
Dissemination level: Public
Author: SEMCARE Consortium
Version: 3.0
DOCUMENT INFORMATION
Grant Agreement
Number
ICT-611388
Full title
Semantic Data Platform for Healthcare
Project URL
www.semcare.eu
EU Project officer
Saila Rinne ([email protected])
Deliverable
Number
1.1
Title
Ethics guidelines and procedures
Work package
Number
1
Title
Scientific Coordination
Delivery date
Contractual
31/03/2014
Status
Version V3.0 Final
Nature
Report  Prototype  Other 
Dissemination Level
Public  Confidential 
Authors (Partner)
AVERBIS, SYNAPSE, EMC, MUG, SGUL
Acronym
Actual
Draft 
Philipp Daumke
Email
Partner
Phone
SEMCARE
04/06/2014
Final 
[email protected]
Responsible Author
AVERBIS
DOCUMENT HISTORY
NAME
DATE
VERSION
DESCRIPTION
All partners
23/05/14
1.0
First draft
A. Honrado (SYNAPSE)
26/05/14
2.0
Internal review
D. Kalra
31/05/14
2.1
Ethics review
All partners
02/06/14
2.2
Feedback review and changes
A. Honrado (SYNAPSE)
04/06/14
3.0
Final version
D1.1 – Ethics guidelines and procedures
WP1: Scientific Coordination
Dissemination level: Public
Author: SEMCARE Consortium
Version: 3.0
DEFINITIONS

Partners (also named as beneficiaries) of the SEMCARE Consortium are referred to herein
according to the following codes:
AVERBIS - Averbis GmbH (Germany) Coordinator
EMC - Erasmus Universitair Medisch Centrum Rotterdam (Netherlands) – Beneficiary
MUG - Medical University of Graz (Austria) – Beneficiary
SGUL - Saint George's University of London (UK) – Beneficiary
SYNAPSE - Synapse Research Management Partners S.L. (Spain) – Beneficiary

Anonymisation: process of de-identification of data by suppressing or generalising values of
attributes that identify a person; no retracing to the real person is possible.

Encryption: process of encoding information in such a way that only authorised parties can read
it.

ETL: extract – transform – load; Process in data warehousing that is often used to integrate data
from multiple sources. A common ETL tool is Talend Open Studio.

HTTPS: Hypertext Transfer Protocol Secure; communications protocol for secure communication
over a computer network using the SSL/TLS protocol.

LDAP: Lightweight Directory Access Protocol; standard application protocol for accessing and
maintaining distributed directory services over a network. Directory services allow the sharing of
information about users, systems, networks, services, and applications throughout the network.

NFS: Network File System; distributed file system protocol allowing a user on a client computer to
access files over a network.

Project: The sum of all activities carried out in the framework of the Grant Agreement.

Risk: Uncertainty that may have a significant impact on the execution or outcome of the project,
and which effect may be negative – a threat risk - or positive – an opportunity risk.

SSL/TLS: cryptographic protocols which are designed to provide communication security over the
Internet.

TrueCrypt: freeware application used for on-the-fly encryption.

Virtual Machine (VM): software-based emulation of a computer.

Pseudo-anonymisation: process of de-identification of data by suppressing or generalising
values of attributes that identify a person; a retracing to the real person is possible for authorised
people by using a mapping table that provides the personal data for a specific identifier.
5
D1.1 – Ethics guidelines and procedures
WP1: Scientific Coordination
Dissemination level: Public
Author: SEMCARE Consortium
Version: 3.0
EXECUTIVE SUMMARY
This document describes the general ethical/privacy and procedures that occurred and are proposed
during the SEMCARE project.
SEMCARE addresses ethical issues by ensuring the appropriate re-use of identifiable patient data
complies with legal requirements on data privacy. The current EU-level legislation applicable to the
clinical scenarios as well as the transposition of the EU legal texts into national law, which has resulted
in some degree of variability, will be taken into account.
At this point (M5) all ethical approvals for secondary use of data have been requested by the clinical
partners and their respective ethical committees have showed no objection to their processing. It is
expected that ethical approvals will be available and sent to the Commission Services in the short term
and included in the deliverable D1.3 Report on ethical framework and procedures in the project.
6
D1.1 – Ethics guidelines and procedures
WP1: Scientific Coordination
Dissemination level: Public
Author: SEMCARE Consortium
Version: 3.0
KEY WORDS (Wordle style)1
1
http://www.wordle.net/.
7
D1.1 – Ethics guidelines and procedures
WP1: Scientific Coordination
Dissemination level: Public
Author: SEMCARE Consortium
Version: 3.0
1. Introduction
Healthcare is a data intensive enterprise – a wide range of teams and institutions in a variety of settings
need to access patient data in order to provide safe and appropriate care to a patient. In all Member
States of the European Union, and indeed most countries in the world, the collection, processing,
sharing and storage of identifiable patient data is regulated by a framework of legislation, ethical
requirements and professional regulations. These frameworks require that healthcare professionals and
their support staff observe strict rules of patient consent, privacy and confidentiality when data are
collected, processed or shared for the purposes of providing care.
However, the legislative framework of data protection is not the only set of rules which impacts on the
handling of patient data. The ethical principles, based on patient autonomy, and working through
consent and confidentiality, are more familiar, and are also professionally relevant to healthcare
professionals, who need to balance patients' medical needs with perceived needs of privacy. Thus, the
level of consent ethically required for the sharing of patient data is influenced by the purpose for which
data are to be used. For immediate and on-going care, “implied” consent is generally regarded as
sufficient, while for “secondary” purposes, such as financial and clinical audit, or when data are used
for research purposes, informed - usually written - consent is needed. If patient data are fully
anonymised, then consent is not required for secondary use, although it is good practice for patients to
be made aware of this use of their information, even if they cannot be identified.
Although at the EU level there is a sound legal basis for Data Protection, Privacy and Human Rights, it
is important to note that the transposition of the EU legal texts into national law has resulted in some
degree of variability, which has in turn created some uncertainty both at national and EU level. The upcoming review of the EU Data Protection Directive (Directive 95/46/EC), to give rise to a new Regulation,
offers a unique opportunity investigate the impact of the current legislation on eHealth services and to
propose new interpretation and refinements which will allow EU legislation to serve as a facilitator of
eHealth – rather than as the hurdle it is often portrayed today.
All SEMCARE partners are complying with the charter of fundamental rights of the European Union as
well as the relevant international directives and declarations on ethical issues as detailed by the EU for
FP7. Care is taken that all experiments are sanctioned by government officials including local and
national ethics committees. All experiments conform to national and EU Directives on the confidentiality
of personal data. In case of use of patient material informed consent is obtained and data of patients is
secured by anonymisation (data protection). The partner institutions respect and enforce all relevant
international codes of practice, such as:
8
D1.1 – Ethics guidelines and procedures
WP1: Scientific Coordination
Dissemination level: Public
Author: SEMCARE Consortium
Version: 3.0

the ethical standards of the 7th Framework programme (only to mention Article 3, all research
will be carried out in compliance with fundamental ethical principles)

the Charter of Fundamental Rights of the European Union, signed in Nice, 7 December 2000

the Convention on Human rights and Biomedicine – Oviedo, 4/4/1997 – Council of Europe

Helsinki Declaration (Fortaleza, October 2013) – World Medical Association

that the ethical review within the European Commission's evaluation procedure does NOT
replace local ethics committee or local authority approval
This document aims to address the ethical issues in the project, which concerns mainly the use of
anonymised clinical data (WP6) related to the use case that has been selected for the project.
Furthermore, the consortium follows the advice of an Ethics Advisor, Prof. Dipak Kalra, who acts as an
external and independent expert that monitors all SEMCARE’s ethical issues, ensuring that there is
compliance with national and EU legislation governing data privacy as well as any other ethical
considerations that may arise in the course of the project.
2. Privacy rules regarding processing of
healthcare data
Personal privacy is a highly respected principle in the European Union (EU). All member states are
signatories of the European Convention on Human Rights (ECHR) from the Council of Europe.
According to article 8 of the ECHR people have, subject to certain restrictions, a fundamental right to
respect for one's "private and family life, his home and his correspondence". This right is embedded
into national legislation of most Member States. Data privacy laws, however, vary widely across Europe.
The European Commission (EC) realized that this diversity of national legislation impedes uniform data
protection and the free flow of data within the EU zone. Therefore in 1995 the EC published Directive
95/46/EC on the protection of individuals with regard to the processing of personal data and on the free
movement of such data (‘Directive’) to harmonize data protection regulation within the EU. The Directive
regulates the processing of personal data and the free movement of such data and had to be
implemented into national law by the end of 1998. Currently all Member States have implemented it
within their own national data protection legislation. The Directive is not a ‘closed regulatory system’
and leaves open a certain scope for policy making at national level. However certain minimum
requirements must be complied with. This includes the processing of personal data for scientific
purposes, such as the processing of data in and for the SEMCARE project.
The SEMCARE platform will be installed in the departments of hospitals and connected to the local
access control system within each site, making data only accessible to physicians that are allowed to
9
D1.1 – Ethics guidelines and procedures
WP1: Scientific Coordination
Dissemination level: Public
Author: SEMCARE Consortium
Version: 3.0
see the data according to the local hospital policies. This way, physicians will be able to analyse the
patient records of their department, and thus to conduct retrospective analyses.
Due to the sensitive nature of the personal health data it is important for SEMCARE to be fully aware
of ethical and regulatory aspects and to implement all reasonable measures to ensure compliance with
ethical and regulatory issues on privacy.
Although all EU participants have different national legislations regarding privacy (and informed
consent), the aforementioned Directive applies to all of them and can function as a base and point of
reference in this document. Section 2.2 summarizes aspects and articles of the Directive most relevant
to the SEMCARE project.
In Section 3 SEMCARE’s clinical partners have briefly set out how they individually deal with these
(ethical and regulatory) aspects at the national level and which procedures are implemented in their
organisation/country to comply with the applicable regulations.
Furthermore, any ethical issues regarding use of electronic health care databases encountered during
the project shall be reported by parties encountering them. They shall report these issues to the
scientific coordinator in writing, together with a description on how they have been solved. These reports
shall be included either in the deliverable D1.3 Report on ethical framework and procedures in the
project (due in month 12) or in the deliverable D1.4 Final report on ethical issues and data privacy (due
in month 24).
2.1. Scope and definitions of the directive 95/46/EC (‘Directive’)
In the Directive (art. 2 sub a) ‘personal data’ are defined as "any information relating to an identified or
identifiable natural person (‘data subject’); an identifiable person is one who can be identified, directly
or indirectly, in particular by reference to an identification number or to one or more factors specific to
his physical, physiological, mental, economic, cultural or social identity". This definition is very broad.
Data are ‘personal data’ when someone is able to link the information to a person, even if the person
holding the data cannot make this link.
For the purpose of the Directive (art. 2 sub b) ‘processing’ means "any operation or set of operations
which is performed upon personal data, whether or not by automatic means, such as collection,
recording, organization, storage, adaptation or alteration, retrieval, consultation, use, disclosure by
transmission, dissemination or otherwise making available, alignment or combination, blocking, erasure
or destruction”.
The "controller" (art. 2 sub d), meaning “the natural or legal person, public authority, agency or any
other body which alone or jointly with others determines the purposes and means of the processing of
personal data” is the one responsible for compliance (art. 6 sub 2).
10
D1.1 – Ethics guidelines and procedures
WP1: Scientific Coordination
Dissemination level: Public
Author: SEMCARE Consortium
Version: 3.0
According to article 4 of the Directive the data protection rules are not only applicable when the controller
is established within the EU, but whenever the controller uses equipment situated within the EU in order
to process data. Controllers from outside the EU, processing data in the EU, will have to follow the EU
data protection regulation.
2.2. Principles of the directive
Personal data may not be processed, except when certain conditions are met. These conditions fall into
three categories: legitimate purpose, transparency and proportionality.
Legitimate purpose
Article 6 sub b states that personal data can only be processed for specified explicit and legitimate
purposes and may not be processed further in a way incompatible with those purposes. Further
processing of data for historical, statistical or scientific purposes shall not be considered as
incompatible provided that Member States provide appropriate safeguards.
Transparency
Personal data may be processed only under the following circumstances (art. 7):

when the data subject has given his/her consent; or

when the processing is necessary for the performance of a contract to which the data subject
is party or in order to take steps at the request of the data subject prior to entering into a
contract; or

when processing is necessary for compliance with a legal obligation to which the controller is
subject; or

when processing is necessary in order to protect the vital interests of the data subject; or

when interest or in the exercise of official authority vested in the controller or in a third party
processing is necessary for the performance of a task carried out in the public to whom the data
are disclosed; or

when processing is necessary for the purposes of the legitimate interests pursued by the
controller or by the third party or parties to whom the data are disclosed, except where such
interests are overridden by the interests for fundamental rights and freedoms of the data
subject.
The Directive states that the data subject has the right to be informed when his/her personal data are
being processed and that the controller must provide his/her name and address, the purpose of
processing, the recipients of the data and all other information required to ensure the processing is fair.
If any personal data is collected the controller or his/her representative must provide a data subject from
11
D1.1 – Ethics guidelines and procedures
WP1: Scientific Coordination
Dissemination level: Public
Author: SEMCARE Consortium
Version: 3.0
whom data are collected with at least the following information, except where he already has it (art. 10
and 11):

the identity of the controller and of his/her representative, if any;

the purposes of the processing for which the data are intended;

any further information such as:

the recipients or categories of recipients of the data,

whether replies to the questions are obligatory or voluntary, as well as the possible
consequences of failure to reply,

the existence of the right of access to and the right to rectify the data concerning him.
The foregoing obligation on providing information to the data subject does not apply where the
data have not been obtained from the data subject himself (e.g. from general practitioners or from
claims) and where, in particular for processing for statistical purposes or for the purposes of historical
or scientific research, the provision of such information proves impossible or would involve a
disproportionate effort or if recording or disclosure is expressly laid down by law. In these cases Member
States shall provide appropriate safeguards.
According to article 12 the data subject has the right to access all data processed relating to him. The
data subject even has the right to demand the rectification, deletion or blocking of data that is
incomplete, inaccurate or is not being processed in compliance with the data protection rules.
Proportionality
According to article 6:

Personal data must be processed fairly and lawfully;

Personal data must be collected for specific and legitimate purposes and not further processed
in a way incompatible with those purposes. Further processing of data for historical,
statistical or scientific purposes shall not be considered as incompatible provided that
Member States provide appropriate safeguards;

Personal data must be adequate, relevant and not excessive in relation to the purposes for
which they are collected and/or further processed;

Personal data must be accurate and, where necessary, kept up to date; every reasonable step
must be taken to ensure that data which are inaccurate or incomplete, having regard to the
purposes for which they were collected or for which they are further processed, are erased or
rectified;

Personal data must be kept in a form which permits identification of data subjects for no longer
than is necessary for the purposes for which the data were collected or for which they are further
12
D1.1 – Ethics guidelines and procedures
WP1: Scientific Coordination
Dissemination level: Public
Author: SEMCARE Consortium
Version: 3.0
processed. Member states shall lay down appropriate safeguards for personal data
stored for longer periods for historical, statistical or scientific use.
The processing of personal data revealing racial or ethnic origin, political opinions, religious or
philosophical beliefs, trade-union membership, and the processing of data concerning health or sex life
is prohibited (art. 8 sub 1) except in case (art. 8 sub 2):

the data subject has given his/her explicit consent to the processing of those data, except where
the laws of the Member State provide that the prohibition referred to in paragraph 1 may not be
lifted by the data subject's giving his/her consent; or

processing is necessary for the purposes of carrying out the obligations and specific rights of
the controller in the field of employment law in so far as it is authorized by national law providing
for adequate safeguards; or

processing is necessary to protect the vital interests of the data subject or of another person
where the data subject is physically or legally incapable of giving his/her consent; or

processing is carried out in the course of its legitimate activities with appropriate guarantees by
a foundation, association or any other non-profit-seeking body with a political, philosophical,
religious or trade-union aim and on condition that the processing relates solely to the members
of the body or to persons who have regular contact with it in connection with its purposes and
that the data are not disclosed to a third party without the consent of the data subjects; or

processing relates to data which are manifestly made public by the data subject or is necessary
for the establishment, exercise or defense of legal claims.
Furthermore an exception to this prohibition of article 8 sub 1 is set out in article 8 sub 3: Article 8 sub
1 shall not apply where processing of the data is required for the purposes of preventive medicine,
medical diagnosis, the provision of care or treatment or the management of health-care services, and
where those data are processed by a health professional subject under national law or rules established
by national competent bodies to the obligation of professional secrecy or by another person also subject
to an equivalent obligation of secrecy.
2.3. Supervisory authority
According to article 28 each Member State must set up a supervisory authority, which is an independent
body that:

will monitor the data protection level in that Member State;

give advice to the government about administrative measures and regulations; and

starts legal proceedings when data protection regulation has been violated.

Individuals may lodge complaints about violations to this supervisory authority or in a court of
law.
13
D1.1 – Ethics guidelines and procedures
WP1: Scientific Coordination
Dissemination level: Public
Author: SEMCARE Consortium
Version: 3.0
The controller must notify the supervisory authority before he starts to process data. The notification
contains at least the following information (art. 19):

the name and address of the controller and of his/her representative, if any;

the purpose or purposes of the processing;

a description of the category or categories of data subject and of the data or categories of data
relating to them;

the recipients or categories of recipient to whom the data might be disclosed;

proposed transfers of data to third countries;

a general description of the measures taken to ensure security of processing.
This information is kept in a public register.
2.4. Summarizing the important aspects of the directives for the
SEMCARE project

Data processing of personal data is legitimate for scientific purposes if adequate safeguards
are provided and followed. Data owners need to explain what their safeguards are.

Subject has the right to access data processed on him, thus there must be a registry for which
purposes and by whom the data are processed. Database owners need to specify how they
keep a registry and in the project the data warehouse will have to be made in a sense that it
keeps logs on processing of data.

Consent is not necessary where the data have not been obtained from the data subject himself
(as is the case in SEMCARE for all clinical databases) and where, in particular for processing
for statistical purposes or for the purposes of historical or scientific research, the provision of
such information proves impossible (e.g. because data have been de-identified) or would
involve a disproportionate effort or if recording or disclosure is expressly laid down by law. In
these cases Member States shall provide appropriate safeguards.
In the case of SEMCARE, all clinical partners will have in place a de-identification process of
data that will make impossible to identify the subjects, except for authorised people through a
specific procedure.

If patients need to be contacted for collection of additional information relevant to the study,
consent will be requested according to the procedures specified by each of the databases.

Processing of health data may be done by a health professional subject under national law or
rules established by national competent bodies to the obligation of professional secrecy or by
another person also subject to an equivalent obligation of secrecy. Database owners must
provide their conditions for use and processing of the data. The SEMCARE system will be
installed in each clinical centre participating in the project and data will not leave the clinical
environment. Only anonymised data will transferred outside the clinical environment for the
14
D1.1 – Ethics guidelines and procedures
WP1: Scientific Coordination
Dissemination level: Public
Author: SEMCARE Consortium
Version: 3.0
purpose of testing the system performance. Each clinical partner will be responsible of
managing and maintaining their own data warehouse and will provide passwords only for
persons who are approved to analyse data for the project purposes (see section 4.4.1).

Personal data should be processed adequately and correctly and kept up to date and not stored
longer than necessary. In SEMCARE a careful definition of the storage requirements and
conditions is needed.
3. Reports of the respective SEMCARE-project
healthcare database participants
3.1. EMC data base
An extensive description of all issues concerning the use of patient data for research purposes in The
Netherlands can be found on the website of the CCMO (Centrale Commissie Mensgebonden
Onderzoek (“Central Committee on Research involving Human Subjects”), http://www.ccmo.nl/en/).
Comprehensive information on the legal framework can be found on http://www.ccmo.nl/en/legalframework. Briefly, if a study falls under the scope of the WMO (Wet medisch-wetenschappelijk
onderzoek met mensen (“Medical Research Involving Human Subjects Act”), then it must undergo a
prior review by an accredited MEC. There are 24 accredited MECs in the Netherlands that review
medical/scientific research proposals (http://www.ccmo.nl/en/accredited-mrecs). The majority are
linked to an institution such as a hospital or an academic medical centre, including the Erasmus MC.
Research falls under the WMO if the following criteria are met:
(1) it concerns medical/scientific research,
(2) participants are subject to procedures or are required to follow rules of behaviour
(http://www.ccmo.nl/en/your-research-does-it-fall-under-the-wmo).
Importantly, retrospective research/research with patient records does not fall under the WMO: the data
are not gathered for the sake of the research, and participating subjects are not required to change their
behaviour for the sake of the research. In these cases, the MEC is often still asked to formally confirm
that a research proposal does not fall under the WMO. We have also followed this approach in the
Erasmus MC for the SEMCARE global use case scenario. Use of identifiable patient data in research
that does not fall under the WMO still needs informed consent from the data subjects, but if the data
are de-identified informed consent is not required.
15
D1.1 – Ethics guidelines and procedures
16
WP1: Scientific Coordination
Dissemination level: Public
Author: SEMCARE Consortium
Version: 3.0
3.2. MUG data base
Relevant
national
law
and
its
relationship
to
European
law
can
be
found
accessing
http://www.ethikkommissionen.at/ (Austrian online platform form of the ethics board).
The SEMCARE platform will work on pseudo-anonymized patient data. The ETL-workflow as well as
the pseudo-anonymisation process is described in Appendix MUG section “Provision of documents”
and “Document takeover”. A sketch of the local environment with the embedded search platform is
given in Figure 1: System sketch.
3.3. SGUL data base
SGUL has to pay more attention to the issues concerning data protection than other participants
because two types of data are involved in the SEMCARE project: the anonymised data for development
and testing and the data loaded in the SEMCARE system.
For the data used for developing and testing purpose, the Data Protection Act 19982 enforces data
anonymity before it is transferred to third party. The national standard on data anonymity is
"Anonymisation: Managing Data Protection Risk Code of Practice" 3 developed by UK Information
Commissioner's Office. An anonymising system will be developed for according to this standard and
the characteristics of clinical texts featuring text mining techniques to automatically identify uniquely
identifiable information and then merging documents based on such information.
Data loaded in the prototype system comes from the same department where the system will be
installed and there is no requirement of anonymisation. The only two restrictions come from Trust policy
and the Data Protection Act 1998 and these will be adhered to:
1. Data will be held in the secure demilitarized zone, an area of the network that is isolated from the
internal network.
2. Identifiable data cannot be transferred outside of the Trust.
2
http://www.legislation.gov.uk/ukpga/1998/29/contents
3http://ico.org.uk/for_organisations/data_protection/topic_guides/~/media/documents/library/Data_Protection/Prac
tical_application/anonymisation-codev2.pdf
D1.1 – Ethics guidelines and procedures
WP1: Scientific Coordination
Dissemination level: Public
Author: SEMCARE Consortium
Version: 3.0
4. General procedures in the SEMCARE project
for processing healthcare data
4.1. Overall strategy
4.1.1. Overall strategy
In this section an overview of the processes and mechanisms applied to protect data privacy in the
SEMCARE project is described. More detailed information on the implementations that are in place
locally at the clinical sites is provided in the sections 4.2 to 4.4 by the clinical partners.
The local authorities in the hospitals are responsible for the definition of security needs within the clinic.
The project SEMCARE assumes that corresponding actions for privacy protection are in place in the
data providing clinics by means of already running clinical systems. The locally installed SEMCARE
services will follow these existing rules.
Furthermore, additional actions are taken in order to protect the processed data, especially with regards
to the project related infrastructure. These are e.g. rules for data deletion or an additional protection of
system access and data transfer. Additionally, a de-identification of the patient data takes place at the
sites in order to assure data privacy.
Legal basis
During the time of system development and testing the legal basis for the transfer of test data is that
data will be fully anonymised according to existing local procedure at the clinical sites to ensure that the
patients cannot be identified.
Once the SEMCARE system goes into production and no longer needs to run on de-identified data, the
legal basis for the data handling and processing is the national legislations regarding privacy for each
clinical site, but also the data protection regulations within the EU that apply for all project partners. The
SEMCARE system may then be considered as an addition to the core capability of the hospital
information system, supporting patient safety for individuals.
Details about the regulations are described in section 2 and 3 of this document.
Responsible authorities
When considering the whole process of the SEMCARE system development, installation and execution
a number of authorities are involved. The following table shows the different authorities that are involved
in the SEMCARE project and provides a short description of their corresponding responsibilities.
17
D1.1 – Ethics guidelines and procedures
WP1: Scientific Coordination
Dissemination level: Public
Author: SEMCARE Consortium
Version: 3.0
Local Authority
Description
Research Institution
The medical informatics departments at MUG and EMC are research
institutions involved in the SEMCARE project. They provide input about
new technologies and methods in the field of text analysis and
terminologies.
Data Providing
Institution
Hospital site providing test data for system development. The test data
provided to AVERBIS will be anonymised or artificially generated.
Real patient data will not leave the hospital site and will only be used
during usage of the final productive system at the site. Patient data will
not be used during system development.
Local IT
IT departments at the three different hospital sites. They are responsible
for approving the software and for the installation and integration of the
system into the existing local IT landscape. Only dedicated personnel will
be able to access the SEMCARE systems.
Data Privacy Protector
/ Ethics Advisor
Local privacy protectors are available at each hospital site.
Overall privacy protector and ethics advisor for the project and the
committee is Dipak Kalra4.
System Provider
AVERBIS develops the software system externally and provides it to the
hospital sites for local installation. No real patient data will be provided to
AVERBIS.
Physical access control
At the clinical sites corresponding actions are already in place in order to refuse physical access to
machines working on sensitive personal data for unauthorised persons. The computer rooms are
protected adequately. Only authorised administrative personnel will be able to enter the server rooms
where the SEMCARE systems are installed locally.
Computer access control
Appropriate actions are in place at the sites in order to avoid that unauthorised people can log into
computers containing personal clinical data. Only authorised administration personnel will get access
to the SEMCARE systems according to their allocated tasks and roles.
The following users will have access to the SEMCARE systems according to their tasks:

Local system administrator. The local system administrator is an employee of the local IT of
the clinical site. He/she enables the installation of the local SEMCARE server and has access
to the systems for maintenance reasons.
4
Clinical Professor of Health Informatics, Director, Centre for Health Informatics and Multiprofessional Education, University
College London
18
D1.1 – Ethics guidelines and procedures
WP1: Scientific Coordination
Dissemination level: Public
Author: SEMCARE Consortium
Version: 3.0
In case of problems with the SEMCARE services the local system administrator owns a detailed
description of all processes, which allows him/her to identify failing processes or the restart of
failed processes.

SEMCARE delegate. The SEMCARE delegate is responsible for the control and regulation on
application level. He/she serves as contact person in case of problems or questions with
regards to the software or the usage of the SEMCARE services. These can be questions by
end users or by the local system administrator (e.g. in case of a software update).
Software updates that can’t be performed by the local system administrator himself are
performed together with the local system administrator in confidence.

End user. The end user has access to the SEMCARE application according to the role and
authorisation concept. This can either be implemented by connecting the SEMCARE system to
the local rights management system or by creating corresponding user accounts that have
different access rights on the computer hardware.
It must be ensured that the people that are authorised for machine access, only can access data for
which they have the corresponding rights. Personal data must not be read, copied, changed or deleted
by unauthorised people during usage, processing or storage. End users will get access to the
application according to the role and authorisation concept and will not have access to the system
components. An access logging can be activated for traceability.
Furthermore, it will be ensured that no project-related data is written to and stored on locations where
unauthorised persons have access to.
Hardening of the system
The SEMCARE system will be installed on a local server in the hospitals by the local IT department in
strong collaboration with the SEMCARE project team. All components that are not needed should be
removed from the server and all remaining components should be tested to be secure.
Furthermore, it has to be checked which components have to be installed additionally in order to
improve the security of the systems. This could be components like firewalls, intrusion detection
systems, virus protection and similar programs.
Additionally, the user rights on the server have to be checked and restricted if needed.
Update Management
The SEMCARE application will be distributed to the three clinical partners as a virtual machine (VM) for
local installation. A plan for the update process of the VMware template has to be developed, defining
who is responsible for the master image of SEMCARE and how changes on the master image are
19
D1.1 – Ethics guidelines and procedures
WP1: Scientific Coordination
Dissemination level: Public
Author: SEMCARE Consortium
Version: 3.0
transferred to the different clinics. This affects updates of operating system components as well as
updates for the SEMCARE services.
Penetration / Scaling / Load Tests
Within the course of the project and the technical development, penetration, scaling and load tests could
be performed in order to check the particular components of the system. With these kinds of technical
analyses, possible leaks could be identified that a third party might use to enter the SEMCARE system.
As a result, corresponding actions could be taken in order to improve the security of the system. The
need of such an analysis and the checks to be performed will be determined during the project involving
all project partners.
4.1.2. Data mining / data extraction
Within the scope of the SEMCARE project unstructured clinical data is extracted from various hospital
information systems and files and transferred to the SEMCARE system for data storage and analysis.
Thereby, the clinical data and the application services will be physically separated. Data storage is done
on NAS (Network Attached Storage) servers and the application is running on a dedicated server as a
VM. The advantage of separate data storage is that the physical storage locations are known and can
be easily cut from the system in case of need. Furthermore, a backup of the data could be created
separately from the system logic at regular intervals.
The consolidation of the clinical data and transfer into the SEMCARE system is done by the local IT
department using the tool Talend Open Studio 5 or another ETL (extract, transform, load) tool. Talend
pushes the data into the PostgreSQL data store and to the Solr and text analysis component. By means
of the text analysis relevant information like diagnosis, symptoms, and therapy of patients is obtained
from the unstructured data sets. During the analysis no personal data is modified. The analysis results
are written to the PostgreSQL data store and the Lucene/Solr search engine index on the NAS. All
communication and data transfer between the NAS and the VM is done via NFS 4.0 (Network File
Service).
The end user can then perform queries on the (un)structured data and the system returns relevant
results. The whole process of data transfers and data storage can be seen in Figure 1.
5
http://talend.com/products/talend-open-studio
20
D1.1 – Ethics guidelines and procedures
WP1: Scientific Coordination
Dissemination level: Public
Author: SEMCARE Consortium
Version: 3.0
Figure 1: SEMCARE data flow and data storage
The following paragraphs describe some fundamental actions that can be taken for the data extraction
and text mining procedures in order to ensure data privacy on personal clinical data.
Data economy
The research project SEMCARE utilises hospital data related to specific use cases. At this juncture we
focus on one use case in the cardiologic area that has been described in more details in the previous
deliverable D3.1 (Sketch of system architecture specification). Only data necessary to answer the
defined use case will be extracted and processed by the application. No patient data that is available in
the hospital system without any relation to the project use cases will be considered.
The clinical data that is needed for the current use case will be loaded into the SEMCARE system by
the local IT department using the ETL software Talend Open Studio or another ETL software. Data will
only be loaded via the ETL software; there is no other way of data input into the system.
Logging
Only allocated and defined personnel will have access to the system components of the SEMCARE
application. To ensure traceability, a logging of the data exchange will be done for the communication
between the different SEMCARE components. Such a logging can be used for the purpose of fault
analysis but also for subsequent examination if personal data has been entered, changed or deleted
21
D1.1 – Ethics guidelines and procedures
WP1: Scientific Coordination
Dissemination level: Public
Author: SEMCARE Consortium
Version: 3.0
and by whom. To what extent logging is needed and the exact data and actions to be logged will be
discussed and defined in future steps within the project.
Secure Data Transfer / Encryption Procedures
The communication between the different SEMCARE components within the intranet of the clinic should
be secure as well as the data storage. The minimal requirements of the cryptographic procedure (e.g.
minimal key length and cipher suites) and which of the available procedures should be chosen will be
refined in the course of the project.
For example, corresponding actions could be performed to ensure that the communication between the
different components of the SEMCARE system (e.g. client computer and server) passes over a secure
data line. A possible method for encryption of the data transmission is the transfer via HTTPS, thus
adding the security capabilities of the SSL/TLS protocol to the standard HTTP communication.
An additional encryption of the stored data that is e.g. stored in the Solr index is possible by using the
software TrueCrypt6 in order to protect it from non-authorised access. However, this is not advisable in
a search application as the query response time will degrade dramatically.
Furthermore, a hardware encryption should be performed to ensure that no confidential information can
be accessed when a data storage medium is stolen.
Data Deletion / Time Periods
Upon data load by the tool Talend Open Studio from the clinical data pool on the SEMCARE server the
system begins to save patient data. This data has to be deleted as soon as it not needed anymore.
All data that is created, provided or used during the project outside the participating hospitals will be
deleted upon finalisation of the project. This includes the anonymised data for development and testing
of the application by the system provider AVERBIS (Averbis GmbH).
The expected end of the project is 31st December 2015.
4.1.3. Definition of the project architecture
The SEMCARE architecture has been designed in a way that the system does not directly work on the
production system of the hospital but only on a copy of the data, i.e. staged data that will be provided
by the hospital. That way, the real patient data is protected against accidental destruction or loss caused
by SEMCARE. Additionally no load to the clinical production system will be generated. The patient data
6
http://www.truecrypt.org/
22
D1.1 – Ethics guidelines and procedures
WP1: Scientific Coordination
Dissemination level: Public
Author: SEMCARE Consortium
Version: 3.0
that is used for patient care in the hospital will at no time be contacted by the SEMCARE systems.
Below Figure 2 shows the broad overview of the system architecture that has been explained in detail
in deliverable D3.1.
Figure 2: Systems involved in the SEMCARE architecture
The SEMCARE systems (staging and production) will be installed locally at the hospital site. After
loading the clinical data into the SEMCARE systems it will be processed and evaluated by appropriate
algorithms completely without leaving the hospital. The project integrates into the existing IT landscape
of the hospital with regards to physical access, computer access, and data access control to the used
IT components (servers and network components). This also protects the security of particularly
sensitive health care data arising in a hospital.
Furthermore, the architectural design of the SEMCARE platform permits data processing and storage
on separate hard drives if needed. It could be necessary to use separate hard drives if different
departments are involved as data providers or if there is data on which users may have differing rights.
A connection to the local right management system, such as LDAP, can be implemented in order to
replicate existing access rights. Such integration can be provided individually for each of our clinical
partners.
A role concept can be applied in order to assure that only authorised users can access the data related
to the SEMCARE project.
Furthermore, all activities happening in context of the SEMCARE system will be logged in order to be
able to examine if patient data has been entered, changed or deleted, and by whom. Only allocated
and defined personnel will have access to the system components and applications of the SEMCARE
applications.
Availability control
The local VM in the clinic doesn’t require any hardware backups or recovery mechanisms due to the
fact that data loss can be overcome by reloading and reprocessing the data from the staged instance if
necessary.
In case of a complete and permanent outage of the VM in the clinic and if a disaster recovery is not
possible, the local VM can be replaced by a new copy of the SEMCARE VM that could be provided by
23
D1.1 – Ethics guidelines and procedures
WP1: Scientific Coordination
Dissemination level: Public
Author: SEMCARE Consortium
Version: 3.0
AVERBIS. All data for the current use cases will then have to be re-imported from the clinical staging
system into the SEMCARE services.
As the SEMCARE application is not crucial for patient care, high availability of the data platform is not
required and a fully automated outage concept will not be provided.
Data Segregation
Within the SEMCARE project the system applications will be provided and run on a virtual machine
(VM). If a clinical partner wants to use the SEMCARE technologies, the clinic gets a copy of this VM,
which is pre-configured for the corresponding clinic. This copy will then be integrated into the clinical
network by the local IT department. Like that, all clinical partners will have their individually configured
and dedicated system.
To enable a separation of the clinical data and the application services provided by AVERBIS we plan
data storage on a separate machine, a so-called NAS (network attached storage). The NAS can be
mounted on the VM and data can easily and securely be transferred between the NAS and the
SEMCARE application on the VM via NFS 4.0. However, as the data is located on a different machine,
the access of the SEMCARE application can also easily be stopped by physically pausing the data
storage machine if needed.
In the beginning, only one use case within the cardiologic field will be relevant for the project, thus no
further data separation mechanisms for data storage are needed. In case of more use cases in the
future it must be assured that data from different scenarios or different departments are separated from
each other according to existing access rights. For this purpose, distributed Solr indexes per use case
could be used which is supported by the generic architecture design. Like that, a further data separation
could be performed within the SEMCARE application in a clinic if needed.
All the SEMCARE systems will only be run locally and queries will only be performed on relevant patient
data. Other information that is not relevant for the defined use case will not be extracted from the
hospital systems. A system for software development and testing and a production system will be
provided separately.
4.1.4. Data anonymisation
Introduction to Anonymisation
Anonymisation is defined as “a process that removes the association between the identifying data the
data subject”, according to ISO Technical Specification ISO/TS 25237 (Health informatics –
Pseudonymisation).
24
D1.1 – Ethics guidelines and procedures
WP1: Scientific Coordination
Dissemination level: Public
Author: SEMCARE Consortium
Version: 3.0
There are two types of anonymisation techniques: masking and de-identification. They deal with
different fields in a data set.
Masking
Masking tends to protect things like names and various IDs like NHS number. It involves significant
distortion of the data. Suppression, an approach to removing a whole field, is the most commonly used
masking technique. Another masking technique involves replacing actual values with random values
selected from a large database. The only standard for masking is ISO Technical Specification 25237,
which focuses on the different ways that pseudonyms can be created, but does not specify the
techniques to use.
De-identification
De-identification involves protecting fields like demographics and individual information, such as age,
home and work address, income, number of children and race. De-identification minimally distorts the
data so that meaningful analytics can still be performed on it, while still being able to protect privacy.
There are three standard for de-identification: lists, heuristics and risk-based methodology.

Lists specify the data elements need to be removed or generalized. A good example is the Safe
Harbor standard in the HIPAA Privacy Rule, in which 18 data elements are listed. However,
this method has been significantly criticised because it does not provide real assurance that
there’s a low risk of re-identification.

Heuristics are rules of thumb that are developed and applied for years. For example, never
release dates of birth, but allow the release of treatment and visit dates. The drawbacks of
heuristics are: 1) the existing rules may not cover all circumstances, especially for rare
diseases. 2) It is difficult to technically prove the effectiveness of privacy protection of such
methods, so it makes them unsuitable for data providers that want to manage their reidentification risk in data release. 3) It requires experts or judges to determine rules.

Risk-based methodology applies mathematic algorithms to automatically change the sensitive
contents in healthcare data, while maintaining the trade-off between the re-identification risk in
data and utility of data. It is consistent with several standards from regulators and governments,
such as “Anonymisation: Managing Data Protection Risk Code of Privacy” by UK Information
Commissioner’s Office.
4.1.5. Data leaving the hospitals to test the software
AVERBIS will develop the underlying software of the SEMCARE platform. In order to test the
performance of the software and to improve the algorithms it is essential to have representative test
25
D1.1 – Ethics guidelines and procedures
WP1: Scientific Coordination
Dissemination level: Public
Author: SEMCARE Consortium
Version: 3.0
data that can be used for this purpose. The test data can be fake or anonymised data but should have
similar structure to the real patient data that will finally be used in the productive system.
Within the SEMCARE project the clinical partner SGUL (Saint George's University of London) will
provide anonymised clinical test data to AVERBIS for the development of the algorithms and interfaces
and for testing and improvement purposes. The legal basis that allows the transfer of such test data by
SGUL is section 251 of the NHS Act 2006. This section says that confidential patient information can
be transferred to third-party applicant for the purpose of supporting health service improvements,
bypassing the common law duty of confidentiality. However, confidential patient information still must
comply with all the other relevant obligations e.g. the Data Protection Act 1998 which enforces the data
anonymisation. SGUL is responsible for the data disclosure. Transferred test data will be encrypted
either at rest or in transition.
The clinical partners EMC (Erasmus Universitair Medisch Centrum Rotterdam) and MUG (Medical
University of Graz) will not provide any data to AVERBIS or to any other clinical partner.
In the scope of the SEMCARE project real patient data will never leave the hospitals in order to meet
the privacy regulations. Data processing of the final system and the operation of the data platform will
be performed within a dedicated server infrastructure in the hospital. Patient data may, however, be
shared between different departments within each hospital. For these data transfers, (pseudo-)
anonymisation processes will be applied that already exist at the clinical sites. The de-identification
procedures that are locally used in each of the three participating hospitals are explained in detail in the
following hospital-specific sections.
For any data transfer of test data to AVERBIS or transfer of patient data between different departments
within the hospital it will be ensured that personal data cannot be read, copied, changed or deleted by
unauthorised people during electronic transmission or during transport or storage on a data storage
medium. As mentioned before, in the SEMCARE project all patient data will remain within the hospitals.
No real clinical data will be transferred outside the hospital, neither by employees, nor by systems. The
data processing of all productive systems will be performed locally.
4.1.6. SEMCARE ‘global’ use case
The starting point of data processing within the SEMCARE project is the clinical data that is available
in the various information systems within the participating hospitals. The patient data necessary for
analysis is information about diagnosis, symptoms, laboratory data, therapies, and medications that are
contained in discharge letters and medical reports, databases or other clinical information systems.
The SEMCARE systems have to be able to handle huge amounts of structured but also unstructured
data extracted from the aforementioned sources, to be analysed by specific text mining procedures.
The individually required data set is defined by the corresponding use case. The department providing
the data will be identified accordingly during the course of the project. In a first ‘global’ use case called
26
D1.1 – Ethics guidelines and procedures
WP1: Scientific Coordination
Dissemination level: Public
Author: SEMCARE Consortium
Version: 3.0
‘Risk Stratification and Differential Diagnosis of Patients suffering from transient loss of consciousness’
we concentrate on cardiologic disorders.
Three possible use cases are aimed during the term of the SEMCARE project:
a) Diagnosis support
There exist many diseases which are very rare and only affect a small number of people. If a
patient suffering from such a rare disease is seeing a doctor and describing his/her symptoms,
in many cases the doctor is not able to determine the diagnosis as he/she is not aware of the
disease.
The SEMCARE system can help the doctor with the diagnosis of rare diseases. For example,
if the doctor hears of a previously unknown disease he/she could use the provided tools to look
for patients suffering from the symptoms forming the rare disease within all the patient data
available in the clinic. Like that, patients that could suffer from the rare disease and were
previously without diagnosis or even with incorrect diagnosis can be identified and re-seen by
the doctor. The SEMCARE project thus offers an enormous advantage for the patients suffering
from rare diseases allowing correct diagnosis and a specific and early therapy.
The physician performing the search on specific symptoms only has access to personal patient
data according to his/her user rights. He/she will not be able to identify a patient for which
he/she does not have the rights to see the data as the patient was e.g. treated in another
department. Like that it is assured that no unauthorised access happens to the personal data.
The patient data is only used in order to identify possible candidates for rare diseases and invite
the patients for another visit to re-check the symptoms and possibly define the diagnosis.
b) Protocol feasibility
When planning a new clinical trial and working on the protocol design, there are several points
that need to be considered in advance. One of the main aspects that have to be investigated is
the feasibility with regards to the patient enrolment. It has to be defined in the protocol how
many patients are planned and needed to participate in the study in order to get an adequate
amount of data that can be evaluated subsequently. This implies that the patient population
fulfilling the inclusion criteria defined in the protocol has to be figured out. Like that the
investigator gets a feeling how many patients are available that match the criteria and in which
timeframe it would be possible to reach the planned patient recruitment number. Depending on
the number of potential participants it can also be evaluated how many clinical sites would be
needed to participate in the study in order to recruit the necessary amount of patients.
In this regard the SEMCARE data platform can be of great help in order to evaluate the patient
population for protocol feasibility assessment. Patient data that is already available in hospital
information systems can be analysed for the inclusion criteria, thus getting an overview of the
27
D1.1 – Ethics guidelines and procedures
WP1: Scientific Coordination
Dissemination level: Public
Author: SEMCARE Consortium
Version: 3.0
potential study participants within a specific hospital. Like that the investigator can see if there
are enough patients available at the clinical site in order to perform the study or if other sites
would have to be included to reach the planned amount of participants. Also, it can be easily
evaluated which timeframe is realistic to achieve complete patient recruitment.
With regards to the ethical aspects of such an assessment there is no problem as the personal
data which will be analysed is only used to get an overview of the patient population. No
individual data is published and a reference to the patient identifying data (like name or exact
date of birth) is not necessary for this use case. In fact, SEMCARE is a useful tool to support
the investigator during protocol preparations and speeds up the process until protocol start.
c) Patient recruitment
As described in the previous section each clinical trial has a defined number of patients that
need to be enrolled in order to fulfil the protocol requirements. Sometimes, it is hard to achieve
the specified amount of study participants within the timeframe. In these cases SEMCARE can
assist as a tool in order to identify patients within a dedicated clinic that fulfil the protocol
inclusion criteria.
Again, the investigator will only be able to search in data in his/her hospital for which he/she
has the corresponding access rights. After identification of potential study participants they will
be asked to sign an informed consent for study inclusion, thus allowing the usage of their
personal data for data analysis and statistical evaluations.
4.2. EMC
4.2.1. Protocol for de-identification
The patient records that will be used for the use case in the Erasmus MC will be de-identified by applying
a locally developed automated method that has been tested and approved by the Erasmus MC privacy
officer. Briefly, we follow a two-step de-identification process. Each patient record contains a header
with a unique hospital identifier from the hospital database. First, for each patient a unique identifier is
generated which replaces the hospital id in the header. The same unique id is used for all other
occurrences of the hospital id. A mapping between the hospital id and the newly generated unique id is
locally stored on the computer where the patient records reside. In a second step, all information related
to names (including those of patients, clinicians and technicians), cities, streets, postal codes, and
hospitals is removed. Any identified words are replaced with a corresponding category name, e.g., the
name J. Doe is replaced by its category #Name#. The de-identification process is based on category
lists and category-specific rules. During the development stage of the system, only de-identified patient
records will be used.
28
D1.1 – Ethics guidelines and procedures
WP1: Scientific Coordination
Dissemination level: Public
Author: SEMCARE Consortium
Version: 3.0
4.2.2. Data management for research
After de-identification, data are stored on an encrypted removable drive and transferred to the
SEMCARE workstation. There will not be data transfer to other applications. The workstation and data
will only be accessible by authorized persons from the Departments of Medical Informatics and
Cardiology. The workstation will run on a virtual machine on a server, which is in a dedicated server
room.
4.2.3. Adaptation of the SEMCARE software
A Dutch version of the SEMCARE platform will be tested and evaluated locally in the Erasmus MC.
Methods and techniques that are developed in WP4 and WP5 will be adjusted and refined locally by
the Department of Medical Informatics of the Erasmus MC. Patient data will not leave the hospital.
4.2.4. Installation of a foreign code in the hospitals
The deployment of the SEMCARE platform in the Cardiology Department of the Erasmus MC will be as
a stand-alone system, fed with de-identified patient data. This will not require any assurances other
than adherence to the standard restrictions (mainly with respect to the firewall and remote access) that
apply to all computer systems in the Erasmus MC.
4.2.5. Clinical connectors
Data will be extracted from a clinical data warehouse at Erasmus MC, for which no specific clinical
connectors need to be developed. At a later stage, a more direct connection to data in the hospital
information system may be foreseen, but the Erasmus MC is currently in the process of migrating to a
new hospital information system and it is not clear yet what clinical connectors and assurances will be
required in the new situation.
4.3. MUG
4.3.1. Protocol for de-identification
A detailed description of the pseudo-anonymisation process can be found in Appendix MUG “Location
of the exported document corpus”.
4.3.2. Data management for research (can include the generation of dummy data)
The whole data management process as well as access rights and location of the pseudo-anonymized
data are described in Appendix MUG. If needed, the generation of dummy data can be made by manual
substantive changes and is administered by Stefan Schulz.
29
D1.1 – Ethics guidelines and procedures
WP1: Scientific Coordination
Dissemination level: Public
Author: SEMCARE Consortium
Version: 3.0
4.3.3. Adaptation of the SEMCARE software
Explained in Annex I MUG “Processing documents using the Averbis Software”.
4.3.4. Installation of a foreign code in the hospitals
Explained in Annex I MUG “Processing documents using the Averbis Software”.
4.3.5. Clinical connectors
Used clinical connectors are described within Appendix MUG “Provision of documents”. The one time
access to the hospital information system (HIS): openMEDOCS (SAP IS-H * MED) is mapped within a
Talend-Data-Integration job on the internal server for data integration (see Figure 1: System Sketch)
and is used as documentation of the complete ETL work flow. The SEMCARE platform will have no
direct access to the HIS via connectors but will work on the pseudo-anonymized data set generated
via the initial ETL workflow. A connector to the pseudo-anonymized data set has to be adapted for the
SEMCARE platform.
4.4. SGUL
4.4.1. Protocol for De-identification
Data Types
We mention data type, here, because different type of data requires different processing protocol and
strategies. From the business view, the data can be classified as clinical, administrative, and survey
data. In the perspective of computer science, data can be grouped as numeric, categorical, textual, and
binary data.
In SEMCARE project, the main challenge in data anonymisation is how to de-identify the free-form text,
which exists widely in EMRs. Unlike texts expressed in news reports or articles, the text in EMRs is
featured as a lot of typos, shorthand, incomplete sentences, spelling errors, and pool grammar.
Therefore, the existing NLP tools may not satisfy the requirement. Moreover, it is not enough for a
system to catch 90% of the sensitive contents in data – it has to catch all of them. This means standards
for text anonymisation have to be much higher than other anonymisations.
Basic Principles

The risk of re-identification can be qualified.

Data privacy and utility should be balanced.

The risk of re-identification should be very small.

Anonymisation involves a mix of technical, contractual, and other measures.
30
D1.1 – Ethics guidelines and procedures
WP1: Scientific Coordination
Dissemination level: Public
Author: SEMCARE Consortium
Version: 3.0
Steps
1. Selecting unique, quasi-identifiable, and sensitive attributes. Unique attributes are those
attributes that can be directly used to uniquely identify individuals, such as hospital number,
patient’s names and address. Quasi-identification attribute set is a group of attributes that can
be used to identify individuals. For example, the combination of a patient’s birthday and post
code is possibly applied by an adversary to re-identify the patient. Sensitive attributes are the
information which is allowed to be published but not allowed to be linked to the certain individual.
Diagnosis and symptom are such examples.
2. Setting the risk threshold. Risk threshold represents the maximum acceptable risk for sharing
the data. When setting this threshold, the following two factors should be considered:

Is the data going to be in the public domain?

What is the extent of the invasion of privacy when this data is shared?
3. Examining plausible attacks. Four plausible attacks can be conducted on the data set:

T1: The data recipient deliberately attempts to re-identify the data.

T2: The data recipient inadvertently re-identifies the data.

T3: There is a data breach at the data recipient side.

T4: An adversary launches a demonstration attack on the data.
In SEMCARE project, since the data is used only for training and testing purpose in Averbis,
i.e., the data is disclosed, the first three attacks should be considered by the following factors:
1) whether Averbis has the motivation, resources, and techniques to re-identify the data set; 2)
Averbis needs the security and privacy controls on the data.
4. Identifying and re-organising the sensitive contents. Text mining techniques such as term
extraction, text classification, and text clustering will be exploited to extract unique, quasiidentifiable, and sensitive attributes. Documents with regard to the certain patient will be merged
and linked, based on the extractions.
5. Anonymising the data
Three types of techniques:

Suppression: Replacing a value in a data set with a NULL or missing value.

Generalisation: Reducing the precision of a field.

Subsampling: Releasing only a simple random sample of the data set rather than the
whole data set.
Principles:
k-anonymity, l-diversity, km-anonymity, (h, k, p)-coherence and ρ-uncertainty
31
D1.1 – Ethics guidelines and procedures
32
WP1: Scientific Coordination
Dissemination level: Public
Author: SEMCARE Consortium
Version: 3.0
The k-anonymity and l-diversity principles are widely accepted; K-anonymity puts only
regulations
on
quasi-identification
attribute
set;
while
l-diversity
restricts
sensitive
attributes.Algorithms:
Partition, Apriori, LRA, VPA, Greedy, and SuppressControl
In the SEMCARE project, a partition-based anonymity algorithm will be implemented, consistent
with both k-anonymity and l-diversity principles.
6. Documenting the process
Measuring risk under attacks
We need to measure the re-identification risk in terms of the above-mentioned plausible attacks. The
metrics is defined by probability of attack and the conditional probability of such attack under a certain
circumstance. Concrete equations are omitted.
Measuring re-identification risk
1. Probability metrics: maximum risk, average risk
When data has been released publicly, the maximum risk is between 0.09 and 0.05. If the data
set is not designed for public, the average risk ranges from 0.1 to 0.05. The actual value will be
determined in the invasion of privacy assessment.
2. Information loss metrics: entropy, missingness
Text Anonymisation
1. General Approaches:

Model-based methods: statistical or machine learning methods

Rule-based methods

Hybrid method: combination of machine learning method with rules

In the SEMCARE project, our method depends on statistical and text mining approaches,
2. Ways to anonymise text
1) To determine unique, quasi-identifiable, and sensitive attributes. Usually, first and middle
name, last name, birthday, street, email, phone numbers, and IDs are unique attributes;
postal code, visit date, health care facility, city and state, and country are quasiidentification attribute.
2) To extract text for these three types of attributes
3) To apply anonymisation methods

Redaction: for example, “John” is replaced with “****”.
D1.1 – Ethics guidelines and procedures
WP1: Scientific Coordination
Dissemination level: Public
Author: SEMCARE Consortium
Version: 3.0

Tagging with information type: for example, “John” is replaced with
“<FIRST_NAME:1>”.

Randomisztion within information type: for example, “John” is replaced with a
randomly selected name from a name database.

Generalization into a less-specific type: a generalized value is used to replace the
original value in the text, which works the same way as with structured data.
3. Evaluation
The challenge with text anonymisation is the detection, i.e., finding elements of personal
information in the text. Text anonymisation performance is evaluated based on how well they
can detect the various elements of personal information.
Evaluation method:
1) Randomly collect a set of documents (Any record with free text is called as a
document.) as a corpus.
2) Split the corpus as training set and test set.
3) Use training set to establish the personal information detection system.
4) Evaluate the performance of detection using “precision” (P), “recall” (R), and “F-score”
(F) metrics.
5) Verify the result:

For example, separate the corpus into several equal-sized parts, say ten; in each
experiment, select nine parts as training set, one left as testing set; average Ps,
Rs, and Fs. Or

Using verification set, which is obtained when originally splitting corpus, to
calculate P, R, and F.
4. Approaches to detecting personal information

Open-sourced text mining tools

Term extraction approach

Rule-based methods

Averbis text mining tools
5. Risk assessment
To calculate risk level of T1, T2, and T3 attacks using the recall of anonymisation.
6. Case study: Informatics Integrating Biology and Bedside (i2b2)
33
D1.1 – Ethics guidelines and procedures
WP1: Scientific Coordination
Dissemination level: Public
Author: SEMCARE Consortium
Version: 3.0
4.4.2. Data Management for Research
1. Data management planning

Complete a checklist for data needs

Complete and maintain a data management plan document

Choose a data management planning tool, such as DMP Online
2. Documenting data

Various documents: reports, notebooks, cookbooks, README files

Metadata: descriptive, administrative, and structural metadata
3. Data storage and backup

Where to store the data: networked drives, personal computers or laptops, external storage
devices

Backup scheme: remote or online back-up services, back strategies
4. Data security, protection and confidentiality
5. Data sharing

Password protected archive with secured FTP service, or

Secured RESTful/SOAP-based web service
In the SEMCARE project, 100 documents will be extracted initially for the purpose of developing the
anonymisation system. The extracted data will be stored on a local workstation, and are not allowed to
be copied to unauthorized computers or portable disks. The production data will never be stored in local
workstation; the anonymisation system will be deployed to the Trust IT department. We will develop a
web service-based user interface to finish data management tasks such as data generation, backup,
and sharing.
4.4.3. Adaptation of the SEMCARE Software

Ability of integrating new text mining or language processing tools with the SEMCARE project

Ability of removing the existing text mining or language processing tools

Ability of replacing the existing text mining or language processing tools with third-party
counterparts

The same abilities as those mentioned above for biomedical terms

Ability of managing clinical connectors
4.4.4. Installation of a Foreign Code in the Hospital
At St. George’s Hospital, data loaded in the SEMCARE system comes from the same department, the
Informatics Department in which the SEMCARE platform is installed. It will be held in the secure
demilitarised zone (an area of the network that is isolated from the internal network), where some
34
D1.1 – Ethics guidelines and procedures
WP1: Scientific Coordination
Dissemination level: Public
Author: SEMCARE Consortium
Version: 3.0
essential services such as WEB are deployed and all the other services on the local area network are
running behind a firewall connected to a public network, and consistent with Trust policy and the Data
Protection Act 1998 and its annotations there will be no transfer of identifiable data outside of the Trust.
4.4.5. Clinical Connectors

Universal interface to extract text from multiple types of sources: TXT, XML, RDF, RTF, DOC,
XSL, PDF, JSON, HL7 CDA, etc. Each type has a relevant software module, which can be
integrated as a plugin.

Extraction tools: Several open-sourced tools can conduct such task, but there is no perfect
solution. We may need to develop our own plugins based on open-sourced ones.
4.4.6. Risks during Data Processing and Possible Solutions
1. Incomplete business requirements may lead to a technical solution that does not deliver the
anticipated added value and user interest.
Solution: refining business requirements, re-collecting training data and then re-constructing
the system. In engineering prospective, that means a quick training and constructing procedure,
so we may need to consider agile development technology when developing the system.
2. Professional or technical resistance from the medical doctors to use the tool.
Solution: In most case, this is caused by poor user interface design. To avoid this, medical
persons must be involved in GUI design.
5. OVERVIEW/CONCLUSION
SEMCARE’s initial inventory shows that the project has to deal with various ethical issues related to
extraction and processing of healthcare data for secondary purposes. This deliverable describes the
current status.
Exchange of best practices on the approaches for de-identification will be encouraged between the
clinical partners as much as possible, always considering national regulation and internal SOPs within
each institution.
SEMCARE’s ethic advisor (Prof Dipak Kalra) will perform an annual review on all ethical aspects of
SEMCARE that will be submitted to the European Commission as part of the Periodic Reports.
Additionally, Prof Kalra will inform partners of any update on the ethical legislation that might be relevant
for the Project.
35
D1.1 – Ethics guidelines and procedures
WP1: Scientific Coordination
Dissemination level: Public
36
Author: SEMCARE Consortium
Version: 3.0
ANNEX I. MUG's processing of documents using
the Averbis Software
Institut für Medizinische Informatik, Statistik und
Dokumentation
Auenbruggerplatz 2, 8036 Graz
Averbis GmbH
Tennenbacher Straße 11
D-79106 Freiburg
Fon: +49 (0) 761 - 203 97690
Fax: +49 (0) 761 - 203 97694
http://www.averbis.com
Univ.-Prof. DI Dr. Andrea Berghold
Institutsvorstand
[email protected]
Tel +43 316 385 13201
Fax +43 316 385 13590
Sachbearbeiterin: Katharina Fink
Tel +43 316 385 83201
[email protected]
Graz, 10.03.2014
SEMCARE
Context
Use of pseudonymized pre-selected clinical documents created by a clinical partner for the
purpose of a use case oriented semantic search.
Provision of documents
Data source:
Hospital information system (HIS): openMEDOCS (SAP IS-H * MED)
Selection criteria for the corpus of documents:
Use case specified by SEMCARE or MUG (Medical University of Graz)
Interface used for the Extract Transform Load (ETL) workflow:
Export through export wizard from this HIS. Structured data fields are transferred prepseudonymized.
Location of the exported document corpus:
On the internal server for data extraction (ISDA) the extracted data is first stored in a MS Access
database. For the pseudonymization of free texts, patient names and doctor names are
additionally required. These data are available for the pseudonymization process outlined below.
• First, tables corresponding to the document structure are inverted in order to achieve the
data structure required (e.g. XML) for text analysis. Attributes such as the document
number, internal patient ID, and the content reference are stored within this structure.
• In order to eliminate the patient's name from the free texts, the patient name is searched
for (in a case-insensitive manner) under resolution of vowels and name fragments and
replaced by the pseudonym names.
Medizinische Universität Graz, Auenbruggerplatz 2, 8036 Graz, www.medunigraz.at
Rechtsform: Juristische Person öffentlichen Rechts gem. Universitätsgesetz 2002. Information: Mitteilungsblatt der Universität und www.medunigraz.at. DVR-Nr. 210 9494.
UID: ATU 575 111 79. Bankverbindung: UniCredit Bank Austria AG IBAN: AT931200050094840004, BIC: BKAUATWW
Raiffeisen Landesbank Steiermark IBAN: AT443800000000049510, BIC: RZSTAT2G
2
•
•
In the third step, all free texts are scanned by name fragments of all known doctor names
(business partner - phys.Person) and replaced by the pseudonym [-Arztname-].
This is followed by a manual inspection.
This entire process is executed on the ISDA in the corresponding project folder. All text changes
are archived in journal files in order to improve the process and to be able to check on substantive
questions at any time on the original text.
Location of the pseudonymized corpus of documents:
After manual control of the extracted data, the file is copied on a project-specific file share on the
ISDA which can be accessed via the SEMCARE server.
Access rights to the pseudonymized corpus of documents:
On the ISDA, qualified and authorized staff and system administrators of the IMI (Institute for
Medical Informatics, Statistics and Documentation) and the KAGES (Steiermärkische
Krankenanstaltengesellschaft m.b.H.) have unrestricted access.
Internal documentation of the "Provision of documents" subsection:
The whole extraction process is mapped within a Talend-Data-Integration job on the ISDA and is
used as documentation of the complete ETL work flow.
Document takeover
Location of the acquired pseudonymized corpus of documents:
SEMCARE server,IMI server room, 5th floor, locked.
Access rights to the SEMCARE server/data folder:
The computer is integrated into the hospital network.
Access is therefore strictly limited to:
• System administrators of the KAGES
• System administrators of the IMI
• IMI SEMCARE project members
Authorized access to the data folder is exclusively allowed only within the hospital network.
Internal documentation of the item entitled "Document takeover“:
The transfer of the documents will be confirmed and documented by Markus Kreuzthaler and
Stefan Schulz during the course of the project.
Place of data processing:
The data processing of the acquired pseudonymized data set is allowed only on the SEMCARE
server. No data is transferred to external interfaces belonging to the project partner Averbis
(written confirmation by Averbis).
Processing of documents using the Averbis software
Location of software development:
Software development in combination with the processing of the pseudonymized data set is only
allowed on the SEMCARE server.
Software development using synthetically created documents, word lists, etc., and other publicly
available corpora can be performed on other PCs and laptops, however.
Medizinische Universität Graz, Auenbruggerplatz 2, A-8036 Graz. www.medunigraz.at
Rechtsform: Juristische Person öffentlichen Rechts gem. Universitätsgesetz 2002. Information: Mitteilungsblatt der Universität und www.medunigraz.at. DVR-Nr. 210 9494.
UID: ATU 575 111 79. Bankverbindung: Bank Austria Creditanstalt BLZ 12000 Konto-Nr. 500 948 400 04, Raiffeisen Landesbank Steiermark BLZ 38000 Konto-Nr. 49510.
3
Software development IDE:
For software development, the Eclipse IDE with Maven build is used. For this it is necessary to
access the external Averbis Maven repository via SSH during the build process. Averbis supplies
the preconfigured Eclipse project. The provided NLP blocks are parameterized according to the use
case regarding the pseudonymized data set (sentence detection, date detection, POS, negation
detection, etc.).
Update of the Averbis-Software/SEMCARE Eclipse project:
Updating the Averbis software on PCs and laptops that are not part of the hospital network (e.g.,
those using the MUG-User network) can be accomplished with the help of remote maintenance
software and the 4 eyes principle. An IMI project member is constantly present, and this person
both opens and closes the session. Updates are therefore performed manually by Averbis and an
IMI project staff member.
The use of Team View or comparable remote maintenance products on the SEMCARE server is
not allowed (the SEMCARE server is integrated in the hospital network). A WebSVN server,
hosted and managed by Averbis (https://aspirin.averbis.uni-freiburg.de/svn/graz/semcare) is used
for updates of the SEMCARE Eclipse project. Software and data can therefore be strictly
separated. Further in the project timeline, fast update processes in accordance to the uses case
could be supported.
Following data could be provided to Averbis:
• Trained models based on the pseudonymized data set.
• Synthetically created documents (made by manual substantive changes and administered
by Stefan Schulz)
• Word and term lists based on the pseudonymized data set.
• Word and term statistics on the basis of the pseudonymized data set.
The release and verification of these data must be authorized and documented by Stefan Schulz.
Figure 1: System Sketch.
Legend:
HIS:
ETL:
ISDA:
Data PRE-PSA:
Data PSA:
SVN-Repository:
Hospital Information System
Extract Transform Load
Internal Sever for Data Extraction
Data Pre-Pseudonymized (Structured field contents)
Data Pseudonymized (Free text contents)
Subversion Repository (Hosting the Eclipse SEMCARE project)
Medizinische Universität Graz, Auenbruggerplatz 2, A-8036 Graz. www.medunigraz.at
Rechtsform: Juristische Person öffentlichen Rechts gem. Universitätsgesetz 2002. Information: Mitteilungsblatt der Universität und www.medunigraz.at. DVR-Nr. 210 9494.
UID: ATU 575 111 79. Bankverbindung: Bank Austria Creditanstalt BLZ 12000 Konto-Nr. 500 948 400 04, Raiffeisen Landesbank Steiermark BLZ 38000 Konto-Nr. 49510.