Semantic Data Platform for Healthcare ICT-611388 Lead beneficiary: Averbis D1.1 Ethics Guidelines and Procedures Date: 04/06/2014 WP1 – Scientific Coordination Nature: Report V3.0 Final Dissemination level: PU (Public) © Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union’s Seventh Programme for research, technological development and demonstration under grant agreement No 611388. D1.1 – Ethics guidelines and procedures WP1: Scientific Coordination Dissemination level: Public Author: SEMCARE Consortium Version: 3.0 TABLE OF CONTENTS TABLE OF CONTENTS ......................................................................................................................... 2 DOCUMENT INFORMATION ................................................................................................................. 4 DOCUMENT HISTORY .......................................................................................................................... 4 DEFINITIONS ......................................................................................................................................... 5 EXECUTIVE SUMMARY ........................................................................................................................ 6 KEY WORDS (WORDLE STYLE) .......................................................................................................... 7 1. INTRODUCTION ............................................................................................................................ 8 2. PRIVACY RULES REGARDING PROCESSING OF HEALTHCARE DATA ............................... 9 2.1. SCOPE AND DEFINITIONS OF THE DIRECTIVE 95/46/EC (‘DIRECTIVE’) .............................................. 10 2.2. PRINCIPLES OF THE DIRECTIVE ...................................................................................................... 11 2.3. SUPERVISORY AUTHORITY ............................................................................................................ 13 2.4. SUMMARIZING THE IMPORTANT ASPECTS OF THE DIRECTIVES FOR THE SEMCARE PROJECT ........... 14 3. REPORTS OF THE RESPECTIVE SEMCARE-PROJECT HEALTHCARE DATABASE PARTICIPANTS .................................................................................................................................... 15 3.1. EMC DATA BASE .......................................................................................................................... 15 3.2. MUG DATA BASE .......................................................................................................................... 16 3.3. SGUL DATA BASE ........................................................................................................................ 16 4. GENERAL PROCEDURES IN THE SEMCARE PROJECT FOR PROCESSING HEALTHCARE DATA .................................................................................................................................................... 17 4.1. OVERALL STRATEGY ..................................................................................................................... 17 4.1.1. Overall strategy .................................................................................................................. 17 4.1.2. Data mining / data extraction .............................................................................................. 20 4.1.3. Definition of the project architecture ................................................................................... 22 4.1.4. Data anonymisation ............................................................................................................ 24 2 D1.1 – Ethics guidelines and procedures WP1: Scientific Coordination Dissemination level: Public Author: SEMCARE Consortium Version: 3.0 4.1.5. Data leaving the hospitals to test the software ................................................................... 25 4.1.6. SEMCARE ‘global’ use case .............................................................................................. 26 4.2. EMC ........................................................................................................................................... 28 4.2.1. Protocol for de-identification ............................................................................................... 28 4.2.2. Data management for research.......................................................................................... 29 4.2.3. Adaptation of the SEMCARE software ............................................................................... 29 4.2.4. Installation of a foreign code in the hospitals ..................................................................... 29 4.2.5. Clinical connectors ............................................................................................................. 29 4.3. MUG ........................................................................................................................................... 29 4.3.1. Protocol for de-identification ............................................................................................... 29 4.3.2. Data management for research (can include the generation of dummy data) ................... 29 4.3.3. Adaptation of the SEMCARE software ............................................................................... 30 4.3.4. Installation of a foreign code in the hospitals ..................................................................... 30 4.3.5. Clinical connectors ............................................................................................................. 30 4.4. SGUL ......................................................................................................................................... 30 4.4.1. Protocol for De-identification .............................................................................................. 30 4.4.2. Data Management for Research ........................................................................................ 34 4.4.3. Adaptation of the SEMCARE Software .............................................................................. 34 4.4.4. Installation of a Foreign Code in the Hospital .................................................................... 34 4.4.5. Clinical Connectors ............................................................................................................ 35 4.4.6. Risks during Data Processing and Possible Solutions ...................................................... 35 5. OVERVIEW/CONCLUSION ......................................................................................................... 35 ANNEX I. MUG'S PROCESSING OF DOCUMENTS USING THE AVERBIS SOFTWARE ............... 36 3 D1.1 – Ethics guidelines and procedures 4 WP1: Scientific Coordination Dissemination level: Public Author: SEMCARE Consortium Version: 3.0 DOCUMENT INFORMATION Grant Agreement Number ICT-611388 Full title Semantic Data Platform for Healthcare Project URL www.semcare.eu EU Project officer Saila Rinne ([email protected]) Deliverable Number 1.1 Title Ethics guidelines and procedures Work package Number 1 Title Scientific Coordination Delivery date Contractual 31/03/2014 Status Version V3.0 Final Nature Report Prototype Other Dissemination Level Public Confidential Authors (Partner) AVERBIS, SYNAPSE, EMC, MUG, SGUL Acronym Actual Draft Philipp Daumke Email Partner Phone SEMCARE 04/06/2014 Final [email protected] Responsible Author AVERBIS DOCUMENT HISTORY NAME DATE VERSION DESCRIPTION All partners 23/05/14 1.0 First draft A. Honrado (SYNAPSE) 26/05/14 2.0 Internal review D. Kalra 31/05/14 2.1 Ethics review All partners 02/06/14 2.2 Feedback review and changes A. Honrado (SYNAPSE) 04/06/14 3.0 Final version D1.1 – Ethics guidelines and procedures WP1: Scientific Coordination Dissemination level: Public Author: SEMCARE Consortium Version: 3.0 DEFINITIONS Partners (also named as beneficiaries) of the SEMCARE Consortium are referred to herein according to the following codes: AVERBIS - Averbis GmbH (Germany) Coordinator EMC - Erasmus Universitair Medisch Centrum Rotterdam (Netherlands) – Beneficiary MUG - Medical University of Graz (Austria) – Beneficiary SGUL - Saint George's University of London (UK) – Beneficiary SYNAPSE - Synapse Research Management Partners S.L. (Spain) – Beneficiary Anonymisation: process of de-identification of data by suppressing or generalising values of attributes that identify a person; no retracing to the real person is possible. Encryption: process of encoding information in such a way that only authorised parties can read it. ETL: extract – transform – load; Process in data warehousing that is often used to integrate data from multiple sources. A common ETL tool is Talend Open Studio. HTTPS: Hypertext Transfer Protocol Secure; communications protocol for secure communication over a computer network using the SSL/TLS protocol. LDAP: Lightweight Directory Access Protocol; standard application protocol for accessing and maintaining distributed directory services over a network. Directory services allow the sharing of information about users, systems, networks, services, and applications throughout the network. NFS: Network File System; distributed file system protocol allowing a user on a client computer to access files over a network. Project: The sum of all activities carried out in the framework of the Grant Agreement. Risk: Uncertainty that may have a significant impact on the execution or outcome of the project, and which effect may be negative – a threat risk - or positive – an opportunity risk. SSL/TLS: cryptographic protocols which are designed to provide communication security over the Internet. TrueCrypt: freeware application used for on-the-fly encryption. Virtual Machine (VM): software-based emulation of a computer. Pseudo-anonymisation: process of de-identification of data by suppressing or generalising values of attributes that identify a person; a retracing to the real person is possible for authorised people by using a mapping table that provides the personal data for a specific identifier. 5 D1.1 – Ethics guidelines and procedures WP1: Scientific Coordination Dissemination level: Public Author: SEMCARE Consortium Version: 3.0 EXECUTIVE SUMMARY This document describes the general ethical/privacy and procedures that occurred and are proposed during the SEMCARE project. SEMCARE addresses ethical issues by ensuring the appropriate re-use of identifiable patient data complies with legal requirements on data privacy. The current EU-level legislation applicable to the clinical scenarios as well as the transposition of the EU legal texts into national law, which has resulted in some degree of variability, will be taken into account. At this point (M5) all ethical approvals for secondary use of data have been requested by the clinical partners and their respective ethical committees have showed no objection to their processing. It is expected that ethical approvals will be available and sent to the Commission Services in the short term and included in the deliverable D1.3 Report on ethical framework and procedures in the project. 6 D1.1 – Ethics guidelines and procedures WP1: Scientific Coordination Dissemination level: Public Author: SEMCARE Consortium Version: 3.0 KEY WORDS (Wordle style)1 1 http://www.wordle.net/. 7 D1.1 – Ethics guidelines and procedures WP1: Scientific Coordination Dissemination level: Public Author: SEMCARE Consortium Version: 3.0 1. Introduction Healthcare is a data intensive enterprise – a wide range of teams and institutions in a variety of settings need to access patient data in order to provide safe and appropriate care to a patient. In all Member States of the European Union, and indeed most countries in the world, the collection, processing, sharing and storage of identifiable patient data is regulated by a framework of legislation, ethical requirements and professional regulations. These frameworks require that healthcare professionals and their support staff observe strict rules of patient consent, privacy and confidentiality when data are collected, processed or shared for the purposes of providing care. However, the legislative framework of data protection is not the only set of rules which impacts on the handling of patient data. The ethical principles, based on patient autonomy, and working through consent and confidentiality, are more familiar, and are also professionally relevant to healthcare professionals, who need to balance patients' medical needs with perceived needs of privacy. Thus, the level of consent ethically required for the sharing of patient data is influenced by the purpose for which data are to be used. For immediate and on-going care, “implied” consent is generally regarded as sufficient, while for “secondary” purposes, such as financial and clinical audit, or when data are used for research purposes, informed - usually written - consent is needed. If patient data are fully anonymised, then consent is not required for secondary use, although it is good practice for patients to be made aware of this use of their information, even if they cannot be identified. Although at the EU level there is a sound legal basis for Data Protection, Privacy and Human Rights, it is important to note that the transposition of the EU legal texts into national law has resulted in some degree of variability, which has in turn created some uncertainty both at national and EU level. The upcoming review of the EU Data Protection Directive (Directive 95/46/EC), to give rise to a new Regulation, offers a unique opportunity investigate the impact of the current legislation on eHealth services and to propose new interpretation and refinements which will allow EU legislation to serve as a facilitator of eHealth – rather than as the hurdle it is often portrayed today. All SEMCARE partners are complying with the charter of fundamental rights of the European Union as well as the relevant international directives and declarations on ethical issues as detailed by the EU for FP7. Care is taken that all experiments are sanctioned by government officials including local and national ethics committees. All experiments conform to national and EU Directives on the confidentiality of personal data. In case of use of patient material informed consent is obtained and data of patients is secured by anonymisation (data protection). The partner institutions respect and enforce all relevant international codes of practice, such as: 8 D1.1 – Ethics guidelines and procedures WP1: Scientific Coordination Dissemination level: Public Author: SEMCARE Consortium Version: 3.0 the ethical standards of the 7th Framework programme (only to mention Article 3, all research will be carried out in compliance with fundamental ethical principles) the Charter of Fundamental Rights of the European Union, signed in Nice, 7 December 2000 the Convention on Human rights and Biomedicine – Oviedo, 4/4/1997 – Council of Europe Helsinki Declaration (Fortaleza, October 2013) – World Medical Association that the ethical review within the European Commission's evaluation procedure does NOT replace local ethics committee or local authority approval This document aims to address the ethical issues in the project, which concerns mainly the use of anonymised clinical data (WP6) related to the use case that has been selected for the project. Furthermore, the consortium follows the advice of an Ethics Advisor, Prof. Dipak Kalra, who acts as an external and independent expert that monitors all SEMCARE’s ethical issues, ensuring that there is compliance with national and EU legislation governing data privacy as well as any other ethical considerations that may arise in the course of the project. 2. Privacy rules regarding processing of healthcare data Personal privacy is a highly respected principle in the European Union (EU). All member states are signatories of the European Convention on Human Rights (ECHR) from the Council of Europe. According to article 8 of the ECHR people have, subject to certain restrictions, a fundamental right to respect for one's "private and family life, his home and his correspondence". This right is embedded into national legislation of most Member States. Data privacy laws, however, vary widely across Europe. The European Commission (EC) realized that this diversity of national legislation impedes uniform data protection and the free flow of data within the EU zone. Therefore in 1995 the EC published Directive 95/46/EC on the protection of individuals with regard to the processing of personal data and on the free movement of such data (‘Directive’) to harmonize data protection regulation within the EU. The Directive regulates the processing of personal data and the free movement of such data and had to be implemented into national law by the end of 1998. Currently all Member States have implemented it within their own national data protection legislation. The Directive is not a ‘closed regulatory system’ and leaves open a certain scope for policy making at national level. However certain minimum requirements must be complied with. This includes the processing of personal data for scientific purposes, such as the processing of data in and for the SEMCARE project. The SEMCARE platform will be installed in the departments of hospitals and connected to the local access control system within each site, making data only accessible to physicians that are allowed to 9 D1.1 – Ethics guidelines and procedures WP1: Scientific Coordination Dissemination level: Public Author: SEMCARE Consortium Version: 3.0 see the data according to the local hospital policies. This way, physicians will be able to analyse the patient records of their department, and thus to conduct retrospective analyses. Due to the sensitive nature of the personal health data it is important for SEMCARE to be fully aware of ethical and regulatory aspects and to implement all reasonable measures to ensure compliance with ethical and regulatory issues on privacy. Although all EU participants have different national legislations regarding privacy (and informed consent), the aforementioned Directive applies to all of them and can function as a base and point of reference in this document. Section 2.2 summarizes aspects and articles of the Directive most relevant to the SEMCARE project. In Section 3 SEMCARE’s clinical partners have briefly set out how they individually deal with these (ethical and regulatory) aspects at the national level and which procedures are implemented in their organisation/country to comply with the applicable regulations. Furthermore, any ethical issues regarding use of electronic health care databases encountered during the project shall be reported by parties encountering them. They shall report these issues to the scientific coordinator in writing, together with a description on how they have been solved. These reports shall be included either in the deliverable D1.3 Report on ethical framework and procedures in the project (due in month 12) or in the deliverable D1.4 Final report on ethical issues and data privacy (due in month 24). 2.1. Scope and definitions of the directive 95/46/EC (‘Directive’) In the Directive (art. 2 sub a) ‘personal data’ are defined as "any information relating to an identified or identifiable natural person (‘data subject’); an identifiable person is one who can be identified, directly or indirectly, in particular by reference to an identification number or to one or more factors specific to his physical, physiological, mental, economic, cultural or social identity". This definition is very broad. Data are ‘personal data’ when someone is able to link the information to a person, even if the person holding the data cannot make this link. For the purpose of the Directive (art. 2 sub b) ‘processing’ means "any operation or set of operations which is performed upon personal data, whether or not by automatic means, such as collection, recording, organization, storage, adaptation or alteration, retrieval, consultation, use, disclosure by transmission, dissemination or otherwise making available, alignment or combination, blocking, erasure or destruction”. The "controller" (art. 2 sub d), meaning “the natural or legal person, public authority, agency or any other body which alone or jointly with others determines the purposes and means of the processing of personal data” is the one responsible for compliance (art. 6 sub 2). 10 D1.1 – Ethics guidelines and procedures WP1: Scientific Coordination Dissemination level: Public Author: SEMCARE Consortium Version: 3.0 According to article 4 of the Directive the data protection rules are not only applicable when the controller is established within the EU, but whenever the controller uses equipment situated within the EU in order to process data. Controllers from outside the EU, processing data in the EU, will have to follow the EU data protection regulation. 2.2. Principles of the directive Personal data may not be processed, except when certain conditions are met. These conditions fall into three categories: legitimate purpose, transparency and proportionality. Legitimate purpose Article 6 sub b states that personal data can only be processed for specified explicit and legitimate purposes and may not be processed further in a way incompatible with those purposes. Further processing of data for historical, statistical or scientific purposes shall not be considered as incompatible provided that Member States provide appropriate safeguards. Transparency Personal data may be processed only under the following circumstances (art. 7): when the data subject has given his/her consent; or when the processing is necessary for the performance of a contract to which the data subject is party or in order to take steps at the request of the data subject prior to entering into a contract; or when processing is necessary for compliance with a legal obligation to which the controller is subject; or when processing is necessary in order to protect the vital interests of the data subject; or when interest or in the exercise of official authority vested in the controller or in a third party processing is necessary for the performance of a task carried out in the public to whom the data are disclosed; or when processing is necessary for the purposes of the legitimate interests pursued by the controller or by the third party or parties to whom the data are disclosed, except where such interests are overridden by the interests for fundamental rights and freedoms of the data subject. The Directive states that the data subject has the right to be informed when his/her personal data are being processed and that the controller must provide his/her name and address, the purpose of processing, the recipients of the data and all other information required to ensure the processing is fair. If any personal data is collected the controller or his/her representative must provide a data subject from 11 D1.1 – Ethics guidelines and procedures WP1: Scientific Coordination Dissemination level: Public Author: SEMCARE Consortium Version: 3.0 whom data are collected with at least the following information, except where he already has it (art. 10 and 11): the identity of the controller and of his/her representative, if any; the purposes of the processing for which the data are intended; any further information such as: the recipients or categories of recipients of the data, whether replies to the questions are obligatory or voluntary, as well as the possible consequences of failure to reply, the existence of the right of access to and the right to rectify the data concerning him. The foregoing obligation on providing information to the data subject does not apply where the data have not been obtained from the data subject himself (e.g. from general practitioners or from claims) and where, in particular for processing for statistical purposes or for the purposes of historical or scientific research, the provision of such information proves impossible or would involve a disproportionate effort or if recording or disclosure is expressly laid down by law. In these cases Member States shall provide appropriate safeguards. According to article 12 the data subject has the right to access all data processed relating to him. The data subject even has the right to demand the rectification, deletion or blocking of data that is incomplete, inaccurate or is not being processed in compliance with the data protection rules. Proportionality According to article 6: Personal data must be processed fairly and lawfully; Personal data must be collected for specific and legitimate purposes and not further processed in a way incompatible with those purposes. Further processing of data for historical, statistical or scientific purposes shall not be considered as incompatible provided that Member States provide appropriate safeguards; Personal data must be adequate, relevant and not excessive in relation to the purposes for which they are collected and/or further processed; Personal data must be accurate and, where necessary, kept up to date; every reasonable step must be taken to ensure that data which are inaccurate or incomplete, having regard to the purposes for which they were collected or for which they are further processed, are erased or rectified; Personal data must be kept in a form which permits identification of data subjects for no longer than is necessary for the purposes for which the data were collected or for which they are further 12 D1.1 – Ethics guidelines and procedures WP1: Scientific Coordination Dissemination level: Public Author: SEMCARE Consortium Version: 3.0 processed. Member states shall lay down appropriate safeguards for personal data stored for longer periods for historical, statistical or scientific use. The processing of personal data revealing racial or ethnic origin, political opinions, religious or philosophical beliefs, trade-union membership, and the processing of data concerning health or sex life is prohibited (art. 8 sub 1) except in case (art. 8 sub 2): the data subject has given his/her explicit consent to the processing of those data, except where the laws of the Member State provide that the prohibition referred to in paragraph 1 may not be lifted by the data subject's giving his/her consent; or processing is necessary for the purposes of carrying out the obligations and specific rights of the controller in the field of employment law in so far as it is authorized by national law providing for adequate safeguards; or processing is necessary to protect the vital interests of the data subject or of another person where the data subject is physically or legally incapable of giving his/her consent; or processing is carried out in the course of its legitimate activities with appropriate guarantees by a foundation, association or any other non-profit-seeking body with a political, philosophical, religious or trade-union aim and on condition that the processing relates solely to the members of the body or to persons who have regular contact with it in connection with its purposes and that the data are not disclosed to a third party without the consent of the data subjects; or processing relates to data which are manifestly made public by the data subject or is necessary for the establishment, exercise or defense of legal claims. Furthermore an exception to this prohibition of article 8 sub 1 is set out in article 8 sub 3: Article 8 sub 1 shall not apply where processing of the data is required for the purposes of preventive medicine, medical diagnosis, the provision of care or treatment or the management of health-care services, and where those data are processed by a health professional subject under national law or rules established by national competent bodies to the obligation of professional secrecy or by another person also subject to an equivalent obligation of secrecy. 2.3. Supervisory authority According to article 28 each Member State must set up a supervisory authority, which is an independent body that: will monitor the data protection level in that Member State; give advice to the government about administrative measures and regulations; and starts legal proceedings when data protection regulation has been violated. Individuals may lodge complaints about violations to this supervisory authority or in a court of law. 13 D1.1 – Ethics guidelines and procedures WP1: Scientific Coordination Dissemination level: Public Author: SEMCARE Consortium Version: 3.0 The controller must notify the supervisory authority before he starts to process data. The notification contains at least the following information (art. 19): the name and address of the controller and of his/her representative, if any; the purpose or purposes of the processing; a description of the category or categories of data subject and of the data or categories of data relating to them; the recipients or categories of recipient to whom the data might be disclosed; proposed transfers of data to third countries; a general description of the measures taken to ensure security of processing. This information is kept in a public register. 2.4. Summarizing the important aspects of the directives for the SEMCARE project Data processing of personal data is legitimate for scientific purposes if adequate safeguards are provided and followed. Data owners need to explain what their safeguards are. Subject has the right to access data processed on him, thus there must be a registry for which purposes and by whom the data are processed. Database owners need to specify how they keep a registry and in the project the data warehouse will have to be made in a sense that it keeps logs on processing of data. Consent is not necessary where the data have not been obtained from the data subject himself (as is the case in SEMCARE for all clinical databases) and where, in particular for processing for statistical purposes or for the purposes of historical or scientific research, the provision of such information proves impossible (e.g. because data have been de-identified) or would involve a disproportionate effort or if recording or disclosure is expressly laid down by law. In these cases Member States shall provide appropriate safeguards. In the case of SEMCARE, all clinical partners will have in place a de-identification process of data that will make impossible to identify the subjects, except for authorised people through a specific procedure. If patients need to be contacted for collection of additional information relevant to the study, consent will be requested according to the procedures specified by each of the databases. Processing of health data may be done by a health professional subject under national law or rules established by national competent bodies to the obligation of professional secrecy or by another person also subject to an equivalent obligation of secrecy. Database owners must provide their conditions for use and processing of the data. The SEMCARE system will be installed in each clinical centre participating in the project and data will not leave the clinical environment. Only anonymised data will transferred outside the clinical environment for the 14 D1.1 – Ethics guidelines and procedures WP1: Scientific Coordination Dissemination level: Public Author: SEMCARE Consortium Version: 3.0 purpose of testing the system performance. Each clinical partner will be responsible of managing and maintaining their own data warehouse and will provide passwords only for persons who are approved to analyse data for the project purposes (see section 4.4.1). Personal data should be processed adequately and correctly and kept up to date and not stored longer than necessary. In SEMCARE a careful definition of the storage requirements and conditions is needed. 3. Reports of the respective SEMCARE-project healthcare database participants 3.1. EMC data base An extensive description of all issues concerning the use of patient data for research purposes in The Netherlands can be found on the website of the CCMO (Centrale Commissie Mensgebonden Onderzoek (“Central Committee on Research involving Human Subjects”), http://www.ccmo.nl/en/). Comprehensive information on the legal framework can be found on http://www.ccmo.nl/en/legalframework. Briefly, if a study falls under the scope of the WMO (Wet medisch-wetenschappelijk onderzoek met mensen (“Medical Research Involving Human Subjects Act”), then it must undergo a prior review by an accredited MEC. There are 24 accredited MECs in the Netherlands that review medical/scientific research proposals (http://www.ccmo.nl/en/accredited-mrecs). The majority are linked to an institution such as a hospital or an academic medical centre, including the Erasmus MC. Research falls under the WMO if the following criteria are met: (1) it concerns medical/scientific research, (2) participants are subject to procedures or are required to follow rules of behaviour (http://www.ccmo.nl/en/your-research-does-it-fall-under-the-wmo). Importantly, retrospective research/research with patient records does not fall under the WMO: the data are not gathered for the sake of the research, and participating subjects are not required to change their behaviour for the sake of the research. In these cases, the MEC is often still asked to formally confirm that a research proposal does not fall under the WMO. We have also followed this approach in the Erasmus MC for the SEMCARE global use case scenario. Use of identifiable patient data in research that does not fall under the WMO still needs informed consent from the data subjects, but if the data are de-identified informed consent is not required. 15 D1.1 – Ethics guidelines and procedures 16 WP1: Scientific Coordination Dissemination level: Public Author: SEMCARE Consortium Version: 3.0 3.2. MUG data base Relevant national law and its relationship to European law can be found accessing http://www.ethikkommissionen.at/ (Austrian online platform form of the ethics board). The SEMCARE platform will work on pseudo-anonymized patient data. The ETL-workflow as well as the pseudo-anonymisation process is described in Appendix MUG section “Provision of documents” and “Document takeover”. A sketch of the local environment with the embedded search platform is given in Figure 1: System sketch. 3.3. SGUL data base SGUL has to pay more attention to the issues concerning data protection than other participants because two types of data are involved in the SEMCARE project: the anonymised data for development and testing and the data loaded in the SEMCARE system. For the data used for developing and testing purpose, the Data Protection Act 19982 enforces data anonymity before it is transferred to third party. The national standard on data anonymity is "Anonymisation: Managing Data Protection Risk Code of Practice" 3 developed by UK Information Commissioner's Office. An anonymising system will be developed for according to this standard and the characteristics of clinical texts featuring text mining techniques to automatically identify uniquely identifiable information and then merging documents based on such information. Data loaded in the prototype system comes from the same department where the system will be installed and there is no requirement of anonymisation. The only two restrictions come from Trust policy and the Data Protection Act 1998 and these will be adhered to: 1. Data will be held in the secure demilitarized zone, an area of the network that is isolated from the internal network. 2. Identifiable data cannot be transferred outside of the Trust. 2 http://www.legislation.gov.uk/ukpga/1998/29/contents 3http://ico.org.uk/for_organisations/data_protection/topic_guides/~/media/documents/library/Data_Protection/Prac tical_application/anonymisation-codev2.pdf D1.1 – Ethics guidelines and procedures WP1: Scientific Coordination Dissemination level: Public Author: SEMCARE Consortium Version: 3.0 4. General procedures in the SEMCARE project for processing healthcare data 4.1. Overall strategy 4.1.1. Overall strategy In this section an overview of the processes and mechanisms applied to protect data privacy in the SEMCARE project is described. More detailed information on the implementations that are in place locally at the clinical sites is provided in the sections 4.2 to 4.4 by the clinical partners. The local authorities in the hospitals are responsible for the definition of security needs within the clinic. The project SEMCARE assumes that corresponding actions for privacy protection are in place in the data providing clinics by means of already running clinical systems. The locally installed SEMCARE services will follow these existing rules. Furthermore, additional actions are taken in order to protect the processed data, especially with regards to the project related infrastructure. These are e.g. rules for data deletion or an additional protection of system access and data transfer. Additionally, a de-identification of the patient data takes place at the sites in order to assure data privacy. Legal basis During the time of system development and testing the legal basis for the transfer of test data is that data will be fully anonymised according to existing local procedure at the clinical sites to ensure that the patients cannot be identified. Once the SEMCARE system goes into production and no longer needs to run on de-identified data, the legal basis for the data handling and processing is the national legislations regarding privacy for each clinical site, but also the data protection regulations within the EU that apply for all project partners. The SEMCARE system may then be considered as an addition to the core capability of the hospital information system, supporting patient safety for individuals. Details about the regulations are described in section 2 and 3 of this document. Responsible authorities When considering the whole process of the SEMCARE system development, installation and execution a number of authorities are involved. The following table shows the different authorities that are involved in the SEMCARE project and provides a short description of their corresponding responsibilities. 17 D1.1 – Ethics guidelines and procedures WP1: Scientific Coordination Dissemination level: Public Author: SEMCARE Consortium Version: 3.0 Local Authority Description Research Institution The medical informatics departments at MUG and EMC are research institutions involved in the SEMCARE project. They provide input about new technologies and methods in the field of text analysis and terminologies. Data Providing Institution Hospital site providing test data for system development. The test data provided to AVERBIS will be anonymised or artificially generated. Real patient data will not leave the hospital site and will only be used during usage of the final productive system at the site. Patient data will not be used during system development. Local IT IT departments at the three different hospital sites. They are responsible for approving the software and for the installation and integration of the system into the existing local IT landscape. Only dedicated personnel will be able to access the SEMCARE systems. Data Privacy Protector / Ethics Advisor Local privacy protectors are available at each hospital site. Overall privacy protector and ethics advisor for the project and the committee is Dipak Kalra4. System Provider AVERBIS develops the software system externally and provides it to the hospital sites for local installation. No real patient data will be provided to AVERBIS. Physical access control At the clinical sites corresponding actions are already in place in order to refuse physical access to machines working on sensitive personal data for unauthorised persons. The computer rooms are protected adequately. Only authorised administrative personnel will be able to enter the server rooms where the SEMCARE systems are installed locally. Computer access control Appropriate actions are in place at the sites in order to avoid that unauthorised people can log into computers containing personal clinical data. Only authorised administration personnel will get access to the SEMCARE systems according to their allocated tasks and roles. The following users will have access to the SEMCARE systems according to their tasks: Local system administrator. The local system administrator is an employee of the local IT of the clinical site. He/she enables the installation of the local SEMCARE server and has access to the systems for maintenance reasons. 4 Clinical Professor of Health Informatics, Director, Centre for Health Informatics and Multiprofessional Education, University College London 18 D1.1 – Ethics guidelines and procedures WP1: Scientific Coordination Dissemination level: Public Author: SEMCARE Consortium Version: 3.0 In case of problems with the SEMCARE services the local system administrator owns a detailed description of all processes, which allows him/her to identify failing processes or the restart of failed processes. SEMCARE delegate. The SEMCARE delegate is responsible for the control and regulation on application level. He/she serves as contact person in case of problems or questions with regards to the software or the usage of the SEMCARE services. These can be questions by end users or by the local system administrator (e.g. in case of a software update). Software updates that can’t be performed by the local system administrator himself are performed together with the local system administrator in confidence. End user. The end user has access to the SEMCARE application according to the role and authorisation concept. This can either be implemented by connecting the SEMCARE system to the local rights management system or by creating corresponding user accounts that have different access rights on the computer hardware. It must be ensured that the people that are authorised for machine access, only can access data for which they have the corresponding rights. Personal data must not be read, copied, changed or deleted by unauthorised people during usage, processing or storage. End users will get access to the application according to the role and authorisation concept and will not have access to the system components. An access logging can be activated for traceability. Furthermore, it will be ensured that no project-related data is written to and stored on locations where unauthorised persons have access to. Hardening of the system The SEMCARE system will be installed on a local server in the hospitals by the local IT department in strong collaboration with the SEMCARE project team. All components that are not needed should be removed from the server and all remaining components should be tested to be secure. Furthermore, it has to be checked which components have to be installed additionally in order to improve the security of the systems. This could be components like firewalls, intrusion detection systems, virus protection and similar programs. Additionally, the user rights on the server have to be checked and restricted if needed. Update Management The SEMCARE application will be distributed to the three clinical partners as a virtual machine (VM) for local installation. A plan for the update process of the VMware template has to be developed, defining who is responsible for the master image of SEMCARE and how changes on the master image are 19 D1.1 – Ethics guidelines and procedures WP1: Scientific Coordination Dissemination level: Public Author: SEMCARE Consortium Version: 3.0 transferred to the different clinics. This affects updates of operating system components as well as updates for the SEMCARE services. Penetration / Scaling / Load Tests Within the course of the project and the technical development, penetration, scaling and load tests could be performed in order to check the particular components of the system. With these kinds of technical analyses, possible leaks could be identified that a third party might use to enter the SEMCARE system. As a result, corresponding actions could be taken in order to improve the security of the system. The need of such an analysis and the checks to be performed will be determined during the project involving all project partners. 4.1.2. Data mining / data extraction Within the scope of the SEMCARE project unstructured clinical data is extracted from various hospital information systems and files and transferred to the SEMCARE system for data storage and analysis. Thereby, the clinical data and the application services will be physically separated. Data storage is done on NAS (Network Attached Storage) servers and the application is running on a dedicated server as a VM. The advantage of separate data storage is that the physical storage locations are known and can be easily cut from the system in case of need. Furthermore, a backup of the data could be created separately from the system logic at regular intervals. The consolidation of the clinical data and transfer into the SEMCARE system is done by the local IT department using the tool Talend Open Studio 5 or another ETL (extract, transform, load) tool. Talend pushes the data into the PostgreSQL data store and to the Solr and text analysis component. By means of the text analysis relevant information like diagnosis, symptoms, and therapy of patients is obtained from the unstructured data sets. During the analysis no personal data is modified. The analysis results are written to the PostgreSQL data store and the Lucene/Solr search engine index on the NAS. All communication and data transfer between the NAS and the VM is done via NFS 4.0 (Network File Service). The end user can then perform queries on the (un)structured data and the system returns relevant results. The whole process of data transfers and data storage can be seen in Figure 1. 5 http://talend.com/products/talend-open-studio 20 D1.1 – Ethics guidelines and procedures WP1: Scientific Coordination Dissemination level: Public Author: SEMCARE Consortium Version: 3.0 Figure 1: SEMCARE data flow and data storage The following paragraphs describe some fundamental actions that can be taken for the data extraction and text mining procedures in order to ensure data privacy on personal clinical data. Data economy The research project SEMCARE utilises hospital data related to specific use cases. At this juncture we focus on one use case in the cardiologic area that has been described in more details in the previous deliverable D3.1 (Sketch of system architecture specification). Only data necessary to answer the defined use case will be extracted and processed by the application. No patient data that is available in the hospital system without any relation to the project use cases will be considered. The clinical data that is needed for the current use case will be loaded into the SEMCARE system by the local IT department using the ETL software Talend Open Studio or another ETL software. Data will only be loaded via the ETL software; there is no other way of data input into the system. Logging Only allocated and defined personnel will have access to the system components of the SEMCARE application. To ensure traceability, a logging of the data exchange will be done for the communication between the different SEMCARE components. Such a logging can be used for the purpose of fault analysis but also for subsequent examination if personal data has been entered, changed or deleted 21 D1.1 – Ethics guidelines and procedures WP1: Scientific Coordination Dissemination level: Public Author: SEMCARE Consortium Version: 3.0 and by whom. To what extent logging is needed and the exact data and actions to be logged will be discussed and defined in future steps within the project. Secure Data Transfer / Encryption Procedures The communication between the different SEMCARE components within the intranet of the clinic should be secure as well as the data storage. The minimal requirements of the cryptographic procedure (e.g. minimal key length and cipher suites) and which of the available procedures should be chosen will be refined in the course of the project. For example, corresponding actions could be performed to ensure that the communication between the different components of the SEMCARE system (e.g. client computer and server) passes over a secure data line. A possible method for encryption of the data transmission is the transfer via HTTPS, thus adding the security capabilities of the SSL/TLS protocol to the standard HTTP communication. An additional encryption of the stored data that is e.g. stored in the Solr index is possible by using the software TrueCrypt6 in order to protect it from non-authorised access. However, this is not advisable in a search application as the query response time will degrade dramatically. Furthermore, a hardware encryption should be performed to ensure that no confidential information can be accessed when a data storage medium is stolen. Data Deletion / Time Periods Upon data load by the tool Talend Open Studio from the clinical data pool on the SEMCARE server the system begins to save patient data. This data has to be deleted as soon as it not needed anymore. All data that is created, provided or used during the project outside the participating hospitals will be deleted upon finalisation of the project. This includes the anonymised data for development and testing of the application by the system provider AVERBIS (Averbis GmbH). The expected end of the project is 31st December 2015. 4.1.3. Definition of the project architecture The SEMCARE architecture has been designed in a way that the system does not directly work on the production system of the hospital but only on a copy of the data, i.e. staged data that will be provided by the hospital. That way, the real patient data is protected against accidental destruction or loss caused by SEMCARE. Additionally no load to the clinical production system will be generated. The patient data 6 http://www.truecrypt.org/ 22 D1.1 – Ethics guidelines and procedures WP1: Scientific Coordination Dissemination level: Public Author: SEMCARE Consortium Version: 3.0 that is used for patient care in the hospital will at no time be contacted by the SEMCARE systems. Below Figure 2 shows the broad overview of the system architecture that has been explained in detail in deliverable D3.1. Figure 2: Systems involved in the SEMCARE architecture The SEMCARE systems (staging and production) will be installed locally at the hospital site. After loading the clinical data into the SEMCARE systems it will be processed and evaluated by appropriate algorithms completely without leaving the hospital. The project integrates into the existing IT landscape of the hospital with regards to physical access, computer access, and data access control to the used IT components (servers and network components). This also protects the security of particularly sensitive health care data arising in a hospital. Furthermore, the architectural design of the SEMCARE platform permits data processing and storage on separate hard drives if needed. It could be necessary to use separate hard drives if different departments are involved as data providers or if there is data on which users may have differing rights. A connection to the local right management system, such as LDAP, can be implemented in order to replicate existing access rights. Such integration can be provided individually for each of our clinical partners. A role concept can be applied in order to assure that only authorised users can access the data related to the SEMCARE project. Furthermore, all activities happening in context of the SEMCARE system will be logged in order to be able to examine if patient data has been entered, changed or deleted, and by whom. Only allocated and defined personnel will have access to the system components and applications of the SEMCARE applications. Availability control The local VM in the clinic doesn’t require any hardware backups or recovery mechanisms due to the fact that data loss can be overcome by reloading and reprocessing the data from the staged instance if necessary. In case of a complete and permanent outage of the VM in the clinic and if a disaster recovery is not possible, the local VM can be replaced by a new copy of the SEMCARE VM that could be provided by 23 D1.1 – Ethics guidelines and procedures WP1: Scientific Coordination Dissemination level: Public Author: SEMCARE Consortium Version: 3.0 AVERBIS. All data for the current use cases will then have to be re-imported from the clinical staging system into the SEMCARE services. As the SEMCARE application is not crucial for patient care, high availability of the data platform is not required and a fully automated outage concept will not be provided. Data Segregation Within the SEMCARE project the system applications will be provided and run on a virtual machine (VM). If a clinical partner wants to use the SEMCARE technologies, the clinic gets a copy of this VM, which is pre-configured for the corresponding clinic. This copy will then be integrated into the clinical network by the local IT department. Like that, all clinical partners will have their individually configured and dedicated system. To enable a separation of the clinical data and the application services provided by AVERBIS we plan data storage on a separate machine, a so-called NAS (network attached storage). The NAS can be mounted on the VM and data can easily and securely be transferred between the NAS and the SEMCARE application on the VM via NFS 4.0. However, as the data is located on a different machine, the access of the SEMCARE application can also easily be stopped by physically pausing the data storage machine if needed. In the beginning, only one use case within the cardiologic field will be relevant for the project, thus no further data separation mechanisms for data storage are needed. In case of more use cases in the future it must be assured that data from different scenarios or different departments are separated from each other according to existing access rights. For this purpose, distributed Solr indexes per use case could be used which is supported by the generic architecture design. Like that, a further data separation could be performed within the SEMCARE application in a clinic if needed. All the SEMCARE systems will only be run locally and queries will only be performed on relevant patient data. Other information that is not relevant for the defined use case will not be extracted from the hospital systems. A system for software development and testing and a production system will be provided separately. 4.1.4. Data anonymisation Introduction to Anonymisation Anonymisation is defined as “a process that removes the association between the identifying data the data subject”, according to ISO Technical Specification ISO/TS 25237 (Health informatics – Pseudonymisation). 24 D1.1 – Ethics guidelines and procedures WP1: Scientific Coordination Dissemination level: Public Author: SEMCARE Consortium Version: 3.0 There are two types of anonymisation techniques: masking and de-identification. They deal with different fields in a data set. Masking Masking tends to protect things like names and various IDs like NHS number. It involves significant distortion of the data. Suppression, an approach to removing a whole field, is the most commonly used masking technique. Another masking technique involves replacing actual values with random values selected from a large database. The only standard for masking is ISO Technical Specification 25237, which focuses on the different ways that pseudonyms can be created, but does not specify the techniques to use. De-identification De-identification involves protecting fields like demographics and individual information, such as age, home and work address, income, number of children and race. De-identification minimally distorts the data so that meaningful analytics can still be performed on it, while still being able to protect privacy. There are three standard for de-identification: lists, heuristics and risk-based methodology. Lists specify the data elements need to be removed or generalized. A good example is the Safe Harbor standard in the HIPAA Privacy Rule, in which 18 data elements are listed. However, this method has been significantly criticised because it does not provide real assurance that there’s a low risk of re-identification. Heuristics are rules of thumb that are developed and applied for years. For example, never release dates of birth, but allow the release of treatment and visit dates. The drawbacks of heuristics are: 1) the existing rules may not cover all circumstances, especially for rare diseases. 2) It is difficult to technically prove the effectiveness of privacy protection of such methods, so it makes them unsuitable for data providers that want to manage their reidentification risk in data release. 3) It requires experts or judges to determine rules. Risk-based methodology applies mathematic algorithms to automatically change the sensitive contents in healthcare data, while maintaining the trade-off between the re-identification risk in data and utility of data. It is consistent with several standards from regulators and governments, such as “Anonymisation: Managing Data Protection Risk Code of Privacy” by UK Information Commissioner’s Office. 4.1.5. Data leaving the hospitals to test the software AVERBIS will develop the underlying software of the SEMCARE platform. In order to test the performance of the software and to improve the algorithms it is essential to have representative test 25 D1.1 – Ethics guidelines and procedures WP1: Scientific Coordination Dissemination level: Public Author: SEMCARE Consortium Version: 3.0 data that can be used for this purpose. The test data can be fake or anonymised data but should have similar structure to the real patient data that will finally be used in the productive system. Within the SEMCARE project the clinical partner SGUL (Saint George's University of London) will provide anonymised clinical test data to AVERBIS for the development of the algorithms and interfaces and for testing and improvement purposes. The legal basis that allows the transfer of such test data by SGUL is section 251 of the NHS Act 2006. This section says that confidential patient information can be transferred to third-party applicant for the purpose of supporting health service improvements, bypassing the common law duty of confidentiality. However, confidential patient information still must comply with all the other relevant obligations e.g. the Data Protection Act 1998 which enforces the data anonymisation. SGUL is responsible for the data disclosure. Transferred test data will be encrypted either at rest or in transition. The clinical partners EMC (Erasmus Universitair Medisch Centrum Rotterdam) and MUG (Medical University of Graz) will not provide any data to AVERBIS or to any other clinical partner. In the scope of the SEMCARE project real patient data will never leave the hospitals in order to meet the privacy regulations. Data processing of the final system and the operation of the data platform will be performed within a dedicated server infrastructure in the hospital. Patient data may, however, be shared between different departments within each hospital. For these data transfers, (pseudo-) anonymisation processes will be applied that already exist at the clinical sites. The de-identification procedures that are locally used in each of the three participating hospitals are explained in detail in the following hospital-specific sections. For any data transfer of test data to AVERBIS or transfer of patient data between different departments within the hospital it will be ensured that personal data cannot be read, copied, changed or deleted by unauthorised people during electronic transmission or during transport or storage on a data storage medium. As mentioned before, in the SEMCARE project all patient data will remain within the hospitals. No real clinical data will be transferred outside the hospital, neither by employees, nor by systems. The data processing of all productive systems will be performed locally. 4.1.6. SEMCARE ‘global’ use case The starting point of data processing within the SEMCARE project is the clinical data that is available in the various information systems within the participating hospitals. The patient data necessary for analysis is information about diagnosis, symptoms, laboratory data, therapies, and medications that are contained in discharge letters and medical reports, databases or other clinical information systems. The SEMCARE systems have to be able to handle huge amounts of structured but also unstructured data extracted from the aforementioned sources, to be analysed by specific text mining procedures. The individually required data set is defined by the corresponding use case. The department providing the data will be identified accordingly during the course of the project. In a first ‘global’ use case called 26 D1.1 – Ethics guidelines and procedures WP1: Scientific Coordination Dissemination level: Public Author: SEMCARE Consortium Version: 3.0 ‘Risk Stratification and Differential Diagnosis of Patients suffering from transient loss of consciousness’ we concentrate on cardiologic disorders. Three possible use cases are aimed during the term of the SEMCARE project: a) Diagnosis support There exist many diseases which are very rare and only affect a small number of people. If a patient suffering from such a rare disease is seeing a doctor and describing his/her symptoms, in many cases the doctor is not able to determine the diagnosis as he/she is not aware of the disease. The SEMCARE system can help the doctor with the diagnosis of rare diseases. For example, if the doctor hears of a previously unknown disease he/she could use the provided tools to look for patients suffering from the symptoms forming the rare disease within all the patient data available in the clinic. Like that, patients that could suffer from the rare disease and were previously without diagnosis or even with incorrect diagnosis can be identified and re-seen by the doctor. The SEMCARE project thus offers an enormous advantage for the patients suffering from rare diseases allowing correct diagnosis and a specific and early therapy. The physician performing the search on specific symptoms only has access to personal patient data according to his/her user rights. He/she will not be able to identify a patient for which he/she does not have the rights to see the data as the patient was e.g. treated in another department. Like that it is assured that no unauthorised access happens to the personal data. The patient data is only used in order to identify possible candidates for rare diseases and invite the patients for another visit to re-check the symptoms and possibly define the diagnosis. b) Protocol feasibility When planning a new clinical trial and working on the protocol design, there are several points that need to be considered in advance. One of the main aspects that have to be investigated is the feasibility with regards to the patient enrolment. It has to be defined in the protocol how many patients are planned and needed to participate in the study in order to get an adequate amount of data that can be evaluated subsequently. This implies that the patient population fulfilling the inclusion criteria defined in the protocol has to be figured out. Like that the investigator gets a feeling how many patients are available that match the criteria and in which timeframe it would be possible to reach the planned patient recruitment number. Depending on the number of potential participants it can also be evaluated how many clinical sites would be needed to participate in the study in order to recruit the necessary amount of patients. In this regard the SEMCARE data platform can be of great help in order to evaluate the patient population for protocol feasibility assessment. Patient data that is already available in hospital information systems can be analysed for the inclusion criteria, thus getting an overview of the 27 D1.1 – Ethics guidelines and procedures WP1: Scientific Coordination Dissemination level: Public Author: SEMCARE Consortium Version: 3.0 potential study participants within a specific hospital. Like that the investigator can see if there are enough patients available at the clinical site in order to perform the study or if other sites would have to be included to reach the planned amount of participants. Also, it can be easily evaluated which timeframe is realistic to achieve complete patient recruitment. With regards to the ethical aspects of such an assessment there is no problem as the personal data which will be analysed is only used to get an overview of the patient population. No individual data is published and a reference to the patient identifying data (like name or exact date of birth) is not necessary for this use case. In fact, SEMCARE is a useful tool to support the investigator during protocol preparations and speeds up the process until protocol start. c) Patient recruitment As described in the previous section each clinical trial has a defined number of patients that need to be enrolled in order to fulfil the protocol requirements. Sometimes, it is hard to achieve the specified amount of study participants within the timeframe. In these cases SEMCARE can assist as a tool in order to identify patients within a dedicated clinic that fulfil the protocol inclusion criteria. Again, the investigator will only be able to search in data in his/her hospital for which he/she has the corresponding access rights. After identification of potential study participants they will be asked to sign an informed consent for study inclusion, thus allowing the usage of their personal data for data analysis and statistical evaluations. 4.2. EMC 4.2.1. Protocol for de-identification The patient records that will be used for the use case in the Erasmus MC will be de-identified by applying a locally developed automated method that has been tested and approved by the Erasmus MC privacy officer. Briefly, we follow a two-step de-identification process. Each patient record contains a header with a unique hospital identifier from the hospital database. First, for each patient a unique identifier is generated which replaces the hospital id in the header. The same unique id is used for all other occurrences of the hospital id. A mapping between the hospital id and the newly generated unique id is locally stored on the computer where the patient records reside. In a second step, all information related to names (including those of patients, clinicians and technicians), cities, streets, postal codes, and hospitals is removed. Any identified words are replaced with a corresponding category name, e.g., the name J. Doe is replaced by its category #Name#. The de-identification process is based on category lists and category-specific rules. During the development stage of the system, only de-identified patient records will be used. 28 D1.1 – Ethics guidelines and procedures WP1: Scientific Coordination Dissemination level: Public Author: SEMCARE Consortium Version: 3.0 4.2.2. Data management for research After de-identification, data are stored on an encrypted removable drive and transferred to the SEMCARE workstation. There will not be data transfer to other applications. The workstation and data will only be accessible by authorized persons from the Departments of Medical Informatics and Cardiology. The workstation will run on a virtual machine on a server, which is in a dedicated server room. 4.2.3. Adaptation of the SEMCARE software A Dutch version of the SEMCARE platform will be tested and evaluated locally in the Erasmus MC. Methods and techniques that are developed in WP4 and WP5 will be adjusted and refined locally by the Department of Medical Informatics of the Erasmus MC. Patient data will not leave the hospital. 4.2.4. Installation of a foreign code in the hospitals The deployment of the SEMCARE platform in the Cardiology Department of the Erasmus MC will be as a stand-alone system, fed with de-identified patient data. This will not require any assurances other than adherence to the standard restrictions (mainly with respect to the firewall and remote access) that apply to all computer systems in the Erasmus MC. 4.2.5. Clinical connectors Data will be extracted from a clinical data warehouse at Erasmus MC, for which no specific clinical connectors need to be developed. At a later stage, a more direct connection to data in the hospital information system may be foreseen, but the Erasmus MC is currently in the process of migrating to a new hospital information system and it is not clear yet what clinical connectors and assurances will be required in the new situation. 4.3. MUG 4.3.1. Protocol for de-identification A detailed description of the pseudo-anonymisation process can be found in Appendix MUG “Location of the exported document corpus”. 4.3.2. Data management for research (can include the generation of dummy data) The whole data management process as well as access rights and location of the pseudo-anonymized data are described in Appendix MUG. If needed, the generation of dummy data can be made by manual substantive changes and is administered by Stefan Schulz. 29 D1.1 – Ethics guidelines and procedures WP1: Scientific Coordination Dissemination level: Public Author: SEMCARE Consortium Version: 3.0 4.3.3. Adaptation of the SEMCARE software Explained in Annex I MUG “Processing documents using the Averbis Software”. 4.3.4. Installation of a foreign code in the hospitals Explained in Annex I MUG “Processing documents using the Averbis Software”. 4.3.5. Clinical connectors Used clinical connectors are described within Appendix MUG “Provision of documents”. The one time access to the hospital information system (HIS): openMEDOCS (SAP IS-H * MED) is mapped within a Talend-Data-Integration job on the internal server for data integration (see Figure 1: System Sketch) and is used as documentation of the complete ETL work flow. The SEMCARE platform will have no direct access to the HIS via connectors but will work on the pseudo-anonymized data set generated via the initial ETL workflow. A connector to the pseudo-anonymized data set has to be adapted for the SEMCARE platform. 4.4. SGUL 4.4.1. Protocol for De-identification Data Types We mention data type, here, because different type of data requires different processing protocol and strategies. From the business view, the data can be classified as clinical, administrative, and survey data. In the perspective of computer science, data can be grouped as numeric, categorical, textual, and binary data. In SEMCARE project, the main challenge in data anonymisation is how to de-identify the free-form text, which exists widely in EMRs. Unlike texts expressed in news reports or articles, the text in EMRs is featured as a lot of typos, shorthand, incomplete sentences, spelling errors, and pool grammar. Therefore, the existing NLP tools may not satisfy the requirement. Moreover, it is not enough for a system to catch 90% of the sensitive contents in data – it has to catch all of them. This means standards for text anonymisation have to be much higher than other anonymisations. Basic Principles The risk of re-identification can be qualified. Data privacy and utility should be balanced. The risk of re-identification should be very small. Anonymisation involves a mix of technical, contractual, and other measures. 30 D1.1 – Ethics guidelines and procedures WP1: Scientific Coordination Dissemination level: Public Author: SEMCARE Consortium Version: 3.0 Steps 1. Selecting unique, quasi-identifiable, and sensitive attributes. Unique attributes are those attributes that can be directly used to uniquely identify individuals, such as hospital number, patient’s names and address. Quasi-identification attribute set is a group of attributes that can be used to identify individuals. For example, the combination of a patient’s birthday and post code is possibly applied by an adversary to re-identify the patient. Sensitive attributes are the information which is allowed to be published but not allowed to be linked to the certain individual. Diagnosis and symptom are such examples. 2. Setting the risk threshold. Risk threshold represents the maximum acceptable risk for sharing the data. When setting this threshold, the following two factors should be considered: Is the data going to be in the public domain? What is the extent of the invasion of privacy when this data is shared? 3. Examining plausible attacks. Four plausible attacks can be conducted on the data set: T1: The data recipient deliberately attempts to re-identify the data. T2: The data recipient inadvertently re-identifies the data. T3: There is a data breach at the data recipient side. T4: An adversary launches a demonstration attack on the data. In SEMCARE project, since the data is used only for training and testing purpose in Averbis, i.e., the data is disclosed, the first three attacks should be considered by the following factors: 1) whether Averbis has the motivation, resources, and techniques to re-identify the data set; 2) Averbis needs the security and privacy controls on the data. 4. Identifying and re-organising the sensitive contents. Text mining techniques such as term extraction, text classification, and text clustering will be exploited to extract unique, quasiidentifiable, and sensitive attributes. Documents with regard to the certain patient will be merged and linked, based on the extractions. 5. Anonymising the data Three types of techniques: Suppression: Replacing a value in a data set with a NULL or missing value. Generalisation: Reducing the precision of a field. Subsampling: Releasing only a simple random sample of the data set rather than the whole data set. Principles: k-anonymity, l-diversity, km-anonymity, (h, k, p)-coherence and ρ-uncertainty 31 D1.1 – Ethics guidelines and procedures 32 WP1: Scientific Coordination Dissemination level: Public Author: SEMCARE Consortium Version: 3.0 The k-anonymity and l-diversity principles are widely accepted; K-anonymity puts only regulations on quasi-identification attribute set; while l-diversity restricts sensitive attributes.Algorithms: Partition, Apriori, LRA, VPA, Greedy, and SuppressControl In the SEMCARE project, a partition-based anonymity algorithm will be implemented, consistent with both k-anonymity and l-diversity principles. 6. Documenting the process Measuring risk under attacks We need to measure the re-identification risk in terms of the above-mentioned plausible attacks. The metrics is defined by probability of attack and the conditional probability of such attack under a certain circumstance. Concrete equations are omitted. Measuring re-identification risk 1. Probability metrics: maximum risk, average risk When data has been released publicly, the maximum risk is between 0.09 and 0.05. If the data set is not designed for public, the average risk ranges from 0.1 to 0.05. The actual value will be determined in the invasion of privacy assessment. 2. Information loss metrics: entropy, missingness Text Anonymisation 1. General Approaches: Model-based methods: statistical or machine learning methods Rule-based methods Hybrid method: combination of machine learning method with rules In the SEMCARE project, our method depends on statistical and text mining approaches, 2. Ways to anonymise text 1) To determine unique, quasi-identifiable, and sensitive attributes. Usually, first and middle name, last name, birthday, street, email, phone numbers, and IDs are unique attributes; postal code, visit date, health care facility, city and state, and country are quasiidentification attribute. 2) To extract text for these three types of attributes 3) To apply anonymisation methods Redaction: for example, “John” is replaced with “****”. D1.1 – Ethics guidelines and procedures WP1: Scientific Coordination Dissemination level: Public Author: SEMCARE Consortium Version: 3.0 Tagging with information type: for example, “John” is replaced with “<FIRST_NAME:1>”. Randomisztion within information type: for example, “John” is replaced with a randomly selected name from a name database. Generalization into a less-specific type: a generalized value is used to replace the original value in the text, which works the same way as with structured data. 3. Evaluation The challenge with text anonymisation is the detection, i.e., finding elements of personal information in the text. Text anonymisation performance is evaluated based on how well they can detect the various elements of personal information. Evaluation method: 1) Randomly collect a set of documents (Any record with free text is called as a document.) as a corpus. 2) Split the corpus as training set and test set. 3) Use training set to establish the personal information detection system. 4) Evaluate the performance of detection using “precision” (P), “recall” (R), and “F-score” (F) metrics. 5) Verify the result: For example, separate the corpus into several equal-sized parts, say ten; in each experiment, select nine parts as training set, one left as testing set; average Ps, Rs, and Fs. Or Using verification set, which is obtained when originally splitting corpus, to calculate P, R, and F. 4. Approaches to detecting personal information Open-sourced text mining tools Term extraction approach Rule-based methods Averbis text mining tools 5. Risk assessment To calculate risk level of T1, T2, and T3 attacks using the recall of anonymisation. 6. Case study: Informatics Integrating Biology and Bedside (i2b2) 33 D1.1 – Ethics guidelines and procedures WP1: Scientific Coordination Dissemination level: Public Author: SEMCARE Consortium Version: 3.0 4.4.2. Data Management for Research 1. Data management planning Complete a checklist for data needs Complete and maintain a data management plan document Choose a data management planning tool, such as DMP Online 2. Documenting data Various documents: reports, notebooks, cookbooks, README files Metadata: descriptive, administrative, and structural metadata 3. Data storage and backup Where to store the data: networked drives, personal computers or laptops, external storage devices Backup scheme: remote or online back-up services, back strategies 4. Data security, protection and confidentiality 5. Data sharing Password protected archive with secured FTP service, or Secured RESTful/SOAP-based web service In the SEMCARE project, 100 documents will be extracted initially for the purpose of developing the anonymisation system. The extracted data will be stored on a local workstation, and are not allowed to be copied to unauthorized computers or portable disks. The production data will never be stored in local workstation; the anonymisation system will be deployed to the Trust IT department. We will develop a web service-based user interface to finish data management tasks such as data generation, backup, and sharing. 4.4.3. Adaptation of the SEMCARE Software Ability of integrating new text mining or language processing tools with the SEMCARE project Ability of removing the existing text mining or language processing tools Ability of replacing the existing text mining or language processing tools with third-party counterparts The same abilities as those mentioned above for biomedical terms Ability of managing clinical connectors 4.4.4. Installation of a Foreign Code in the Hospital At St. George’s Hospital, data loaded in the SEMCARE system comes from the same department, the Informatics Department in which the SEMCARE platform is installed. It will be held in the secure demilitarised zone (an area of the network that is isolated from the internal network), where some 34 D1.1 – Ethics guidelines and procedures WP1: Scientific Coordination Dissemination level: Public Author: SEMCARE Consortium Version: 3.0 essential services such as WEB are deployed and all the other services on the local area network are running behind a firewall connected to a public network, and consistent with Trust policy and the Data Protection Act 1998 and its annotations there will be no transfer of identifiable data outside of the Trust. 4.4.5. Clinical Connectors Universal interface to extract text from multiple types of sources: TXT, XML, RDF, RTF, DOC, XSL, PDF, JSON, HL7 CDA, etc. Each type has a relevant software module, which can be integrated as a plugin. Extraction tools: Several open-sourced tools can conduct such task, but there is no perfect solution. We may need to develop our own plugins based on open-sourced ones. 4.4.6. Risks during Data Processing and Possible Solutions 1. Incomplete business requirements may lead to a technical solution that does not deliver the anticipated added value and user interest. Solution: refining business requirements, re-collecting training data and then re-constructing the system. In engineering prospective, that means a quick training and constructing procedure, so we may need to consider agile development technology when developing the system. 2. Professional or technical resistance from the medical doctors to use the tool. Solution: In most case, this is caused by poor user interface design. To avoid this, medical persons must be involved in GUI design. 5. OVERVIEW/CONCLUSION SEMCARE’s initial inventory shows that the project has to deal with various ethical issues related to extraction and processing of healthcare data for secondary purposes. This deliverable describes the current status. Exchange of best practices on the approaches for de-identification will be encouraged between the clinical partners as much as possible, always considering national regulation and internal SOPs within each institution. SEMCARE’s ethic advisor (Prof Dipak Kalra) will perform an annual review on all ethical aspects of SEMCARE that will be submitted to the European Commission as part of the Periodic Reports. Additionally, Prof Kalra will inform partners of any update on the ethical legislation that might be relevant for the Project. 35 D1.1 – Ethics guidelines and procedures WP1: Scientific Coordination Dissemination level: Public 36 Author: SEMCARE Consortium Version: 3.0 ANNEX I. MUG's processing of documents using the Averbis Software Institut für Medizinische Informatik, Statistik und Dokumentation Auenbruggerplatz 2, 8036 Graz Averbis GmbH Tennenbacher Straße 11 D-79106 Freiburg Fon: +49 (0) 761 - 203 97690 Fax: +49 (0) 761 - 203 97694 http://www.averbis.com Univ.-Prof. DI Dr. Andrea Berghold Institutsvorstand [email protected] Tel +43 316 385 13201 Fax +43 316 385 13590 Sachbearbeiterin: Katharina Fink Tel +43 316 385 83201 [email protected] Graz, 10.03.2014 SEMCARE Context Use of pseudonymized pre-selected clinical documents created by a clinical partner for the purpose of a use case oriented semantic search. Provision of documents Data source: Hospital information system (HIS): openMEDOCS (SAP IS-H * MED) Selection criteria for the corpus of documents: Use case specified by SEMCARE or MUG (Medical University of Graz) Interface used for the Extract Transform Load (ETL) workflow: Export through export wizard from this HIS. Structured data fields are transferred prepseudonymized. Location of the exported document corpus: On the internal server for data extraction (ISDA) the extracted data is first stored in a MS Access database. For the pseudonymization of free texts, patient names and doctor names are additionally required. These data are available for the pseudonymization process outlined below. • First, tables corresponding to the document structure are inverted in order to achieve the data structure required (e.g. XML) for text analysis. Attributes such as the document number, internal patient ID, and the content reference are stored within this structure. • In order to eliminate the patient's name from the free texts, the patient name is searched for (in a case-insensitive manner) under resolution of vowels and name fragments and replaced by the pseudonym names. Medizinische Universität Graz, Auenbruggerplatz 2, 8036 Graz, www.medunigraz.at Rechtsform: Juristische Person öffentlichen Rechts gem. Universitätsgesetz 2002. Information: Mitteilungsblatt der Universität und www.medunigraz.at. DVR-Nr. 210 9494. UID: ATU 575 111 79. Bankverbindung: UniCredit Bank Austria AG IBAN: AT931200050094840004, BIC: BKAUATWW Raiffeisen Landesbank Steiermark IBAN: AT443800000000049510, BIC: RZSTAT2G 2 • • In the third step, all free texts are scanned by name fragments of all known doctor names (business partner - phys.Person) and replaced by the pseudonym [-Arztname-]. This is followed by a manual inspection. This entire process is executed on the ISDA in the corresponding project folder. All text changes are archived in journal files in order to improve the process and to be able to check on substantive questions at any time on the original text. Location of the pseudonymized corpus of documents: After manual control of the extracted data, the file is copied on a project-specific file share on the ISDA which can be accessed via the SEMCARE server. Access rights to the pseudonymized corpus of documents: On the ISDA, qualified and authorized staff and system administrators of the IMI (Institute for Medical Informatics, Statistics and Documentation) and the KAGES (Steiermärkische Krankenanstaltengesellschaft m.b.H.) have unrestricted access. Internal documentation of the "Provision of documents" subsection: The whole extraction process is mapped within a Talend-Data-Integration job on the ISDA and is used as documentation of the complete ETL work flow. Document takeover Location of the acquired pseudonymized corpus of documents: SEMCARE server,IMI server room, 5th floor, locked. Access rights to the SEMCARE server/data folder: The computer is integrated into the hospital network. Access is therefore strictly limited to: • System administrators of the KAGES • System administrators of the IMI • IMI SEMCARE project members Authorized access to the data folder is exclusively allowed only within the hospital network. Internal documentation of the item entitled "Document takeover“: The transfer of the documents will be confirmed and documented by Markus Kreuzthaler and Stefan Schulz during the course of the project. Place of data processing: The data processing of the acquired pseudonymized data set is allowed only on the SEMCARE server. No data is transferred to external interfaces belonging to the project partner Averbis (written confirmation by Averbis). Processing of documents using the Averbis software Location of software development: Software development in combination with the processing of the pseudonymized data set is only allowed on the SEMCARE server. Software development using synthetically created documents, word lists, etc., and other publicly available corpora can be performed on other PCs and laptops, however. Medizinische Universität Graz, Auenbruggerplatz 2, A-8036 Graz. www.medunigraz.at Rechtsform: Juristische Person öffentlichen Rechts gem. Universitätsgesetz 2002. Information: Mitteilungsblatt der Universität und www.medunigraz.at. DVR-Nr. 210 9494. UID: ATU 575 111 79. Bankverbindung: Bank Austria Creditanstalt BLZ 12000 Konto-Nr. 500 948 400 04, Raiffeisen Landesbank Steiermark BLZ 38000 Konto-Nr. 49510. 3 Software development IDE: For software development, the Eclipse IDE with Maven build is used. For this it is necessary to access the external Averbis Maven repository via SSH during the build process. Averbis supplies the preconfigured Eclipse project. The provided NLP blocks are parameterized according to the use case regarding the pseudonymized data set (sentence detection, date detection, POS, negation detection, etc.). Update of the Averbis-Software/SEMCARE Eclipse project: Updating the Averbis software on PCs and laptops that are not part of the hospital network (e.g., those using the MUG-User network) can be accomplished with the help of remote maintenance software and the 4 eyes principle. An IMI project member is constantly present, and this person both opens and closes the session. Updates are therefore performed manually by Averbis and an IMI project staff member. The use of Team View or comparable remote maintenance products on the SEMCARE server is not allowed (the SEMCARE server is integrated in the hospital network). A WebSVN server, hosted and managed by Averbis (https://aspirin.averbis.uni-freiburg.de/svn/graz/semcare) is used for updates of the SEMCARE Eclipse project. Software and data can therefore be strictly separated. Further in the project timeline, fast update processes in accordance to the uses case could be supported. Following data could be provided to Averbis: • Trained models based on the pseudonymized data set. • Synthetically created documents (made by manual substantive changes and administered by Stefan Schulz) • Word and term lists based on the pseudonymized data set. • Word and term statistics on the basis of the pseudonymized data set. The release and verification of these data must be authorized and documented by Stefan Schulz. Figure 1: System Sketch. Legend: HIS: ETL: ISDA: Data PRE-PSA: Data PSA: SVN-Repository: Hospital Information System Extract Transform Load Internal Sever for Data Extraction Data Pre-Pseudonymized (Structured field contents) Data Pseudonymized (Free text contents) Subversion Repository (Hosting the Eclipse SEMCARE project) Medizinische Universität Graz, Auenbruggerplatz 2, A-8036 Graz. www.medunigraz.at Rechtsform: Juristische Person öffentlichen Rechts gem. Universitätsgesetz 2002. Information: Mitteilungsblatt der Universität und www.medunigraz.at. DVR-Nr. 210 9494. UID: ATU 575 111 79. Bankverbindung: Bank Austria Creditanstalt BLZ 12000 Konto-Nr. 500 948 400 04, Raiffeisen Landesbank Steiermark BLZ 38000 Konto-Nr. 49510.
© Copyright 2024