You are here:
Home Clinical Genomics Research Project design and preparation Making yourself data capable
Last updated: May 15, 2018

Making yourself data capable

Genomic research involves transferring, analysing and storing large amounts of complex data, securely and confidentially. This can be expensive and time consuming.  A data management plan (DMP) can help improve the project’s efficiency and prevent a raft of expensive issues, such as losing data or violating participants’ privacy. A DMP may be developed before starting a genomics project.

A data management plan should consider:
  • What and how much data will be created?
  • What policies will apply to the data?
  • What are the participants’ rights around data security, privacy and confidentiality?
  • Who will own and have access to the data?
  • What data, and for how long, must it be stored?
  • What are the computing and staff requirements and costs?  
  • What data management practices will be used?
  • Who will be responsible for each of these activities?

Data considerations

Metadata, genomics data and results

Hundreds of files, in various formats and sizes, are generated or used in genomics research,. This information is often linked to participants’ data (non- or re-identifiable), information on how the data was sequenced and analysed, and the outputs from other analysis pipelines.

Participants’ data (participant information, medical history and/or clinic-pathological data), as well as information about the study (study investigators, sequencing information, methodologies, tools and versions), are often referred to as metadata. Metadata is key for data management, validating, interoperating, reproducing and revisiting results and is often required to publish. Where possible, metadata should use consistent or controlled vocabulary.  Participants’ personal information may be stored separately from genomic data to reduce the risk of re-identification.

Back to top

Data legislation, guidelines and best practice

Australian legislation, guidelines and international best practices around clinical genomics data are in constant flux as the field is quickly advancing, and it is recommended that you consult with your local genomics facility, data scientist and/or bioinformatician before starting a project. Government legislations, funders and publishers have rules, guidelines and requirements around the data, some of which are listed below.

 Legislation and policies

  • Privacy Act 1988 (Commonwealth)
  • Privacy and Personal Information Protection Act 1998 (NSW)
  • Health Records and Information Privacy Act 2002 (NSW)
  • Australian Human Rights Commission Act 1986 (Commonwealth)
  • Government Information (Public Access) Act 2009 (NSW)
  • Financial Services Council Genetic Testing Policy and Family History Policy (2002,2005,2016)

Funders statements, guidelines and requirements

  • National Health and Medical Research Council Statement on Data Sharing (2015)
  • National Statement on Ethical Conduct in Human Research (2015)
  • Australian Code for the Responsible Conduct of Research (2007)
  • Australian Research Council (ARC) Funding Rules (2016)

International funders such as the Wellcome Trust (UK), Medical Research Council (UK), National Institutes of Health (NIH) (USA), Bill and Melinda Gates Foundation (USA) and National Science Foundation (USA) also have specific requirements and guideline around data sharing.

The legislation, guidelines and best practices discussed in this Resource are subject to change and Centre for Genetics Education will endevour to update them where possible. Consult with a specialist or expert in the field is recommended.Consult with a specialist or expert in the field is always recommended. Please contact the Centre for Genetics Education with any comments.

Back to top

Data security, privacy and confidentiality

Genomic data cannot be truly “anonymised” as it conceivably possible to re-identify a participant using genomic data, but ensuring the privacy and confidentiality of participants remains paramount. It is also critical to gain consent from participants around the risks involved with the security of their data.

Participants’ identity and personal information may be stored and managed separately from their genomic data.  Participants may be allocated an identifier that is used to track or identify them at appropriate times. The National Statement on Ethical Conduct in Human Research does not use the term de-identify as it is ambiguous [1]. The term identifiable is used instead.

There are three mutually exclusive forms:

  • Individually identifiable data, where the identity of a specific individual can reasonably be ascertained. For example, by the individual’s name, image, date of birth
  • Re-identifiable data, from which identifiers have been removed and replaced by a code, but it remains possible to re-identify a specific individual. For example, by using a code or linking different data sets
  • Non-identifiable data, is data that has never been labelled with individual identifiers or identifiers have been permanently removed, and by means of which no specific individual can be identified. A subset of “non-identifiable data” is data that is linked to other datasets or metadata, though the participant’s identity remains unknown.

Strict standards may be implemented to ensure the security of physically (local servers, hard drives and USBs) and electronically stored and transported data. All data must be appropriately encrypted and password protected. Local servers, hard drives and USBs should be password protected and stored and transported securely. Cloud services (online storage and transfers), such as Amazon Web Services, Google Docs, Google Cloud, Nectar Cloud and Dropbox, must also have adequate protections in place.  If you are required to store and analyse human genomic data in Australia, you may account for the physical location of computers used for Cloud computing.

To manage data security and privacy, researchers are advised to consult with a data scientist and/or bioinformatician before embarking on a genomic study.

Back to top

Data access

There is a range of options for managing and restricting access to participants’ data, from open, through mediated or controlled, to closed access. Access control measures may be study specific, outlined in the EDP and have the participants’ consent. Controls and restrictions that can be put in place include but are not limited to:

  • Specific authorisation required from the data owner to access data
  • Access only given to approved researchers
  • Removing data identifiers unless consent has been given for information to be shared
  • Embargos imposed on sensitive data until the sensitivity is no longer pertinent
  • Secure access to data stored in a secure online environment that allows researchers remote access to perform analysis but data is not downloadable

Back to top

Publishers and public repositories requirements

Genomic research often utilises publicly available data repositories. Participants’ genomes and variants are compared against data in these repositories to reveal novel and functional variants. When submitting data to a public repository the participants’ personal information should not be included. It is also important to advise participants that consent cannot be withdrawn after the data has been placed in a public repository.

Most peer-reviewed publications, funders and local policies require that data is shared or made available as it enables further research and replication of findings. Publishers and funders may also require that “raw” or unprocessed data is stored for a period of time after publication or data generation. These requirements may be considered when developing a DMP and EDP. 

Publishers have specific requirements including:

  • Requiring that all data underlying a publication be made available with no or minimal restrictions
  • Requiring a statement on the authors’ willingness to share the data

For more information see - Publications and submitting to public repositories

Back to top

Computation and staff costs

It is important to budget for computation and staffing costs relating to the analysis, management and storage of the data, often well beyond the research study period. These costs include but are not limited to:

  • Data scientists and bioinformaticians’ time
  • Time and financial costs associated with data transfers (especially for larger datasets)
  • High performance computer (HPC) time for analysis
  • Data storage costs beyond the research study period
  • Commercial software and databases

Back to top

Each study will have different requirements and it is, therefore, important to consult with your local genomics provider, data scientist and/or bioinformatician.

Have you thought about?

  • How will you ensure participants’ privacy and confidentiality is maintained?
  • What resources will you need to complete your analyses?
  • Will you be required to store the data beyond the study period? If so, how will it be maintained?
  • Have you considered consulting with or outsourcing to a provider of data storage and management or centralised repositories, such as the National Computational Infrastructure or CSIRO?
  • Would generating multiple data sets (genomic, proteomic, functional studies) to compare and combine, increase the power of your study?
  • Will this study generate intellectual property? If so, how the results might be translated or commercialised?

Back to top

[1] The National Statement on Ethical Conduct in Human Research https://www.nhmrc.gov.au/book/chapter-3-5-human-genetics