Download Report

BMC Bioinformatics 2015, Volume 16 Suppl 2
http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S2
MEETING ABSTRACTS
Open Access
Highlights from the Tenth International Society
for Computational Biology (ISCB) Student Council
Symposium 2014
Boston, MA, USA. 11 July 2014
Published: 28 January 2015
These abstracts are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S2
INTRODUCTION
A1
Highlights from the tenth ISCB Student Council Symposium 2014
Farzana Rahman1, Katie Wilkins2, Annika Jacobsen3, Alexander Junge4,
Esmeralda Vicedo5, Dan DeBlasio6, Anupama Jigisha7, Tomás Di Domenico8*
1
Genomics and Computational Biology Research Group, University of South
Wales, UK; 2Computational Biology Department, Cornell University, USA;
3
Centre for Integrative Bioinformatics (IBIVU), VU University Amsterdam,
Netherlands; 4Center for non-coding RNA in Technology and Health,
University of Copenhagen, Denmark; 5Department of Bioinformatics and
Computational Biology, Fakultät für Informatik, Germany; 6Department of
Computer Science, University of Arizona, USA; 7University of Geneva,
Switzerland; 8Wellcome Trust/Cancer Research UK Gurdon Institute,
University of Cambridge, UK
E-mail: [email protected]
BMC Bioinformatics 2015, 16(Suppl 2):A1
About the Student Council and the symposium: The Student Council
(SC), part of the International Society for Computational Biology (ISCB),
aims at nurturing and assisting the next generation of computational
biologists. Our membership and leadership are composed of volunteer
students and post-docs in computational biology and related fields. The
main goal of our organisation is to offer networking and soft skill
development opportunities to our members.
The Student Council Symposium (SCS) takes place every year, directly
preceding the ISMB/ECCB conferences. SCS 2014 marked the tenth
consecutive edition of the event, after the success of previous years’
editions [1-7].
Meeting format: The Student Council Symposium is a one-day event.
Following the success of previous years, SCS 2014 kicked off with a scientific
speed dating session. During this session our delegates have to find a
partner to introduce themselves to, and they discuss their scientific
backgrounds and interests. After five minutes they must switch partners, and
this goes on until the allotted time runs out. The traditional scientific
component of the meeting consisted of two keynote presentations by senior
researchers, twelve oral presentations by delegates, and a poster session.
To celebrate the 10th edition of the Student Council Symposium, Dr. Manuel
Corpas, Dr. Jeroen DeRidder, Dr. Nils Gehlenborg and Dr. Geoff Macintyre,
former members of the Student Council, delivered a welcome address and
an overview of the Student Council’s history.
Dr. David Bartel (HHMI/MIT/Whitehead Institute, US) and Dr. Ashlee Earl (The
Broad Institute of MIT & Harvard, US) generously agreed to deliver the
keynote addresses at SCS 2014. In addition, Abhishek Pratap, Senior
Research Scientist in Bioinformatics at our institutional partner Sage
Bionetworks, gave a short presentation about Enabling Collaborative and
Reproducible Research through the Synapse software.
SCS 2014 received 76 submissions from students, which were peer-reviewed
by 23 independent reviewers. More than 50 abstracts were accepted for
poster presentations, and 12 abstracts were invited to deliver an oral
presentation. Extended abstracts of oral presentations are included in this
report. All abstracts are available online in the SCS 2014 booklet http://
scs2014.iscbsc.org/booklet-2014.
Welcome address: 10 years of Student Council: The commemorative
welcome address opened the day, and Drs. Corpas, DeRidder, Gehlenborg
and Macintyre provided our delegates with their points of view on the
evolution of the Student Council during its first 10 years. Having now
become young group leaders and senior postdocs, they offered an
interesting perspective on the impact the Student Council has had on the
development of their carreers.
Keynotes: Dr. David Bartel’s keynote followed the welcome address. In his
talk, Dr. Bartel gave an overview of the current understanding of microRNAs,
the progress in predicting their targets, and how measurements of their
regulatory effects have revealed an unexpected developmental switch in the
nature of mRNA translational control.
Dr. Ashlee Earl gave us an overview of her work on tackling longstanding
and emerging challenges in infectious disease by taking advantage of new
sequencing technologies. In particular, she described her group’s work on
tackling the emergence of multi-drug resistant strains of pathogens through
the development of approaches and tools to examine the drug resistant
Mycobacterium tuberculosis.
Student presentations: The first oral presentation was delivered by
Yassine Soulimi, who introduced the COSMOS software for cloud enabled
next generation sequencing analysis [8]. COSMOS is a scalable workflow
management framerwork, which aims at reducing the cost of whole
genome data analysis in order to place it within a reimbursable cost point
and in clinical time.
When performing multiple sequence alignments, most users tend to rely on
the default parameters of the algorithm. A different parameter setting may
however have great impact on the quality of the output alignment.
Parameter advising is the task of selecting good parameters for a given set
of input sequences to be aligned. Dan DeBlasio presented his work on
constructing improved advisors for multiple sequence alignment [9].
Lin-Yang Cheng described his efforts to enhance quantitative protein-level
conclusions in experiments with data-independent spectral acquisition by
the statistical elimination of spectral features with large between-run
variation. His results show that his approach achieves an accuracy that
exceeds the standard approach of using three spectral features with the
highest intensity between runs [10].
Microsatellites are short, tandem-repeated DNA sequences which make up
approximately 3% of the human genome. The expansion of these
© 2015 various authors, licensee BioMed Central Ltd. All articles published in this supplement are distributed under the terms of the
Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and
reproduction in any medium, provided the original work is properly cited.
BMC Bioinformatics 2015, Volume 16 Suppl 2
http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S2
Page 2 of 10
microsatellite repeats has been linked to many neurological and
developmental disorders. Harriet Dashnow presented her work on
developing a microsatellite genotyping algorithm that addresses several
issues regarding the length determination of microsatellites from nextgeneration sequencing data, and provides a highly accurate and more
detailed analysis of microsatellite loci [11].
Yi Zhong reported the development of novel computational tools to gain
biological insight from Ribo-seq and RNA-seq data in a fast and accurate
way. By using transcriptome-scale ribosome footprinting data from
leukemia cell lines, he identified drug-sensitive genes showing both
decrease of translational efficiency and accumulation of ribosome
occupancy at 5’UTRs. These genes constitute potential therapeutic targets
for cancer [12].
The usage of tetranucleotides for genomes analysis is a promising
approach for the evaluation of host-parasite coevolution and gene
exchange within the mycobacteriophage population. Benjamin Siranosian
talked about computationally inexpensive methods, based on the usage
of tetranucleotides, that are also independent of gene annotation, and
their usefulness for phage clustering and the analysis of evolutionary
relationships [13].
Haeewook Lee presented his work on the detection of structural variants
involving insertion sequence elements in mutation accumulation lines of
Escherichia coli. By extending an A-Bruijn graph based structural variant
detection framework he was able to tackle the challenge of obtaining direct
estimates on insertion, deletion and recombination event rates.
Using a Random Forest machine learning approach, Russel Sutherland
reported results on the discrimination between cancer differentiation
subtypes. By applying the algorithm to exome sequencing data from
tumour and normal tissue samples from 1798 patients, they were able to
discriminate between 5 cancer types with high accuracy.
Alex Salazar presented Emu, an algorithm that resolves alternate
representations of larger sequence variants (LSVs) by comparing variants
across genomes. Emu improves the analysis of LSVs in bacterial genomes by
reducing cross-sample noise resulting from per-sample variant calls [14].
Vikas Pejaver presented MutPred2, a method for the prediction of
pathogenicity of missense variants and their molecular effects. The software
can be used to guide downstream experiments for elucidating the
molecular basis of disease, and to assist in the development of therapeutic
strategies.
Sarah Keasey presented her work on systematically identifying and
analysing thousands of direct binary protein interactions within Y. pestis.
The resulting benchmark dataset can be highly useful for the analysis of
protein interaction networks functioning within an important human
pathogen [15].
On behalf of Amin Ardeshirdavani, Prof. Yves Moreau presented NGSLogistics, a platform to analyse NGS data in a distributed way, while
guaranteeing privacy and security [16]. The framework aims to reduce the
effort and time needed to evaluate the significance of mutations based on
full genome and full exome sequencing.
Award winners: Thanks to the generous contribution of the Swiss
Institute of Bioinformatics, two travel fellowships were awarded to Sarah
Keasey and Vikas Pejaver to attend SCS 2014.
Based on the votes of the SCS delegates, a judging committee awarded
three speakers with one best oral and two best poster presentations awards.
The best oral presentation award went to Harriet Dashnow for her work
entitled “Genotyping Microsatellites in Next-Generation Sequencing Data”.
The first place in the best poster presentation awards went to Alex Salazar
for his work, “Investigating large sequence variants in drug resistant
Mycobacterium tuberculosis”. The second place in the poster presentation
awards went to Sarah Keasey for her work, “The Road To Linking Genomics
And Proteomics Of Pathogenic Bacteria: From Binary Protein Complexes To
Interaction Pathways”.
In addition to the aforementioned awards, Russell Sutherland and Dilmi
Perera received F1000 awards for their poster presentations at SCS 2014.
Conclusions: This year’s number of submissions and participants saw a
slight decline in comparison with the previous edition. Visa issues and the
general lack of funding seem to be the main reasons according to our
surveys. All these issues notwithstanding, the quality of the keynote
presentations, the 12 oral presentations and the poster session once again
made the Student Council Symposium a great success.
Preparations are already ongoing for the 11th edition of SCS to be held in
Dublin, Ireland, preceding ISMB/ECCB 2015. For further information
regarding the Student Council, its events, internships and community,
please visit http://www.iscbsc.org.
Acknowledgements: Because of space constraints we are unable to
mention in this publication all the volunteers whose contributions make
the Student Council Symposium a reality every year. Our recognition and
appreciation goes out to all of them, since without their support the
organisation of such an event would simply not be possible
We would like to thank ISCB Executive Director Diane Kovats, ISCB
Conferences Director Steven Leard, ISMB Conference Administrator Pat
Rodenburg, ISMB programmer Jeremy Henning, and ISCB Administrative
Support Suzi Smith for their logistical support and invaluable advice.
Furthermore, we thank the ISCB Board of Directors for their continued
support of the ISCB Student Council in general and the Student Council
Symposium in particular
We are greatly indebted to ISMB 2014 conference chairs Prof. Bonnie
Berger and Dr. Janet Kelso for giving us the opportunity to organise the
Student Council Symposium 2014 in Boston
The Student Council would also like to thank our keynote speakers
Dr. David Bartel and Dr. Ashlee Earl who generously donated their
valuable time by delivering keynote addresses
The Symposium would not be possible without the financial support of
our generous sponsors. We would like to thank BioMed Central, Oxford
University Press, Sage Bionetworks, IMGT, the Swiss Institute of
Bioinformatics, and F1000 for their contributions
We are very grateful to all the volunteer reviewers for their work on
ensuring the quality of the scientific program, and to the program and
travel fellowship committees for coordinating the reviewing effort
Finally, we thank all Student Council members that have spent countless
hours organising all aspects of SCS 2014 to ensure its success
References
1. Di Domenico T, Prudence C, Vicedo E, Guney E, Jigisha A, Shanmugam A:
Highlights from the ISCB Student Council Symposium 2013. BMC
Bioinformatics 2014, 15(Suppl 3):A1.
2. Goncearenco A, Grynberg P, Botvinnik OB, Macintyre G, Abeel T: Highlights
from the Eighth International Society for Computational Biology (ISCB)
Student Council Symposium 2012. BMC Bioinformatics 2012, 13(Suppl 18):A1.
3. Grynberg P, Abeel T, Lopes P, Macintyre G, Rubino LP: Highlights from the
Student Council Symposium 2011 at the International Conference on
Intelligent Systems for Molecular Biology and European Conference on
Computational Biology. BMC Bioinformatics 2011, 12(11).
4. Klijn C, Michaut M, Abeel T: Highlights from the 6th International Society
for Computational Biology Student Council Symposium at the 18th
Annual International Conference on Intelligent Systems for Molecular
Biology. BMC Bioinformatics 2010, 11(Suppl 10):4.
5. Abeel T, de Ridder J, Peixoto L: Highlights from the 5(th) International
Society for Computational Biology Student Council Symposium at the
17(th) Annual International Conference on Intelligent Systems for
Molecular Biology and the 8(th) European Conference on Computational
Biology. BMC Bioinformatics 2009, 10(13).
6. Peixoto L, Gehlenborg N, Janga SC: Highlights from the Fourth
International Society for Computational Biology Student Council
Symposium at the Sixteenth Annual International Conference on
Intelligent Systems for Molecular Biology. BMC Bioinformatics 2008, 9(10).
7. Gehlenborg N, Corpas M, Janga SC: Highlights from the Third
International Society for Computational Biology Student Council
Symposium at the Fifteenth Annual International Conference on
Intelligent Systems for Molecular Biology. BMC Bioinformatics 2007, 8(8).
8. Souilmi Yassine, Jung Jae-Yoon, Lancaster Alex, Gafni Erik, Amzazi Saaid,
Ghazal Hassan, Wall Dennis, Tonellato Peter: COSMOS: Cloud Enabled NGS
Analysis. BMC Bioinformatics 16(Suppl 2):A2.
9. DeBlasio Dan, Kececioglu John: Parameter Advising for Multiple Sequence
Alignment. BMC Bioinformatics 16(Suppl 2):A3.
10. Cheng Lin-Yang, Liu Yansheng, Chang Ching-Yun, Röst Hannes,
Ruedi Aebersold, Olga Vitek: Statistical elimination of spectral features
with large between-run variation enhances quantitative protein-level
conclusions in experiments with data-independent spectral acquisition.
BMC Bioinformatics 16(Suppl 2):A4.
11. Dashnow Harriet, Tan Susan, Das Debjani, Simon Easteal, Oshlack Alicia:
Genotyping microsatellites in next-generation sequencing data. BMC
Bioinformatics 16(Suppl 2):A5.
BMC Bioinformatics 2015, Volume 16 Suppl 2
http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S2
12. Zhong Y, Drewe P, Wolfe AL, Singh K, Wendel H, Rätsch G: Protein
translational control and its contribution to oncogenesis revealed by
computational methods. BMC Bioinformatics 16(Suppl 2):A6.
13. Ye Chen, Siranosian Benjamin, Herold Emma, Kwon Minjae,
Perera Sudheesha, Williams Edward, Taylor Sarah, deGraffenried Christopher:
Tetranucleotide usage in mycobacteriophage genomes: alignment-free
methods to cluster phage and infer evolutionary relationships. BMC
Bioinformatics 16(Suppl 2):A7.
14. Salazar Alex, Earl Ashlee, Desjardins Christopher, Abeel Thomas:
Normalizing alternate representations of large sequence variants across
multiple bacterial genomes. BMC Bioinformatics 16(Suppl 2):A8.
15. Keasey LSarah, Natesan Mohan, Pugh Christine, Kamata Teddy,
Wuchty Stefan, Ulrich GRobert: The road to linking genomics and
proteomics of pathogenic bacteria: From binary protein complexes to
interaction pathways. BMC Bioinformatics 16(Suppl 2):A9.
16. Ardeshirdavani Amin, Souche Erika, Dehaspe Luc, Van Houdt Jeroen,
Vermeesch Robert Joris, Moreau Yves: NGS-Logistics: Data infrastructure
for efficient analysis of NGS sequence variants across multiple centers.
BMC Bioinformatics 16(Suppl 2):A10.
MEETING ABSTRACTS
A2
COSMOS: cloud enabled NGS analysis
Yassine Souilmi1,2, Jae-Yoon Jung2, Alex Lancaster2, Erik Gafni3, Saaid Amzazi1,
Hassan Ghazal4, Dennis Wall2,5, Peter Tonellato2*
1
Department of Biology, Faculty of Sciences of Rabat, Morocco; 2Center for
Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA;
3
INVITAE, San Francisco, CA 94107, USA; 4Department of Biology, Mohamed
First University, Oujda/Nador, Morocco; 5Department of Pediatrics, Division of
Systems Medicine, Stanford University, Stanford, CA 94305, USA
BMC Bioinformatics 2015, 16(Suppl 2):A2
Background: The dramatic fall of next generation sequencing (NGS) cost in
recent years positions the price in range of typical medical testing, and thus
whole genome analysis (WGA) may be a viable clinical diagnostic tool.
Modern sequencing platforms routinely generate petabyte data. The current
challenge lies in calling and analyzing this large-scale data, which has
become the new time and cost rate-limiting step.
Methods: To address the computational limitations and optimize the
cost, we have developed COSMOS (http://cosmos.hms.harvard.edu) , a
scalable, parallelizable workflow management system running on clouds
(e.g., Amazon Web Services or Google Clouds). Using COSMOS [1], we
have constructed a NGS analysis pipeline implementing the Genome
Analysis Toolkit - GATK v3.1 - best practice protocol [2,3], a widely
accepted industry standard developed by the Broad Institute. COSMOS
performs a thorough sequence analysis, including quality control,
alignment, variant calling and an unprecedented level of annotation
using a custom extension of ANNOVAR. COSMOS takes advantage of
parallelization and the resources of a high-performance compute cluster,
either local or in the cloud, to process datasets of up to the petabyte
scale, which is becoming standard in NGS.
Conclusion: This approach enables the timely and cost-effective
implementation of NGS analysis, allowing for it to be used in a clinical
setting and translational medicine. With COSMOS we reduced the whole
genome data analysis cost under the $100 barrier, placing it within a
reimbursable cost point and in clinical time, providing a significant change
to the landscape of genomic analysis and cement the utility of cloud
environment as a resource for Petabyte-scale genomic research.
References
1. Gafni E, Luquette LJ, Lancaster AK, Hawkins JB, Jung J-Y, Souilmi Y, Wall DP,
Tonellato PJ: COSMOS: Python library for massively parallel workflows.
Bioinformatics 2014, btu385.
2. DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C,
Philippakis AA, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell TJ,
Kernytsky AM, Sivachenko AY, Cibulskis K, Gabriel SB, Altshuler D, Daly MJ: A
framework for variation discovery and genotyping using nextgeneration DNA sequencing data. Nat Genet 2011, 43:491-498.
3. Van der Auwera GA, Carneiro MO, Hartl C, Poplin R, del Angel G, LevyMoonshine A, Jordan T, Shakir K, Roazen D, Thibault J, Banks E,
Garimella KV, Altshuler D, Gabriel S, DePristo MA: From FastQ Data to
Page 3 of 10
High-Confidence Variant Calls: The Genome Analysis Toolkit Best
Practices Pipeline. 2013.
A3
Parameter advising for multiple sequence alignment
Dan DeBlasio*, John Kececioglu
Department of Computer Science, University of Arizona, Tucson, AZ, 85721,
USA
E-mail: [email protected]
BMC Bioinformatics 2015, 16(Suppl 2):A3
Background: While the multiple sequence alignment output by an aligner
strongly depends on the parameter values used for its alignment scoring
function (i.e. choice of gap penalties and substitution scores), most users
rely on the single default parameter setting. A different parameter setting,
however, might yield a much higher-quality alignment for a specific set of
input sequences. The problem of picking a good choice of parameter
values for a given set of input sequences is called parameter advising.
A parameter advisor has two ingredients: (i) a set of parameter choices to
select from, and (ii) an estimator that estimates the accuracy of a
computed alignment; the parameter advisor then picks the parameter
choice from the set whose resulting alignment has highest estimated
accuracy.
Our estimator Facet (Feature-based Accuracy Estimator) is a linear
combination of real-valued feature functions of an alignment. We assume
the feature functions are given as well as the universe of parameter choices
from which the advisor’s set is drawn. For this scenario we define the
problem of learning an optimal advisor by finding the best possible
parameter set for a collection of training data of reference alignments.
Learning optimal advisor sets is NP-complete [1]. For the advisor sets
problem, we develop a greedy ℓk-approximation algorithm that finds near
optimal sets of size at most k given an optimal solution of size ℓ<k. For the
advisor estimator problem, we have an efficient method for finding the
coefficients for the estimator that performs well in practice [2,3].
Results: Parameter advising: We apply parameter advising to boost the
true accuracy of the Opal aligner [4,5], where the advisor is using
parameter sets found by the ℓk-approximation algorithm. Figure 1 shows
the accuracy of the advisor for a parameter set of size k = 10, where the
benchmarks are assigned to bins based on their accuracy using a default
parameter choice; the figure also shows the accuracies when using a single
default parameter choice, and an oracle. The number of benchmarks per
bin is indicated above the columns. An oracle is an advisor that knows the
true accuracy of an alignment; its accuracy is shown by the dotted line,
which gives the performance of a perfect advisor. Notice that in many
cases the performance of the estimator is close to the oracle. This is most
clear on the bin which has lowest average accuracy, where advising
increases the average accuracy by almost 20% compared to using a single
default parameter.
Figure 2 shows the average advising accuracy for parameter sets of
various cardinalities using as the estimator Facet [3], TCS [6], MOS [7], and
PredSP [8], where in the average, benchmark bins contribute equally. The
vertical axis is advising accuracy on the testing data, averaged over all
Figure 1(abstract A3) Advising accuracy of Facet within benchmark
bins
BMC Bioinformatics 2015, Volume 16 Suppl 2
http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S2
Figure 2(abstract A3) Average advising accuracy of estimators on
sets of varying cardinality
benchmarks and all folds using 12-fold cross-validation. The horizontal
axis is the cardinality k of the greedy advisor set. Greedy advisor set
found by the approximation algorithm are augmented from the exact set
of cardinality ℓ = 1 (namely, the best single parameter choice). Notice
that Facet (the topmost curve in the plot) continues to increase in
advising accuracy up to cardinality k = 6. Notice also that while all of the
advisors reach a plateau, for Facet this occurs at a greater cardinality and
accuracy than for other estimators.
Accuracy estimation: Our tool Facet (Feature-based Accuracy Estimator)
[9] is an easy-to-use, open-source utility for estimating the accuracy of a
protein multiple sequence alignment. Facet evaluates the estimated
accuracy of a computed alignment as a linear combination of real-valued
feature functions. We considered 12 features of which we found an
optimal subset of 5 that provide the best performance for alignment
advising. Many of the most useful features utilize information about
protein secondary structure. We find coefficients by fitting the difference in
estimator values to the difference in true accuracy for pairs of examples
where the correct alignment is known. This “difference fitting” approach is
computationally efficient and yields an estimator that works well for
advising.
Facet is open-source software that allows users to estimate accuracy as
either (1) a stand alone tool, or (2) a software library that can be integrated
into a pre-existing Java application. The implementation provides
optimized default coefficients and features. These coefficients may also be
specified manually and new features can also be added. Figure 3 shows a
simple example of using Facet within a Java application to choose
between two alignments of the same set of sequences. The secondary
structure predictions are computed on the unaligned sequences and can
be reused between the two alignments.
The Facet website provides parameter sets that can be used with the
Opal aligner (namely substitution matrices and affine gap penalties), as
well as scripts for structure prediction.
Conclusion: While the new problem of learning optimal parameter sets for
an advisor is NP-complete, in practice our greedy approximation algorithm
efficiently learns parameter sets that are remarkably close to optimal.
Moreover, these parameter sets significantly boost the accuracy of an
aligner compared to a single default parameter choice, when advising using
the best accuracy estimators from the literature.
Figure 3(abstract A3) Example of invoking Facet in Java
Page 4 of 10
References
1. DeBlasio DF, Kececioglu JD: Learning Parameter Sets for Alignment
Advising. Proceedings of the 5th ACM Conference on Bioinformatics,
Computational Biology and Health Informatics (ACM-BCB) 2014.
2. DeBlasio DF, Wheeler TJ, Kececioglu JD: Estimating the accuracy of
multiple alignments and its use in parameter advising. Proceedings of the
16th Conference on Research in Computational Molecular Biology (RECOMB)
2012, 45-59.
3. Kececioglu JD, DeBlasio DF: Accuracy Estimation and Parameter Advising
for Protein Multiple Sequence Alignment. Journal of Computational
Biology 2013, 20(4):259-279.
4. Wheeler TJ, Kececioglu JD: Multiple alignment by aligning alignments.
Bioinformatics 2007, 23(13):559-68.
5. Wheeler TJ, Kececioglu JD: Opal: multiple sequence alignment software,
Version 2.1.0. 2012 [http://opal.cs.arizona.edu].
6. Chang JM, Tommaso PD, Notredame C: TCS: A new multiple sequence
alignment reliability measure to estimate alignment accuracy and
improve phylogenetic tree reconstruction. Molecular Biology and Evolution
2014.
7. Lassmann T, Sonnhammer ELL: Automatic assessment of alignment
quality. Nucleic Acids Research 2005, 33(22):7120-7128.
8. Ahola V, Aittokallio T, Vihinen M, Uusipaikka E: Model-based prediction of
sequence alignment quality. Bioinformatics 2008, 24(19):2165-2171.
9. DeBlasio DF, Kececioglu JD: Facet: software for accuracy estimation of
protein multiple sequence alignments, Version 1.1. 2014 [http://facet.cs.
arizona.edu].
A4
Statistical elimination of spectral features with large between-run
variation enhances quantitative protein-level conclusions in
experiments with data-independent spectral acquisition
Lin-Yang Cheng1*, Yansheng Liu2, Ching-Yun Chang1, Hannes Röst2,
Ruedi Aebersold2,3, Olga Vitek4
1
Department of Statistics, Purdue University, West Lafayette IN, USA;
2
Department of Biology, Institute of Molecular Systems Biology, ETH Zurich,
8093 Zurich, Switzerland; 3Faculty of Science, University of Zurich, 8057
Zurich, Switzerland; 4Department of Computer Science, Purdue University,
West Lafayette IN, USA
E-mail: [email protected]
BMC Bioinformatics 2015, 16(Suppl 2):A4
Background: Many proteomic investigations summarize the quantitative
information across multiple spectral features into protein-level conclusions.
Data-independent spectral acquisition (DIA) now generates a lot of interest,
as it allows us to quantify many spectral features in a single run. However,
the disadvantage of DIA experiments as compared, e.g., to Selected
Reaction Monitoring (SRM) is that the features are subject to interferences
and noise. We argue that between-run variation provides an additional
insight for distinguishing good-quality and noisy DIA features. To
appropriately use the quantitative between-run variation, it is important to
account for the properties experimental design, and distinguish random
artifacts from the biological changes. We have previously proposed a
method (Chang et al., ASMS 2013) that accounts for the experimental design
to eliminate features with low information content.
Results: In this project we furthermore emphasized that conducting
regularization helps us avoid exploring every subset of features exhaustively,
and allows us to conduct hypothesis tests later on so that we would be able
to control the false discovery rate of the feature selection process.
Weevaluated our proposed approach by using three datasets that have
BMC Bioinformatics 2015, Volume 16 Suppl 2
http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S2
some notion of ground truth: an extensive simulation study, a controlled
mixture where proteins were spiked into a complex background in known
concentrations, and a study of 232 plasma samples, where 18 proteins were
quantified in both SWAH and SRM mode in presence of heavy labeled
reference peptides. We worked on [1] protein-level estimates of fold
changes between conditions, [2] sensitivity and specificity of detecting
changes in protein abundance, and [3] accuracy of relative quantification of
protein abundance in individual biological samples. A family of linear mixed
models similar to that in MSstats http://www.msstats.org were fit to all the
datasets. Then we conducted the regularization and hypothesis test to
control the selection false discovery rate.
Conclusion: The results demonstrated that our proposed feature selection
approach enhanced sensitivity and specificity of the conclusions, was
robust to the amount of noisy fragments, and increased the correlation of
subject quantification between SRM and DIA workflows. Importantly, the
performance exceeded that of the frequently used ‘top 3’ approach, which
consists of using three spectral features with the highest average intensity
between runs. Furthermore, we showed that our proposed approach
outperforms using correlation to select the information features.
References
1. Clough T, Thaminy S, Ragg S, Aebersold R, Vitek O: Statistical protein
quantification and significance analysis in label-free LC-MS experiments
with complex designs”. BMC Bioinformatics 2012, 13:S16.
2. Chang CY, Picotti P, Hüttenhain R, Heinzelmann-Schwarz V, Jovanovic M,
Aebersold R, Vitek O: Protein significance analysis in selected reaction
monitoring (SRM) measurements. Molecular and Cellular Proteomics 2012,
11, Article M111.014662.
3. Choi M, Chang CY, Clough T, Broudy D, Killeen T, MacLean B, Vitek O:
MSstats: an R package for statistical analysis of quantitative mass
spectrometry-based proteomic experiments. Bioinformatics 2014.
4. Lockhart R, Taylor J, Tibshirani R, Tibshirani R: A significance test for the
lasso. The Annals of Statistics 2014, 42.
A5
Genotyping microsatellites in next-generation sequencing data
Harriet Dashnow1,2,3*, Susan Tan4, Debjani Das4, Simon Easteal4,
Alicia Oshlack2,3
1
Life Science Computation Centre, Victorian Life Sciences Computation
Initiative, Carlton, VIC, Australia; 2The University of Melbourne, Parkville, VIC,
Australia; 3Murdoch Childrens Research Institute, Parkville, VIC, Australia;
4
John Curtin School of Medical Research - Australian National University,
Canberra, ACT, Australia
BMC Bioinformatics 2015, 16(Suppl 2):A5
Background: Microsatellites are short (2-6bp) DNA sequences repeated in
tandem, which make up approximately 3% of the human genome [1].
These loci are prone to frequent mutations and high polymorphism with
the estimated mutation rates of 10 −2 - 10 −6 events per locus per
generation, orders of magnitude higher than other parts of the genome
[2]. Dozens of neurological and developmental disorders have been
attributed to microsatellite expansions [3]. Microsatellites have also been
implicated in a range of functions such as DNA replication and repair,
chromatin organisation and regulation of gene expression [4].
Traditionally, microsatellite variation has been measured using capillary
gel electrophoresis [5]. In addition to being time-consuming, and
expensive, this method fails to reveal the full complexity at these loci
because it does not directly sequence the fragment but only measure the
number of bases in the repeat.
Next-generation sequencing has the potential to address these problems.
However, determining microsatellite lengths using next-generation
sequencing data is difficult. In particular, polymerase slippage during PCR
amplification introduces stutter noise. A small number of software tools
have been written to genotype simple microsatellites in next-generation
sequencing data [6-8], however they fail to address the issues of SNPs
and compound repeats, and in some cases provide only approximate
genotypes.
We have begun to develop a microsatellite genotyping algorithm that
addresses these issues, providing high accuracy as well as more detailed
analysis of microsatellite loci. We have validated it using high depth
amplicon sequencing data of microsatellites near the AVPR1A gene.
Page 5 of 10
Table 1(abstract A5) Concordance of microsatellite
variance calls three validation methods: electrophoresis,
manual inspection and Mendelian inheritance
Validation method
Concordant #
Concordant %
Electrophoresis
9/9
100%
Manual inspection
17/18
~95%
Mendelian inheritance
18/18
100%
Results: We found high concordance between our algorithm and repeat
lengths obtained by electrophoresis, manual inspection and Mendelian
inheritance (Table 1). By subsampling the reads, we found that our model
is accurate to within one repeat unit down to coverages that we would
expect in standard exome sequencing (Figure 1). Additionally, we detected
polymorphic single nucleotide changes within some microsatellites.
Conclusions: The algorithm was approximately 95% correct at calling the
exact same genotype on high depth sequencing data. When it did call a
genotype incorrectly, the genotype was only one repeat unit different. The
algorithm can perform at approximately 90% accuracy to within one repeat
unit with as few as 20 informative reads and reaches almost 100% accuracy
to within one repeat unit with 100 or more informative reads.
Future work will include expanding the algorithm to genotype compound
microsatellites and further validation and comparison with other algorithms
will be performed on whole genome data sets.
References
1. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K,
Dewar K, Doyle M, FitzHugh W: Initial sequencing and analysis of the
human genome. Nature 2001, 409(6822):860-921.
2. Gemayel R, Vinces MD, Legendre M, Verstrepen KJ: Variable tandem
repeats accelerate evolution of coding and regulatory sequences. Annual
review of genetics 2010, 44:445-477.
3. Gatchel JR, Zoghbi HY: Diseases of unstable repeat expansion:
mechanisms and common principles. Nature Reviews Genetics 2005,
6(10):743-755.
4. Li YC, Korol AB, Fahima T, Beiles A, Nevo E: Microsatellites: genomic
distribution, putative functions and mutational mechanisms: a review.
Molecular ecology 2002, 11(12):2453-2465.
5. Guichoux E, Lagache L, Wagner S, Chaumeil P, LÉGer P, Lepais O,
Lepoittevin C, Malausa T, Revardel E, Salin F, et al: Current trends in
microsatellite genotyping. Molecular Ecology Resources 2011, 11(4):591-611.
6. Gymrek M, Golan D, Rosset S, Erlich Y: lobSTR: A short tandem repeat
profiler for personal genomes. Genome Research 2012.
7. Highnam G, Franck C, Martin A, Stephens C, Puthige A, Mittelman D:
Accurate human microsatellite genotypes from high-throughput
resequencing data using informed error profiles. Nucleic acids research
2012, gks981.
8. Cao MD, Tasker E, Willadsen K, Imelfort M, Vishwanathan S, Sureshkumar S,
Balasubramanian S, Bodén M: Inferring short tandem repeat variation
from paired-end short reads. Nucleic acids research 2014, 42(3):e16-e16.
A6
Protein translational control and its contribution to oncogenesis
revealed by computational methods
Yi Zhong*, Phillip Drewe, Andrew L Wolfe, Kamini Singh, Hans-Guido Wendel,
Gunnar Rätsch
Memorial Sloan Kettering Cancer Center, 1275 York avenue, New York,
NY 10065, USA
BMC Bioinformatics 2015, 16(Suppl 2):A6
Background: Protein translation is a fundamental biochemical process
and the regulation of this process in response to a variety of changes has
been demonstrated to play a key role in cellular functional activity.
Recently, the translational control of oncogenes is implicated in many
cancers [1].
Results: We recently reported a translation initiation factor eIF4A RNA
helicase-dependent mechanism of translational control that contributes to
oncogenesis and underlies the anticancer effects of drug silvestrol [2].
Inhibition of eIF4A with silvestrol has powerful therapeutic effects in vitro
BMC Bioinformatics 2015, Volume 16 Suppl 2
http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S2
Page 6 of 10
Figure 1(abstract A5) Genotyping accuracy at the (AC)n promoter locus as a function of the number of reads spanning the microsatellite. 20 to
3000 reads were sampled with replacement from those spanning the microsatellite. This was done 1000 times for each depth. A shows the portion of
genotypes that were exactly correct, B shows the proportion of genotypes that were correct to within one repeat unit
and in vivo. In this study, we developed novel computational tools,
specifically be tailored to study high throughput ribosome footprint data
(Ribo-seq) [3], to identify the genes featuring either one of the two
changes between two experiment conditions: 1) translational efficiency
(TE), and 2) ribosome occupancy distribution profile (ROD) on mRNA.
In the parametric test of TE, we take RNA abundance and ribosome
occupancy density into account in order to expeditiously identify
differential translation efficiency. Whereas the non-parametric test of ROD
[4] aims to identify differential occupancy profiles, such as ribosome
stalling at specific sites even if overall translation efficiency remain
unchanged. Using transcriptome-scale ribosome footprinting data of
leukemia cell line, we defined drug-sensitive genes showing both
decrease of translational efficiency (Figure 1A) and accumulation of
ribosome occupancy at 5’UTR (Figure 1B). Among the most eIF4Adependent transcripts are a number of oncogenes, super-enhancer
associated transcription factors and epigenetic regulators.
Conclusions: Computational and statistical methodologies facilitate the
discovery of the hallmark of eIF4A-dependent transcripts, namely 5’UTR
sequence harbors the 12-mer guanine quartet (CGG)4 motif associated with
RNA G-quadruplex (GQ) structures (Figure 1C). Our novel computational
tools provide a fast, accurate solution to gain biological insights from Riboseq and RNA-seq data.
References
1. Hay N, Sonenberg N: Upstream and downstream of mTOR. Genes Dev
2004, 18(16):1926-45.
2. Wolfe AL, Singh K, Zhong Y, Drewe P, Rajasekhar VK, Sanghvi VR,
Mavrakis KJ, Jiang M, Roderick JE, Van der Meulen J, Schatz JH, Rodrigo CM,
Zhao C, Rondou P, de Stanchina E, Teruya-Feldstein J, Kelliher MA,
Speleman F, Porco JA Jr, Pelletier J, RÃ¤tsch G, Wendel HG: RNA
G-quadruplexes cause eIF4A-dependent oncogene translation in cancer.
Nature 2014, 513(7516):65-70.
3. Ingolia NT, Brar GA, Rouskin S, McGeachy AM, Weissman JS: The ribosome
profiling strategy for monitoring translation in vivo by deep sequencing
of ribosome-protected mRNA fragments. Nat Protoc 2012, 7(8):1534-50.
4. Drewe P, Stegle O, Hartmann L, Kahles A, Bohnert R, Wachter A,
Borgwardt K, RÃ¤tsch G: Accurate detection of differential RNA
processing. Nucleic Acids Res 2013, 41(10):5189-98.
A7
Tetranucleotide usage in mycobacteriophage genomes: alignment-free
methods to cluster phage and infer evolutionary relationships
Benjamin Siranosian1,2*, Emma Herold2, Edward Williams2, Chen Ye2,
Christopher de Graffenried3
1
Center for Computational Molecular Biology, Brown University, Providence,
RI, USA; 2Division of Biology and Medicine, Brown University, Providence, RI,
USA; 3Department of Molecular Microbiology and Immunology, Brown
University, Providence, RI, USA
E-mail: [email protected]
BMC Bioinformatics 2015, 16(Suppl 2):A7
Background: The genomic sequences of phages isolated on mycobacterial
hosts are diverse, mosaic and often share little nucleotide similarity. However,
about 30 unique types have been isolated, allowing most phage to be
grouped into clusters and further into subclusters [1]. Many tools for the
analysis of mycobacteriophage genomes depend on sequence alignment or
knowledge of gene content. These methods are computationally expensive,
can require significant manual input (for example, gene annotation) and can
be ineffective for significantly diverged sequences [2]. We evaluated
tetranucleotide usage in mycobacteriophages as an alternative to alignmentbased methods for genome analysis.
Description: We computed tetranucleotide usage deviation, the ratio of
observed counts of 4-mers in a genome to the expected count under a null
model [3]. Tetranucleotide usage deviation is comparable for members of the
same phage subcluster and distinct between subclusters. Neighbor joining
phylogenetic trees were constructed on pairwise Euclidean distances between
all genomes in the mycobacteriophage database. In almost every case, phage
were placed in a monophyletic clade with members of the same subcluster.
With few exceptions, trees computed from tetranucleotide usage deviation
accurately reconstruct trees based on gene content for a subset of the
mycobacteriophage population (Figure 1). We also evaluated the possibility of
assigning clusters to unknown phage based on tetranucleotide usage
deviation. Under a simple nearest neighbor classifier, cluster assignments were
recovered at a frequency greater than 98%. In addition, we looked for
evidence of horizontal gene transfer by using tetranucleotide difference index,
a measure of the deviation in tetranucleotide usage from the genomic mean
in a sliding window across the genome [3]. Tetranucleotide difference index
plots showed a strong spike at the end of cluster L mycobacteriophages,
which could indicate horizontal gene transfer in the region.
Conclusions: Genome analysis based on tetranucleotide usage shows
promise for evaluating host-parasite coevolution and gene exchange within
the mycobacteriophage population. These methods are computationally
inexpensive and independent of gene annotation, making them optimal
candidates for further research aimed at clustering phage and determining
evolutionary relationships. Code for genome analysis and data used in this
project are freely available at https://github.com/bsiranosian/tango_final.
References
1. Hatfull GF: Mycobacteriophages: Windows into Tuberculosis. PLoS Pathog
2014, 10:e1003953.
2. Vinga S, Almeida J: Alignment-free sequence comparison-a review.
Bioinformatics 2003, 19:513-523.
BMC Bioinformatics 2015, Volume 16 Suppl 2
http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S2
Page 7 of 10
Figure 1(abstract A6) Computational methods revealed drug effects on protein translation. A Histogram of the ratio of TE in control and drugtreated samples. Red: genes with significant TE down and up regulation were identified based on the read count of Ribo-seq data. Blue and green: TE up
and down genes defined by |Z score| > 2 as often used in other analyses. B Averaged distribution profile of ribosome occupancy of 62 drug-sensitive
genes. Ribosome footprint coverages and transcript lengths were normalized. C Twelve-nucleotide motif that is highly enriched in 5’ UTR of TE down and
ROD positive genes. We suggested that the GQ structure is responsible for ribosome stalling in the 5’ UTR [2]
Figure 1(abstract A7) A) Neighbor joining tree constructed from tetranucleotide usage deviation distances and B) tree from [4] constructed from
predicted protein products in a subset of sequenced mycobacteriophages. Our method accurately places phage in a monophyletic clade with members
of the same subcluster and often reconstructs relationships between subclusters. In some cases, a subcluster is not placed with other members of the cluster
because of significant and conserved differences in tetranucleotide usage, such as overrepresentation of the 4-mer ‘GATC’ in cluster B3 genomes
BMC Bioinformatics 2015, Volume 16 Suppl 2
http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S2
3.
4.
Pride DT, Wassenaar TM, Ghose C, Blaser MJ: Evidence of host-virus
co-evolution in tetranucleotide usage patterns of bacteriophages and
eukaryotic viruses. BMC Genomics 2006, 7:8.
Hatfull GF, Jacobs-Sera D, Lawrence JG, Pope WH, Russell DA, Ko C-C,
Weber RJ, Patel MC, Germane KL, Edgar RH, Hoyte NN, Bowman CA,
Tantoco AT, Paladin EC, Myers MS, Smith AL, Grace MS, Pham TT,
O’Brien MB, Vogelsberger AM, Hryckowian AJ, Wynalek JL, Donis-Keller H,
Bogel MW, Peebles CL, Cresawn SG, Hendrix RW: Comparative Genomic
Analysis of 60 Mycobacteriophage Genomes: Genome Clustering, Gene
Acquisition, and Gene Size. Journal of Molecular Biology 2010, 397:119-143.
A8
Normalizing alternate representations of large sequence variants
across multiple bacterial genomes
Alex Salazar1,2, Ashlee Earl1, Christopher Desjardins1, Thomas Abeel1,3*
1
Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA;
2
University of California, Santa Cruz, California, USA; 3Delft Bioinformatics Lab,
Delft University of Technology, Delft, Netherlands
BMC Bioinformatics 2015, 16(Suppl 2):A8
Background and description: Variant-focused comparative genomics
enables researchers to study the evolution of distinct genetic characteristics
in bacterial populations, while avoiding the difficulties of whole-genome
assembly and alignment. A major challenge in using this method is that
many variant detecting tools are largely limited to predicting single
nucleotide variants (SNVs) and small indels. This is a challenge because
bacterial organisms do not only possess SNVs but also harbor much larger
sequence variants (LSVs), such as large indels and substitutions (>25 nt),
when compared to a reference genome. LSVs have been shown to play a
role in shaping important biological aspects such as virulence and drug
resistance as well as reporting on population structure [1-3]. Recent variant
callers, such as Pilon http://www.broadinstitute.org/software/pilon/, can
identify LSVs with single nucleotide accuracy in microbial genomes.
However, one remaining challenge is that identical LSVs can be represented
non-identically by a single variant detecting tool; this generally results from
similarity in the flanking sequence of the variant and variability of the read
quality and alignment information in that region across the different strains.
As a result, alternate representations of large variants make it difficult to
perform downstream analyses - such as association studies - that depend on
consistent representations of variants.
We present Emu, an algorithm that resolves alternate representations of
LSVs by comparing variant calls across genomes.
Results: To evaluate Emu’s ability to resolve alternate representations of
LSVs, we introduced 179 simulated LSVs into the H37Rv genome–a carefully
curated and finished reference genome for Mycobacterium tuberculosis
(Mtb). We then used Pilon to identify variants in a set of 146 clinical samples
of Mtb that were collected in China using the modified H37Rv genome as a
reference [4]. We identified a total of 10,001 unique variant representations.
The average number of non-identical representations of each simulated LSV
was 56 (in the range of 1 to 145). We then applied Emu to identify the nonidentical representations across the genomes of the 146 clinical samples and
canonicalize them to a single form. Emu reduced the total number of nonidentical representations to 676 LSVs bringing the average number of nonidentical representations at each LSV to 4, with 15 LSVs reduced to a single
representation and no LSV having more than 25 representations.
We then investigated how Emu’s ability to resolve alternate representations
might impact association analyses, e.g., associating LSVs with population
structure. We ran Pilon again on the set of 161 clinical samples from China,
but used the unmodified H37Rv genome. Pilon identified a total of 20,512
distinct LSVs when compared to the unmodified H37Rv genome. By
applying Emu, the number of distinct LSVs decreased by almost 50% to
10,936 LSVs. Emu also increased the power of association tests on the LSVs.
While we initially identified a total number of 69 LSVs that were significantly
associated (p < 0.01) with membership to a specific clade, after processing
with Emu that number increased to 94.
Conclusion: Emu enables comprehensive analysis of LSVs in bacterial
genomes by reducing the cross-sample noise that results from per-sample
variant calls. By normalizing our variant calls with Emu, we increased our
power to utilize LSVs association tests. Pilon and Emu are open source
Page 8 of 10
tools that can also be applied to identify and normalize variants in other
organisms.
References
1. Alland D, Lacher DW, Hazbón MH, Motiwala AS, Qi W, Fleischmann RD,
Whittam TS: Role of large sequence polymorphisms (LSPs) in generating
genomic diversity among clinical isolates of Mycobacterium tuberculosis
and the utility of LSPs in phylogenetic analysis. J Clin Microbiol 2007, 45:39-46.
2. Maurelli AT, Fernández RE, Bloch CA, Rode CK, Fasano A: “Black holes” and
bacterial pathogenicity: a large genomic deletion that enhances the
virulence of Shigella spp. and enteroinvasive Escherichia coli. Proc Natl
Acad Sci USA 1998, 95:3943-3948.
3. Mutreja A, Kim DW, Thomson NR, Connor TR, Lee JH, Kariuki S, Croucher NJ,
Choi SY, Harris SR, Lebens M, Niyogi SK, Kim EJ, Ramamurthy T, Chun J,
Wood JLN, Clemens JD, Czerkinsky C, Nair GB, Holmgren J, Parkhill J,
Dougan G: Evidence for several waves of global transmission in the
seventh cholera pandemic. Nature 2011, 477:462-5.
4. Zhang H, Li D, Zhao L, Fleming J, Lin N, Wang T, Liu Z, Li C, Galwey N,
Deng J, Zhou Y, Zhu Y, Gao Y, Wang T, Wang S, Huang Y, Wang M,
Zhong Q, Zhou L, Chen T, Zhou J, Yang R, Zhu G, Hang H, Zhang J, Li F,
Wan K, Wang J, Zhang X-E, Bi L: Genome sequencing of 161
Mycobacterium tuberculosis isolates from China identifies genes and
intergenic regions associated with drug resistance. Nat Genet 2013,
September: 1-8.
A9
The road to linking genomics and proteomics of pathogenic bacteria:
from binary protein complexes to interaction pathways
Sarah L Keasey1*, Mohan Natesan1, Christine Pugh1, Teddy Kamata1,
Stefan Wuchty2, Robert G Ulrich1
1
Molecular and Translational Sciences Division, U.S. Army Medical Research
Institute of Infectious Diseases, Frederick, MD 21702, USA; 2National Center
for Biotechnology Information, National Institutes of Health, Bethesda, MD
20892, USA
BMC Bioinformatics 2015, 16(Suppl 2):A9
Background: The availability of fully sequenced genomes of many bacterial
organisms has enabled mapping networks of binary protein interactions
that form the basic building blocks of molecular pathways and dynamic
assemblies defining all cellular activities. Few proteome-scale studies have
been reported for pathogenic bacteria though, suggesting that a systemswide network analysis of binary interaction partners could reveal groups of
proteins that coordinate to achieve specific biological tasks important to
pathogenesis and provide a functional map useful to the discovery of new
antibiotics, vaccines, and diagnostic tools.
Results: We performed a comprehensive proteomics analysis of the
pathogenic bacterium Yersinia pestis and analytically identified more than
74,000 binary interactions. Using a library of biotinylated recombinant
proteins to probe a planar microarray comprised of immobilized proteins
that represented approximately 85% (3,552 proteins) of the Y. pestis
proteome, we measured protein-protein interactions by fluorescence
intensity of the laser-scanned microarrays. We obtained kinetic interaction
data for >1,600 binary complexes by microarray-based, surface plasmon
resonance imaging, and identified several high-affinity (K D ~ nM)
interactions. We applied a machine learning algorithm that used previously
reported experimental protein-protein interactions from Escherichia coli as a
training set in order to extract E. coli-like interactions from the Y. pestis
dataset. The node degree distribution of the resulting network, comprised of
2344 interactions between 314 proteins, approximates a power-law
distribution typical of scale-free networks. Functional annotation clustering
of proteins within the network revealed statistically enriched complexes and
pathways involved in diverse biological processes. Among the more notable
protein assemblies identified were components of the RNA polymerase
enzyme and ribosomes. Small modules of proteins related to various
metabolic pathways, as well as previously reported interactions involved in
homologous recombination and fatty acid biosynthesis, were also present in
the network. Two highly interconnected network sub-regions contained a
large percentage of proteins with functions linked to transcription and
translation.
Conclusions: We have systematically identified and analyzed thousands
of direct binary protein interactions within Y. pestis. This new benchmark
BMC Bioinformatics 2015, Volume 16 Suppl 2
http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S2
Page 9 of 10
data set will serve as a critical tool for the analysis of protein interaction
networks functioning within an important human pathogen.
terms of computing and data storage and transfer, with off-site data transfer
currently being a key bottleneck. Moreover, the analysis of NGS data also
raises the major challenge of how to reconcile federated analysis of personal
genomic data and confidentiality of data to protect privacy. In many
situations, the analysis of data from a single study alone will be much less
powerful than if it can be correlated with other studies. In particular, when
investigating a mutation of interest, it is extremely useful to obtain data
about other patients or controls sharing similar mutations. However,
personal genome data (whole genome, exome, transcriptome data, etc.) is
sensitive personal data. Confidentiality of this data must be guaranteed at all
times and only duly authorized researchers should access such personal
data.
Methods: To address all challenges described above, we developed a data
structure NGS-Logistics, which fulfills all requirements of a successful
application that can process data inclusively and comprehensively from
multiple sources while guaranteeing privacy and security. NGS-Logistics is a
web-based application providing a data structure to analyze NGS data in a
distributed way. The data can be located in any data center, anywhere in
the world. NGS-Logistics provides an environment in which researchers do
not need to worry about the physical location of the data (Figure 1). With
respect to users rights, queries will be sent to each remote server. The host
will process the request and return the results back to the main server
where all the privacy limitations are controlled for the data. Once the results
are ready, the end user can see the desired information. Depending on the
type of query, results will be divided into two parts, the first part is related
to the samples to which the user has authorized access, and for which the
users can see all details. The second part contains results for the whole
population, for which the user has only access to some aggregate statistics
without details. An example of such a query would be to review the
mutations present at a single genomic position in each individual patient
from a set of patients to which the user has authorized access (1st part) and
to contrast these results with background frequency of mutation in the
reference populations (2nd part) (Figure 2).
Results: The pilot version of NGS-Logistics has been installed and is
currently being beta-tested by users at the Center for Human Genetics of
the University of Leuven. Currently we have two installations of the
system, the first one at the Leuven University Hospitals and the second
one at the Flemish Supercomputing Center (VSC). The development of
A10
NGS-Logistics: data infrastructure for efficient analysis of NGS sequence
variants across multiple centers
Amin Ardeshirdavani1,2,4*, Erika Souche3,4, Luc Dehaspe3,4, Jeroen Van Houdt3,4
, Joris Robert Vermeesch3,4, Yves Moreau1,2,4
1
KU Leuven, Department of Electrical Engineering (ESAT), STADIUS Center for
Dynamical Systems, Signal Processing and Data Analytics, Leuven, Belgium;
2
iMinds Medical IT Department. Kasteelpark Arenberg 10, Box 2446, 3001
Leuven, Belgium; 3KU Leuven, Center of Human Genetics Gasthuisberg, O&N I
Herestraat 49 - box 602, 3000 Leuven, Belgium; 4KU Leuven Department of
Human Genetics Gasthuisberg, O&N I Herestraat 49 - box 602, 3000 Leuven,
Belgium
BMC Bioinformatics 2015, 16(Suppl 2):A10
Background: Next-Generation Sequencing (NGS) is a key tool in genomics,
in particular in research and diagnostics of human Mendelian, oligogenic,
and complex disorders [1]. Multiple projects now aim at mapping the
human genetic variation on a large scale, such as the 1,000 Genomes
Project, the UK 100k Genome Project. Meanwhile with the dramatic
decrease of the price and turnaround time, large amounts of human
sequencing data have been generated over the past decade [2]. As of
January 2014, about 2,555 sequencers were spread over 920 centers across
the world [3]. As a result, about 100,000 human exome have been
sequenced so far [4]. Crucially, the speed at which NGS data is produced
greatly surpasses Moore’s law [5] and challenges our ability to conveniently
store, exchange, and analyze this data. Data pre-processing is needed to
extract reliable information from sequencing data and it can be divided into
two major steps: primary analysis (image analysis and base calling) and
secondary analysis. When looking for variation in the human genome,
secondary analysis consists of aligning/mapping the reads against the
reference genome and scanning the alignment for variation. Both raw data
and mapped reads are large files occupying significant disk storage space.
The collection of files resulting from the analysis of a single whole genome
study can take up to 50Gb of disk space. This raises significant issues in
Figure 1(abstract A10) NGS-Logistics components. Users pass their queries from the NGS-Logistics web interface to the clients. Request are stored and
scheduled in the main database. Each center has one database, being the only way of communication between centers and the main system. Centers
and their databases are connected through a secured connection, to which only valid and trusted IPs are allowed to connect. The query manager is
responsible for tracking and running the request, as well as collecting and returning the results to the main system
BMC Bioinformatics 2015, Volume 16 Suppl 2
http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S2
Page 10 of 10
Figure 2(abstract A10) Single Point Query result page (Statistic section) for chr9:2115841. The query of chr9:2115841 shows that only one sample
is polymorphic at this position. All samples that can be genotyped at this position from the active data set, control data set and whole are homozygous
reference. The MAF of this variant in each data set is thus very low
NGS-Logistics has significantly reduced the effort and time needed to
evaluate the significance of mutations from full genome sequencing and
exome sequencing, in a safe and confidential environment. This platform
provides more opportunities for operators who are interested in
expanding their queries and further analysis.
References
1. Voelkerding KV, Dames SA, Durtschi JD: Next-generation sequencing: from
basic research to diagnostics. Clin Chem 2009, 55(4):641-658.
2. Institute NHGR: DNA Sequencing Costs 2013.
3. Next Generation Genomics: World Map of High-throughput Sequencers.
[http://omicsmaps.com/].
4.
5.
Human genome: Genomes by the thousand. Nature 2010,
467(7319):1026-1027.
DNA Sequencing Costs: Data from the NHGRI Genome Sequencing
Program (GSP). [http://www.genome.gov/sequencingcosts/].
Cite abstracts in this supplement using the relevant abstract number,
e.g.: Ardeshirdavani et al.: NGS-Logistics: data infrastructure for efficient
analysis of NGS sequence variants across multiple centers. BMC
Bioinformatics 2015, 16(Suppl 2):A10