Final Report
Some of the most challenging and influential opportunities for computer
science arise in developing and applying information technology to understand
the principles of life, and in particular, the molecular mechanics of
the cell. The pace of extraordinary advances in molecular biology has
accelerated in the past decade due to discoveries coming from genome
projects on human and model organisms. The breadth and density of genome
data that must be stored, accessed, and analyzed, indicates that computation
will play a central role in the post-genomic phase. In fact, one can say
that the arrival of these enormous amounts of data already turned molecular
biology into a computationally intensive discipline. And indeed, the results
of recent work show, that many fundamental algorithmic methods became very
useful (and many others seem to be very promising) in dealing with new
challenging problems in computational molecular biology.
·
Realizing Shapes From the viewpoint of molecular evolution
theory, over the past decade P. Schuster and co-workers Stadler, Reidys,
Hofacker, Bronberg-Bauer, Flamm, and others, see
[3] have studied
properties of neural networks within the context of protein and
especially RNA (and in particular 2D) structures. For example, it is
known [3,
12,
13,
14] that for any two secondary structures S, S' there exists
an RNA sequence compatible with both, and that this result does not
extend to more than two secondary structures. Indeed, a simple formula
for the number of RNA sequences compatible with secondary structures
S, S' plays a role in the algorithms of Flamm et al.
[7] and of
Abfalter et al.
[1] to design an RNA switch.
Here we show that a natural extension of this problem is NP-complete.
Unless P=NP, there is no polynomial time algorithm, which when given
secondary structures S_1,...,S_k, for k ≥ 4, determines the
least number of positions, such that after removal of all base pairs
incident to these positions there exists an RNA nucleotide sequence
compatible with the given secondary structures. We also considered
a restricted version of this problem with a "fixed maximum" number
of possible stars and show that it has a simple polynomial time solution.
·
Locating Conserved Gene Pairs Given the genomes of two related species, one important task
is to uncover and locate the conserved genes, i.e., genes
sharing similar functions. This task is non-trivial as most
parts of a genome are non-coding areas and the information of
genes of each genome is often not available. Delcher et al.
[6]
had given a very useful observation: A conserved gene rarely
contains the same entire sequence in the two genomes, but it
shares a lot of short common substrings that are unique to this
conserved gene (MUM pair). However, not every MUM pair corresponds
to conserved genes. The key step is how to select the right pairs.
·
Efficient Probe SelectionThe DNA micro-array technology
[9],
originally developed to measure the level of gene expression, becomes
one of the most widely used tools in genomic study. Micro-arrays have been
proved to benefit areas including gene discovery, disease diagnosis,
and multi-virus discovery.
The crux of micro-array design lies in how to select a unique probe
[10]
that distinguishes a given genomic sequence from other sequences. However,
in cases that the existence of a unique probe is unlikely, e.g., in
the context of a large family of closely homologous genes, the use of
a limited number of non-unique probes is still desirable. Because of its
significance, probe selection attracts a lot of attention. Various probe
selection algorithms have been developed in recent years. Good probe
selection algorithms should produce a small number of candidate probes
as possible. Efficiency is also crucial because the data involved is
usually huge. Most existing algorithms usually select probes by filtering,
which is usually not selective enough and quite a large number of probes
are returned. We propose a new direction to tackle the problem and give
an efficient algorithm to select (randomly) a small set of probes and
demonstrate that such a small set of probes is sufficient to distinguish
each sequence from all the other sequences. Based on the algorithm, we
have developed a probe selection software RandP S, which runs efficiently
and effectively in practice. A number of experiments have been carried
out and the results will be discussed.
·
Micro-array Comparison ToolThere have been discussions on
the possibility of designing and implementing a tool that would allow the
comparison of micro-array experiments (experiments involving mouse from
various public repositories such as NCBI GEO and array express). The aim of the tool
is to allow the user to search the repositories for
experiments (based on choosing some subset of genes) and to be able to
compare the base experimental results with the results of others. This
would then (hopefully) guide the user to similar genes which show
similar expression patterns (i.e. more (or less) protein is produced if
the mouse is exposed to some infection). The tool would involve data mining
techniques in order to trawl through
the vast amount of experimental data and obtain the relevant data to be
compared. It is hoped that from this that it would be possible to go
some way towards identifying the genes responsible for making a
particular strain resistant (or equivalently, susceptible) to sleeping
sickness.
In short, comparative genomics can be defined as the analysis and
comparison of genomes from different species. The core research is primarily
focused on the study of human genetics by comparisons with model organisms
such as mice, the fruit fly, and the bacterium E. coli. The purpose of
comparative genomics is to gain better understanding of how species
have evolved in time and how to determine the function and regulations of genes
and non-coding regions of the genome. So far, researchers have learned a
great deal about the function of human genes by examining their counterparts
in simpler model organisms such as the mouse. Genome researchers look
at many different features when comparing genomes, including: sequence
similarity, gene location, the length and number of coding regions (called
exons) within genes, the amount of non-coding DNA in each genome, and highly
conserved regions maintained in organisms of diverse complexity.
Comparative genomics has also many biological applications, including the
examination of diseases caused by genome rearrangements, the location of
homologous genes in genomes that have not yet been sequenced, the estimation
of phylogenetic relationships, and the investigation of mechanisms of genome
evolution. All this can be done only by efficient and robust procedures
detecting meaningful patterns in nucleotide point mutations or gene order
that may reveal molecular evolutionary mechanisms and phylogenetic
relationships. And in this context we may find useful recent developments
in discrete mathematics and algorithmics, including: combinatorial group
testing (screening library of clones), time/space efficient searching
methods (discovering specific patterns and regularities in DNA sequences),
compact representation (indexing) of massive data, data mining tools
(reasoning based on association rules), and many others.
In our explorations, we have undertaken an interdisciplinary research effort
focusing on integration of efficient algorithmic methodologies with
challenging combinatorial problems arising in comparative genomics.
In particular we investigated the following research problems:
This is a joint work with P. Clote
(Boston College, USA), L.A. Gasieniec
(Liverpool University, UK), R. Kolpakov
(Moscow State University, Russia),
E. Kranakis
(Carleton University, Canada) and D. Krizanc
(Wesleyan University, USA). The results of this work were published in Journal
of Theoretical Biology, see
[5].
Different approaches have been proposed to select the right
MUM pairs. Earlier work uses longest common subsequence (LCS)
approach. Some other work models the problem as a combinatorial
clustering problem. We propose a new structural optimization
problem called the Mutated Subsequence Problem, which gives
consideration to possible mutations between two species (in the form
of reversals and transpositions) when comparing the genomes.
A practical algorithm called mutated subsequence algorithm (MSS)
is devised to solve this optimization problem, and it has been
evaluated using different pairs of human and mouse chromosomes,
and different pairs of virus genomes of Baculoviridae. MSS is found
to be effective and efficient; in particular, MSS can reveal >90%
of the conserved genes of human and mouse that have been reported
in the literature. When compared with existing softwares MUMmer
[11] and MaxMinCluster
[15],
MSS uncovers 14 and 7% more genes on average,
respectively. Furthermore, a hybrid approach is devised
to integrate MUMmer or MaxMinCluster with MSS, which has better
performance and reliability.
This is a joint work H.L. Chan,
T.W. Lam,
S.M. Yiu and
X. Fan
(University of Hong Kong, China ),
W.K. Sung (
National University
of Singapore ), and P.W.H. Wong (
Liverpool University, UK ).
The results of this work were published in Bioinformatics, see
[4].
This is a joint work of L.A. Gasieniec,
C.Y. Li, and P.W.H. Wong
(University of Liverpool, UK)
and P. Sant
(currently University of Luton, UK).
The results of this work were presented at DIMACS Workshop on Detecting and Processing Regularities in High
Throughput Biological Data, June 20 - 22,
2005 DIMACS Center, at Rutgers University (pic). The paper
[8] is currently submitted
to one of the journals in the field.
There are several challenges involved, one of which is how to normalize
the data (currently we just have a list of values which are somewhat
dependent on experimental conditions). What we are currently thinking of
is a world in which gene expression levels are either up, down or
neutral. We believe that what we are proposing is useful and we aim to
begin developing a prototype tool which will hopefully lead to a more
advanced tool in the near future.
This is a joint work of A. Broadhead, I. Morton (Biosciences, University
of Liverpool, UK) and P. Sant
(University of Luton, UK). The work is in
the early stage of research explorations.
Project Assessment
We believe that research explorations on selected
computational problems in comparative genomics performed within the scope
of this project were successful and of very high standard.
The results of our collaborative work were published in top journals in
the field: Journal of Theoretical Biology and Bioinformatics,
and presented at high quality research events: DIMACS Workshop on Detecting and Processing Regularities in High
Throughput Biological Data and
IEEE 4th Symposium on Bioinformatics and Bioengineering, BIBE'2004.
The funding provided within the grant allowed to employ research associate
Dr P. Sant
(formerly King's College London)who, apart from lightening
the academic load (teaching) of the principal investigator, he contributed to
the main themes of this project. This resulted in publications and development
of the probe selection software.
We were very glad to learn that Dr Sant's stay in Liverpool helped him to
find permanent position as a lecturer at the
Department of Computer Science, the University of Luton, from September 1st, 2005.
We were also delighted to welcome other academics to participate in this
project and in particular Dr Prudence Wong
(lecturer in CTAG) who currently
co-supervises (together with Prof Gasieniec) a PhD student
Cindy Ying Li
who specializes in combinatorics of micro-arrays, probe selection and more
recently de novosequencing methods.
Perhaps the greatest advantage of this project was unforgettable opportunity
to interact with top researchers in the field via meetings in Liverpool,
research visits
(Universitaet in Berlin, Germany;
Ottawa University and
Carlton University in Ottawa, Canada;
and
University of California in Riverside,
USA) and workshop attendance (
2nd Haifa Annual International
Stringologyin April 2005
and DIMACS Workshop on Detecting and Processing Regularities in High
Throughput Biological Data in June 2005.
In this context special thanks are to
Prof Gary Benson
(Boston University, USA),
Prof Andrew Brass
(University of Manchester, UK),
Prof Marek Chrobak
(University of California, USA),
Prof Peter Clote
(Boston College, USA),
Prof Evangelos Kranakis
(Carleton University, Canada),
Prof Gad M. Landau
(Haifa University, Israel),
Prof David W. Mount
(University of Arizona, USA),
Prof Laxmi Parida
(IBM Watson, USA),
Prof Knut Reinert
(Frei Universitaet, Berlin),
Prof Cenk Sahinalp
(Simon Fraser University, Canada),
David Sankoff
(Ottawa University, Canada).
We strongly believe that work performed within this project is just a starting
point in further research explorations (on possibly larger scale) in
comparative genomics and other related topics in computational molecular
biology. This includes collaboration with both academic and industrial partners.
In this context we are in the process of initial discussions with Prof Chris
Mundy who is currently at
MerseyBio, Liverpool. We agreed that there is scope
for collaboration on algorithm development for InTeraSeq's novel sequencing
method, and that in particular we should focus on a grant application to include
staff costs for at least 1 PhD studentship or post-doc (a good candidate would
be Ms
Cindy Ying Li,
my PhD student, who is currently doing a relevant work).
Chris Mundy made it clear that InTeraSeq is still in the process of formation,
and it won't begin operations until the end of January. Though we agreed to
talk again then about how an application could be configured. We were both
to think about the best sources of funds. Furthermore, last summer we took part
in preparation of two BBSRC Postgraduate Studentship applications:
(1) Prediction of calcium-binding DxDxDG motifs in protein sequences,
Dr Daniel Rigden (PI)
in Biological Sciences and
Dr Prudence Wong
in Computer Science; and
(2) Prediction of gene networks using gene expression and transcription factor
binding sites data,
Prof Steve Kemp (PI)
and
Dr Harry Noyes
in Biological Sciences
and
Prof Leszek A. Gasieniec
in Computer Science.
We expect to learn more about the outcome of both applications in December.
Finally, as planned, we bought two personal computers and a digital
projector which assisted us in all regular research activities
forming parts of this project. The purchased equipment will stay in the
Algorithms Group at
the Department of Computer Science.
In conclusion, I would like to thank
MRC for funding this project and my two
main co-investigators
Prof Steve Kemp
and
Prof Paul H. Leng for their constant
support and encouragement.
Prof Leszek A Gasieniec
Liverpool, 21/11/2005.
Literature:
[1] I.C. Abfalter, C. Flamm, and P.F. Stadler, Design of multi-stable nucleic acid sequences, In Proc. German Conference on Bioinformatics, 2003
[2] A. De Bonis, L. Gasieniec, and U. Vaccaro, Generalized Framework for Selectors with Applications in Optimal Group Testing, In Proc. 30th International Colloquium on Automata, Languages and Programming, ICALP'2003, pp 81-96.
[3] E. Bornberg-Bauer, Random structures and evolution of biopolymers: A computational case study on RNA secondary structures, Pharmaceutica Acta Helvetiae, 71:79-85, 1996.
[4] H.L. Chan, T.W. Lam, W.K. Sung, P.W.H. Wong, S.M. Yiu and X. Fan, The Mutated Subsequence Problem and Locating Conserved Genes, Bioinformatics, 21(10):2271--2278, 2005, also in: A Mutation-Sensitive Approach for Locating Conserved Gene Pairs between Related Species, Proceedings of IEEE Fourth Symposium on Bioinformatics and Bioengineering (BIBE), 2004, pp. 545--552.
[5] P. Clote, L. Gasieniec, R.M. Kolpakov, E. Kranakis, and D. Krizanc, On realizing shapes in the theory of RNA neutral networks, Journal of Theoretical Biology, 236(2) 2005, pp. 216-227.
[6] A.L. Delcher, S. Kasif, R.D. Fleischmann, J. Peterson, O. White and S.L. Salzberg, Alignment of whole genomes, Nucleic Acids Res., 27, 2369-2376, 1999.
[7] C. Flamm, I.L. Hofacker, S. Mauer-Stroh, P.F. Stadler, and M. Zehl, Design of multistable RNA molecules, RNA, 7, 2001, pp. 254-265.
[8] L.A. Gasieniec, C.Y. Li, P. Sant, and P.W.H. Wong, Efficient Probe Selection in Microarray Design, submitted.
[9] D. Gerhold, T. Rushmore, and C.T. Caskey, DNA chips: promising toys have become powerful tools, Trends Biochem. Sci., 44:168-173, 1999.
[10] L. Kaderali and A. Schliep, Selecting signature oligonucleotides to identify organisms using DNA arrays, Bioinformatics, 18:1340-1349, 2002.
[11] S. Kurtz, A. Phillippy, A.L. Delcher, M. Smoot, M. Shumway, C. Antonescu and S.L. Salzberg, Versatile and open software for comparing large genomes, Genome Biology, 5, R12, 2004.
[12] C. Reidys, P.F. Stadler, and P. Schuster, Generic properties of combinatory maps: neural networks of RNA secondary structures, Bulletin of Biology, 59(2), 1997, pp. 339-397.
[13] P. Schuster, Molecular insights into evolution of phenotypes, In J.P. Critchfield and P. Schuster, Eds., Evolutionary Dynamics - Exploring the Interplay of Accident, Selection, Nautrality, and Function. Oxford University Press, New York, 2000.
[14] P. Schuster, W. Fontana, P.F. Stadler, and I.L. Hofacker, From sequences to shapes and back: A case study in RNA secodnary structures, Royal Society London,, Series B, 255, 1994, pp. 279-284.
[15] P.W.H. Wong, T.W. Lam, N. Lu, H.F. Ting and S.M. Yiu, An efficient algorithm for optimizing whole genome alignment with noise, Bioinformatics, 2004, 20(16):2676-2684.