COGNAC: Comparative Genomics -- Algorithms and Complexity



Prof Leszek A. Gasieniec, Prof Paul H. Leng
Department of Computer Science, University of Liverpool
Prof Steve Kemp
School of Biological Scicence, University of Liverpool

Final Report Some of the most challenging and influential opportunities for computer science arise in developing and applying information technology to understand the principles of life, and in particular, the molecular mechanics of the cell. The pace of extraordinary advances in molecular biology has accelerated in the past decade due to discoveries coming from genome projects on human and model organisms. The breadth and density of genome data that must be stored, accessed, and analyzed, indicates that computation will play a central role in the post-genomic phase. In fact, one can say that the arrival of these enormous amounts of data already turned molecular biology into a computationally intensive discipline. And indeed, the results of recent work show, that many fundamental algorithmic methods became very useful (and many others seem to be very promising) in dealing with new challenging problems in computational molecular biology.

In short, comparative genomics can be defined as the analysis and comparison of genomes from different species. The core research is primarily focused on the study of human genetics by comparisons with model organisms such as mice, the fruit fly, and the bacterium E. coli. The purpose of comparative genomics is to gain better understanding of how species have evolved in time and how to determine the function and regulations of genes and non-coding regions of the genome. So far, researchers have learned a great deal about the function of human genes by examining their counterparts in simpler model organisms such as the mouse. Genome researchers look at many different features when comparing genomes, including: sequence similarity, gene location, the length and number of coding regions (called exons) within genes, the amount of non-coding DNA in each genome, and highly conserved regions maintained in organisms of diverse complexity. Comparative genomics has also many biological applications, including the examination of diseases caused by genome rearrangements, the location of homologous genes in genomes that have not yet been sequenced, the estimation of phylogenetic relationships, and the investigation of mechanisms of genome evolution. All this can be done only by efficient and robust procedures detecting meaningful patterns in nucleotide point mutations or gene order that may reveal molecular evolutionary mechanisms and phylogenetic relationships. And in this context we may find useful recent developments in discrete mathematics and algorithmics, including: combinatorial group testing (screening library of clones), time/space efficient searching methods (discovering specific patterns and regularities in DNA sequences), compact representation (indexing) of massive data, data mining tools (reasoning based on association rules), and many others.

In our explorations, we have undertaken an interdisciplinary research effort focusing on integration of efficient algorithmic methodologies with challenging combinatorial problems arising in comparative genomics. In particular we investigated the following research problems:

·    Realizing Shapes From the viewpoint of molecular evolution theory, over the past decade P. Schuster and co-workers Stadler, Reidys, Hofacker, Bronberg-Bauer, Flamm, and others, see [3] have studied properties of neural networks within the context of protein and especially RNA (and in particular 2D) structures. For example, it is known [3, 12, 13, 14] that for any two secondary structures S, S' there exists an RNA sequence compatible with both, and that this result does not extend to more than two secondary structures. Indeed, a simple formula for the number of RNA sequences compatible with secondary structures S, S' plays a role in the algorithms of Flamm et al. [7] and of Abfalter et al. [1] to design an RNA switch. Here we show that a natural extension of this problem is NP-complete. Unless P=NP, there is no polynomial time algorithm, which when given secondary structures S_1,...,S_k, for k ≥ 4, determines the least number of positions, such that after removal of all base pairs incident to these positions there exists an RNA nucleotide sequence compatible with the given secondary structures. We also considered a restricted version of this problem with a "fixed maximum" number of possible stars and show that it has a simple polynomial time solution.
This is a joint work with P. Clote (Boston College, USA), L.A. Gasieniec (Liverpool University, UK), R. Kolpakov (Moscow State University, Russia), E. Kranakis (Carleton University, Canada) and D. Krizanc (Wesleyan University, USA). The results of this work were published in Journal of Theoretical Biology, see [5].


·    Locating Conserved Gene Pairs Given the genomes of two related species, one important task is to uncover and locate the conserved genes, i.e., genes sharing similar functions. This task is non-trivial as most parts of a genome are non-coding areas and the information of genes of each genome is often not available. Delcher et al. [6] had given a very useful observation: A conserved gene rarely contains the same entire sequence in the two genomes, but it shares a lot of short common substrings that are unique to this conserved gene (MUM pair). However, not every MUM pair corresponds to conserved genes. The key step is how to select the right pairs.
Different approaches have been proposed to select the right MUM pairs. Earlier work uses longest common subsequence (LCS) approach. Some other work models the problem as a combinatorial clustering problem. We propose a new structural optimization problem called the Mutated Subsequence Problem, which gives consideration to possible mutations between two species (in the form of reversals and transpositions) when comparing the genomes. A practical algorithm called mutated subsequence algorithm (MSS) is devised to solve this optimization problem, and it has been evaluated using different pairs of human and mouse chromosomes, and different pairs of virus genomes of Baculoviridae. MSS is found to be effective and efficient; in particular, MSS can reveal >90% of the conserved genes of human and mouse that have been reported in the literature. When compared with existing softwares MUMmer [11] and MaxMinCluster [15], MSS uncovers 14 and 7% more genes on average, respectively. Furthermore, a hybrid approach is devised to integrate MUMmer or MaxMinCluster with MSS, which has better performance and reliability.
This is a joint work H.L. Chan, T.W. Lam, S.M. Yiu and X. Fan (University of Hong Kong, China ), W.K. Sung ( National University of Singapore ), and P.W.H. Wong ( Liverpool University, UK ). The results of this work were published in Bioinformatics, see [4].


·    Efficient Probe SelectionThe DNA micro-array technology [9], originally developed to measure the level of gene expression, becomes one of the most widely used tools in genomic study. Micro-arrays have been proved to benefit areas including gene discovery, disease diagnosis, and multi-virus discovery. The crux of micro-array design lies in how to select a unique probe [10] that distinguishes a given genomic sequence from other sequences. However, in cases that the existence of a unique probe is unlikely, e.g., in the context of a large family of closely homologous genes, the use of a limited number of non-unique probes is still desirable. Because of its significance, probe selection attracts a lot of attention. Various probe selection algorithms have been developed in recent years. Good probe selection algorithms should produce a small number of candidate probes as possible. Efficiency is also crucial because the data involved is usually huge. Most existing algorithms usually select probes by filtering, which is usually not selective enough and quite a large number of probes are returned. We propose a new direction to tackle the problem and give an efficient algorithm to select (randomly) a small set of probes and demonstrate that such a small set of probes is sufficient to distinguish each sequence from all the other sequences. Based on the algorithm, we have developed a probe selection software RandP S, which runs efficiently and effectively in practice. A number of experiments have been carried out and the results will be discussed.
This is a joint work of L.A. Gasieniec, C.Y. Li, and P.W.H. Wong (University of Liverpool, UK) and P. Sant (currently University of Luton, UK). The results of this work were presented at DIMACS Workshop on Detecting and Processing Regularities in High Throughput Biological Data, June 20 - 22, 2005 DIMACS Center, at Rutgers University (pic). The paper [8] is currently submitted to one of the journals in the field.


·    Micro-array Comparison ToolThere have been discussions on the possibility of designing and implementing a tool that would allow the comparison of micro-array experiments (experiments involving mouse from various public repositories such as NCBI GEO and array express). The aim of the tool is to allow the user to search the repositories for experiments (based on choosing some subset of genes) and to be able to compare the base experimental results with the results of others. This would then (hopefully) guide the user to similar genes which show similar expression patterns (i.e. more (or less) protein is produced if the mouse is exposed to some infection). The tool would involve data mining techniques in order to trawl through the vast amount of experimental data and obtain the relevant data to be compared. It is hoped that from this that it would be possible to go some way towards identifying the genes responsible for making a particular strain resistant (or equivalently, susceptible) to sleeping sickness.
There are several challenges involved, one of which is how to normalize the data (currently we just have a list of values which are somewhat dependent on experimental conditions). What we are currently thinking of is a world in which gene expression levels are either up, down or neutral. We believe that what we are proposing is useful and we aim to begin developing a prototype tool which will hopefully lead to a more advanced tool in the near future.
This is a joint work of A. Broadhead, I. Morton (Biosciences, University of Liverpool, UK) and P. Sant (University of Luton, UK). The work is in the early stage of research explorations.

Project Assessment We believe that research explorations on selected computational problems in comparative genomics performed within the scope of this project were successful and of very high standard. The results of our collaborative work were published in top journals in the field: Journal of Theoretical Biology and Bioinformatics, and presented at high quality research events: DIMACS Workshop on Detecting and Processing Regularities in High Throughput Biological Data and IEEE 4th Symposium on Bioinformatics and Bioengineering, BIBE'2004.

The funding provided within the grant allowed to employ research associate Dr P. Sant (formerly King's College London)who, apart from lightening the academic load (teaching) of the principal investigator, he contributed to the main themes of this project. This resulted in publications and development of the probe selection software. We were very glad to learn that Dr Sant's stay in Liverpool helped him to find permanent position as a lecturer at the Department of Computer Science, the University of Luton, from September 1st, 2005. We were also delighted to welcome other academics to participate in this project and in particular Dr Prudence Wong (lecturer in CTAG) who currently co-supervises (together with Prof Gasieniec) a PhD student Cindy Ying Li who specializes in combinatorics of micro-arrays, probe selection and more recently de novosequencing methods.

Perhaps the greatest advantage of this project was unforgettable opportunity to interact with top researchers in the field via meetings in Liverpool, research visits (Universitaet in Berlin, Germany; Ottawa University and Carlton University in Ottawa, Canada; and University of California in Riverside, USA) and workshop attendance ( 2nd Haifa Annual International Stringologyin April 2005 and DIMACS Workshop on Detecting and Processing Regularities in High Throughput Biological Data in June 2005. In this context special thanks are to Prof Gary Benson (Boston University, USA), Prof Andrew Brass (University of Manchester, UK), Prof Marek Chrobak (University of California, USA), Prof Peter Clote (Boston College, USA), Prof Evangelos Kranakis (Carleton University, Canada), Prof Gad M. Landau (Haifa University, Israel), Prof David W. Mount (University of Arizona, USA), Prof Laxmi Parida (IBM Watson, USA), Prof Knut Reinert (Frei Universitaet, Berlin), Prof Cenk Sahinalp (Simon Fraser University, Canada), David Sankoff (Ottawa University, Canada).

We strongly believe that work performed within this project is just a starting point in further research explorations (on possibly larger scale) in comparative genomics and other related topics in computational molecular biology. This includes collaboration with both academic and industrial partners. In this context we are in the process of initial discussions with Prof Chris Mundy who is currently at MerseyBio, Liverpool. We agreed that there is scope for collaboration on algorithm development for InTeraSeq's novel sequencing method, and that in particular we should focus on a grant application to include staff costs for at least 1 PhD studentship or post-doc (a good candidate would be Ms Cindy Ying Li, my PhD student, who is currently doing a relevant work). Chris Mundy made it clear that InTeraSeq is still in the process of formation, and it won't begin operations until the end of January. Though we agreed to talk again then about how an application could be configured. We were both to think about the best sources of funds. Furthermore, last summer we took part in preparation of two BBSRC Postgraduate Studentship applications: (1) Prediction of calcium-binding DxDxDG motifs in protein sequences, Dr Daniel Rigden (PI) in Biological Sciences and Dr Prudence Wong in Computer Science; and (2) Prediction of gene networks using gene expression and transcription factor binding sites data, Prof Steve Kemp (PI) and Dr Harry Noyes in Biological Sciences and Prof Leszek A. Gasieniec in Computer Science. We expect to learn more about the outcome of both applications in December.

Finally, as planned, we bought two personal computers and a digital projector which assisted us in all regular research activities forming parts of this project. The purchased equipment will stay in the Algorithms Group at the Department of Computer Science.

In conclusion, I would like to thank MRC for funding this project and my two main co-investigators Prof Steve Kemp and Prof Paul H. Leng for their constant support and encouragement.

Prof Leszek A Gasieniec
Liverpool, 21/11/2005.

Literature:

[1] I.C. Abfalter, C. Flamm, and P.F. Stadler, Design of multi-stable nucleic acid sequences, In Proc. German Conference on Bioinformatics, 2003

[2] A. De Bonis, L. Gasieniec, and U. Vaccaro, Generalized Framework for Selectors with Applications in Optimal Group Testing, In Proc. 30th International Colloquium on Automata, Languages and Programming, ICALP'2003, pp 81-96.

[3] E. Bornberg-Bauer, Random structures and evolution of biopolymers: A computational case study on RNA secondary structures, Pharmaceutica Acta Helvetiae, 71:79-85, 1996.

[4] H.L. Chan, T.W. Lam, W.K. Sung, P.W.H. Wong, S.M. Yiu and X. Fan, The Mutated Subsequence Problem and Locating Conserved Genes, Bioinformatics, 21(10):2271--2278, 2005, also in: A Mutation-Sensitive Approach for Locating Conserved Gene Pairs between Related Species, Proceedings of IEEE Fourth Symposium on Bioinformatics and Bioengineering (BIBE), 2004, pp. 545--552.

[5] P. Clote, L. Gasieniec, R.M. Kolpakov, E. Kranakis, and D. Krizanc, On realizing shapes in the theory of RNA neutral networks, Journal of Theoretical Biology, 236(2) 2005, pp. 216-227.

[6] A.L. Delcher, S. Kasif, R.D. Fleischmann, J. Peterson, O. White and S.L. Salzberg, Alignment of whole genomes, Nucleic Acids Res., 27, 2369-2376, 1999.

[7] C. Flamm, I.L. Hofacker, S. Mauer-Stroh, P.F. Stadler, and M. Zehl, Design of multistable RNA molecules, RNA, 7, 2001, pp. 254-265.

[8] L.A. Gasieniec, C.Y. Li, P. Sant, and P.W.H. Wong, Efficient Probe Selection in Microarray Design, submitted.

[9] D. Gerhold, T. Rushmore, and C.T. Caskey, DNA chips: promising toys have become powerful tools, Trends Biochem. Sci., 44:168-173, 1999.

[10] L. Kaderali and A. Schliep, Selecting signature oligonucleotides to identify organisms using DNA arrays, Bioinformatics, 18:1340-1349, 2002.

[11] S. Kurtz, A. Phillippy, A.L. Delcher, M. Smoot, M. Shumway, C. Antonescu and S.L. Salzberg, Versatile and open software for comparing large genomes, Genome Biology, 5, R12, 2004.

[12] C. Reidys, P.F. Stadler, and P. Schuster, Generic properties of combinatory maps: neural networks of RNA secondary structures, Bulletin of Biology, 59(2), 1997, pp. 339-397.

[13] P. Schuster, Molecular insights into evolution of phenotypes, In J.P. Critchfield and P. Schuster, Eds., Evolutionary Dynamics - Exploring the Interplay of Accident, Selection, Nautrality, and Function. Oxford University Press, New York, 2000.

[14] P. Schuster, W. Fontana, P.F. Stadler, and I.L. Hofacker, From sequences to shapes and back: A case study in RNA secodnary structures, Royal Society London,, Series B, 255, 1994, pp. 279-284.

[15] P.W.H. Wong, T.W. Lam, N. Lu, H.F. Ting and S.M. Yiu, An efficient algorithm for optimizing whole genome alignment with noise, Bioinformatics, 2004, 20(16):2676-2684.