COGNAC

Final Report Some of the most challenging and influential opportunities for computer science arise in developing and applying information technology to understand the principles of life, and in particular, the molecular mechanics of the cell. The pace of extraordinary advances in molecular biology has accelerated in the past decade due to discoveries coming from genome projects on human and model organisms. The breadth and density of genome data that must be stored, accessed, and analyzed, indicates that computation will play a central role in the post-genomic phase. In fact, one can say that the arrival of these enormous amounts of data already turned molecular biology into a computationally intensive discipline. And indeed, the results of recent work show, that many fundamental algorithmic methods became very useful (and many others seem to be very promising) in dealing with new challenging problems in computational molecular biology.

In short, comparative genomics can be defined as the analysis and comparison of genomes from different species. The core research is primarily focused on the study of human genetics by comparisons with model organisms such as mice, the fruit fly, and the bacterium E. coli. The purpose of comparative genomics is to gain better understanding of how species have evolved in time and how to determine the function and regulations of genes and non-coding regions of the genome. So far, researchers have learned a great deal about the function of human genes by examining their counterparts in simpler model organisms such as the mouse. Genome researchers look at many different features when comparing genomes, including: sequence similarity, gene location, the length and number of coding regions (called exons) within genes, the amount of non-coding DNA in each genome, and highly conserved regions maintained in organisms of diverse complexity. Comparative genomics has also many biological applications, including the examination of diseases caused by genome rearrangements, the location of homologous genes in genomes that have not yet been sequenced, the estimation of phylogenetic relationships, and the investigation of mechanisms of genome evolution. All this can be done only by efficient and robust procedures detecting meaningful patterns in nucleotide point mutations or gene order that may reveal molecular evolutionary mechanisms and phylogenetic relationships. And in this context we may find useful recent developments in discrete mathematics and algorithmics, including: combinatorial group testing (screening library of clones), time/space efficient searching methods (discovering specific patterns and regularities in DNA sequences), compact representation (indexing) of massive data, data mining tools (reasoning based on association rules), and many others.

In our explorations, we have undertaken an interdisciplinary research effort focusing on integration of efficient algorithmic methodologies with challenging combinatorial problems arising in comparative genomics. In particular we investigated the following research problems:

· Realizing Shapes From the viewpoint of molecular evolution theory, over the past decade P. Schuster and co-workers Stadler, Reidys, Hofacker, Bronberg-Bauer, Flamm, and others, see [3] have studied properties of neural networks within the context of protein and especially RNA (and in particular 2D) structures. For example, it is known [3, 12, 13, 14] that for any two secondary structures S, S' there exists an RNA sequence compatible with both, and that this result does not extend to more than two secondary structures. Indeed, a simple formula for the number of RNA sequences compatible with secondary structures S, S' plays a role in the algorithms of Flamm et al. [7] and of Abfalter et al. [1] to design an RNA switch. Here we show that a natural extension of this problem is NP-complete. Unless P=NP, there is no polynomial time algorithm, which when given secondary structures S_1,...,S_k, for k ≥ 4, determines the least number of positions, such that after removal of all base pairs incident to these positions there exists an RNA nucleotide sequence compatible with the given secondary structures. We also considered a restricted version of this problem with a "fixed maximum" number of possible stars and show that it has a simple polynomial time solution.
This is a joint work with P. Clote (Boston College, USA), L.A. Gasieniec (Liverpool University, UK), R. Kolpakov (Moscow State University, Russia), E. Kranakis (Carleton University, Canada) and D. Krizanc (Wesleyan University, USA). The results of this work were published in Journal of Theoretical Biology, see [5].

· Locating Conserved Gene Pairs Given the genomes of two related species, one important task is to uncover and locate the conserved genes, i.e., genes sharing similar functions. This task is non-trivial as most parts of a genome are non-coding areas and the information of genes of each genome is often not available. Delcher et al. [6] had given a very useful observation: A conserved gene rarely contains the same entire sequence in the two genomes, but it shares a lot of short common substrings that are unique to this conserved gene (MUM pair). However, not every MUM pair corresponds to conserved genes. The key step is how to select the right pairs.
Different approaches have been proposed to select the right MUM pairs. Earlier work uses longest common subsequence (LCS) approach. Some other work models the problem as a combinatorial clustering problem. We propose a new structural optimization problem called the Mutated Subsequence Problem, which gives consideration to possible mutations between two species (in the form of reversals and transpositions) when comparing the genomes. A practical algorithm called mutated subsequence algorithm (MSS) is devised to solve this optimization problem, and it has been evaluated using different pairs of human and mouse chromosomes, and different pairs of virus genomes of Baculoviridae. MSS is found to be effective and efficient; in particular, MSS can reveal >90% of the conserved genes of human and mouse that have been reported in the literature. When compared with existing softwares MUMmer [11] and MaxMinCluster [15], MSS uncovers 14 and 7% more genes on average, respectively. Furthermore, a hybrid approach is devised to integrate MUMmer or MaxMinCluster with MSS, which has better performance and reliability.
This is a joint work H.L. Chan, T.W. Lam, S.M. Yiu and X. Fan (University of Hong Kong, China ), W.K. Sung ( National University of Singapore ), and P.W.H. Wong ( Liverpool University, UK ). The results of this work were published in Bioinformatics, see [4].

· Efficient Probe SelectionThe DNA micro-array technology [9], originally developed to measure the level of gene expression, becomes one of the most widely used tools in genomic study. Micro-arrays have been proved to benefit areas including gene discovery, disease diagnosis, and multi-virus discovery. The crux of micro-array design lies in how to select a unique probe [10] that distinguishes a given genomic sequence from other sequences. However, in cases that the existence of a unique probe is unlikely, e.g., in the context of a large family of closely homologous genes, the use of a limited number of non-unique probes is still desirable. Because of its significance, probe selection attracts a lot of attention. Various probe selection algorithms have been developed in recent years. Good probe selection algorithms should produce a small number of candidate probes as possible. Efficiency is also crucial because the data involved is usually huge. Most existing algorithms usually select probes by filtering, which is usually not selective enough and quite a large number of probes are returned. We propose a new direction to tackle the problem and give an efficient algorithm to select (randomly) a small set of probes and demonstrate that such a small set of probes is sufficient to distinguish each sequence from all the other sequences. Based on the algorithm, we have developed a probe selection software RandP S, which runs efficiently and effectively in practice. A number of experiments have been carried out and the results will be discussed.
This is a joint work of L.A. Gasieniec, C.Y. Li, and P.W.H. Wong (University of Liverpool, UK) and P. Sant (currently University of Luton, UK). The results of this work were presented at DIMACS Workshop on Detecting and Processing Regularities in High Throughput Biological Data, June 20 - 22, 2005 DIMACS Center, at Rutgers University (pic). The paper [8] is currently submitted to one of the journals in the field.

· Micro-array Comparison ToolThere have been discussions on the possibility of designing and implementing a tool that would allow the comparison of micro-array experiments (experiments involving mouse from various public repositories such as NCBI GEO and array express). The aim of the tool is to allow the user to search the repositories for experiments (based on choosing some subset of genes) and to be able to compare the base experimental results with the results of others. This would then (hopefully) guide the user to similar genes which show similar expression patterns (i.e. more (or less) protein is produced if the mouse is exposed to some infection). The tool would involve data mining techniques in order to trawl through the vast amount of experimental data and obtain the relevant data to be compared. It is hoped that from this that it would be possible to go some way towards identifying the genes responsible for making a particular strain resistant (or equivalently, susceptible) to sleeping sickness.
There are several challenges involved, one of which is how to normalize the data (currently we just have a list of values which are somewhat dependent on experimental conditions). What we are currently thinking of is a world in which gene expression levels are either up, down or neutral. We believe that what we are proposing is useful and we aim to begin developing a prototype tool which will hopefully lead to a more advanced tool in the near future.
This is a joint work of A. Broadhead, I. Morton (Biosciences, University of Liverpool, UK) and P. Sant (University of Luton, UK). The work is in the early stage of research explorations.

[1] I.C. Abfalter, C. Flamm, and P.F. Stadler, Design of multi-stable nucleic acid sequences, In Proc. German Conference on Bioinformatics, 2003

[2] A. De Bonis, L. Gasieniec, and U. Vaccaro, Generalized Framework for Selectors with Applications in Optimal Group Testing, In Proc. 30th International Colloquium on Automata, Languages and Programming, ICALP'2003, pp 81-96.

[3] E. Bornberg-Bauer, Random structures and evolution of biopolymers: A computational case study on RNA secondary structures, Pharmaceutica Acta Helvetiae, 71:79-85, 1996.

[4] H.L. Chan, T.W. Lam, W.K. Sung, P.W.H. Wong, S.M. Yiu and X. Fan, The Mutated Subsequence Problem and Locating Conserved Genes, Bioinformatics, 21(10):2271--2278, 2005, also in: A Mutation-Sensitive Approach for Locating Conserved Gene Pairs between Related Species, Proceedings of IEEE Fourth Symposium on Bioinformatics and Bioengineering (BIBE), 2004, pp. 545--552.

[5] P. Clote, L. Gasieniec, R.M. Kolpakov, E. Kranakis, and D. Krizanc, On realizing shapes in the theory of RNA neutral networks, Journal of Theoretical Biology, 236(2) 2005, pp. 216-227.

[6] A.L. Delcher, S. Kasif, R.D. Fleischmann, J. Peterson, O. White and S.L. Salzberg, Alignment of whole genomes, Nucleic Acids Res., 27, 2369-2376, 1999.

[7] C. Flamm, I.L. Hofacker, S. Mauer-Stroh, P.F. Stadler, and M. Zehl, Design of multistable RNA molecules, RNA, 7, 2001, pp. 254-265.

[8] L.A. Gasieniec, C.Y. Li, P. Sant, and P.W.H. Wong, Efficient Probe Selection in Microarray Design, submitted.

[9] D. Gerhold, T. Rushmore, and C.T. Caskey, DNA chips: promising toys have become powerful tools, Trends Biochem. Sci., 44:168-173, 1999.

[10] L. Kaderali and A. Schliep, Selecting signature oligonucleotides to identify organisms using DNA arrays, Bioinformatics, 18:1340-1349, 2002.

[11] S. Kurtz, A. Phillippy, A.L. Delcher, M. Smoot, M. Shumway, C. Antonescu and S.L. Salzberg, Versatile and open software for comparing large genomes, Genome Biology, 5, R12, 2004.

[12] C. Reidys, P.F. Stadler, and P. Schuster, Generic properties of combinatory maps: neural networks of RNA secondary structures, Bulletin of Biology, 59(2), 1997, pp. 339-397.

[13] P. Schuster, Molecular insights into evolution of phenotypes, In J.P. Critchfield and P. Schuster, Eds., Evolutionary Dynamics - Exploring the Interplay of Accident, Selection, Nautrality, and Function. Oxford University Press, New York, 2000.

[14] P. Schuster, W. Fontana, P.F. Stadler, and I.L. Hofacker, From sequences to shapes and back: A case study in RNA secodnary structures, Royal Society London,, Series B, 255, 1994, pp. 279-284.

[15] P.W.H. Wong, T.W. Lam, N. Lu, H.F. Ting and S.M. Yiu, An efficient algorithm for optimizing whole genome alignment with noise, Bioinformatics, 2004, 20(16):2676-2684.