A Short Primer on Bioinformatics


Introduction

With its implied double dose of biology and computer science, bioinformatics can be an intimidating topic. The word itself is a somewhat fuzzy recent coinage that covers a whole group of rapidly expanding disciplines, loosely coupled. Many of the people active in bioinformatics prefer the more descriptive ( and older ) term computational microbiology, but as the use of computers spreads into other areas of biology, it's likely the more general word will take pride of place.

The goal of this writeup is to provide a quick overview of major topics in bioinformatics, and provide the non-technical reader with a perspective on how bioinformatics relates to more traditional biology and computer science.


Major Topics in Bioinformatics

Bioinformatics is an umbrella term for a variety of diverse areas of interest. What follows is a survey of some the most active areas of current research.


Text and Data Archiving

Much of the important work being done in bioinformatics parallels the work taking place in every major discipline - making journal articles, papers, and information accessible over the Internet in electronic form. This challenge has embroiled biologists in the usual battles over standards, metadata, markup and copyright familiar to us from the humanities. While everyone agrees on the laudable goal of providing universal access to print archives, actually making this happen requires time and hard work.

The hard sciences are in a more difficult position than many other disciplines, since they also face the challenge of making available prodigious quantities of raw experimental data. Biologists and computer scientists are actively working on standards to represent gene sequences, protein sequences, molecular structure, medical images, and all the other types of complex data they need to exchange as part of their research, but this work is slow going, made difficult by the existence of mutually incompatible, established formats.

Historically, researchers and research groups have stored their data in ad-hoc fashion, based on the needs of a particular experiment. Everybody recognizes that this variety of incompatible formats has become an obstacle to progress and cooperation, and so a good deal of work is going into setting standards for data compatibility.

One useful tool in this area has been Perl, a scripting language first created by Larry Wall in 1987. Originally designed as a language for text processing, Perl has matured into a full-featured computer language that lets programmers interconvert data between many formats with minimal hassle. Recognizing the value of Perl as a tool for processing data, the bioinformatics community has taken steps to standardize on a set of modules for handling sequence data. This initiative, which goes by the name BioPerl, is a direct analogue to initiatives like OKI for courseware - an attempt at creating a common set of tools for sharing complex data. Similar efforts are being undertaken in other programming languages, with the goal of eventual interoperability across all environments.


Ensuring Accessibility

To be useful, it's not enough for data to be in the right format - it also has to be made available on the public Internet, and in such a way that both human beings and computer programs running automated searches can find it. The situation is somewhat delicate given the proprietary nature of much of the research data, particularly in areas of inerest to the pharmaceutical industry ( see the section on Bioethics.

Lincoln Stein has likened the current state of affairs in bioinformatics to that of Italy in the time of the city states - a collection of fragmented, divided, and inward-looking principalities, incapable of sustained cooperation. Much of what is truly revolutionary about computational biology won't get off the ground until there is this seamless framework of data that researchers can combine, sift through, or rearrange without having to worry about its provenance. All of these efforts require time, goodwill, and hard work on the part of people who would probably much rather be doing new research.


Gene sequencing

The most visible and active branch of bioinformatics is the art of gene sequencing, sometimes called ``genomics'', including the much-publicized Human Genome Project and its many offspring. While scientists have long known that DNA molecules carry hereditary information, only in the 1990s did advances in sequencing technology make it feasible to sequence the entire genome of anything more complex than a bacterium.

Understanding the significance of gene sequencing ( as well as its limitations ) requires a little background.

A Very Short DNA Primer

DNA molecules consist of many hundreds of thousands of nucleotides, which form complementary pairs down the length of a DNA strand. There are four of these nucleotides in all ( called G, A, C, and T ), and the specific sequence of nucleotides in a DNA strand encodes information that tells cells how to build particular proteins.

Machinery within the cell reads DNA in groups of three letters, or codons. There are sixty-four possible codons, and together they code for 20 different amino acids ( the building blocks of protein molecules ), as well as processing instructions to the cell, such as ``cut here'' or ``start reading at this point''. The instructions for building a protein are read sequentially by ribosomes that build a protein as they go, adding the appropriate amino acid for each chunk of information they read.

For reasons that are not well understood, much of the DNA in a genome never gets read by the cell at all, and doesn't seem to code for anything useful. The parts that do get read are called coding regions, and much of the effort in sequencing goes towards identifying and locating these coding regions on a chromosome.

Very few genes in a cell are actually active in protein production at any given time. Different sections of DNA may be dormant or active over the life of a cell, their expression triggered by other genes and changes in the cell's internal environment. How genes interact, why they express at certain times and not others, and how the mechanisms of gene suppression and activation work are all topics of intense interest in microbiology.

The Mechanics of Sequencing

The goal of genome sequencing projects is to record all of the genetic information contained in a given organism - that is, create a sequential list of the base pairs comprising the DNA of a particular plant or animal. Since chromosomes consist of long, unbroken strands of DNA, a very convenient way to sequence a genome would be to unravel each chromosome and read off the base pairs like punch tape.

Unfortunately, there is no machine available that can read a single strand of DNA ( ribosomes are very good at it, but nature has not seen fit to provide them with an output jack ). Instead, scientists have to use a cruder, shotgun technique that first chops the DNA into short pieces ( which we know how to identify ) and then tries to reassemble the original sequence based on how the short fragments overlap.

To illustrate the difficulty of the task, imagine that someone gave you a bin containing ten shredded copies of the United States tax code, and asked you to reconstruct the original. The only way you could do it would be to hunt through the shredded bits, look for other shreds with overlapping regions, and assemble larger and larger pieces at a time until you had at least one copy of the whole thing.

Much of the work involved in DNA sequencing is exactly this kind of painstaking labor, with the added caveat that the short fragments invariably contain errors, and that reliably sequencing a single stretch of DNA might involve combining many dozens of duplicate data sets to arrive at an acceptable level of fidelity.

All of this work is done using computers that sift through sequencing data and apply various alignment algorithms to look for overlaps between short DNA fragments. Since DNA in its computer representation is just a long string of letters, these algorithms are close cousins of text analysis techniques that have been used for years on electronic documents, creating a curious overlap between genetics and natural language processing.

The work of assembling a sequence requires that many data sets be combined together, and common errors ( dropped sequences, duplications, backwards sequences ) detected and eliminated from the final result. Much of the work taking place right now on data from the Human Genome Project is just this kind of careful computerized analysis and correction.

Finding the Genes

Once a reliable DNA sequence has been established, there still remains the task of finding the actual genes ( coding regions ) embedded within the DNA strand. Since a large proportion of the DNA in a genome is non-coding ( and presumably unused ), it is important to find and delimit the coding regions as a first step to figuring out what they do.

This search for coding regions is also done with the help of computer algorithms, and once again there is a good amount of borrowing from areas like signal processing, cryptography, and natural language processing - all techniques that are good at distinguishing information from random noise.

Finding the coding regions is an important step in genome analysis, but it is not the end of the road. An ongoing debate in genetics has been the question of how much information is actually encoded in the genome itself, and how much is carried by the complex system surrounding gene expression.

A useful analogy here ( borrowed from Douglas Hofstadter ) is that of the ant colony. While a single ant has only a small repertoire of behaviors, and responds predictably to a handful of chemical signals, a colony of many thousands of ants will show very subtle and sophisticated behavior in how it forages for food, raises its young, manages resources and deals with external threats. No amount of study of an individual ant can reveal anything about the behavior of a colony, because the properties of the colony don't reside in any single ant - they arise spontaneously out of the interactions between many thousands of ants, all obeying simple rules. This phenomenon of emergent behavior is known to play a part in gene expression, but nobody knows to what extent.

Emergent behavior is very hard to simulate, because there is no way to infer the simple rules from the complex behavior - we are forced to proceed by trial and error. Still, computers give us a way to try many different rule sets and test hypotheses about gene interaction that would take decades to work out on pencil and paper.

Unexplored Territory

Surprisingly enough, for all the intense research taking place in genomics, only a very few organisms have had their genome fully sequenced ( we are in the proud company of the puffer fish, the fruit fly, brewer's yeast, and several kinds of bacteria). Many technical and computational challenges remain before sequencing becomes an automatic process - some species are still very difficult for us to sequence, and much remains to be learned about the role and origin of all that non-coding DNA. Progress will require both advances in the lab and advances in our ability to analyze and process the genome data with computers.


Molecular Structure and Function

Closely related to gene sequencing is a second major field of interest in bioinformatics - the search for a mapping between the chemical structure of a protein and its function within the cell.

We noted above that proteins are assembled from amino acids based on instructions encoded in an organism's DNA. Even though all proteins are made out of the same building blocks, they come in a bewildering variety of forms and serve many functions within the cell. Some proteins are purely structural, others have special receptors that fit other molecules, and still others serve as signals or messages that can pass information from cell to cell. The role a protein plays depends solely on its shape, and the shape of a protein depends on the sequence of amino acids that compose it.

Amino acid chains start out as linear molecules, but as more amino acids are added, the chain begins to fold up. This spontaneous folding results in a complex, three-dimensional structure unique to the particular protein being generated. The pattern of folding is automatic and reproducible, so that a given amino acid sequence will always create a protein with a certain configuration.

The ability to predict the final shape of a protein from its amino acid composition is the Holy Grail of pharmacology. If we could design protein molecules from scratch, it would become possible to create potent new drugs and therapies, tailored to the individual. Finding treatments for a disease would be as simple as creating a protein to fit around the business end of an infectious agent and render it harmless.

The protein folding problem, as it is called, is computationally very difficult. It may even be an intractable problem. An enormous amount of effort continues to go into finding a mapping between the amino acid sequence of a molecule, its ultimate configuration, and how its structure affects its function.

Distributed computing plays a critical role in studying protein folding. As computing power increases, it will become possible to test more sophisticated models of folding behavior, and more accurately estimate the intramolecular forces within individual proteins to understand why they fold the way they do.

A better understanding of protein folding is also critical to finding therapies for prion-based diseases like bovine spongiform encephalopathy ( BSE, or mad cow disease ) which appears to be caused by a 'rogue' configuration of a common protein that in turn reconfigures other proteins it comes into contact with. Prion diseases are poorly understood and present a grave risk, since one protein molecule is presumably enough to infect an entire organism, and common sterilization techniques are not sufficient to destroy infectious particles.


Molecular Evolution

A third application of bioinformatics, closely related to genomics and protein analysis, is the study of molecular evolution.

Molecular evolution is a statistical science that uses changes in genes and proteins as a kind of evolutionary clock. Over time, any species will accumulate minor mutations in its genetic code, due to inevitable transcription errors and environmental factors like background radiation. Some of these mutations prove fatal; a tiny fraction prove beneficial, but the vast majority have no noticeable effect on the organism.

Because this trickle of small changes is slow, cumulative, and essentially random, it can serve as a useful indicator of how long two species have been out of genetic contact - that is, when they last shared a common ancestor. By comparing molecules across many species, scientists can determine the relative genetic distance between them, and learn more about their evolutionary history.

The study of molecular markers in evolution builds on on the same sequencing and analysis techniques discussed in the section on DNA. Because it relies on tiny changes to the genome, ths kind of statistical analysis requires very precise sequencing data and sophisticated computational models, including advanced GIS ( geographic information systems ) software to correlate a species' genetic history with its historical range.


Modeling and Simulations

The final field we will cover, computer modeling of biological systems, is the most closely linked to traditional computer science, and in many ways the most accessible.

Computer modeling

Ever since Conway's Game of Life in 1970, computer scientists have been creating simulations of complex systems that generate intricate behavior from a simple set of underlying rules. Research in modeling is active in two directions - simulating living systems to gain a better understanding of the rules underpinning their behavior, and applying models observed in nature to help solve abstract problems in unrelated fields.

The first approach has been used to better understand forests, predator-prey relationships, and other complex ecosystems, and has taken its place as an invaluable teaching tool at all levels of biology. Students can use computer simulations to observe firsthand the effects that different constraints and starting conditions can have on a biological system.

Since biological factors are known to have a critical effect on climate, much of the ongoing research on global warming and long-range weather patterns also relies on sophisticated models of plant and ocean ecology.

The second approach to modeling - stealing clever algorithms from nature - has led to some interesting and surprising solutions to thorny problems. For example, researchers in ant behavior have been able to use simulated 'ants' and 'pheromones' running around a shipping map to find more efficient routings for companies like UPS and Federal Express, saving those companies millions of dollars a year in operating expenses.

Biophysics

An interesting and little-mentioned side branch of computer modeling is the study of biophysics. Biophysics deals with plants and animals from an engineering perspective, trying to figure out how what holds them up, how they get from place to place, and what ideas we can steal from nature to improve our own building skills. Biophysics also helps scientists make inferences about long-extinct animals based on fossil evidence - figuring out how fast a dinosaur could run, or what kind of environment was necessary for giant insects to flourish.

While not as well-funded or highly publicized as efforts in microbiology, biophysics still represents an innovative application of computer science to biology, and has important applications in an eclectic set of fields: paleontology, structural engineering, hydrodynamics, and sports medicine.


Bioethics and the Scientific Commons

It's impossible to talk about bioinformatics without discussing the charged legal and ethical context surrounding the field. The legal issues clouding research into microbiology, especially in the area of genetics, are severe and likely to get worse.

Biopatents and the research community

There is a real tension between, on the one hand, the scientific tradition of peer review and open sharing of information, and on the other hand, the very high value of some of this information to business. Private companies have successfully been able to acquire patents on individual genes, and in some cases entire organisms. This unprecedented extension of patent law into the realm of living creatures is not the result of explicit policy decisions by elected representatives, but of administrative decisions made out of the public eye.

Given the enormously lucrative nature of patents and pharmacological research data, many public universities are facing an unpleasant conflict of interest between their need for income and their mission as educational institutions charged with serving the public. More and more often, a university's interest in maintaining a scientific commons gets weighed against the potential monetary value of patents and trade secrets.

Genetically modified foods

Advances in gene sequencing and manipulation techniques have rapidly been put to use by multinational agribusiness concerns. Much of the food supply in this country now comes from genetically modified crops, whose safety has never been demonstrated, and whose potential for causing permanent changes to the ecosystem is poorly understood.

The introduction of bioengineered plants and animals into the wild is a massive, ongoing uncontrolled experiment. Given our marginal understanding of even the most basic aspects of gene expression and cross-species gene transfer, the lack of a public debate on the subject of transgenic crops and animals has been somewhat shocking.

Privacy and public health

Any increase in our understanding of hereditary and congenital traits in the human genome has implications for privacy, individual choice, and public health policy. As screening for potentially fatal diseases becomes available, sometimes years in advance, we will all have to make difficult personal and collective choices about the kind of health care system we want, and the degree to which we are ready to sacrifice our privacy in the interests of public health.