I spent the past few days reading about ENCODE: the ENCyclopedia Of DNA Elements, which is generating a lot of fuzz right now – why does reading about it give me a headache? What is ENCODE? This is a great chance to talk about this “big science” project, and to learn how communication of scientific results can become a mess…
The genome is a collection of genetic codes, based on which an organism (like us) gets the traits and features the organism has. These traits and features come from many processes within the cell – the codes are transcribed and translated to become chains of amino acids, which are then modified to become proteins, which are then transported to where they need to be, and essentially become the building blocks for an organism. Now, after the Human Genome Project, we have an idea what the long sequence of codes looks like – 3164.7 million chemical nucleotide bases, each is represented by a letter of A, T, C, or G. This is massive! If we were to print this out letter by letter, apparently we can fill two hundred 500-page telephone directories. The ENCODE project (420 scientists, 32 labs around the world) aims to go a step further. It says on its project website:
The goal of ENCODE is to build a comprehensive parts list of functional elements in the human genome, including elements that act at the protein and RNA levels, and regulatory elements that control cells and circumstances in which a gene is active.
This is an important step, because simply knowing the codes does not tell us what they really do. But how do you start building a comprehensive list when you have 3164.7 million nucleotide bases to go through? In general, the ENCODE approach is this – let’s imagine that you are doing online shopping at ebay, which has lots and lots (and lots!) of products. Some are useful, working products, and some are not. You want to get a clock, but looking for it one after another is simply taking too long, so instead you look for specific “features” – something with gears, a circular face with numbers 1 to 12 on it, with hour/minute hands, and so on.
This is actually a pretty smart approach. In their 2012 paper, the ENCODE team looked for 4 features: regions of transcription, transcription factor association, chromatin structure, and histone modification, because these are elements that likely matter if we are to search for something specific (like, a clock) later on.
So what’s the problem? It mostly comes down to one word – “function.”
In ENCODE’s news release, they stated that
[...], researchers linked more than 80 percent of the human genome sequence to a specific biological function and mapped more than 4 million regulatory regions where proteins specifically interact with the DNA.
The news release further stated that “most of the human genome is involved in the complex molecular choreography required for converting genetic information into living cells and organisms.”
This sent a shock wave throughout much of the science community. From what we learnt about DNA and human genome so far, we know that a large proportion of the sequence is not “functional” – doesn’t code for a protein and doesn’t seem to have specific purposes in the cell. It is what we called “junk DNA” (terrible term, because not having immediate functions doesn’t mean that it should be thrown out – so many scientists avoid the term). 80% is much, much higher than what was expected by most scientists. This discovery by ENCODE was picked up immediately by media, marked as “an overturn of the junk DNA theory” *cringe*. A new breakthrough in the field! – or is it?
You might have figured out what doesn’t seem quite right here. What ENCODE identified were “functional elements” – elements that suggest the possibility for biological functions. Just like not all products with gears are actually “functional” (it could be a clock, a broken watch, some “as seen on TV” product, a bag of random mechanical parts, or a craft project glued together by your 4 year old nephew), identification of functional elements does not equate to actual biological functions in your cell. And having functional elements does not confirm involvement in critical cellular pathways or association with important functions in the cell.
After the immediate media hype, other scientists expressed concerns (to say it lightly), but it was too late (read A Genome-Sized Media Failure by Michael White). This also leads to a very recent, rather aggressive paper by Graur et al refuting the claims by the ENCODE project. This whole thing is now very messy (that’s why I was having a headache) :( I won’t elaborate much further, but if you want to know more about this – see Further Reading.
The funny thing is, ENCODE could have been more specific, could have chosen a less controversial term - like “specific biochemical activity” as suggested by PZ Myers, or perhaps “ability to bind to cellular factors.” If they did not attempt to over-reach the claim, the focus would have remained on the amazingly huge amount of information that ENCODE provides, which can now be analyzed by scientists around the world to enable us to know more about our genome and how it works.
The ENCODE experience is probably good for science (I didn’t say it is going to be a pleasant one…). We now have this enthusiasm/obssession about big science, that there is so much pressure to get the “next breakthrough” out, to create the next hype. But we should really come back to the objectives of these big science projects – for ENCODE, it is about building an informative genome database for scientists – and disseminate well-supported information to the public and media with adequate explanations.
And, scientists or not, we should remain curious yet inquisitive about “breakthrough” discoveries in the future :)
Postscript 1: This reminds me of the OPERA discovery about neutrinos travelling faster than the speed of light – which was found out later to be the result of equipment/calculation errors. Even though in the end this went down not so nicely, at least they right out stated that they were not sure what was going on, and invited everyone to help figure out whether this was a true discovery or an error (In fact, this sparked a lot of good public discussion about particle physics, which was awesome). I gave the OPERA team kudos for that.
Postscript 2: While I am a little sympathetic about the situation ENCODE is in, I don’t have much good to say about ENCODE’s public promo video below. Neither the Human Genome Project, nor ENCODE, is a shortcut to drug discoveries and treatments for rare diseases. They are however critical steps toward the understanding of how our genome works. It will take a lot more efforts in the future to tease out specifics – and the video seems to convey the message the ENCODE is much closer than Human Genome Project in finding cures for diseases (it isn’t…we don’t even know where the end is…)
- So I take it you aren’t happy with ENCODE… by Josh Witten
- ENCODE says what? by Sean Eddy
- A coverage by the Guardian Scientists attacked over claim that ‘junk DNA’ is vital to life
- ENCODE gets a public reaming, and the ENCODE Delusion by PZ Myers
- Ewan Birney (ENCODE lead coordinator) and Chris Ponting from Oxford on BCC radio
- Ed Yong’s exhausting summary ENCODE: the rough guide to the human genome
- Nature Journal’s ENCODE Project Explorer
- PLoS paper: A User’s Guide to the Encyclopedia of DNA Elements (ENCODE)