January 23, 2009

Q&A: Harvard's Tonellato Offers a Guided Trip to the Cloud

Newsletter:
BioInform

In early February Harvard Medical School senior research scientist Peter Tonellato will begin leading an eight-week seminar called "Clouded Translational Science" that will take place at Harvard's Countway Library and will also be webcast to research groups around the world.

According to its description, the seminar will address the "accelerating barrage of new associations between phenotypes of disease and molecular signals" for which no baseline of methodology exists to translate such results into practical clinical use. Tonellato said he wants to let participants explore new methods and tools and apply cloud computing to examine and integrate data and methods that will propel the translation from discovery to practice.

Tonellato runs the Laboratory for Personalized Medicine at the Center of Biomedical Informatics at Harvard and Childrenn's Hospital of Boston where he is senior research scientist. Last year he deployed an Oracle database in the Amazon Elastic Compute Cloud environment to run simulations to predict the clinical utility of genetic testing in patients [BioInform Dec. 12, 2008].

Rather than apply actual patient data and build a data center, he told BioInform, he decided to apply cloud computing to manipulate up to 100 million different clinical avatars in the course of the simulation.

Tonellato, an applied mathematician by training, also co-founded PointOne Systems of Wauwatosa, Wisc., a company devoted to helping physicians sort through their patients' clinical and genomic data. The company was sold at the end of last year to an undisclosed buyer. What follows is an edited version of the conversation.

How is the cloud computing seminar organized?

We're launching a series of collaborations with groups that have different discoveries and for each one of them there is a spectrum of nuances [to find] the most important data elements and then to understand how best to model those data elements. It's a step-by-step process.

This is an introduction to the scientific basics and technical basics of being able to do this in a replicable manner. For example, we have a few groups from Harvard, one of whom, Denis Wall [director of Harvard Medical School's Computational Biology Initiative], does systems biology and wants to do a more in-depth study across genetic variants of autism, and do gene-gene interaction prediction.

What we want to do then is take his understanding of the gene-gene interaction on that variant information and then identify how to build clinical avatars that can be used to do various kinds of simulations in autism, such as for treatment options, genetic tests, early diagnosis.

The seminar has five or six projects, only one of which is building clinical avatars. I have spoken to all of the people involved since I am curious if we can build a parallel or adjacent objective based on this idea of clinical avatars. They enthusiastically embrace that and I will be pursuing that with them.

Their [own] objective in these projects may be related to other things. For example, George Church's group [at Harvard] is participating and they want to use cloud computing to see if they can more rapidly analyze next-generation sequencing data from the Personal Genome Project in order to identify variants. We are working with them to see how quickly we get those variants detected in an individual and hybridize that to clinical information that's relevant to that individual's ongoing or future health situation.

The way I am couching this seminar for my own group and the groups I am working with is to find new ways to do validation and verification through predictions and simulations and use new technologies, which is where the Amazon Web Services [Elastic Compute] Cloud comes in.

How do participants have to be equipped?

Most of the groups are very sophisticated on the computer side. I have asked them to identify three objectives: a scientific objective, a computational objective, and then the third objective is that we want to do something on the cloud to see if that scientific and computational objective [can] be more rapidly attained by using the cloud computing infrastructure, as opposed to the usual data center situation.

I have asked them to think of a proof-of-concept project that fits in some way with their scientific and computational objective; something they have already done before.

We don't want to take on everything as brand new; we want to do it in bite sizes. We are going to be learning how to do things on the cloud. In particular let's do something we have already done; it's pretty complex, we're not learning how to do the actual thing, we're learning how to do it on the cloud.

For the autism work, are they placing the literature on the cloud and doing data mining there?

They have done some gene-gene natural language processing of text to identify gene-gene interactions. So we say, 'Let's replicate that preliminary result they have by using the cloud infrastructure.'

That research group will be doing an exercise to get the abstracts and publications on the cloud, install the software on the cloud, and figure out how to link everything together on the cloud. If it requires a computer cluster, then let's figure out how to invoke that cluster on the cloud, let's run it, and compare the results we got on the cloud with what we got in the data center. They should be identical.

If you have too many new factors going at the same time for the proof of concept and something doesn't turn out the way you think it should, you don't know if it is the way we invoke something.

Course participants will also think about their next computational projects. I am setting up a virtual data center for each project group in advance of the seminar. Amazon will be sending one of their key people, a guru of cloud computing kind of person, who will give a lecture on the first seminar day, walk everybody through the basic steps, and then make themselves available to answer questions.

What kind of cloud space is there for your participants?

They will get an allocation of virtual machines to select from. In the Amazon environment there are literally thousands of different kinds of virtual machines — Windows machines, all the flavors of Linux you can imagine, with Oracle or without, with MySQL or without. All these combinations have already been set up. I don't want to be the salesperson for Amazon, but what I really like is that you go in through this console, pick which things you want to light up, click on them, and they light up.

In my mind they have really reached an elegant simplification that allows any R01 lab, or any lab in the world, in fact, the ability, with a few hundred bucks a month, to have a fairly nice data center that they have full control over.

What is the financial scope for users? Costs accrue depending how much data is moved on and off the cloud, but what is an average lab going to need to spend?

When we started, the first four, five, six months, we were probably running under $1,000 a month. That included virtual engines running 24/7. When we did the 100 million avatar studies, we probably had to launch five times more engines and that was included. We might have been up to 15 machines or so, ran them for a day or two so we could do the analysis. Then the beauty of the cloud is that you shrink them down back to the two or three to keep things going. That's what everyone has got to remember — you spin them up and then you want to spin them down.

If you keep a medium-sized cluster running — let's say three different instances of virtual machines, with 8 to 16 dual cores each — you 're talking about $100 to $150 a month to run. Each of those has space allocations, maybe a terabyte each. If you want to allocate a large data structure to be mounted to each one of those, there's a small additional charge.

Is cloud computing attractive for scientists who don't want to have to set up a data center of their own?

One of the keys for companies to do this, not researchers, is they don't have the capital costs, so they can expense these costs. That makes a huge difference to them.

It looks like we are going to have a healthcare economist examining each of these course projects as they are implemented on the cloud, to analyze what are the costs, starting from scratch, ramping it up on the cloud, running it for a couple of months, bringing it down. …. He is going to do a comparison between the current traditional way of doing things and the cloud.

I like to set things up in baby steps so everybody is successful and then they are on their own. … Where they go thereafter, the sky is the limit. It will probably be breaking new ground.

When you mention the cloud as part of this seminar, does that always mean the Amazon Cloud or are you including other companies that offer virtualized computing?

I would never try to lay down plusses and minuses between Amazon and other groups; that's not my objective. I did do a review last year about this time, dived down into who had what, what were the costs, and for a number of different reasons I decided to go with Amazon. I am happy I made that decision, including the stability they have now with half a million developers or entities or so operating on the cloud. I suspect that is a lot more than other environments.

There are dozens and dozens [of others]; I don't remember all the names. There is a company called Rackspace with a group called Mosso [Rackspace started Mosso as its cloud hosting division]. There are many startups. There were two groups launching up in China, and one group in India. These were typical data centers that had latched onto this idea of virtual computing and cloud computing and were launching a secondary business. All three of them were using the open source virtual computing [platform] environment.

How has cloud computing played out for you practically?

My lab uploaded all five [human genomes] that exist today, the current NCBI [Human Genome] Build 36, [Jim] Watson's genome, [Craig] Venter's genome, the Chinese genome, and now there's an African genome. We are putting those on our cloud environment and what's surprising is that there is still latency between invoking an ftp command to move a huge dataset from one physical location, in our case Boston, to another physical location, and I quite frankly don't know where the AWS servers are located. Surprisingly these data transfers still fail periodically.

One of the problems we have generally is that you need the data near the CPU cycles. If they are not near the CPU cycles you have latency between the algorithm or computational process accessing the data and executing whatever analysis it's doing.

Wherever our CPU cycles are located, Timbuktu, or underneath our desks, or across the country, ideally you want the data physically next door. If you don't, you have this tremendous issue about migrating the data back and forth.

Amazon has put their huge datasets next to the CPU cycles. I think that reflects to a certain degree their idea of making things easier for researchers. And they are not going to make a ton of money off of researchers.

They are working with us to help us figure out how to do our projects. I hope they are learning about things that work well and others that don't, because that will be better for us in the long run.

Amazon Web Services published a case study with your clinical avatar project, which stated that it took you only 10 days to get set up. Your course participants might think that will be how it works for them, but that sounds pretty quick.

I knew everything about our project, the resources required and so on. Actually that is why I am doing some work up front, to understand their projects, and try to define virtual data centers to properly support their project, what kind of infrastructure, how many CPU cycles do they think they are going to use.

Suppose they know the background and infrastructure needs and it is all laid out so it is clear to them. My objective is that they quickly learn how to do it. My group and I have worked on these kinds of issues for years and years and we happen to be able to do it pretty quickly. Part of the goal of the seminar is to give other people the same level of experience, so they can do something quickly and that [knowledge] propagates around after the usual learning curve.

You had 100 million avatars in your simulation. That sounds like a clinical trial dreamscape. How did you accomplish that?

Clinical avatars are created from a mathematical model, a stochastic model, all of which is based upon an analysis of real data. That [data] came from patient populations or various studies related to … the statistical distribution and correlation of the important variables of that population and the correlation between those variables, such as age, percent smokers, genetic variant distribution.

I want to be clear on this: These are predicted values … not actual patient data. Consequently I don’t have [Health Insurance Portability and Accountability Act] concerns. The limiting side is important to recognize, too: it is not real data. I can do wonderful things with 100 million avatars. I can do very valuable comparison analysis between, for example, algorithms that predict therapeutic warfarin dosing and there are many of those.

At the same time, like any mathematical model you can look back at it and say, 'Oops, I forgot xyz factor,' so the whole thing is moot. You have to cautiously interpret those results.

I would never suggest that clinical avatars would replace clinical trials on real people. The positive side is you reduce some barriers; you can rapidly move forward on different predictions and simulations.

Are you still running PointOne?

PointOne was privately acquired. That was completed at the end of last year. I have been advising them but I am out of the PointOne business now. I am very happy how things are developing at Harvard and am glad to be able to focus my attention on that.

I still spend half my time in Wisconsin [at the Medical College] and half my time in Boston. A number of universities in Wisconsin, and I am not going to name names, are going to participate in the seminar. That's very exciting. There are a couple of groups in Japan that are going to participate and a couple in Massachusetts and one small group in India.

Because I alternate between Wisconsin and Boston, we're going to do everything [through] WebEx. We're not going to try to do multiple video feeds but do the presentations, the audio lectures, and a discussion group via WebEx. We will post everything on the Web and people are more than welcome to look at the webcast. It's a project-driven seminar, multidisciplinary, and will run throughout the world.

My objective is to use the project to bind the groups together in the different locations and use the lectures to make sure everybody gets to the same levels of understanding and insight into how they might approach these. That is why I call it an experiment in translational science, because the entire seminar is an experiment.

We will hopefully have some successful use cases that come out of this. I suspect there will be some rough edges but I do intend to replicate the seminar going forward. We think of this as version 0.9.

SITE NEWS

1- "UNESCO-L’Oreal Morocco, Imane Allali received the first prize of the best PhD student"

2- "Sequencing Step Now 'Trivial' Part of Clinical Genomics Pipeline amid Analysis, Reimbursement Challenges"

3- Bioinformatics, Medical information and Translational Medicine,