Interview: Fighting Dementia with Bits & Bytes
The analysis of genetic data could lead to new approaches for the prevention and treatment of Alzheimer’s disease. However, evaluation of such data requires enormous computing power. With that in mind, the DZNE and the information technology company Hewlett Packard Enterprise (HPE) are investigating the potential of “Memory-Driven Computing”. The DZNE is the first institute worldwide to use this technology in biomedical research. Memory-Driven Computing is based on a new type of computer architecture that HPE has developed within the framework of its research project termed “The Machine”. In contrast to conventional computer architectures, which are processor-centric, Memory-Driven Computing is memory-centric, and emphasizes ultra-fast data access. We spoke with Prof. Joachim Schultze, a genome researcher and research group leader at the DZNE, about this project and the perspectives it offers.
Mr. Schultze, what promise does genetic data analysis hold for Alzheimer’s research?
Alzheimer’s is a complex disease. Although some mutations are known to cause Alzheimer’s, in the great majority of cases various factors seem to interact. Therein lays its complexity. We can think of Alzheimer’s as an enormous puzzle, and our task as that of finding the right pieces and fitting them together. Lifestyle seems to play a role, but genes also contribute more or less to whether one stays healthy or gets ill. In fact, some genetic features increase the risk of developing Alzheimer’s – but they do not necessarily trigger the disease. Other gene variants appear to protect against Alzheimer’s. Statistically, people with such genes are less susceptible to get the disease. In genetic research, we are attempting to identify such risk and protective factors.
What could be the benefits?
We are hoping to improve the understanding of the causes and mechanisms of Alzheimer’s. This could provide a basis for new therapies. Prevention might also benefit from this research.
Why is that?
A better understanding of the involvement of genes in Alzheimer’s could make it possible to generate risk profiles for each individual – subject to their consent, of course. Certainly, this would only make sense and be ethically justifiable, if therapeutic options would be better than today – that is, if interventions would be available to reduce the risk of disease. While such interventions are still quite a way off, they are nevertheless a realistic goal.
Would this involve medications?
We don’t know yet what effective prevention against Alzheimer’s looks like. One element could be drugs that at least delay the onset of the disease, even if they cannot prevent it. However, prevention could involve other aspects as well, such as lifestyle. Studies suggest that regular exercise has benefits beyond training the cardiovascular system. It is also good for the brain. As a result, people who exercise are less susceptible to Alzheimer’s. From these findings, however, only general recommendations can be derived as far as lifestyle is concerned. We are hoping that someday we will be able to become more specific in this area, and be able to provide recommendations tailored towards each individual and their specific risk profiles. That would be a step towards personalized disease prevention.
How do you go about identifying protective genes and risk genes?
We look for striking features in the genome. This requires comparing the genomes of as many healthy people and as many Alzheimer’s patients as possible. If certain gene variants are more common in patients, this indicates that they could play a role in the disease. However, for conclusive evidence of a causal relationship, we have to look more closely. For example, we have to determine which proteins they encode, and study how those proteins are implicated in metabolic processes or in the function of nerve cells or immune cells in the brain. Such work is by no means trivial. In sum, while genetic analysis provides important indications, it is really only a starting point.
What kinds of data quantities have to be processed in such analysis?
The human genome comprises some three billion base pairs. For each base pair, there are four options, meaning that each base pair can be encoded by two bits of data. For one genome, this translates into a data volume of about 700 megabytes. However, the data sets generated by gene sequencers are usually much larger. In fact, the data set of one genome typically comprises about 180 gigabytes. This is because such data sets contain at least parts of the genome not only once but several times. When a sequencer analyzes a genome, it reads through it letter by letter, so to speak. But that does not happen in one run. Instead, a sequencer gradually generates overlapping segments of the genome. Later, these segments have to be put together. Furthermore, in addition to the gene sequence, such data sets also contain a range of additional information, such as information about data quality. That is why the data set for one genome can grow to 180 gigabytes – an amount of data that would fill more than 35 standard DVDs.
Do you need to look at entire genomes?
Not necessarily. Some questions can be studied by looking at segments. In any case, one needs to have genetic data of as many people as possible – ideally, from several thousand. This adds up to data volumes of several hundred terabytes.
How long do computers take to process such amounts of data?
It can take days or even weeks. That is because for each processing step large data quantities have to be loaded from external storage – usually, hard disks – into the computer’s working memory. The working memory is comparatively small. Therefore, shifting the data takes a lot of time. As a result, the bottleneck on genetic analysis is not so much the processor’s computing speed. Because processors are getting faster and faster, even though we are gradually coming up against limits here. The bottleneck is data access. The task is somewhat comparable to the assembling of a puzzle. Imagine that the puzzle pieces are spread over many boxes. In that case you would have to open one box after the other to find the matching pieces. This takes a lot of time. It's much faster and easier to put the puzzle together when all of the pieces are spread out on a table in front of you. Applying this analogy to computers, we can see that, ideally, there should be a huge working memory that contains all the data, in such a way that the processor can access it directly. That is the core principle of Memory-Driven Computing.
How is the DZNE using this technology?
We are currently working on adapting the gene analysis algorithms to the new computer architecture. In other words, we first have to do some development work. You have to consider that analysis of genetic data normally entails several processing steps. And, each of these requires its own specific algorithms. That is why we began by looking at an especially important processing step, which is quite computationally intensive. Put simply, the job consists in linking the gene fragments generated by the sequencer into a continuous sequence. We wanted to determine the extent to which this task could be accelerated. To that end, we adapted a well-established algorithm to the special needs of Memory-Driven Computing. The result is encouraging: We were able to reduce the computing time for a small set of test data from 22 minutes to 13 seconds. It is important to note that the acceleration is maintained even for much larger data volumes, as they occur in large population studies. In other words, the benefits of the new computer architecture scale-up.
What computers are you using?
Currently, HPE has one prototype specifically designed for Memory-Driven-Computing. That computer is located in the U.S.. It has a working memory of 160 terabytes, the largest of any computer worldwide. So far, we have carried out one test run on this machine. We were the first cooperation partners of HPE worldwide to be granted this access. Otherwise, we work with HPE computers that run a software emulation of the prototype’s capabilities. Such emulations are not as fast as the prototype. Nevertheless, they can replicate essential aspects. Recently we got our own computer for such tests in our computing center in Bonn. This is a “HPE Superdome Flex”. Some aspects of its hardware are designed according to Memory-Driven Computing principles. In particular, the computer supports extremely fast data exchange, and it has a 1.5 terabyte memory that can be scaled-up to 48 terabytes.
What comes next?
Thus far, we have been concentrating mainly on one algorithm. For gene analysis, however, additional software tools are required. We plan to gradually adapt these to Memory-Driven Computing. Once the tools are ready, we intend to begin evaluating current research data. I think that by mid-2019 we could be ready to get started. And the development will continue. Our long-term goal is to compare the genetic data of thousands of people within a few minutes.
And then?
I'm expecting that this new technology will profoundly influence and change how research is done. It will enable us to compile and analyze information to an extent that was previously inconceivable. And its applications are not limited to the area of genetic data. The DZNE's different areas of research, including fundamental research, clinical research and population studies, all generate enormous amounts of data. We want to interlink all this information. As I indicated, we can think of Alzheimer’s as an enormous puzzle. If we just look at individual pieces of that puzzle, we'll never see the big picture. And yet the big picture is precisely what we're trying to see. For example, one can ask how disease-relevant gene variants affect the brain. It would thus make sense to link genetic data with data from brain-imaging – for example, with data from MRI scans. We can expand this concept, and also integrate lab data and other data from clinical studies. And do that for hundreds or even thousands of individuals. That truly takes us into the realm of big data. To analyze such large quantities of data, we need special tools. That is why we have high hopes for this new computer technology.
The interview was conducted by Marcus Neitzert.