The evolution of computational genomics: solving bioinformatics challenges

Dr. Martin Hemberg - The evolution of computational genomics

Dr. Martin Hemberg

Here Dr. Martin Hemberg, Assistant Professor of Neurology, Brigham & Women's Hospital and Member of the Faculty, Harvard Medical School, envisions a future in which bioinformatics challenges are continually being solved to deal with the ever-increasing volume of data that is being produced.

 

Dr. Hemberg’s field of research is computational genomics. He develops methods and models to help understand gene regulation. He also has a long history of collaborating with experimental groups, mainly in neurobiology, assisting them to analyze and interpret their data. 

Dr. Hemberg started out studying maths and physics, but towards the end of his undergraduate studies decided that biology was going to be his route into research. As a result, he carried out his post-graduate and PhD work in theoretical systems biology at Imperial College London under the supervision of Professor Mauricio Barahona. He was a post-doc in the Kreiman lab at Boston Children’s Hospital, working on analyzing ChIP-seq and RNA-seq data. In 2014, Dr. Hemberg moved to Cambridge, UK, to set up a research group at the Wellcome Sanger Institute with the goal of developing methods for analyzing single-cell RNA-seq data.

Following relocation to the USA in 2021 to establish a new research group in Boston, Dr. Hemberg is now embarking on the next stage of his career.

How did you first get involved in analyzing sequencing data?

With my background in theoretical systems biology, it was interesting to come across a team of ‘experimentalists’ in the lab next door to mine at Boston Children’s Hospital. It was just around the time when second generation sequencing technology was becoming available and my neighbours were optimizing their ChIP-seq and RNA-seq protocols but had no idea how they were going to analyze the data. So, I got involved, and worked on this for the duration of my post-doc contract.

Then, thinking back to my experience with single-cell analysis at Imperial – part mathematical, not genome wide, but just looking at individual cells, it seemed clear to me that these two fields were going to converge. For my first Principal Investigator post (at the Wellcome Sanger Institute) I pitched ideas around systems biology, mathematical biology, and biophysical approaches using single-cell RNA-seq data. However, as we started to do this, it quickly became clear that we did not really understand the data, so we needed to switch around to doing method development to enable us to process the data and separate the signal from the noise. This became the focus of the lab: developing novel methods to analyze single-cell data sets. 

How did this develop, and what is driving your work now?

It has been an interesting journey. We have two types of projects. Firstly, what I would call ‘in house’ projects, where we are working on methods and analyzing public datasets independently to improve analysis or come up with novel ways of approaching the data. The second group of projects are ‘applied’. We collaborate with external groups where we help them to analyze their data. There is a really nice synergy here. Often, I will be talking with someone else about their problems and realize that we have a new method that can offer a solution. It works the other way around, too, when we realise that a specific issue faced by one group is actually a generic problem, and we can then work to find a solution for everyone.

This work has now transferred to Boston, where I am setting up a new group to work using the same template – the same approach. I moved during 2021, and I can definitely advise colleagues that it is not a good idea to move continents with your family, change jobs and try to set up a new team during a pandemic! 

What we will eventually be doing is defining some of the key problems – from an informatics point of view, from a technology point of view and, just as importantly, from a biological point of view. Looking at where there are no good solutions currently, and working to develop methods and tools to solve these challenges and deal with the ever-increasing volume of data that is being produced. It is a very exciting prospect. Clearly, as data set size increases, you need to rely on more sophisticated software and data processing tools that can cope with the scale.

One of your most cited papers is guidelines for computational analysis of single-cell data. Why has this been so important?

Initially, we were thinking only about ourselves and our own research but, as we were working out how best to approach our data, I thought we should write it down and put it into a form where it might help others. In addition to the publication, we also formatted the information as an online course. The feedback has been great. Every time I am at a conference, I have had people mentioning how they learned how to handle their data using our principles. 

What plans do you have to update this work?

It’s definitely a work in progress. We just hit 10,000 visitors/month on the course portal, so I’m absolutely committed to keeping it as current as possible. In the past, we updated only occasionally, as time allowed, but now we have a dedicated person who will be looking at it as part of their day-to-day responsibilities.  

What is your view regarding the best ways to organise the data?

I always tell my students, many of whom have more of a biology background, rather than having experience in mathematics or informatics, that there is usually a module that is taught in computing courses called ‘algorithms and data structures’. And there is a very good reason you put these things together. If you have the right data structure, the algorithm becomes trivial, whereas if you have the wrong data structure, the algorithm becomes a ton of work!

In addition, a more practical consideration is data set size. If you have 10MB, you can store it on a hard drive and do a linear search to get what you want, but if you have 10GB, you need to have much more efficient ways of accessing and querying your data. There is absolutely no doubt that is one of the keys as you scale up to very large data sets – which is something that is really driving the field, so it’s critical.

What has been the impact of biobanks – and the expanded nature of data types being collected, computed and queried?

I see biobanks as simply increasing the number of opportunities available to us – on one hand, first papers coming out in a new field, or after a major development, most often leave potential for subsequent re-analysis and re-working of the data sets to cover angles that were not looked at by the original authors.

On the other hand, and perhaps more exciting still, is the opportunity to combine data sets and data types and see more than you could see with just one modality. That is really important.

Moreover, you can use these expanded data sets to validate any number of new hypotheses by going back and analyzing the biobank data without having to go out and do the experiments yourself. 

Can you summarize your view of the ‘grand challenges’ in bioinformatics today?

That’s a tough one. I’ll answer in two ways – at a high level, genomics is about understanding ‘genotype to phenotype’ mapping. How is information encoded, how is it read out, and how is it then stored in the genome? That will take a while to sort out. 

In terms of less ambitious goals – I’m very interested in the regulatory code, and I think we can make significant progress on this. It’s a case of understanding information concerning adjusting the expression levels of different genes and how that is encoded within the genome through promoters and enhancers. 

Specifically, in terms of single-cell analysis, new technology development is really driving the field. I have to keep my ear to the ground to understand these new technologies and how they can help us solve the day-to-day problems we have. For example, I recall that when the costs of cell isolation and sequencing RNA started to fall, researchers took advantage, new protocols emerged where pooled samples were sequenced ‘in bulk’, and the results were deconvoluted to identify individuals. The amount and complexity of the data increased dramatically. In 2016, the problem did not exist, then around 2018 the first publications came out relating to these new experiments, and we started looking at enhancing analysis methodology. What’s become obvious over the years is that a difficult computational challenge can be totally solved by a better assay – or, conversely, a new assay can throw up a new, interesting and challenging bioinformatics and computational problem. The moral is clear: you need to be nimble.

Martin, thank you for your time, and for giving us a most interesting perspective on single-cell data analysis.

For more information on how our technology can help to transform your data analysis, as well as information on our REVEAL: SingleCell app, contact lifesciences@paradigm4.com.

Author bio

Martin Hemberg, Ph.D., Assistant Professor of Neurology, Brigham & Women’s Hospital and Member of the Faculty, Harvard Medical School, located at The Evergrande Center for Immunologic Diseases (a joint program of Brigham and Women’s Hospital and Harvard Medical School). Prior to this role, Martin Hemberg was a CDF Group Leader at the Sanger Institute.

Dr. Matt Brauer - How to grow a biomed startup

Dr. Matt Brauer

Biobanks offer a rich source of data for biomed startups, but the route to deriving maximum benefit from them is not always straightforward. We talk to Matt Brauer, Vice President of Data Science at Maze Therapeutics, about his experience with the UK Biobank, Finngen and other consortia, and the role that making data-management processes open-source has had in delivering business success.
Dr. Lygia Pereira - Thirty years in genetics

Dr. Lygia Pereira

The course of academic research rarely runs smoothly, and through her 30-year career in genetics, Professor Lygia V. Pereira has certainly seen plenty of challenges – but also lots of successes. We talk to her about the importance of genetic diversity to better serve the health needs of countries like Brazil but also to improve our understanding of disease and health across all populations, and why we need to make bioinformatics tools truly user-friendly.
Dr. Ahmed Hamed - Network science

Dr. Ahmed Hamed

In the latest interview in our cutting-edge series, Ahmed Abdeen Hamed, Assistant Professor of Data Science and Artificial Intelligence at Norwich University, explains the power of network sciences to solve intractable data analysis problems and why this approach is proving valuable in the medical sciences.