Jesse Johnson: Bringing Together AI and Medical Research

Today, we speak with Jesse Johnson, Ph.D., a prominent data scientist with over ten years of experience in AI. After a number of years leading data teams at biotech startups, Jesse recently founded a company called Merelogic, whose goal is to help biotech organizations turn their machine learning proof-of-concept projects into tangible impact.

Jesse Johnson has transitioned from academia to biotechnology, applying his deep expertise in mathematics and data engineering to practical research and analysis of genomic structures and medications.

He began his career as a lecturer and researcher at Yale University, where he focused on the topology and geometry of abstract three-dimensional spaces. After years in academia, he was invited to work on real-life projects as a software engineer at Google.

Two years later, he moved on to biotech, consistently building his career in this field and ultimately gaining a top position at Dewpoint Therapeutics before starting his own consulting group. In this interview, we talked about his career, the global challenges of biotech, and whether there is a “silver bullet” solution for achieving better results in bioengineering.

Interview with Jesse Johnson

Your career path is truly remarkable and impressive. First academia, then a job at Google, now biotech. What inspired you to make the transition to biotechnology? What challenges did you hope to address in this field?

At Google, I was working on data analysis for hotel search, with one project related to verifying the prices we listed and another pertaining to entity resolution. It turns out that if two different sources list similar-sounding hotels, it’s surprisingly hard to tell if they actually mean the same places. It was a great introduction to the technology Google was building, their approach to engineering and many kinds of data quality concerns I’ve seen in every other place I’ve worked.

Most of the teams were excited about those technical challenges, and so was I. It was interesting to solve real-world business problems by applying advanced AI technologies. But after a couple of years, I began to feel that I wanted something with a greater impact. That’s why I transitioned into the healthcare space and then into biotech. Leaving a comfortable, well-paying job at Google to join a large, century-old pharmaceutical company may have seemed like a risky move, but it turned out to be the right decision in the long run.

Jesse_Johnson

What is the biggest challenge for you as a data scientist in the biomedical field?

The main challenge is to establish a productive collaboration between the data science team, biologists, and chemists, ensuring that everyone has a clear understanding of the experimental constraints, data collection, and model-building possibilities. No one has a complete understanding of all the different moving parts, so identifying the most promising opportunities takes a lot of back and forth.

I have read your blog “Scaling Biotech”, in which you discuss bridging the gap between the mental models of biologists, or wet lab researchers, and data scientists. What are the key differences between their approaches?

A biologist is well-versed in the various data sources available in the lab, including their reliability and the time and financial costs associated with acquiring them. When conducting a cost/benefit analysis, they often concentrate on immediate and specific use cases. These cases usually have narrow scopes, so they may underestimate the benefits while overemphasizing the costs.

In contrast, a typical data scientist may have a limited understanding of the available data sources and focus on what has already been or is being collected rather than considering what could be gathered with the proper inquiry. However, their cost/benefit analysis will likely consider potential future use cases, overemphasizing the potential benefits.

What are the specific features of biological research as compared to other natural sciences?

Since its inception, biology has existed in an environment where data analysis is not easily accessible. Unlike physics and chemistry, where a limited number of experiments can produce universal insights, biological data is typically collected within a particular context due to the field’s heterogeneity. It can be challenging to determine which aspects of the data are broadly applicable and which are only relevant to the particular context in which they were gathered.

What is the nature of the work done by Merelogic?

In my own work in biotech and from my conversations with other leaders and practitioners in the area, the most frustrating problem that tends to come up is when promising and potentially impactful projects get stuck at the proof-of-concept phase. You may get some really exciting results in a Jupyter notebook and get good feedback on internal presentations, but actual adoption into the scientific workflows and pipeline is often much more elusive. In my experience, there are some common reasons for this, all of which can be addressed with varying amounts of effort. So with Merelogic, I’m exploring different ways to help biotech organizations make this happen.

I am curious about the name of your company – where does it come from?

Well, I wanted to pick a more or less unique name, which is hard to do these days. What I ended up with is derived from “mereology,” which is the study of how individual parts come together to form a whole. You have different parts of a biotech organization, including the data and wet lab teams, but you really need to make them function as a single whole.

What kinds of projects do you typically work on?

In biotech, anything you figure out needs to be confirmed with an experiment in the lab, so you can frame any analysis in terms of selecting or designing the next experiment. In other words, you want to take the data from the first N experiments to determine the best experiment N + 1. This might be identifying a molecule that should be tested more or deciding how the experimental conditions should be tweaked. In any case, the first hard part is to make that connection from the model prediction to a decision about experiment N + 1. The second hard part is to convince the team doing the experiment to actually do it that way.

What are the most impressive results you’ve seen?

I’ll give you two examples that might sound simple but ended up being very impactful. In the first, we picked drug candidates by looking at digital microscopy images, and the lab used fairly simple formulas based on segmenting the images. The ML team had built a classification model trained on “positive controls”; in other words, these were examples of the behavior that the lab was looking for. This model turned out to be able to detect much more subtle differences, but there were two problems. First, it took so long to get the data from the lab to the ML environment that the bench teams had moved on by the time the predictions were available. Second, no one had actually talked to the bench teams about using these ML-based predictions. Once we got those problems ironed out, we had our first candidates identified through ML.

For the second example, we wanted to predict the most effective protein sequences for a multi-step process. We had readings from different steps in the process, but some were quite noisy and incomplete. So we had to be very deliberate about which parts of the problem we modeled and how we accounted for the other parts. In the end, we were able to create a classification model that could tell us which proteins to try next. As we are discussing this, those proteins are being created for the next experiment.

Could you describe your preferred technology stack? Which languages/frameworks/libraries do you tend to use most often?

I mostly write Python because it’s general-purpose and flexible. I don’t have a preferred ML framework, but for APIs, I’ve been pretty happy with Django lately. For data storage, most of the data in early-stage biotech research is either very large raw data that are best as flat files in S3 or relatively small structured data that you can just throw into Postgres. Besides, there are a lot of custom tools for specific data types, like sequencing data or digital microscope images, where I tend to use open-source options.

What would you name as the most impressive trend in AI recently?

It’s not really a trend, but I’ve been impressed with how the image analysis tools originally designed for tasks like distinguishing cats and dogs can be adapted for segmenting complex cell images that look completely different to my eye. But a number of groups have found that you can get a boost from pre-trained models on these large image corpuses.

What, in your opinion, is the most ridiculous AI invention of the century?

I don’t know about the most ridiculous invention overall, but at some point, I asked ChatGPT to suggest articles on a very specific topic and realized it was just making up plausible titles. This is given that training machine learning models has achieved incredible accuracy, and there is a huge knowledge base to support their predictions. I’ve found some great ways to use ChatGPT effectively, but it’s kind of ridiculous that you can’t tell when it’s telling the truth or lying with 100% confidence.

Today, there’s often an over-emphasis on optimizing model architecture over improving data quality. I’ve seen plenty of projects where you hit a wall with how accurately you can predict things, no matter how much you tweak the model. Often people don’t think about how they can collect more accurate data. I hope that’s widely recognized as an embarrassing mistake five years from now.

Any takeaways you would like to share with our readers?

My main takeaway is that AI and machine learning have enormous potential to revolutionize the field of medical science. But the kinds of problems you’re going to face are very different from the technical problems you may be used to from traditional tech. It’s more about understanding healthcare and biomedical science, collaborating with the lab teams, and figuring out how to get better and more relevant data.

That’s exciting. We’ll be following the developments at your company and overall in biotech. Thank you for your thoughts!


For more interviews, check out our interview section. To hear more from us, subscribe to our YouTube channel, follow us on Twitter, or subscribe via the form below to receive new Serokell articles via email.

Banner that links to Serokell Shop. You can buy cool FP T-shirts there!
More from Serokell
What is k-means clustering in machine learning?What is k-means clustering in machine learning?
F1 Score for model evaluation in MLF1 Score for model evaluation in ML
How Sber Built ruDALL-E: Interview with Sergei MarkovHow Sber Built ruDALL-E: Interview with Sergei Markov