Research Programming Artificial Intelligence Interviews Other

Machine Learning in Biotech: Interview with Quantori

Q: What projects are you working on now?

Yuriy: I am currently focused on developing a new platform that accelerates time to market for life-changing technologies such as diagnostics and precision medicine, as well as novel biologic and small molecule drugs. Quantori’s platform for drug discovery and development aims to build new data integration and high-performance computational environments for global and early-stage biopharma companies. We have just published a paper in the Journal of Medicinal Chemistry . Using Quantori’s in silico drug design approach and technologies, we identified potent and selective anti-cancer compounds and experimentally verified the results in the lab. We were able to reduce the lead optimization effort by 300% and decrease the number of compounds to be tested by 70%. Maksim: One of the projects I had a chance to work on was predicting the age of samples. The goal of the project is twofold. First, to provide a pipeline for preprocessing ancient DNA – as it turns out, there is a demand for such tools. The second goal is to create a new method for predicting the age of samples. Of course, there are several existing tools for age estimation. Still, a more advanced one is always welcome in science as some age estimation methods do not apply to certain conditions of the remains.You may wonder what this has to do with machine learning. In fact, changes in DNA, such as degradation over time, have statistical patterns that algorithms can analyze. Deep learning provides a good start, enabling automatic feature engineering. As a result of this research, we plan to deliver a pipeline as open-source code and publish an article.

Q: What were your biggest challenges while you were working on these projects?

Yuriy: My biggest challenge was designing a comprehensive yet easy-to-use data engineering and data analysis platform that would allow researchers to focus on science instead of IT or software development issues. In life sciences, we have to deal with complex objects such as molecular structures, genomic and proteomic data, medical images, clinical data, real-world evidence, electronic medical records, and other unstructured or poorly structured textual information. We need to collect, store, analyze, model, and visualize data from all those heterogeneous sources in a secure and regulated environment. In my opinion, integrative informatics is the key to success in silico drug development and repurposing, patient stratification for clinical trials, and, ultimately, personalized medicine. Maksim: The biggest challenge was probably that we underestimated the time required to process data. In machine learning, data processing typically involves finding a suitable data set, cleaning it, and transforming it. And only after these steps are complete can we move on to ML algorithms.

Q: What is the craziest thing you have done in your career?

Yuriy: Going back to school at the age of 54 to get my MBA. Maksim: I think the craziest part of my career was starting a new degree in Machine Learning at Georgia Tech. I stopped working for an entire year to focus on my studies. However, I believe that my time investment will definitely pay off in the long run, so maybe that’s not the craziest thing!?

Q: What programming languages/frameworks will be the most popular in a decade?

Maksim: Here I would talk about the languages or frameworks that are used in big data and data science. First of all, I believe that Python, in which many libraries were created, will remain one of the most popular and widely used languages. As data science and machine learning will have much more applications and impact in the future, Python will remain the king among data science languages.Regarding deep learning frameworks, I believe PyTorch will become the leader that will push back TensorFlow and Keras. In my opinion, PyTorch is more user-friendly. It is easier to troubleshoot the programs written in PyTorch than in TensorFlow.Also, I expect Scala to claim a large market share thanks to its scalability. In addition, the code written in Scala is very concise. We can already observe this trend, as Scala is one of the fastest-growing languages.

Article by Inna Logunova

November 29th, 2022

9 min read

In this interview, we speak with two representatives from Quantori, a data science and digital transformation company specializing in research and development in the biopharma industry.

Yuriy Gankin is a co-founder and chief scientific officer at Quantori. He’s a life science researcher, serial entrepreneur, and inventor and holds a Ph.D. in analytical chemistry from Tufts University and an MBA from MIT Sloan.

Maksim Kazanskii is Quantori’s engineering manager in machine learning. With a background in theoretical physics and computer science, he has significant experience in designing computer vision algorithms. Last year, he switched to developing ML and data engineering algorithms for health informatics.

Read the interview to learn about the cutting-edge frameworks being used in biotech today.

Please tell us about your company, your team, and your role there.

Yuriy Gankin: Quantori, headquartered in Boston/Cambridge, focuses on designing and developing intelligent IT and data science solutions for life sciences, pharmaceutical, and healthcare companies. Our offering includes advanced analytics, scientific knowledge management, and software engineering. We believe that we can shape the future by innovating drug discovery and improving patient care and want to transform life sciences through the power of digital IT.

Maksim Kazanskii: I have worked at Quantori’s Machine Learning and Cloud Engineering department, where I manage technical processes. The projects I’ve been involved in are both client- and research-oriented and related to machine learning and data engineering.

What projects are you working on now?

Yuriy: I am currently focused on developing a new platform that accelerates time to market for life-changing technologies such as diagnostics and precision medicine, as well as novel biologic and small molecule drugs. Quantori’s platform for drug discovery and development aims to build new data integration and high-performance computational environments for global and early-stage biopharma companies. We have just published a paper in the Journal of Medicinal Chemistry. Using Quantori’s in silico drug design approach and technologies, we identified potent and selective anti-cancer compounds and experimentally verified the results in the lab. We were able to reduce the lead optimization effort by 300% and decrease the number of compounds to be tested by 70%.

Maksim: One of the projects I had a chance to work on was predicting the age of samples. The goal of the project is twofold. First, to provide a pipeline for preprocessing ancient DNA – as it turns out, there is a demand for such tools. The second goal is to create a new method for predicting the age of samples. Of course, there are several existing tools for age estimation. Still, a more advanced one is always welcome in science as some age estimation methods do not apply to certain conditions of the remains.

You may wonder what this has to do with machine learning. In fact, changes in DNA, such as degradation over time, have statistical patterns that algorithms can analyze. Deep learning provides a good start, enabling automatic feature engineering. As a result of this research, we plan to deliver a pipeline as open-source code and publish an article.

What were your biggest challenges while you were working on these projects?

Yuriy: My biggest challenge was designing a comprehensive yet easy-to-use data engineering and data analysis platform that would allow researchers to focus on science instead of IT or software development issues. In life sciences, we have to deal with complex objects such as molecular structures, genomic and proteomic data, medical images, clinical data, real-world evidence, electronic medical records, and other unstructured or poorly structured textual information. We need to collect, store, analyze, model, and visualize data from all those heterogeneous sources in a secure and regulated environment. In my opinion, integrative informatics is the key to success in silico drug development and repurposing, patient stratification for clinical trials, and, ultimately, personalized medicine.

Maksim: The biggest challenge was probably that we underestimated the time required to process data. In machine learning, data processing typically involves finding a suitable data set, cleaning it, and transforming it. And only after these steps are complete can we move on to ML algorithms.

Can you describe your technology stack? What languages, frameworks, and libraries do you use most often?

Yuriy: Our cloud platform offering enables effortless semi-structured data ingestion and exploitation, schema inference or schema enforcement, data source versioning, and real-time ingestion. It’s built on top of highly efficient metadata storage supporting the most known schema varieties (Avro, Parquet, Orc, BigQuery, Relational databases, Spark Data types), schema conversion, and lineage. It enables quick data pipeline deployment into data lakes in minutes without a “schema first” approach and without the delay or loss of data. In Q-Flow, we bring the functionality of the best data engineering tools (Airflow, Prefect) to the ML domain by adding tracking, data lineage, and feature store functionality.

Apart from basic data engineering tools, only ZenML offers limited support for the collaborating features but doesn’t offer anything specifically for life sciences. Its in-progress features include creating a domain-specific language. This language is used to define pipelines that solve issues with manual graph specifications, create a set of plugins for setting up a local instance of resources such as cache and tracker and spin them off in the cloud (such as AWS and GCP). It also enables setting up life-sciences-specific services such as MRI and genomic processors and providing advanced data lineage and feature stores support.

Maksim: Of course, my technological stack depends on the project I work on. I believe it is better to be task-oriented and adapt your stack to the project’s specific needs. However, I do have some preferences. For deep learning problems, I tend to use PyTorch. I believe that PyTorch is a more flexible and “easy-to-code” framework than Tensorflow or Keras. For research tasks, PyTorch has been gaining more and more popularity in the industry. In the long term, PyTorch can become even more prevalent than Tensorflow. What helps Tensorflow right now is the possibility of using frameworks in production.

So let me explain the difference between research projects and production projects.

In research, you commonly have established data, and the goal is to create a model with the best performance for the static data. Suppose, however, you want to use your working model in production. In this case, you need not only to deliver the model on the server but also attach data and model monitoring tools to it since data is changing dynamically. So, if the task is to bring a model to a production environment, I prefer to use Tensorflow and TensorFlow extended framework (TFX). But sometimes we need to create a tool in the cloud, so I prefer to use the AWS cloud stack (Amazon Web Services) for that purpose. I would not be original here. Alongside that, I use frameworks and libraries based on Python since it is the most flexible language with an enormous number of libraries and frameworks for data analysis and machine learning.

From your point of view, what has been the most impressive trend in AI recently?

Yuriy: To me, the most impressive AI trends in the life sciences are the de novo design of drug candidates, for example, using the Variational AutoEncoder Generative Adversarial Network (VAE-GAN) and ML in predictive genomics. Scientists have extensively used machine learning to select the best drug candidates or prioritize tests. Today, with the help of ML, we can develop entirely new synthesizable chemical entities, some of which have already entered clinical trials. Whole genome sequencing is reaching a price point that could justify its inclusion in every patient’s medical record. The genome contains multidimensional information that can be used for diagnosis and, in some cases, therapy in multiple ways that we can’t even imagine at this point. Combining scientific breakthroughs and ML is needed to address many unresolved issues in interpreting complex variants.

Maxim: I think the most impressive current and future trend in AI is its applications in healthcare and medicine. Perhaps this is why I work at Quantori. In my opinion, numerous areas can benefit from AI. In medical imaging, AI can not only compete on par with radiologists but, in some cases, surpass human levels. Just recently, Quantori published an article in Nature describing the system that evaluates the severity of COVID cases based on X-ray images. AI has laid a new foundation for drug discovery. In the next 10-15 years, we will see many new drugs that will emerge as a result of the AI revolution. Unlike traditional drug discovery methods that rely more on wet lab experiments, AI and Big Data significantly lower development costs and bring many experiments out of the lab and onto servers.

In addition, sophisticated algorithms can extract valuable new insights from the data. Take AlphaFold as an example. The algorithm, developed by Google, made it possible to determine the exact structure of proteins based solely on the given sequence. Proteins are complex organic molecules that play an essential role in cells. They consist of large amounts of amino acids, and knowing the spatial structure of such molecules opens doors for more precise medicine. Before AlphaFold, scientists used a variety of rather expensive lab-based approaches, such as mass spectrometry, to identify the structure of a protein. They required complicated experiments with single proteins. AlphaFold can determine the structure of proteins in a matter of seconds.

What is the craziest thing you have done in your career?

Yuriy: Going back to school at the age of 54 to get my MBA.

Maksim: I think the craziest part of my career was starting a new degree in Machine Learning at Georgia Tech. I stopped working for an entire year to focus on my studies. However, I believe that my time investment will definitely pay off in the long run, so maybe that’s not the craziest thing!?

What programming languages/frameworks will be the most popular in a decade?

Maksim: Here I would talk about the languages or frameworks that are used in big data and data science. First of all, I believe that Python, in which many libraries were created, will remain one of the most popular and widely used languages. As data science and machine learning will have much more applications and impact in the future, Python will remain the king among data science languages.

Regarding deep learning frameworks, I believe PyTorch will become the leader that will push back TensorFlow and Keras. In my opinion, PyTorch is more user-friendly. It is easier to troubleshoot the programs written in PyTorch than in TensorFlow.

Also, I expect Scala to claim a large market share thanks to its scalability. In addition, the code written in Scala is very concise. We can already observe this trend, as Scala is one of the fastest-growing languages.

What AI trends will we be embarrassed about when we look back on them five years from now?

Yuriy: Black-box ML models in life sciences, especially those based on under-curated or erroneous data. Medical professionals and scientists need ways to understand the basis for the insights provided by AI models. The lack of understanding leads to mistrust and low uptake of ML modeling in medicine beyond scientific publications. There are several approaches to address this. We recently published a paper in Nature Portfolio’s Scientific Reports that provides a solution to the ongoing burden on the global healthcare system and, most importantly, enables timely and personalized treatment of patients infected with COVID -19 and other lung diseases.

In this research study, we present a novel two-stage disease scoring workflow based on image segmentation and multi-task learning for segmentation and assessment of COVID -19 infections. Our approach provides results that radiologists can interpret as areas of infection are visually identified. This eliminates the “black box” issue associated with most deep learning models. “Black box” is the reason why doctors have been reluctant to use deep learning, which provides an answer but gives little to no information about how it was obtained.

Maxim: I think natural language processing, especially chatbots, is an overrated trend. Please understand me correctly, the current level of chatbots is impressive. Indeed, algorithms like GPT-3 could produce excellent texts or be a perfect companion. Some data scientists at Google have even suggested that the NLP bot become sentient. But in general, a chatbot is just a statistical algorithm that does an excellent job of juggling words without giving them meaning. NLP algorithms may one day become sophisticated enough to replicate human consciousness. But I believe that the creation of AGI (Artificial General Intelligence) is still a long way off, perhaps a few decades away.

We hope that you enjoyed our interview with Yuriy Gankin and Maxim Kazanskii from Quantori.

For more interviews, check out our interview section. To hear more from us, subscribe to our YouTube channel, follow us on Twitter, or subscribe via the form below to receive new Serokell articles via email.

tagged:

12 upvotes

Get new articles via email

No spam – you'll only receive stuff we’d like to read ourselves.

Machine Learning in Biotech: Interview with Quantori