Machine learning is a field of computer science that uses computers to learn from data. It has developed over the past few decades as an approach to artificial intelligence and computational modeling, able to perform tasks such as image recognition and language translation. Companies in many industries including finance, manufacturing, and healthcare are already applying machine learning technologies.
The question before us is whether machine learning can be used in biotechnology. From our current understanding so far, the answer seems to be yes; machine learning techniques have already been applied successfully in diverse fields of bioinformatics such as genomics, proteomics and drug discovery studies.
What Is Machine Learning In Biotechnology
Biotechnology is a diverse field which uses biological material and biological processes in the development of industrial and health applications. Machine learning in biotechnology is the use of machine learning techniques to generate knowledge and understanding from data, which can then be used to develop new products, improve existing ones or provide better insights into biology.
The accuracy of structure prediction has grown from 70% to more than 80% thanks to the use of ML.
Machine learning in biotechnology involves the use of computational algorithms and statistical models to analyze large data sets on genome sequences and other biomolecules.
It enables analyzing complex patterns in the data to predict how individual genes and proteins are related, how they interact with each other, or whether they cause diseases. It also helps discover previously unknown relationships among specific gene changes that predispose people to disease.
Benefits Of Machine Learning In Biotechnology
Deep learning is a subset of machine learning that uses neural networks to learn from data. Neural networks are a type of artificial intelligence that are modeled after the brain and can learn to perform tasks by example. Deep learning is used for both business and scientific functions, such as data science and image recognition.
A survey of pharma and life sciences specialists found that 44% were employing artificial intelligence in their research and development efforts, despite PricewaterhouseCoopers’ prediction that by 2030, artificial intelligence will contribute USD15.7 trillion to the global output.
The life sciences sector is booming, thanks in large part to advances in biotechnology and life sciences. Machine learning software is making it possible for data scientists to process vast amounts of data more efficiently, and the biotechnology industry is reaping the benefits.
1. Big Data
Biotechnology generates a lot of data – biological, clinical, proteomic and biomedical datasets. With the increased availability of bioinformatics tools, their growth over the past decade has increased exponentially. The number of bioinformaticians has also increased massively in the last couple of years.
Many of these datasets are highly complex and require advanced statistical analysis. This is exactly where machine learning can be applied to help make sense out of the data by using predictive methods to extract patterns from large volumes of data.
2. Identify Unknown Genes And Proteins
Genes have complex regulatory network. A vast majority of regulation is done by the proteins that are regulated. Machine learning is also able to identify unknown proteins and genes by detecting meaningful patterns in biological data. Machine learning algorithms can then be used to predict functions of new genes and proteins, identify their relationship with other molecules, or detect molecular interactions that have previously gone unnoticed.
Each business, BioNTech and Moderna, received around $1 billion in outside investment for the development of the vaccine.
In other words, if you have a dataset that contains a list of the 25 most abundant proteins in your sample, you might wonder how those proteins interact with each other. This can be computed using machine learning techniques. It will discover patterns between known and unknown gene expression, protein interactions, or any other biological phenomena represented in the dataset.
3. Data Mining
Data mining refers to the process of extracting patterns from massive data sets and discovering hidden relationships within them. It is a key component of machine learning, which can be used to derive new insights from large datasets.
A typical example would be analyzing different types of genes, for example transcription factors, using publicly available gene expression datasets (e.g., GEO dataset on the NCBI website).
The machine learning algorithms will identify patterns in the various datasets retrieved from these studies where the same gene was expressed at different levels in certain cell types or under a specific condition. When enough patterns are collected, they can then generate hypotheses that can be tested in further experiments.
4. Targeted Therapy
Targeted therapy involves treating a specific, pathogenic molecule that causes or exacerbates a disease. These molecules are usually small biological molecules, but they can also be large classes of proteins or even entire pathways. Targeting these pathways is crucial for the therapeutic development process in biotechnology.
Machine learning can identify specific and relevant molecules by discovering patterns in a dataset to predict what new targeted molecules might be discovered. Algorithms are able to identify these molecular targets using experiments that have already been conducted, e.g. from gene expression data.
5. Discovery Of Genetic Risk Factors For Various Diseases
Understanding the interactions among genes and proteins that predispose people to disease is a big challenge in biotechnology. Machine learning is able to discover meaningful patterns in large sets of gene expression studies or protein interaction datasets and then predict which specific genes are associated with a certain disease or which proteins are associated with different diseases.
The computational know-how used by machine learning algorithms enables them to uncover relationships between molecular changes that predispose people to disease, and could potentially be targeted by targeted therapy.
6. New Computational Methods
Machine learning provides us with new opportunities to develop new computational methods and algorithms that are able to find patterns in biological datasets. Such methods have already been discovered and applied in biotechnology, including:
A medicine that is wildly successful and brings in at least $1 billion in annual sales is considered a blockbuster.
Cluster analysis, which finds overlapping patterns and group similar objects that belong together, is a commonly used method to compare biological datasets. In the context of biology, this means comparing gene expression data from related cell types or different conditions in a dataset (e.g., conditions of different endpoint genes). This can be used to create hypotheses on how a certain pathway is turned on or off in a certain cell type under certain conditions.
These computational methods have been applied to discover biological and biochemical networks, which are made up of gene expression data, protein interaction data, and other biological knowledge.
Using computational methods like clustering and association analysis, machine learning algorithms are able to derive new insights into the relationships between different genes and proteins. Using these methods also helps identify new pathways that could become therapeutic targets.
7. Drug Development
Drug discovery is an important part of biotechnology, as it enables treating diseases that are currently incurable. The purpose of a drug is to target a specific molecule or pathway that causes or exacerbates a disease.
Once a potential drug target is identified, most drugs are not immediately approved for clinical use. The pharmaceutical industry has to go through a stringent process to prove that the drug is effective and safe before it can be used in the clinic.
With an AUC (“area under the curve”) of roughly 70%, the number of false alarms dropped by 73%.
Machine learning can be used to aid drug discovery and development by identifying potential new therapeutic targets, finding compounds that can interact with these targets, or even find substitutes for existing drugs.
Machine learning methods can be applied to find patterns in gene expression data or protein-protein interaction datasets that identify new potential therapeutic targets for existing drugs or even molecules that have previously been overlooked as a therapeutic target.
8. Genome-Wide Association Studies
We know that certain genetic variations can influence a person’s risk of having a disease, but we don’t know which specific variant or genetic difference is responsible. It can be challenging to find the specific variants in the genome that are responsible for a particular disease because they are hidden in our DNA sequence.
Risks Of Machine Learning In Biotechnology
Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages.
As a data scientist, you will likely encounter real world data that is unstructured and requires NLP methods to make it manageable. Some key deep learning methods for NLP are convolutional neural networks (CNNs), recurrent neural networks (RNNs), and long short-term memory networks (LSTMs).
1. Biased Data
Machine learning algorithms can be biased, for example because the algorithm misclassifies positive data samples as negative. This means that the algorithms will make a mistake when deciding that a sample is related to a certain category. When this happens, machine learning methods have to be a lot more careful than other methods to avoid being biased or making mistakes.
In biotechnology, machine learning can not only get us closer to discovering new molecular targets and drug candidates, but can also help us identify inappropriate samples (e.g., those with potential off-target effects) that should not be used for further experiments in biotechnology animal studies and clinical trials.
2. Biased Data Quality Control
At least 10% of people worldwide are thought to suffer from a mental illness.
Data is crucial to data-driven research, but, as we have seen in the previous article, with the pressure to publish and focus on high-impact publications in biotechnology, it can also be challenging to ensure that samples are what they appear to be, and are not misclassified.
This problem was addressed by an approach commonly used in academic publishing called “data curation”. However, this approach still needs improvement if it will continue to be useful when applied more widely in biotechnology practice.
In machine learning algorithms like those used for drug discovery and discovery of disease mechanisms (e.g. association analysis), it is crucial that data quality control (preprocessing, cleaning, and normalization) is performed correctly to ensure that the algorithm can properly identify relevant samples.
3. Data Overfitting
Data overfitting means that the model no longer performs well on new data samples that were not used to train the model (e.g., data from different cell types or patients with different disease).
To avoid this problem in biotechnology applications, computational methods should be fine-tuned on a regular basis as new datasets become available or are uploaded into a database, so they can better identify important relationships (e.g., genes involved in disease) based on the newly available data and not on old datasets.
Machine learning models can also misinterpret data, as was shown in a famous 2011 study performed by O’Reilly and others on fine wine. In the study, machine learning models were trained on a dataset of chemical compounds that give wine its specific taste and smell.
When asked to predict the quality of a specific wine based on the chemical analysis of that wine, their model predicted that wine would be rated 99 points out of 100 by an expert panel.
However, when the machine was presented with data from a different wine (of equal quality), it generated an entirely different sensory profile that was not rated nearly as high. Biotechnology applications of machine learning models need to be analyzed to determine the possible sources of bias and errors.
5. Statistical Errors
Statistical errors arise from error-prone practices such as inappropriate data preprocessing or using inappropriate algorithms for classification, regression and clustering methods. If a researcher is not following standard practices on data preprocessing (e.g., removing outliers, adjusting for different variables), incorrect conclusions may be drawn from the data alone without any other supporting evidence (e.g. from animal or clinical trials).
6. Invalid Statistical Sets
Data sets can also be invalid, because the data was collected in a way that is not scientifically sound. For example, two data points collected at different times may actually represent the same sample. If a dataset includes such duplicate samples, it will lead to incorrect conclusions and biased results (e.g., when trying to find disease genes).
7. Inappropriate Data Sampling
There are many cases of data being collected in an inappropriate manner at the beginning of biotechnology research projects. This can result in errors when trying to identify relationships between genes and diseases, and improper results from association studies (e.g. when the wrong data is used for analysis).
8. Improper Data Analysis
Data analysis errors can be committed when applying the wrong statistical methods or algorithms to evaluate a dataset and identify important associations. For example, using a non-parametric method on a dataset that needs to be analyzed using parametric methods is not appropriate.
To avoid this problem, researchers need to understand the importance of selecting the right method for their problem (e.g., when preparing a manuscript for publication) so that they can properly analyze data and draw appropriate conclusions.
Machine learning will become increasingly important in biotechnology and medicine, as the amount of data generated continues to increase. The ability to identify relationships between biological samples, diseases, drugs and genetic factors using machine learning algorithms will allow researchers to develop more accurate hypotheses.
With more accurate hypotheses, researchers will be able to conduct trials that are closer to those that will be applied in future clinical applications (e.g., better designed for a specific disease).