Example: bachelor of science

A guide to machine learning for biologists

REVIEWS. A guide to machine learning for biologists Joe G. Greener 1,2. , Shaun M. Kandathil 1,2. , Lewis Moffat1 and David T. Jones 1 . Abstract | The expanding scale and inherent complexity of biological data have encouraged a growing use of machine learning in biology to build informative and predictive models of the underlying biological processes. All machine learning techniques fit models to data; however, the specific methods are quite varied and can at first glance seem bewildering. In this Review, we aim to provide readers with a gentle introduction to a few key machine learning techniques, including the most recently developed and widely used techniques involving deep neural networks.

Machine learning’ refers broadly to the process of fit - ting predictive models to data or of identifying informa-tive groupings within data. The field of machine learning essentially attempts to approximate or imitate humans’ ability to recognize patterns, albeit in an objective man-ner, using computation. Machine learning is particularly

Tags:

  Machine

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of A guide to machine learning for biologists

1 REVIEWS. A guide to machine learning for biologists Joe G. Greener 1,2. , Shaun M. Kandathil 1,2. , Lewis Moffat1 and David T. Jones 1 . Abstract | The expanding scale and inherent complexity of biological data have encouraged a growing use of machine learning in biology to build informative and predictive models of the underlying biological processes. All machine learning techniques fit models to data; however, the specific methods are quite varied and can at first glance seem bewildering. In this Review, we aim to provide readers with a gentle introduction to a few key machine learning techniques, including the most recently developed and widely used techniques involving deep neural networks.

2 We describe how different techniques may be suited to specific types of biological data, and also discuss some best practices and points to consider when one is embarking on experiments involving machine learning . Some emerging directions in machine learning methodology are also discussed. Deep learning Humans make sense of the world around them by it is used in nearly every field of biology. However, only machine learning methods observing it, and learning to predict what might happen in the past few years has the field taken a more critical based on neural networks. next. Consider a child learning to catch a ball: the child look at the available strategies and begun to assess which The adjective deep' refers (usually) knows nothing about the physical laws that methods are most appropriate in different scenarios, to the use of many hidden layers in the network, two govern the motion of a thrown ball; however, by a pro- or even whether they are appropriate at all.

3 Hidden layers as a minimum cess of observation, trial and error, the child adjusts his This Review aims to inform biologists on how they but usually many more than or her understanding of the ball's motion, and how to can start to understand and use machine learning tech- that. Deep learning is a subset move his or her body, until he or she is able to catch it niques. We do not intend to present a thorough literature of machine learning , and reliably. In other words, the child has learned how to review of articles using machine learning for biological hence of artificial intelligence more broadly. catch the ball by building a sufficiently accurate and problems1, or to describe the detailed mathematics of useful model' of the process, by repeatedly testing this various machine learning methods2,3.

4 Instead, we focus Artificial neural networks model against the data and by making corrections to on linking particular techniques to different types of bio- A collection of connected the model to make it better. logical data (similar reviews are available for specific nodes loosely representing neuron connectivity in a machine learning ' refers broadly to the process of fit- biological disciplines; see, for example, refs4 11). We also biological brain. Each node is ting predictive models to data or of identifying informa- attempt to distil some best practices of how to practi- part of a layer and represents tive groupings within data.

5 The field of machine learning cally go about the process of training and improving a a number calculated from the essentially attempts to approximate or imitate humans' model. The complexity of biological data presents pitfalls previous layer. The connections, ability to recognize patterns, albeit in an objective man- as well as opportunities for their analysis using machine or edges, allow a signal to flow from the input layer to the ner, using computation. machine learning is particularly learning techniques. To address these, we discuss the output layer via hidden layers. useful when the dataset one wishes to analyse is too large widespread issues that affect the validity of studies, with (many individual data points) or too complex (contains guidance on how to avoid them.)

6 The bulk of the Review a large number of features) for human analysis and/or is devoted to the description of a number of machine when it is desired to automate the process of data analy- learning techniques, and in each case we provide exam- sis to establish a reproducible and time- efficient pipeline. ples of the appropriate application of the method and Data from biological experiments frequently possess how to interpret the results. The methods discussed 1. Department of Computer these properties; biological datasets have grown enor- include traditional machine learning methods, as these Science, University College London, London, UK.

7 Mously in both size and complexity in the past few dec- are still the best choices in many cases, and deep learning ades, and it is becoming increasingly important not only with artificial neural networks, which are emerging as 2. These authors contributed equally: Joe G. Greener, to have some practical means of making sense of this the most effective methods for many tasks. We finish Shaun M. Kandathil. data abundance but also to have a sound understand- by describing what the future holds for incorporating e- mail: ing of the techniques that are used. machine learning machine learning in data analysis pipelines in biology.

8 Has been used in biology for a number of decades, but There are two goals when one is using machine learn- s41580-021-00407-0 it has steadily grown in importance to the point where ing in biology. The first is to make accurate predictions Nature Reviews | Molecular Cell Biology 0123456789();: Reviews where experimental data are lacking, and use these large amounts of unlabelled data. This can improve predictions to guide future research efforts. However, performance in cases where labelled data are costly as scientists we seek to understand the world, and so to obtain. the second goal is to use machine learning to further our understanding of biology.

9 Throughout this guide Classification, regression and clustering problems. When we discuss how these two goals often come into con- a problem involves assigning data points to a set of dis- flict in machine learning , and how to extract under- crete categories (for example, cancerous' or not can- standing from models that are often treated as black cerous'), the problem is called a classification problem', boxes' because their inner workings are difficult to and any algorithm that performs such classification can understand12. be said to be a classifier. By contrast, regression models output a continuous set of values, such as predicting the Key concepts free energy change of folding after mutating a residue We first introduce a number of key concepts in machine in a protein17.

10 Continuous values can be thresholded learning . Where possible, we illustrate these concepts or otherwise discretized, meaning that it is often pos- Ground truth with examples taken from biological literature. sible to reformulate regression problems as classifi- The true value that the output cation problems. For example, the free energy change of a machine learning model General terms. A dataset comprises a number of data mentioned above can be binned into ranges of values is compared with to train the points or instances, each of which can be thought of that are favourable or unfavourable for protein stability. model and test performance.


Related search queries