Machine learning in cell biology – teaching computers to recognize phenotypes

Summary Recent advances in microscope automation provide new opportunities for high-throughput cell biology, such as image-based screening. High-complex image analysis tasks often make the implementation of static and predefined processing rules a cumbersome effort. Machine-learning methods, instead, seek to use intrinsic data structure, as well as the expert annotations of biologists to infer models that can be used to solve versatile data analysis tasks. Here, we explain how machine-learning methods work and what needs to be considered for their successful application in cell biology. We outline how microscopy images can be converted into a data representation suitable for machine learning, and then introduce various state-of-the-art machine-learning algorithms, highlighting recent applications in image-based screening. Our Commentary aims to provide the biologist with a guide to the application of machine learning to microscopy assays and we therefore include extensive discussion on how to optimize experimental workflow as well as the data analysis pipeline.


Introduction
Commercially available motorized microscopes can yield data at a throughput of .10 5 images per day, raising a strong need for automated data analysis (Conrad and Gerlich, 2010;Lock and Strömblad, 2010). Computational data analysis not only reduces the workload for the experimentalist, but also ensures objectivity and consistency in the annotation of large data sets (Danuser, 2011). The complexity and diversity in microscopic image data, however, poses challenges for developing suitable data analysis workflows.
Bioimage informatics methods offer powerful solutions for specific image analysis tasks, such as object detection, motion analysis or measurements of morphometric features (Danuser, 2011;Murphy, 2011;Eliceiri et al., 2012;Myers, 2012). Most image analysis algorithms, however, have been developed for specific biological assays. The application of the respective algorithms to other markers or cell types then often requires parameter tuning or even re-programming of the software. Manual software adaptations, however, are tedious and provide major obstacles for most cell biological laboratories, owing to the limited knowledge about the mathematics behind the image analysis algorithms and a lack of expertise in software engineering.
Machine learning aims to provide a general solution to this problem by learning processing rules from examples rather than relying on manual adjustments of parameters or pre-defined processing steps (Hastie et al., 2005;Bishop, 2006;Domingos, 2012). Machine learning is particularly superior to conventional image processing programs when it comes to solving complex multi-dimensional data analysis tasks such as discriminating morphologies that are not easily described by a few parameters (Boland and Murphy, 2001;Conrad et al., 2004;Neumann et al., 2010).
Machine learning generally proceeds in two phases (Hastie et al., 2005;Bishop, 2006). In the training phase, a collection of data samples is used to build or improve a computer system by learning from inherent structure and relationships within this data. This computer system is then applied to new data samples to predict certain properties of these data samples. Thus, the overall goal of any machine-learning method is to generalize from a few training examples to make accurate predictions on large sets of data samples that were not observed during training (Hastie et al., 2005;Bishop, 2006;Tarca et al., 2007;de Ridder et al., 2013).
A common machine-learning discipline is classification. In this approach, the user generates a training data set by annotating some representative examples according to predefined classes. The machine-learning algorithm automatically infers the rules to discriminate the classes, which can then be applied to the full data set. This type of learning is termed 'supervised' machine learning, and its principal goal is to infer general properties of the data distribution from a few annotated examples (Hastie et al., 2005;Bishop, 2006;Tarca et al., 2007;de Ridder et al., 2013). Supervised machine learning has been successfully applied in diverse biological disciplines, such as high-content screening (Kittler et al., 2004;Lansing Taylor et al., 2007;Doil et al., 2009;Collinet et al., 2010;Fuchs et al., 2010;Neumann et al., 2010;Schmitz et al., 2010;Mercer et al., 2012), drug development (Perlman et al., 2004;Slack et al., 2008;Loo et al., 2009;Castoreno et al., 2010;Murphy, 2011), DNA sequence analysis (Castelo and Guigó, 2004;Ben-Hur et al., 2008) and proteomics (Yang and Chou, 2004;Datta and Pihur, 2010;Reiter et al., 2011), as well as in many other fields outside of biology, such as speech (Rabiner, 1989) and face recognition (Viola and Jones, 2004), and prediction of stock market trends (Kim, 2003).
A second type of machine learning extracts information from the data completely independently of user annotations. The goal of 'unsupervised' machine learning is to group data points into clusters on the basis of a similarity measure or to facilitate data mining by reducing the complexity of the data (Hastie et al., 2005;Bishop, 2006;Tarca et al., 2007;de Ridder et al., 2013). Unlike supervised approaches, unsupervised methods enable the exploration of unknown phenotypes (Wang et al., 2008;Lin et al., 2010) and have been successfully used for phenotypic profiling of drug effects (Perlman et al., 2004).
A number of recent reviews and textbooks provide extensive theoretical background on different machine-learning algorithms (Hastie et al., 2005;Bishop, 2006;Larrañaga et al., 2006;Tarca et al., 2007;Danuser, 2011;de Ridder et al., 2013). Successful application of machine learning, however, also needs to take into account many practical considerations and it requires knowledge about the specific data type and analysis goals. This Commentary aims to provide a guide for the cell biologist to establish an efficient machine-learning pipeline for the analysis of microscopic images. We first discuss how image data are converted into units that serve as input for machine-learning methods. We then provide background on state-of-the-art supervised machine-learning methods and discuss what needs to be taken into account to optimize their performance. We also introduce the basic concepts of unsupervised machine learning and highlight some recent applications in cell biology.
The machine-learning pipeline for cell phenotyping Machine learning is widely used in image-based screening to classify cell morphologies that are traced by fluorescent markers. The principal objective of the screening is to determine whether an experimental perturbation (e.g. treatment with a chemical compound, small interfering RNA or genetic manipulation) leads to a cellular phenotype (e.g. change in cell morphology, protein expression level or anything that can be probed by imaging biosensors). The most commonly used machine-learning method, classification, is based on the definition of phenotypes by representative examples (Hastie et al., 2005;Bishop, 2006;Tarca et al., 2007;de Ridder et al., 2013). Thus, before a screen can be conducted, examples need to be recorded for unperturbed negative controls as well as for expected classes of phenotypes.
If representative examples for phenotypes are not available and cannot be obtained, supervised machine learning is not applicable and unsupervised methods need to be used instead (see below).
The actual machine-learning algorithm is typically embedded into a processing pipeline that converts original raw data into units that are suitable as input for the respective machine-learning algorithm (Tarca et al., 2007;de Ridder et al., 2013). The principal input for any learning algorithm is a set of objects, each of which are described by quantitative features. For cell biological applications based on microscopy data, the typical processing pipeline comprises image pre-processing, object detection and feature extraction (Fig. 1).
Image pre-processing The first step of the machine-learning pipeline, image preprocessing, aims to remove artifacts produced by the microscope or camera. For instance, uneven illumination of the microscope field of view should be compensated for by image flat-field correction (Buchser et al., 2004). This normalizes the cellular signal intensity levels, as these should not change with the position inside the imaging field. Pixel noise resulting from low light exposure, particularly in live-cell imaging applications, should also be removed by smoothing filters (Lindblad et al., 2004). In time-lapse movies, subsequent images might not be in the registry owing to a random or systematic drift of the microscope stage position. Image registration techniques find optimal image transformations to correct for such artifacts (Thévenaz et al., 1998;Oliveira and Tavares, 2012).

Object detection
Next, the objects of interest, which form the basis for classification, need to be defined. Most machine-learning pipelines separate objects of interest (e.g. cells) from image background, yet classification can also be performed at the level of image pixels (Kaynig et al., 2010;Sommer et al., 2011) or whole unsegmented images (Huang and Murphy, 2004;Shamir et al., 2008;Weber et al., 2013) (Fig. 2). Object detection is either based on region properties (e.g. bright regions can be segmented from background by intensity thresholding), or based  on contours (e.g. edges can be detected based on the local image gradient). No single method, however, is suitable to solve all possible segmentation problems in cell-based screening, and it is therefore inherently difficult to generalize the image segmentation method. The segmentation of the image can also be facilitated by machine learning: pixel classifiers that work on local pixel neighborhoods aim to learn to separate foreground (e.g. cells) from background by classifying whether pixels belongs to an object (Tu and Bai, 2010;Sommer et al., 2011).
To ease the image segmentation task, many imaged-based screening projects use reference markers such as fluorescent chromatin or DNA labels (Kittler et al., 2007;Collinet et al., 2010;Neumann et al., 2010;Schmitz et al., 2010;Mercer et al., 2012). On the basis of the primary segmentation marker, secondary object regions can be derived in order to probe diverse secondary markers without the need to adapt the program code for segmentation of the secondary image channel.
When analysis on a single-cell level is not required, it is possible to apply machine learning on unsegmented images  (Sommer et al., 2011). Pixels of cells and background regions are annotated interactively by brush strokes according to pre-defined classes. Features of the labeled pixels and their local neighborhood are then used to learn a pixel classifier. Afterwards, this classifier is used to predict new images in a pixel-wise fashion to obtain a partitioning of the image into the phenotype classes. (B) Object classification for analysis of cellular phenotypes with CellCognition . Each segmented cell is user-labeled according to its cell cycle state in order to learn a classifier, which is then applied to unseen data to predict cell morphology classes. Total accuracies of .95% can be achieved by this approach, such as in the discrimination of eight different cell cycle stages based on a chromatin marker . Similar approaches have been used to screen for DNA damage response signaling (Doil et al., 2009) and to classify subcellular protein localization (Boland and Murphy, 2001). (C) Segmentation-free image classification by Wndchrm software (Shamir et al., 2008). Image features characterize the image as a whole and classification outputs a class membership per image. Segmentation-free approaches are applied in cases in which segmentation of objects is difficult or impossible owing to high cell densities (cells are touching) or when dealing with complex cellular structures, such as dendrites of neuronal cells (Weber et al., 2013).
( Fig. 2C) by classifying image features that do not require object segmentation (Huang and Murphy, 2004) or by learning phenotypic distances based on rectangular image patches (Hamilton et al., 2009;Rajaram et al., 2012a).

Feature extraction
Following segmentation, each object needs to be described by quantitative features that form the basis to distinguish them by a classifier algorithm. The performance of a machine-learning pipeline relies substantially on an appropriate collection of relevant features (Hastie et al., 2005;Bishop, 2006;Tarca et al., 2007;de Ridder et al., 2013). The raw image pixel intensities are not well suited as features, because they withhold information on spatial and spectral patterns and can contain undesirable information such as the absolute orientation of cell objects (Huh et al., 2009). Thus, descriptive features need to be derived from the pixel intensities that enrich information relevant for classification. Two types of features are widely used to describe cell objects in microscopic images. Texture features quantify the distribution of pixel intensities within each object. Simple examples are mean intensity and standard deviation. More advanced texture features measure the granularity at different scales (Chen et al., 1995;Chebira et al., 2007) or pixel-pixel co-occurrence patterns (Haralick, 1979). A second class of feature describes the contour on the basis of the segmentation mask, for example, the contour roughness or circularity (Liu et al., 2011). Many powerful morphometric features are abstract representations of images and therefore difficult to intuitively relate to visual inspection of the cell image. Relevant features that relate to a phenotype can be automatically determined by the learning algorithm, and will vary with the specific biological marker and assay (Fig. 3). To avoid tedious manual adaptations of feature sets for each specific application, multi-purpose feature libraries have been developed, and these cover the needs for most cell biological assays (Jones et al., 2008;Held et al., 2010;Shariff et al., 2010).
Even though a versatile applicability of a machine-learning pipeline requires comprehensive feature sets, gathering more features does not always improve performance. This is because the increase in dimensionality with each feature renders the classification task exponentially more complex. This is referred to as the 'curse of dimensionality' (Hastie et al., 2005) and can be addressed by algorithms that reduce dimensionality, for example, by selecting the most informative features (Loo et al., 2007;Saeys et al., 2007). Engineering the right set of features is often key to the success of a machine-learning project, and at least as important as using the right learning algorithm (Fig. 4).
In summary, the processing pipeline yields a set of objects (typically representing cells), each of which is associated with an ordered list of feature values called the feature vector. Objects are thus represented in a multi-dimensional feature space, where the number of features defines the dimensionality. The challenging task of supervised machine learning is then to infer rules for how to discriminate different classes of objects in this multi-dimensional feature space.
How does a machine learn? As discussed above, there are two different types of machine learning, supervised and unsupervised learning. Supervised learning is guided by user training with the goal of subsequently applying a learned program to a similar task on independent large data sets (Hastie et al., 2005;Bishop, 2006;Tarca et al., 2007;Domingos, 2012;de Ridder et al., 2013). Unsupervised learning, by contrast, is fully independent of user interaction and aims to recognize patterns in the data to facilitate the interpretation of complex multi-dimensional data (Hastie et al., 2005;Bishop, 2006;Tarca et al., 2007;de Ridder et al., 2013). As supervised machine learning has been used much more widely in cell biology, we focus our Commentary on this approach and only outline general aspects of unsupervised methods at the end of this section.

Supervised machine learning: learning from user-defined examples
In supervised machine learning, a human expert first defines the processing task by annotating a small subset of objects from the original data set, for example, by phenotype labels according to cell morphology ( Fig. 1). This training data serves to automatically infer internal parameters of a learning model (the learner), which is then applied to discriminate between the different classes of objects in the full data set. Thus, the overall task of supervised machine learning is to generalize from a few selected examples. The supervised learning process is guided by an objective function, which evaluates how well the learner adapted to the training data (Hastie et al., 2005;Bishop, 2006;Domingos, 2012). On the basis of the objective function, an optimization procedure seeks parameters that yield the best learner. Importantly, the overall goal is to obtain a learner that generalizes: the learner needs to perform well on data that was not used for training. It is therefore essential to withhold a fraction of the training data to test this. If the learning performance were only evaluated based on the data used for learning, a simple memorization of the examples might perform best, which is likely to yield poor results on independent data. Various strategies have been developed for optimal splitting of training data into fractions that serve learning and testing, respectively (see below).
Supervised machine learning has been an important backbone for analysis pipelines in many high-content screening projects Image pre-processing, segmentation (optional), feature extraction  Machine learning in cell biology 5533 (Kittler et al., 2007;Fuchs et al., 2010;Neumann et al., 2010;Schmitz et al., 2010;Mercer et al., 2012). The strengths of supervised machine learning are intuitive assay development based on examples, the versatility and applicability to diverse assays, and efficient and robust computation of large datasets. This approach, however, depends on phenotype examples, which precludes searching for novel and unexpected phenotypes in screens.
The models underlying machine-learning algorithms How is the learning process implemented in a computer algorithm? There are two principally different types of learning models: generative approaches, which model the distribution of data points, or discriminant approaches, which model decision boundaries between different classes (Hastie et al., 2005;Bishop, 2006; for details on specific algorithms, see Box 1).
Generative methods model statistical distributions underlying the data objects. This can be based on certain probability distributions (e.g. Gaussian distributions), whose parameters are estimated from the training data (parametric models). Decision boundaries that separate data points according to their class membership are formed implicitly. Generative models can be used to synthesize new data points, which might be useful in some specialized applications [e.g. simulation of cell morphology (Buck et al., 2012;Rajaram et al., 2012b)]. Generative models have also been successfully applied to correct misclassifications of cell cycle stages, aided by temporal information in time-lapse movies  or the discovery of new biologically active peptide hormones by searching for sequence features in protein sequences (Mirabeau et al., 2007) using hidden Markov models (Rabiner, 1989).
Discriminant approaches, by contrast, directly model the decision boundary between different classes rather than the distribution of data points. The simplest implementation is a linear decision boundary (or a hyperplane in high-dimensional feature space). Linear discriminant methods are very robust towards noise in the data, yet their decision boundaries cannot accurately discriminate objects of different classes if they are distributed in complex patterns, such as typically observed for cell morphologies (Meyer et al., 2003;Loo et al., 2007;Fuchs et al., 2010;Held et al., 2010;Neumann et al., 2010). Most discriminant methods used in cell biological applications, therefore, use non-linear classifiers, which can express more complex decision boundaries.
The complexity of non-linear decision boundaries can range from smoothly bent functions to arbitrary rugged and unconnected boundaries (Fig. 3A-C). The more complex a decision boundary, the better it can separate complex distributions of data points. By contrast, complex decision boundaries are more likely to represent details that are specific to the sampled training data or noise and therefore might not apply to the general distribution of other data points.
These characteristics of classifiers are referred to as bias and variance (Hastie et al., 2005;Bishop, 2006;Domingos, 2012). A high bias means a strong preference of the learner to follow its internal model assumptions, even if this does not match well to the training data. A linear classifier will therefore always yield a linear classification boundary even if this leads to severe misclassifications on non-linear data distributions. A low bias, by contrast, indicates that a classifier has no strong internal model assumptions and is able to adapt to arbitrary cluttered training data. A learner with the lowest bias, however, is not necessarily the optimal solution, because the ability to generalize from training data are also assessed by a second parameter termed variance.
The variance of a classifier indicates its stability when repeatedly applied to subsets of training data points drawn independently from the same underlying data source (e.g. the same biological experiment). Classifiers with a low variance produce similar decision boundaries when applied to different training sets, whereas high variance classifiers are prone to adapt to noise and particularities of that very instance of training data. A major design goal for machine-learning algorithms is to optimize the trade-off between bias and variance. In many implementations, this can be controlled by parameters whose optimal values depend on the specific experimental data.

Box 1. Supervised classification algorithms
State-of-the-art supervised classification methods have been optimized towards classification accuracy, computational performance, learning from as few training objects as possible and versatility in their application. Widely used algorithms are described below. Support vector machines Support vector machines (SVMs) aim to find a decision hyperplane that separates data points of different classes with a maximal margin (i.e. maximal distance to the nearest training data points). Because data points of different classes might not always be completely separable by a hyperplane, most SVM implementations are based on a soft margin, which allows misclassifications at a certain cost value. SVMs themselves are linear classifiers, but they can generate non-linear decision boundaries if the data points are transformed beforehand to higher dimensions (such as a Gaussian kernel) using a mapping function (Vapnik, 2000). SVMs are relatively robust towards noisy features and are computationally efficient, and implementations are available in diverse bioimaging software packages Conrad et al., 2011;Horvath et al., 2011).

Adaptive boosting
Adaptive boosting (AdaBoost) combines several 'weak' learners to form a 'strong' classifier by iteratively adding and reweighting simple classifiers such as thresholds (Freund and Schapire, 1995). Owing to its iterative nature, boosting is particularly suitable for interactive online learning (Jones et al., 2008). However, AdaBoost is relatively sensitive towards noisy data and outliers (Kanamori et al., 2007). A widely used implementation, GentleBoost (Friedman et al., 2000), is available in the bioimaging software package CellProfiler Analyst (Jones et al., 2008).

Random forest
Random forests (RFs) (Breiman, 2001) train an ensemble of decision trees (Breiman et al., 1983) under random influence to average their outcome. Averaging the prediction of an ensemble reduces the overall variance while maintaining the low bias typical for decision trees. RFs are robust in high dimensions, because of an implicit feature selection, and are computationally efficient and easily parallelizable. An RF implementation widely used in cell biological applications is available (Kaynig et al., 2010;Sommer et al., 2011). In light of the diversity of supervised machine-learning methods, how can we identify the best algorithm? Important requirements are maximal accuracy and versatile application to diverse cell biology assays without the need to adapt software. Whether generative or discriminative classification approaches are better suited to solve a machine-learning task depends on how well internal model assumptions are met in the data (Ng and Jordan, 2002). For instance, support vector machines (discriminative approach) are widely used in cell biology (Meyer et al., 2003;Loo et al., 2007;Fuchs et al., 2010;Held et al., 2010;Neumann et al., 2010) owing to their good average performance among benchmark data sets (Meyer et al., 2003) and applicability to different data structures (Hastie et al., 2005). However, generative approaches, such as linear discriminant analysis, might be favorable in other cases, such as classifying the phenotypes of the actin cytoskeleton in Drosophila melanogaster cells (Wang et al., 2008).
Other considerations can be taken into account depending on the specific application. For example, methods are preferred if they require only small numbers of training objects for good performance. Some applications might require a human to interpret the decision rules of the classifier. Other applications might need a particularly fast computing performance. Some methods that have been found to be particularly versatile and powerful for cell biological applications are specified in Box 1 and software implementations are listed in Box 2.
How to measure and optimize the performance of machine learning?
The most widely used performance metric for a learner is total error, that is, the ratio of incorrect classifications divided by the total number of objects. Depending on the learning task, it can be useful to decompose the total error into false-positive and falsenegative errors, which enables specific optimization strategies. For instance, if an RNA interference screen yields a long candidate gene list that cannot be completely validated by secondary assays, it could be useful to minimize false-positive prediction of phenotypes, taking into account that some potential phenotypes might be missed. If the most important goal of a screen is comprehensiveness and it is feasible to validate all candidates by secondary analysis, then it might be preferred to minimize false-negative classifications (e.g. misclassification of a phenotype as a negative control morphology) by taking into account an increased false-positive error rate.
Accurate evaluation of the performance of a machine-learning method needs a comprehensive and representative data set for the specific goal. In light of the diversity of data types and analysis tasks in cell biology, it is often difficult to estimate the performance of published learning methods based on the specific proof-of-concept data used in the respective study. For objective benchmarking of learning methods in high-content screening, several annotated reference data sets have been published (Ljosa et al., 2012;Rajaram et al., 2012b).
How many data objects are required to train a good learner? Unfortunately, there is no general rule, because this depends on the method and the variability within the specific data set. In practice, some applications can yield satisfying results by training with ten objects per class, although most applications will require substantially more. Discriminative methods typically need more training objects to achieve a satisfactory performance than do generative models (Ng and Jordan, 2002). Irrespective of the learning algorithm, an increase in the number of features generally requires more training examples (Hastie et al., 2005). The most important evaluation criterion for a learner is its ability to generalize (Hastie et al., 2005;Bishop, 2006;Tarca et al., 2007;Domingos, 2012;de Ridder et al., 2013). To measure this, the available annotated reference data needs to be split into three subsets. The first fraction of objects is used for the initial learning. A second fraction of objects serves to improve the parameter settings of the learner. Finally, the performance of the learner is evaluated against the third fraction, the independent test Box 2. Machine-learning software for cell biologists Machine learning methods have been implemented in a number of open-source software projects dedicated to high-content screening data (Shamir et al., 2010;Eliceiri et al., 2012).
CellProfiler and CellProfiler Analyst (Carpenter et al., 2006;Jones et al., 2008;Kamentsky et al., 2011) (http://www.cellprofiler. org). A particular strength of these software packages is a modular workflow design, which enables rapid development of analysis assays. CellProfiler Analyst provides a multi-class active learning interface based on boosting. CellProfiler runs on all major operating systems and supports computing on clusters for largescale screening.
CellCognition  (http://www.cellcognition.org/) has been optimized for time-resolved imaging applications. It comprises a complete machine-learning pipeline from cell segmentation and feature extraction to supervised and unsupervised learning. CellCognition runs on all major operating systems and supports computing on clusters for large-scale screening.
ilastik (Sommer et al., 2011) (http://www.ilastik.org/) is an interactive segmentation tool based on pixel classification, which facilitates more complex image-segmentation tasks and provides real-time feedback.
Data format standards for high-content screening such as CellH5 (Sommer et al., 2013) and SDCubes (Millard et al., 2011) aim at facilitating inter-operability between different software packages by storing multi-dimensional original image data together with processing parameters and intermediate processing results. CellH5 has interfaces to R Bioconductor (Gentleman et al., 2004) and CellCognition , and can be natively accessed from all major programming languages; SDCubes has been implemented for ImageRails (Millard et al., 2011). data. This procedure prevents overfitting and allows for a good generalization (Hastie et al., 2005;Bishop, 2006;Tarca et al., 2007;Domingos, 2012;de Ridder et al., 2013).
To make most efficient use of a limited number of training objects, a procedure termed k-fold cross-validation has been developed (Kohavi, 1995;Ambroise and McLachlan, 2002). The training data set is partitioned into a user-defined number of k subsets, of which all but one are used for initial training of the learner. The remaining fraction serves to measure the performance of the learner and optimize its parameters. This is repeated for all fractions of data, typically five or ten times.
When a specific class is highly overrepresented in the data, an optimization towards total accuracy might yield a learner that performs poorly on predicting the less-abundant classes. This problem can be tackled either by sub-sampling only a fraction of training objects from the abundant classes while preserving all training objects from the less-abundant classes, or by specialized learning algorithms (Kotsiantis et al., 2006).
What overall accuracy can we expect from machine learning in a typical cell biological experiment? This is difficult to express in absolute numbers because it depends on many different parameters and the quality of the data. Many cell biological applications have achieved total accuracies of .90%, often within the range of object labeling inconsistencies between different human annotators.
Unsupervised machine learning -learning from intrinsic data structure In some biological applications it is difficult or impossible to define a training data set, which precludes the use of supervised machine-learning methods. For example, an image-based screen might be aimed at the discovery of a hypothetical morphological deviation that has not been observed before. In such cases, unsupervised machine-learning methods can be used to detect individual outlier objects or clusters of objects that differ from the control group in a dataset ( Fig. 3D-F). The overall goal of unsupervised machine learning is the identification of structures in the input data without prior user definition of the output.
In the absence of annotated training data, the definition of an objective function becomes more difficult, as it cannot make use of classification error rates. Instead, objective functions in unsupervised learning are typically based on distances in the feature space. For instance, clustering methods aim to group objects into clusters by minimizing the distance between objects within each cluster and maximizing the distance between different cluster centers (Bishop, 2006;Box 3).
Another widely used unsupervised method is dimensionality reduction (Van der Maaten et al., 2009), which aims to find a less redundant and lower-dimensional representation of the data points, keeping as much information as possible from the original high-dimensional feature space (Fig. 3D,E). Dimensionality reduction enables better visualization of the data points and thereby facilitates data mining by visual inspection.
Despite the advantage of fully automated data analysis without user training, unsupervised learning has not yet been widely used in cell biological applications. The biggest problem is the relatively poor performance on noisy data and the unpredictable output, which limits the interpretation, particularly when the cluster differences relate to complex combinations of multiple features. To overcome these limitations, some applications of unsupervised learning have incorporated additional knowledge about the data, such as, for example, temporal constraints on morphological transitions (Zhong et al., 2012) or non-negativity constraints on gene expression data (Devarajan, 2008).

Box 3. Unsupervised machine-learning algorithms
The main disciplines of unsupervised learning are clustering and dimensionality reduction. Clustering aims at assigning categorical class labels to data points without prior training. Widely used clustering methods are described below.
k-means clustering k-means clustering finds a user-defined number (k) of clusters by an iterative procedure. The cluster centers are initialized randomly and each data point is first assigned to the closest cluster center. Then, each cluster center is recalculated based on the mean of all assigned data points. This is repeated until convergence (i.e. the cluster centers) does not change beyond a significance threshold in the update step.
Gaussian mixture model Gaussian mixture model (GMM) extends k-means clustering by accounting for more complex data distributions. In addition to estimating cluster centers (means), each cluster center is associated with parameters that describe a Gaussian distribution. The estimation of a variance per cluster enables the modeling of data clusters with elliptical data spread.

Hierarchical clustering
In contrast to k-means and GMM clustering, hierarchical clustering is directly based on distances between the data points. In the first step, all data points are defined as single clusters. Then clusters are merged according to a linkage criterion based on small distances. This process is recursively applied, yielding a hierarchical cluster tree termed a dendrogram. Hierarchical clustering has been widely used to visualize similarities between complex phenotypes and is implemented in, for example, Bioconductor (Gentleman et al., 2004).

Dimensionality reduction
Dimensionality reduction is used to facilitate visual inspection of high-dimensional data. This is necessary because data points are very scarcely distributed in the high-dimensional feature space, which grows exponentially with the number of dimensions (Hastie et al., 2005;Bishop, 2006;Domingos, 2012). Dimensionality reduction also enables a more compact and less redundant visualization of the data owing to the smaller number of features. Widely used methods for dimensionality reduction are: N Principal component analysis (PCA), which maps original data points by a linear transformation (rotation) to a new feature space, where all transformed features are mutually uncorrelated. The resulting dimensions (principal components, PCs) are ranked by the amount of variance they cover in the data. The highest-ranked PCs thus enrich relevant information, and low-ranked PCs can be removed for further data analysis (Fig. 3). Owing to its wide applicability and effectiveness, PCA is often used for visualization and as a preprocessing step in classification and clustering.  Active learning -computer assists the user in data annotation A major bottleneck in supervised learning is the generation of user-annotated labels. Human experts might introduce bias and subjective variability into the training data set if information about the true object state is unattainable (Zhong et al., 2012). In addition, it is difficult, and in many cases impossible, to anticipate the gain in learning achieved by selecting and annotating a particular data point. The annotation of rare and extreme phenotypic responses might be more informative than repeated adding of samples to an already well-annotated class, yet the user might not have the expertise in identifying the best training sample sets. This limitation is addressed by active learning methods. The learning algorithm selects data points autonomously and presents them to the human expert for labeling. Data points are selected by the learning algorithm in order to maximize the learning progress, and hence, minimize the overall annotation effort (Jones et al., 2009). The criteria for selecting and proposing objects for annotation is typically based on uncertainty measures, whereby the most uncertain objects (from the perspective of the learner) are selected first. Similarly, interactive learning aims to shorten the feedback loop in the annotation process. Directly applying the learning result to other yet-unlabeled data samples allows the expert to inspect the current power of the learner visually and thus helps to identify cases with wrong predictions.
The prioritization of computer-selected data points can indeed improve the learning rate (Tomanek and Olsson, 2009) by guiding the human expert in establishing a comprehensive training data set (Fig. 4). Interactive learning requires fast algorithms and efficient software implementations and thus might not always be applicable.

Some experimental design guidelines
Reproducibility of the image-recording procedure is of utmost importance for the successful application of machine learning. Machine learning is designed to generalize from examples, but it will only generalize from variability that was present in the training data. For example, slight changes in the image focal plane, which might not even be noticed by a human observer, can introduce variability into the data that leads to systematic misclassifications. It is therefore strongly advisable to use autofocus devices to maximize reproducibility of image recording.
Similarly, the illumination intensity should be kept absolutely constant. Variable illumination intensities result in different noise levels, which can bias the classification. Conventional mercury or xenon light sources have variable illumination intensities depending on their lifetime and the heat-up time, for which compensation is required. New light sources, such as LEDs or solid-state lasers, yield a more stable output and are therefore preferable for machine-learning applications. Variable cell densities or differences in low-level image features owing to the experimental setup (such as microscope settings or different imaging media or incubation temperatures) that are not related to a biological phenotype can severely compromise the reliability of machine-learning methods (Shamir, 2011). An experimentalist should therefore keep environmental conditions as constant as possible. Data quality and reproducibility can be assessed by automated quality control (Zeder et al., 2010) and by incorporating control treatments in the assay. Differences in image features resulting from experimental variations are unlikely to be become obvious in the evaluation of the machine-learning method itself and thus have to be avoided early on in data acquisition and sample preparation.
Feature design has a great impact on the overall performance, as the learner can only learn what it has 'seen' in terms of features. The design and selection of optimal features can be difficult; however, general-purpose feature sets work well for most morphology-based assays (Hu and Murphy, 2004;Carpenter et al., 2006;Jones et al., 2008;Held et al., 2010). Engineering of specialized features might be necessary for specific biological assays, but should be envisioned only after unsuccessful application of general-purpose feature sets (Fig. 4).

Machine learning in cell biology -conclusions and outlook
Machine learning has tremendous power in the analysis of largescale microscopic image data. Some representative examples for typical machine-learning applications are screens for mitotic regulators (Kittler et al., 2004;Neumann et al., 2010;Schmitz et al., 2010;Wurzenberger et al., 2012), control of cellular stress responses (Wippich et al., 2013), factors involved in ribosome biogenesis (Wild et al., 2010) and cellular host factors involved in virus infection (Mercer et al., 2012). Unsupervised machine learning has been used, for example, to study the heterogeneity of cell responses to diverse drugs (Loo et al., 2009;Singh et al., 2010), to construct genetic interaction profiles (Horn et al., 2011) and for automatic staging of mitotic progression (Zhong et al., 2012).
Current implementations of machine-learning software for cell biology have been optimized for the needs of large-scale screens. However, most cell biological studies are hypothesis driven and require frequent adaptations of the assay for testing small sets of candidate experimental perturbations. In such an experimental framework, many biologists still visually inspect data and develop quantification methods based on specific rule sets that are implemented manually as macros or software plug-ins. This approach is tedious and the data analysis often still requires some level of user interaction. By further improving the usability of software interfaces, machine learning could eventually replace most manually programmed analysis pipelines to facilitate assay development and increase processing throughput, accuracy and objectivity.
The power of machine learning can be further leveraged by a seamless integration into the image-acquisition process (Conrad et al., 2011). As state-of-the-art microscopes support full motorization and specimen interaction (e.g. by photobleaching at defined image areas or compound dispensing), automatic online recognition of phenotypes enables intelligent imaging workflows with highly sophisticated biological assays.