Common types of sets
Optimal Bayesian classifier
The optimal Bayesian classifier is a classification technique. This is a set of all the hypotheses in the hypothesis space. On average, no other ensemble can beat it.[16] The Naive Bayes classifier is a version of this that assumes the data is conditionally independent of class and makes the calculation more feasible. Each hypothesis is given a vote proportional to the probability that the training data set would be sampled from a system if that hypothesis were true. To facilitate finite-sized training data, the vote for each hypothesis is also multiplied by the a priori probability of that hypothesis. The optimal Bayes classifier can be expressed with the following equation:.
Where is the predicted class, is the set of all possible classes, is the hypothesis space, refers to a probability, and is the training data. As a set, the optimal Bayes classifier represents a hypothesis that is not necessarily in . However, the hypothesis represented by the optimal Bayes classifier is the optimal hypothesis in the set space (the space of all possible sets formed only by hypotheses in ).
This formula can be reformulated using Bayes' theorem, which says that the posterior probability is proportional to the probability multiplied by the prior probability:.
therefore,.
Bootstrap aggregation (bagging)
Bootstrap aggregation (bagging) consists of training an ensemble from bootstrap data sets. A bootstrap set is created by selecting from the original training data set with replacement. Therefore, a bootstrap set can contain an example given zero, once, or multiple times. Ensemble members can also have limits on features (for example, the "Node (computing)" nodes of a decision tree), to encourage exploration of diverse features.[17] Local variance information in bootstrap ensembles and feature considerations promote diversity in the ensemble, and can strengthen the ensemble.[18] To reduce overfitting, a member can be validated using the out-of-bag ensemble (examples not in the ensemble). bootstrap).[19].
Inference is performed by voting the predictions of the ensemble members, which is called aggregation. This is illustrated below with a set of four decision trees. Each tree classifies the query example. Since three of the four predict the positive class, the overall classification of the ensemble is positive. Random forests like the one shown are a common application of assembly.
Boosting
Boosting consists of training successive models by emphasizing training data misclassified by previously learned models. Initially, all data (D1) have equal weight and are used to learn a base model M1. Examples misclassified by M1 are assigned a higher weight than those correctly classified. This boosted data (D2) is used to train a second base model M2, and so on. The inference is made by voting.
In some cases, boosting has given better results than bagging, but it tends to overfit more. The most common application of boosting is Adaboost"), but some newer algorithms obtain better results.
Bayesian model averaging
Bayesian model averaging (BMA) makes predictions by averaging the predictions of models weighted by their a posteriori probabilities given the data.[20] BMA is known to often give better answers than a single model, obtained, for example, by stepwise regression&action=edit&redlink=1 "Stepwise regression (not yet written)"), especially when very different models have almost identical performance on the training set but otherwise they can perform very differently.
The issue with any use of Bayes' theorem is the prior, that is, the (perhaps subjective) probability that each model is the best for a given purpose. Conceptually, BMA can be used with any prior. The R packages ensembleBMA[21] and BMA[22] use the priority implicit in the Bayesian information criterion (BIC), following Raftery (1995).[23] The R package BAS supports the use of the priorities implicit in the Akaike information criterion (AIC) and other criteria on alternative models, as well as priorities on the coefficients.[24].
The difference between the BIC and the AIC is the strength of the preference for parsimony. BIC's penalty for model complexity is , while AIC's is . Large-sample asymptotic theory states that if a best model exists, then with increasing sample sizes the BIC is strongly consistent, that is, you will almost certainly find it, while the AIC may not, because the AIC may continue to place excessive posterior probability on models that are more complicated than necessary. On the other hand, AIC and AICc are asymptotically "efficient" (i.e., minimum mean square prediction error), while BIC is not.[25].
Haussler et al. (1994) showed that when BMA is used for classification, its expected error is at most twice the expected error of the Bayes optimal classifier.[26] Burnham and Anderson (1998, 2002) contributed greatly to introducing the basic ideas of Bayesian model averaging to a wider audience and popularizing the methodology.[27] The availability of software, including other free open source packages for R in addition to those mentioned above, helped. to make the methods accessible to a broader public.[28].
Bayesian combination of models
Bayesian model combining (BMC) is an algorithmic correction to Bayesian model averaging (BMA). Instead of sampling each model in the ensemble individually, it is sampled from the space of possible ensembles (with model weights drawn randomly from a Dirichlet distribution with uniform parameters). This modification overcomes BMA's tendency to converge and give all weight to a single model. Although BMC is somewhat more computationally expensive than BMA, it tends to produce much better results. It has been shown that BMC is better on average (with statistical significance) than BMA and bagging.[29].
Using Bayes' law to calculate model weights requires calculating the probability of the data based on each model. Typically, none of the models in the ensemble are exactly the distribution from which the training data was generated, so they all correctly receive a value close to zero for this term. This would work well if the ensemble was large enough to sample the entire model space, but it is rarely possible. Consequently, each pattern in the training data will cause the ensemble weight to shift toward the ensemble model that most closely matches the distribution of the training data. In essence, it boils down to an unnecessarily complex method of performing model selection.
The possible weights of a set can be visualized as if they were located in a simplex. At each vertex of the simplex, all weight is assigned to a single model in the ensemble. The BMA converges towards the vertex closest to the distribution of the training data. Instead, BMC converges toward the point where this distribution is projected onto the simplex. In other words, instead of selecting the model closest to the generated distribution, look for the combination of models closest to the generated distribution.
BMA results can often be approximated by using cross-validation to select the best model from a set of models. Similarly, BMC results can be approximated using cross-validation to select the best combination of sets from a random sampling of possible weights.
Bucket of models
A "model cube" is an ensemble technique in which a model selection algorithm is used to choose the best model for each problem. When tested on a single problem, a cube of models may not produce better results than the best model in the ensemble, but when tested across many problems, it will typically produce much better results, on average, than any model in the ensemble.
The most commonly used method for model selection is cross-validation (sometimes called a “baking contest”). It is described with the following pseudocode:
Selection by cross-validation can be summarized as: "try them all against the training set and choose the one that works best".[30].
Gating is a generalization of cross-validation selection. It consists of training another learning model to decide which of the cube models is the most appropriate to solve the problem. Often, a perceptron is used for the gating model. It can be used to choose the "best" model or to give a linear weight to the predictions of each model in the cube.
When using a cube of models with a large set of problems, it may be desirable to avoid training some of the models that take a long time to train. Milestone learning is a meta-learning approach "Meta-learning (computer science)" that tries to solve this problem. It involves training only the fast (but inaccurate) algorithms in the cube, and then using the performance of these algorithms to help determine which slow (but accurate) algorithm is most likely to obtain better results.[31].
Stacking
Stacking (sometimes called stacked generalization) involves training a model to combine the predictions of other learning algorithms. First, all other algorithms are trained using the available data, and then a combinator algorithm (final estimator) is trained to make a final prediction using all predictions from the other algorithms (base estimators) as additional inputs or using cross-validated predictions from the base estimators, which can avoid overfitting.[32] If an arbitrary combinator algorithm is used, stacking can theoretically represent any of the ensemble techniques described in this article, although in practice it often often does. A logistic regression model is used as a combinator.
Stacking typically gives better results than either model trained separately.[33] It has been used successfully in both supervised learning tasks (regression,[34] classification, and distance learning)[35] and unsupervised learning (density estimation).[36] It has also been used to estimate the error rate of bagging.[3][37] It has been reported to outperform Bayesian averaging. models.[38] The top two Netflix contest results used shuffling, which can be considered a form of stacking.[39].
Vote
Voting is another form of assembly. See, for example, the weighted majority algorithm (machine learning).