SEARCH WITHIN CONTENT
Citation Information : International Journal of Advanced Network, Monitoring and Controls. VOLUME 3 , ISSUE 1 , ISSN (Online) 2470-8038, DOI: 10.21307/ijanmc-2018-021, May 2018 © 2018.
License : (CC-BY-NC-ND-4.0)
Published Online: 07-May-2018
Ensemble learning can improve the accuracy of the classification algorithm and it has been widely used. Traditional ensemble learning methods include bagging, boosting and other methods, both of which are ensemble learning methods based on homogenous base classifiers, and obtain a diversity of base classifiers only through sample perturbation. However, heterogenous base classifiers tend to be more diverse, and multi-angle disturbances tend to obtain a variety of base classifiers. This paper presents a text classification ensemble learning method based on multi-angle perturbation heterogeneous base classifier, and validates the effectiveness of the algorithm through experiments.
The main idea of ensemble learning is to generate multiple learners through certain rules and then adopt some integrated strategy to make the final decision. In general, multiple learners in the so-called ensemble learning are all homogenous “weak learners”. Based on these weak learners, multiple learners are generated through sample set perturbation, and a strong learner is obtained after integration. With the deepening of integrated learning, its broad definition gradually accepted by scholars. It refers to a collection of multiple classifiers using learning methods, without distinction between the nature of the classifier. However, the research of ensemble learning with homogenous classifiers is still the most common, and it is usually only perturbed by a single angle such as algorithm training set. The random forest algorithm adds the perturbation of the classification attribute to the traditional bagging algorithm, and thus obtains a better classification effect. This shows that the multi-angle perturbation can produce a larger difference base learner, and the ensemble learning model has higher classification accuracy. In addition, the research shows that the diversity of base learners based on the heterogeneous base classifier is stronger, so the classification model has stronger classification accuracy and generalization performance. Therefore, this paper combines the above two factors and designs a text classification ensemble learning method based on multi-angle perturbation heterogeneous base classifier.
“Weak Learning Is Equivalent to Strong Learning” is a theoretical issue raised by Kearns and Valiant in 1989. The Boosting algorithm arises from the proof of this issue. Then the Boosting algorithm derived a number of variants, including Gradient Boosting, LPBoosting and so on. Because of the characteristics of boosting that training classifiers serially, the training process takes up more resources and has lower efficiency. Therefore, whether it is possible to use a few classifiers and obtain the same performance is a matter of concern to researchers. Zhou Zhihua and others on the “selective ensemble” of boosting algorithm helped to overcome this problem. “Selective ensemble” only used the classifier with has good classification results to integrate the classifiers. This idea can finish the construction of ensembled model more efficiently without changing the original algorithm that training base classifiers. In recent years, a method of selective integration based on clustering, selection, optimization and other methods has also been developed.
The theoretical basis of ensemble learning shows that strongly learner and weak learner are equivalent, so we can find ways to convert weaker learners into strongly learners, without having to look for hard-to-find Learner. Currently there is a representative ensemble learning method boosting, bagging. The traditional Bagging algorithm and Boosting algorithm as well as many derived algorithms of the two algorithms are ensemble learning based on homogenous base classifier. And diversity is only obtained through sample disturbances, while multi-angle disturbances and heterogeneous classifiers can improve model classification accuracy. This paper first trains and integrates homogenous base classifiers, compares and analyzes changes in the accuracy of base classifiers and integrated models, and then integrates k-nearest neighbor classifiers, Bayesian classifiers, and logistic regression classifiers in text classifiers. The integration model of the heterogeneous base classifier compares the diversity with the base classifier homogenous Bagging algorithm to measure the KW value and accuracy.
In order to obtain an integrated learning model with higher accuracy, more base classifiers with more diversity and good classification results should be obtained as much as possible. From the perspective of diversity, we can try to select a combination of many “attributes” from the variable factors in the classification process. Here, “attribute” refers to everything that causes the change of the algorithm classification result. From the general process of text classification analysis, feature selection, feature dimension, classifier selection and classifier parameters can be used as a basis for the diversity of the classifier.
For each classification model, its algorithm parameters, feature selection algorithm, feature dimension are disturbed. In this paper, many kinds of classifiers are integrated, and an integrated learning model based on multi-angle perturbation heterogeneous basis classifiers is designed. Inputs in the process of model training are feature selection algorithm set S, feature dimension set N, classifier set C, adjustable parameter set A and parameter optional value set (dictionary) V. Training steps are as follows:
Step 1: Pre-process the sample set.
Step 2: Select an algorithm for each feature, make a feature selection for each feature dimension, and add the feature selection result to the feature selection result list L.
Step 3: Perform Step 4 for each classifier.
Step 4: Train and save to the classifier list C-output for each parameter of the classifier in combination with each result in the L list.
The output of the model is the classifier list C-output. The testing process of the model is as follows: After the pre-processing and the vectorization of the sample to be tested, a series of classification models are used to predict the samples to obtain a plurality of classification results. The majority of voting integration strategies lead to the final classification result.
The feature selection algorithm, feature dimension, and classifier all serve as a source for the diversity of the base classifiers. In this paper, feature selection algorithm can use chi-square statistics, information gain and mutual information algorithm. Classifier perturbation can be trained by Bayesian classifier, k-nearest neighbor classifier and logistic regression classifier. Since the parameters of the classifier are also variables, they can also be used as disturbance variables.
The experiment uses Sogou Labs’ entire network news dataset, and randomly selects 600 news documents from five categories of financial, education, automotive, entertainment and women, and uses the body part and its category markers as the experimental text data Set (balanced data set). The experiment will use 80% data as the training set and the rest as test sets.
With the increase of feature dimensions, the accuracy of each model is on the rise. When the number of features is small, the accuracy of the integrated model is only lower than that of the information gain algorithm. When the number of features exceeds 300, the integrated model performs best. It can be seen that the classification effect of the integrated model is not always better than that of a single classifier. When the feature dimension is small, the accuracy of the integrated model is lower than that of the information gain algorithm model. In the experimental results obtained from experimental data in this paper, when the feature dimension exceeds 400 dimensions, the accuracy of the model tends to be stable, and the accuracy of ensemble learning model is always higher than that of a single classifier model.
It can be seen from the experimental results that under the same conditions, the classification results of multiple classifiers combined with multiple feature selection algorithms are quite different. That is to say, the diversity between the base classifiers obtained by the perturbation feature selection algorithm is strong. Therefore, a variety of feature selection algorithms can be used as one of the sources of the base classifiers.
As can be seen from Table 1, when the feature selection algorithm is chi-square statistics, information gain or mutual information algorithm, the classification accuracy of the single classifier is lower, and the accuracy of the integrated classifier classification is higher than that of any single classifier. The disturbance of classifier makes the algorithm vary greatly in accuracy, so the disturbance of classifier can also be used as one of the sources of the diversity of classifier.
Due to the different settings of the base classifier parameters will lead to some differences between the training model, this paper designed experiments to further examine the accuracy of the basic learning model in the disturbance of classifier parameters. The experimental results are shown in Table 2-4.
Compared with the above three groups of experiments, the K-nearest neighbor classifier has a strong diversity among the classifiers in the selection of “K value” and the Bayesian classifier perturbation of the “classifier type” parameters. Therefore, Base classifiers with higher classification accuracy are candidates. However, the logistic regression classifier is insensitive to the two parameters of “classification method” and “loss function optimization method”. The accuracy of the base classifier is almost constant and the diversity is lower. In the multi-angle perturbation integrated model, only one of the classifiers can be selected.
Through the above three groups of experiments, we have screened the selected parameters of the base classifier with strong diversity. From the experimental data obtained from the above three experiments, the KW diversity measure between homogeneity classifiers that make up each classifier can be calculated as shown in Table 5.
The range of KW values is [0,1]. When KW is 0 or 1, the base classifiers are the same, and there is no diversity among base classifiers. When KW is 0.25, the base classifier has the highest diversity. As can be seen from table 5, the integrated models with the most diversity of base classifiers in table 5 are all based on heterogeneous base classifiers. The KW value of this model is better than that based on the rest of the integrated learning models.
Using the integrated method, the above feature selection algorithm, feature dimension, classifier and its parameters are taken as input to integrate all the base classifiers, and an integrated model based on multi-angle perturbation heterogeneous base classifiers is obtained. The multi-angle disturbance integrated learning model parameters are summarized in Table 6.
The parameters shown in Table 6 are used as inputs to the model to train the integrated learning model designed in this paper. Compare this model with the Bagging text classification model with only sample perturbation. The experimental results are shown in Table 7.
The experimental results show that the Bagging algorithm based on K-nearest neighbor classifier has higher KW value, that is to say, the classifier has strong diversity but low accuracy. Bagging algorithm based on Bayesian classifier and logistic regression classifier has low KW value and accuracy, that is, the base classifier has less diversity and low accuracy. The integrated learning model based on multi-angle disturbance heterogeneous basis classifier designed in this paper has the highest classification accuracy and the strong diversity of base classifiers.
This paper analyzes the algorithmic process of Bagging and Boosting, and finds that both of them are integrated learning strategies based on homogeneity classifier. At present, the research on heterogeneous base classifier integrated learning is less. In this paper, we design a learning model of multi-angle perturbation heterogeneous basis classifier. Multi-angle perturbation of heterogeneous classifiers, and try to integrate them. The experimental results show that the integrated learning model based on multi-angle perturbation-based heterogeneous base classifiers proposed and designed in this paper has higher classification accuracy and rich base classifier diversity. This will provide an important basis for further research on heterogeneous classifier integration.