Data balancing methods by fuzzy rough sets

Abstract: The robustness of rough sets theory in data cleansing have been proved in

many studies. Recently, fuzzy rough set also make a deal with imbalanced data by two

approaches. The first is a combination of fuzzy rough instance selection and balancing

methods. The second tries to use different criteria to clean majorities and minorities

classes of imbalanced data. This work is an extension of the second method which was

presented in [16]. The paper depicts complete study about the second method with some

proposed algorithms. It focuses mainly on binary classification with kNN and SVM for

imbalanced data. Experiments and comparisons among related methods will confirm pros

and coin of each method with respect to performance accuracy and time consumption.

Data balancing methods by fuzzy rough sets trang 1

Trang 1

Data balancing methods by fuzzy rough sets trang 2

Trang 2

Data balancing methods by fuzzy rough sets trang 3

Trang 3

Data balancing methods by fuzzy rough sets trang 4

Trang 4

Data balancing methods by fuzzy rough sets trang 5

Trang 5

Data balancing methods by fuzzy rough sets trang 6

Trang 6

Data balancing methods by fuzzy rough sets trang 7

Trang 7

Data balancing methods by fuzzy rough sets trang 8

Trang 8

Data balancing methods by fuzzy rough sets trang 9

Trang 9

Data balancing methods by fuzzy rough sets trang 10

Trang 10

Tải về để xem bản đầy đủ

pdf 20 trang baonam 8960
Bạn đang xem 10 trang mẫu của tài liệu "Data balancing methods by fuzzy rough sets", để tải tài liệu gốc về máy hãy click vào nút Download ở trên

Tóm tắt nội dung tài liệu: Data balancing methods by fuzzy rough sets

Data balancing methods by fuzzy rough sets
Công nghệ thông tin 
 40 T. T. Huyen, L. B. Dung, M. Dinh, “Data balancing methods by fuzzy rough sets.” 
DATA BALANCING METHODS BY FUZZY ROUGH SETS 
Tran Thanh Huyen1,2*, Le Ba Dung3, Mai Dinh4 
Abstract: The robustness of rough sets theory in data cleansing have been proved in 
many studies. Recently, fuzzy rough set also make a deal with imbalanced data by two 
approaches. The first is a combination of fuzzy rough instance selection and balancing 
methods. The second tries to use different criteria to clean majorities and minorities 
classes of imbalanced data. This work is an extension of the second method which was 
presented in [16]. The paper depicts complete study about the second method with some 
proposed algorithms. It focuses mainly on binary classification with kNN and SVM for 
imbalanced data. Experiments and comparisons among related methods will confirm pros 
and coin of each method with respect to performance accuracy and time consumption. 
Keywords: Rough Set theory; Fuzzy-rough sets; Granular computing; Imbalanced data; Instance selection. 
1. INTRODUCTION 
Rough set theory was first introduced by Pawlak [21, 22] in early 1980s as a machine 
learning methods for knowledge acquisition from data. It provides mathematical 
approach to deal with inconsistency between items in datasets. It consequently can be 
used for pattern extraction, decision rule generation, feature selection and data reduction. 
In case of data reduction/selection, main ideas are extracting fuzzy memberships of 
positive regions [23] and choosing the instances which have large membership degrees 
(degrees have to larger than thresholds) for training phases [3, 13, 28]. It thus can reduce 
noise by defining and eliminating low quality instances in balance datasets. 
Other issue that may cause an inconsistency problem is imbalanced data [15]. A 
dataset is imbalanced when the numbers of instances in some classes are much larger 
than in others. Such classes are called majority classes. The classes with small 
cardinality are referred to as minority classes. 
There are two possible approaches in using fuzzy rough set [23] to select instance 
from imbalanced datasets. The first is combination of balancing methods and a rough set 
based noise removed technique [24, 25, 29]. In these approaches, fuzzy rough sets are 
first used to remove low quality instances. Then, a well-known balancing technique 
called “Synthetic Minority Oversampling Technique (SMOTE)” [4] is employed to form 
a candidate set. Finally, fuzzy-rough sets are used again to select quality instances from 
the candidate sets. 
The second approach is using different criteria in defining different thresholds for 
majority and minority classes [16, 27]. By using small threshold for minority class, more 
items from minority classes can be selected. Besides, the research in [16] introduces other 
method to deal with highly imbalanced data by keeping all instance in minority class while 
remove/change label in majority class. The experimental results in those studies show 
considerable improvement in classification performance. However, the absence of 
generating candidates and optimizing thresholds methods limits application in real. 
There are also some study in using fuzzy-rough sets as classification methods for 
imbalanced data [26, 30]. However, making a classifier is not in scope of this paper. 
This research will show complete study of using fuzzy rough set to select instance in 
Nghiên cứu khoa học công nghệ 
 Tạp chí Nghiên cứu KH&CN quân sự, Số Đặc san CNTT, 12 - 2020 41 
imbalanced dataset without any balancing techniques. The paper is structured as follows: 
Section 2 reviews the original rough set theory and an extension called fuzzy-rough set; 
In section 3, fuzzy rough instance selection approach to deal with inconsistency is 
discussed along with their issues; The proposed algorithms of this research with methods 
to choose parameters are introduced in section 4; Experiments with results and 
comparisons are discussed in section 5; Finally, section 6 concludes the study. 
2. THEORETICAL BACKGROUND 
2.1. Information Systems and Tolerance Relation 
An information system is represented as a data table. Each row of this table represents 
an instance of an object such as people, things, etc. Information about every object is 
described by object attribute (feature) values. 
An information system in the rough sets study is formally defined as a pair ( ),I U A= 
where U is a non-empty finite set of objects called the universe and A is a non-empty 
finite set of attributes such that :af U A→ for every a A [21, 22]. The non-empty 
discrete value set Va is called the domain of a. The original rough set theory deals with 
complete information systems in which ( ), , ax U a A f x is a precise value. 
If U contains at least an unknown value object, then I is called an incomplete 
information system, ... 5.46 0.7463 0.5583 0.7258 0.7889 0.6825 0.7239 
abalone9-18 16.40 0.5119 0.5060 0.5978 0.6153 0.5256 0.5256 
yeast-1-4-5-8 vs 7 22.10 0.5000 0.5000 0.5848 0.6197 0.5735 0.5282 
glass5 22.78 0.5000 0.5000 0.6711 0.7014 0.6931 0.7123 
kr-vs-k-
one_vs_fifteen 
27.77 0.9647 0.9647 0.9799 0.9647 0.9679 0.9808 
yeast4 28.10 0.5333 0.5294 0.7618 0.7643 0.7539 0.8020 
winequality-red-4 29.17 0.5000 0.5000 0.5498 0.5861 0.6081 0.6043 
yeast-1-2-8-9 vs 7 30.57 0.5000 0.5000 0.5684 0.6764 0.5959 0.6458 
ozone one hr 31.42 0.5000 0.5000 0.7231 0.6696 0.7411 0.7413 
yeast5 32.73 0.7044 0.7103 0.8774 0.9391 0.8755 0.8888 
winequality-red-8 vs 
6.csv 
35.44 0.5000 0.5000 0.5632 0.5618 0.6616 0.6609 
ecoli-0-1-3-7_vs_2-
6 
39.00 0.6667 0.5000 0.5000 0.6667 0.6740 0.6479 
yeast6 41.40 0.6440 0.6599 0.8247 0.8644 0.7649 0.8821 
winequality-red-8 vs 
6-7 
46.50 0.5000 0.5000 0.4754 0.5320 0.6465 0.6465 
abalone-19_vs_10-
11-12-13 
49.69 0.5000 0.5000 NA 0.5431 0.5674 0.5716 
shuttle-2 vs 5 66.67 0.9841 0.9486 0.9524 0.9789 0.9841 0.9841 
winequality-red-3 vs 
5 
68.10 0.5000 NA NA 0.5412 0.5725 0.5718 
poker-8-9 vs 5.csv 82.00 0.5000 0.5000 NA 0.5086 0.6188 0.6189 
Average AUC in Table 3 also shows that MFRIS1 archives better performance for 
most lower IR imbalanced datasets. On the other hand, MFRIS3 records better 
performance with high IR. MFRIS2 is also a solution for some datasets. In these cases, 
maybe the number of positive instances is very small so it is impossible to remove some 
or there is little different among qualities of positive instances. 
In order to show the statistical significance of the performance of the proposed 
method, we perform Friedman test [10] to compare the AUC to each of the state-of-the-
art algorithms. The Friedman rankings and test statistics are given in Table 4 and Table 5. 
Nghiên cứu khoa học công nghệ 
 Tạp chí Nghiên cứu KH&CN quân sự, Số Đặc san CNTT, 12 - 2020 55 
Table 4. Friedman Ranking w.r.t. AUC. 
Algorithm Ranking w.r.t AUC 
FRIS 1.8919 
FRPSS 3.4595 
MFRIS2 3.6892 
MFRIS3 4.5541 
MFRIS1 4.6486 
Table 5. Friedman statistic. 
N 37 
Chi-square 49.7955 
df 4 
Asymp. Sig. 3.9837e-10 
Let’s choose the level of significant as 0.05. Because the significance level (Asymp. 
Sig.) is less than 0.05, it is plausible that there is an overall significant difference 
between the mean ranks of these algorithms. Next, we perform the Wilcoxon post hoc 
test [31] to compare the proposed algorithms with each of the other algorithms. The 
adjusted p-values with respect to MFRIS1, MFRIS2 and MFRIS3 are given in Table 6, 7 
and 8. 
Table 6. Comparing MFRIS1 with others. 
MFRIS1 vs p-values 
FRIS 4.0463e-07 
FRPSS 4.3322e-04 
MFRIS2 4.2049e-02 
MFRIS3 0.3292 
Table 7. Comparing MFRIS2 with others. 
MFRIS2 vs p-values 
FRIS 1.6983e-04 
FRPSS 0.5330 
MFRIS1 0.9579 
MFRIS3 0.9916 
Table 8. Comparing MFRIS3 with others. 
MFRIS3 vs p-values 
FRIS 1.4046e-06 
FRPSS 5.9737e-03 
MFRIS1 0.6708 
MFRIS3 8.4063e-03 
From these tables, we can see that MFRIS2 performs better than the FRIS algorithm, 
but is not so significant in comparison with FRPSS. On the other hand, MFRIS1 and 
MFRIS3 relatively give the same performance, and they both significantly outperform 
Công nghệ thông tin 
 56 T. T. Huyen, L. B. Dung, M. Dinh, “Data balancing methods by fuzzy rough sets.” 
the other algorithms in terms of AUC. 
To analyze the benefit of the algorithms, average and increasing AUC are depicted in 
Figure 4. The left plot shows performance of kNN and SVM with and without instance 
selection methods. The right plot illustrates some increasing to compare with 
performance without cleansing data. From the figure, it is noticed that the algorithms 
assist increasing in classification performance in overall. SVM probably receives more 
support from instance selection methods. 
Figure 4. Comparing average of AUC (on all datasets) and increase on average of AUC 
using MFRISs for different classifiers. 
To compare time consumption among fuzzy rough based instance selection methods, 
Table 9 shows that the modification of fuzzy rough instance selection with granularity 
1.0 = for imbalanced datasets consumes far less than the conventional methods. On the 
other hands, the proposed algorithms have to spend more time for tuning granularity. In 
this step, time consuming to calculate approximation of the package is most concern. In 
comparison among MFRISs, time consumption increases from the first to the third 
method. 
Table 9. Prepossessing time (in minutes) for each instance selection method 
(Datasets with more than 1000 instances). 
Describe FRIS FRPSS MFRIS1 MFRIS2 MFRIS3 Granularity 
segment0 4.2682 6.6287 1.9824 2.0654 2.6732 1.00 
yeast-0-2-5-6_vs_3-7-
8-9 
0.7493 0.7433 0.4502 0.4283 0.5663 1.00 
yeast-0-2-5-7-9_vs_3-
6-8 
0.7405 0.8229 0.4604 0.4393 0.5682 1.00 
ozone_eight_hr 0.1625 0.2711 5.9860 6.1786 7.3651 0.15 
shuttle-c0-vs-c4 2.6479 2.7236 1.0280 1.0577 1.3117 1.00 
kr-vs-k-
one_vs_fifteen 
0.2367 0.3219 3.1913 3.1134 3.2165 0.05 
yeast4 1.6571 1.6874 0.7349 0.7291 0.9285 1.00 
winequality-red-4 1.8274 1.8734 0.9359 0.9348 1.1819 1.00 
ozone_one_hr 0.1558 0.1667 5.9767 6.2840 7.4906 0.15 
yeast5 1.6490 1.6955 0.7117 0.7337 0.8980 1.00 
Nghiên cứu khoa học công nghệ 
 Tạp chí Nghiên cứu KH&CN quân sự, Số Đặc san CNTT, 12 - 2020 57 
yeast6 1.6440 1.6672 0.7383 0.7215 0.9212 1.00 
abalone-19_vs_10-11-
12-13 
2.1847 NA 0.9316 0.9337 1.1880 1.00 
shuttle-2_vs_5 9.3936 9.4679 2.7952 3.0382 3.6771 1.00 
poker-8-9_vs_5 0.4975 NA 1.0288 1.3063 1.4366 1.00 
6. CONCLUSION AND FUTURE WORK 
In this paper, we make an extension of the earlier study with three modification of 
fuzzy rough instance selection for imbalanced data. By using different criteria in 
selecting positive and negative classes, it possible to obtain higher classification 
performance with kNN and SVM. The proposed approach is discussed with 
mathematical explanation, algorithm presentation and method to optimize parameters. 
The experiments demo the improvement in performance of the proposed methods. 
They can also be competitive with state-of-the-art methods while eliminating using other 
balancing technique such as SMOTE. 
This paper confirms the robustness of using multiple criteria in fuzzy-rough sets 
based instance selection approach. It solves problems of automatically choosing 
parameters that exist in [16]. Further work would be studies on multiple-class 
imbalanced data selection methods and fuzzy-rough sets based classifier which can deal 
with imbalanced data. 
REFERENCES 
[1]. Jesus Alcala-Fdez, Alberto Fernandez, Julian Luengo, Joaquin Derrac, and Salvador Garcia. 
Keel data-mining software tool: “Data set repository, integration of algorithms and 
experimental analysis framework.” Multiple-Valued Logic and Soft Computing, 17(2-
3):255–287, 2011. 
[2]. K. Bache and M. Lichman. “UCI Machine Learning Repository”, 2013. 
[3]. Yaile Caballero, Rafael Bello, Delia Alvarez, Maria M. Gareia, and Yaimara Pizano. 
“Improving the k-nn method: Rough set in edit training set”. In John Debenham, editor, 
Professional Practice in Artificial Intelligence, volume 218 of IFIP International Federation 
for Information Processing, pages 21–30. Springer US, 2006. 
[4]. Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. 
“Smote: Synthetic minority over-sampling technique”. Journal of Artificial Intelligence 
Research, 16:321–357, 2002. 
[5]. Chris Cornelis, Nele Verbiest, and Richard Jensen. “Ordered weighted average based fuzzy 
rough sets”. In Rough Set and Knowledge Technology - 5th International Conference, 
RSKT 2010, Beijing, China, October 15- 17, 2010. Proceedings, pages 78–85, 2010. 
[6]. Corinna Cortes and Vladimir Vapnik. “Support-vector networks”. Machine Learning, 
20(3):273–297, 1995. 
[7]. T. Cover and P. Hart. “Nearest neighbor pattern classification”. IEEE Trans. Inf. Theor., 
13(1):21–27, September 1967. 
[8]. Didier Dubois and Henri Prade. “Rough fuzzy sets and fuzzy rough sets”. In International 
Journal of General Systems, volume 17, pages 191–209. 1990. 
[9]. Didier Dubois and Henri Prade. “Putting rough sets and fuzzy sets together. In Roman 
Slowinski”, editor, Intelligent Decision Support, volume 11 of Theory and Decision 
Library, pages 203–232. Springer Netherlands, 1992. 
[10]. Friedman, M. “The use of ranks to avoid the assumption of normality implicit in the analysis 
Công nghệ thông tin 
 58 T. T. Huyen, L. B. Dung, M. Dinh, “Data balancing methods by fuzzy rough sets.” 
of variance”. Journal of the American Statistical Association, 32(200):675–701, 1973. 
[11]. Grzymala-Busse, J. W., Clark, P. G., and Kuehnhausen, M. “Generalized probabilistic 
approximations of incomplete data”. International Journal of Approximate Reasoning, 
55(1, Part 2):180 – 196. Special issue on Decision-Theoretic Rough Sets, 2014. 
[12]. Jin Huang and C.X. Ling. “Using auc and accuracy in evaluating learning algorithms”. 
Knowledge and Data Engineering, IEEE Transactions on, 17(3):299 310, March 2005. 
[13]. R. Jensen and C. Cornelis. “Fuzzy-rough instance selection”. In Fuzzy Systems (FUZZ), 
2010 IEEE International Conference on, pages 1–7, July 2010. 
[14]. Marzena Kryszkiewicz. “Rough set approach to incomplete information systems”. Inf. Sci., 
112(1-4):39–49, December 1998. 
[15]. Victoria Lopez, Alberto Fernandez, Salvador Garcia, Vasile Palade, and Francisco Herrera. 
“An insight into classification with imbalanced data: Empirical results and current trends 
on using data intrinsic characteristics”. Information Sciences, 250(0):113 – 141, 2013. 
[16]. Do Van Nguyen, Keisuke Ogawa, Kazunori Matsumoto, and Masayuki Hashimoto. 
“Editing training sets from imbalanced data using fuzzyrough sets”. In IFIP Advances in 
Information and Communication Technology, volume 458, pages 115–129, France, 2015. 
[17]. Do Van Nguyen, Koichi Yamada, and Muneyuki Unehara. “Extended tolerance relation to 
define a new rough set model in incomplete information systems”. Advances in Fuzzy 
Systems, 2013. Article ID 372091. 
[18]. Do Van Nguyen, Koichi Yamada, and Muneyuki Unehara. “On probability of matching in 
probability based rough set definitions”. In IEEESMC2013, pages 449–454, Manchester, 
The UK, 2013. 
[19]. Nguyen, D. V., Yamada, K., and Unehara, M. “Rough set approach with imperfect data 
based on dempster-shafer theory”. Journal of Advanced Computational Intelligence and 
Intelligent Informatics, 18(3):280–288, 2014. 
[20]. Nguyen, H. S. “Discretization problem for rough sets methods”. In Proceedings of the First 
International Conference on Rough Sets and Current Trends in Computing, RSCTC ’98, 
pages 545–552, London, UK, UK. Springer-Verlag, 1998. 
[21]. Zdzislaw Pawlak. “Rough sets”. International Journal of Computer and Information 
Sciences, 11:341–356, 1982. 
[22]. Zdzislaw Pawlak. “Rough Sets”. Theoretical Aspects of Reasoning about Data. Kluwer 
Acad., 1991. 
[23]. Anna Maria Radzikowska and Etienne E. Kerre. “A comparative study of fuzzy rough sets”. 
Fuzzy Sets Syst., 126(2):137–155, March 2002. 
[24]. Enislay Ramentol, Yaile Caballero, Rafael Bello, and Francisco Herrera. SMOTE-RSB *: 
“A hybrid preprocessing approach based on oversampling and undersampling for high 
imbalanced data-sets using SMOTE and rough sets theory”. Knowl. Inf. Syst., 33(2):245–
265, 2011. 
[25]. Enislay Ramentol, Nele Verbiest, Rafael Bello, Yaille Caballero, Chris Cornelis, and 
Francisco Herrera. Smote-frst: “A new resampling method using fuzzy rough set theory”. In 
Cengiz Kahraman, Etienne Kerre, and Faik Tunc Bozbura, editors, World Scientific 
Proceedings Series on Computer Engineering and Decision Making, volume 7, pages 800–
805. World Scientific, 2012. 
[26]. Enislay Ramentol, Sarah Vluymans, Nele Verbiest, Yaille Caballero, Rafael Bello, Chris 
Cornelis, and Francisco Herrera. Ifrowann: “Imbalanced fuzzy rough ordered weighted average 
nearest neighbor classification”. In IEEE Transaction on Fuzzy System, volume 23, 2012. 
[27]. Verbiest, N. Multi threshold frps: “A new approach to fuzzy rough set prototype selection”. 
In RSCTC 2014, LNAI, volume 8536, pages 83–91, 2014. 
[28]. Nele Verbiest, Chris Cornelis, and Francisco Herrera. Frps: “A fuzzy rough prototype 
Nghiên cứu khoa học công nghệ 
 Tạp chí Nghiên cứu KH&CN quân sự, Số Đặc san CNTT, 12 - 2020 59 
selection method”. Pattern Recognition, 46(10):2770 – 2782, 2013. 
[29]. Nele Verbiest, Enislay Ramentol, Chris Cornelis, and Francisco Herrera. “Preprocessing 
noisy imbalanced datasets using SMOTE enhanced with fuzzy rough prototype selection”. 
Appl. Soft Comput., 22:511–517, 2014. 
[30]. Sarah Vluymans, Danel Sanchez Tarrago, Yvan Saeys, Chris Cornelis, and Francisco 
Herrera. “Fuzzy rough classifier for class imbalanced multi-instance data”. Pattern 
Recognition, 53:36–45, 2016. 
[31]. Wilcoxon, F. “Individual comparisons by ranking methods”. Biometrics Bulletin, 1(6):80–
83, 1945. 
[32]. Ronald R. Yager. “On ordered weighted averaging aggregation operators in multicriteria 
decisionmaking”. IEEE Trans. Syst. Man Cybern., 18(1):183–190, January 1988. 
[33]. Y. Y. Yao. “Combination of rough and fuzzy sets based on -level sets”. In Rough sets and 
data mining: Analysis for imprecise data, pages 301– 321. Kluwer Academic, 1997. 
[34]. Hans-Jurgen Zimmermann. “Fuzzy Set Theory and its Applications”. Springer, 2001. 
TÓM TẮT 
PHƯƠNG PHÁP CÂN BẰNG DỮ LIỆU BẰNG TẬP HỢP THÔ MỜ 
Sự hiệu quả của lý thuyết tập hợp thô trong việc làm sạch dữ liệu đã được chứng minh 
trong nhiều nghiên cứu. Gần đây, tập thô mờ cũng thực hiện xử lý dữ liệu mất cân bằng 
bằng hai cách tiếp cận. Đầu tiên là sự kết hợp của các phương pháp cân bằng và trích 
chọn đối tượng thô mờ. Phương pháp thứ hai cố gắng sử dụng các tiêu chí khác nhau để 
làm sạch các lớp đa số và thiểu số trong dữ liệu mất cân bằng. Công việc này là một phần 
mở rộng của phương pháp thứ hai đã được trình bày trong [15]. Bài báo mô tả nghiên 
cứu đầy đủ về phương pháp thứ hai với một số thuật toán được đề xuất. Nó tập trung chủ 
yếu vào phân loại nhị phân với kNN và SVM cho dữ liệu không cân bằng. Các thử nghiệm 
và so sánh giữa các phương pháp có liên quan sẽ xác nhận ưu điểm và xu hướng của từng 
phương pháp về độ chính xác của hiệu suất và mức tiêu thụ thời gian. 
Từ khóa: Lý thuyết tập hợp thô; Tập thô mờ; Tính toán hạt; Trích chọn đối tượng. 
Received 21st October 2020 
Revised 10th December 2020 
Accepted 15th December 2020 
Author affiliations: 
1UET, Vietnam National University, Hanoi, Vietnam; 
2Institute of Information Technology, Hanoi, Vietnam; 
3IOIT, Vietnam Academy of Science and Technology; 
4Military College of Tank-Armour Officers. 
*Corresponding author: huyenmit82@gmail.com. 

File đính kèm:

  • pdfdata_balancing_methods_by_fuzzy_rough_sets.pdf