Data balancing methods by fuzzy rough sets

Abstract: The robustness of rough sets theory in data cleansing have been proved in

many studies. Recently, fuzzy rough set also make a deal with imbalanced data by two

approaches. The first is a combination of fuzzy rough instance selection and balancing

methods. The second tries to use different criteria to clean majorities and minorities

classes of imbalanced data. This work is an extension of the second method which was

presented in [16]. The paper depicts complete study about the second method with some

proposed algorithms. It focuses mainly on binary classification with kNN and SVM for

imbalanced data. Experiments and comparisons among related methods will confirm pros

and coin of each method with respect to performance accuracy and time consumption.

Download

Trang 1

Trang 2

Trang 3

Trang 4

Trang 5

Trang 6

Trang 7

Trang 8

Trang 9

Trang 10

Tải về để xem bản đầy đủ

20 trang baonam 13860

Download

Bạn đang xem 10 trang mẫu của tài liệu "Data balancing methods by fuzzy rough sets", để tải tài liệu gốc về máy hãy click vào nút Download ở trên

Tóm tắt nội dung tài liệu: Data balancing methods by fuzzy rough sets

Công nghệ thông tin
40 T. T. Huyen, L. B. Dung, M. Dinh, “Data balancing methods by fuzzy rough sets.”
DATA BALANCING METHODS BY FUZZY ROUGH SETS
Tran Thanh Huyen1,2*, Le Ba Dung3, Mai Dinh4
Abstract: The robustness of rough sets theory in data cleansing have been proved in
many studies. Recently, fuzzy rough set also make a deal with imbalanced data by two
approaches. The first is a combination of fuzzy rough instance selection and balancing
methods. The second tries to use different criteria to clean majorities and minorities
classes of imbalanced data. This work is an extension of the second method which was
presented in [16]. The paper depicts complete study about the second method with some
proposed algorithms. It focuses mainly on binary classification with kNN and SVM for
imbalanced data. Experiments and comparisons among related methods will confirm pros
and coin of each method with respect to performance accuracy and time consumption.
Keywords: Rough Set theory; Fuzzy-rough sets; Granular computing; Imbalanced data; Instance selection.
1. INTRODUCTION
Rough set theory was first introduced by Pawlak [21, 22] in early 1980s as a machine
learning methods for knowledge acquisition from data. It provides mathematical
approach to deal with inconsistency between items in datasets. It consequently can be
used for pattern extraction, decision rule generation, feature selection and data reduction.
In case of data reduction/selection, main ideas are extracting fuzzy memberships of
positive regions [23] and choosing the instances which have large membership degrees
(degrees have to larger than thresholds) for training phases [3, 13, 28]. It thus can reduce
noise by defining and eliminating low quality instances in balance datasets.
Other issue that may cause an inconsistency problem is imbalanced data [15]. A
dataset is imbalanced when the numbers of instances in some classes are much larger
than in others. Such classes are called majority classes. The classes with small
cardinality are referred to as minority classes.
There are two possible approaches in using fuzzy rough set [23] to select instance
from imbalanced datasets. The first is combination of balancing methods and a rough set
based noise removed technique [24, 25, 29]. In these approaches, fuzzy rough sets are
first used to remove low quality instances. Then, a well-known balancing technique
called “Synthetic Minority Oversampling Technique (SMOTE)” [4] is employed to form
a candidate set. Finally, fuzzy-rough sets are used again to select quality instances from
the candidate sets.
The second approach is using different criteria in defining different thresholds for
majority and minority classes [16, 27]. By using small threshold for minority class, more
items from minority classes can be selected. Besides, the research in [16] introduces other
method to deal with highly imbalanced data by keeping all instance in minority class while
remove/change label in majority class. The experimental results in those studies show
considerable improvement in classification performance. However, the absence of
generating candidates and optimizing thresholds methods limits application in real.
There are also some study in using fuzzy-rough sets as classification methods for
imbalanced data [26, 30]. However, making a classifier is not in scope of this paper.
This research will show complete study of using fuzzy rough set to select instance in
Nghiên cứu khoa học công nghệ
Tạp chí Nghiên cứu KH&CN quân sự, Số Đặc san CNTT, 12 - 2020 41
imbalanced dataset without any balancing techniques. The paper is structured as follows:
Section 2 reviews the original rough set theory and an extension called fuzzy-rough set;
In section 3, fuzzy rough instance selection approach to deal with inconsistency is
discussed along with their issues; The proposed algorithms of this research with methods
to choose parameters are introduced in section 4; Experiments with results and
comparisons are discussed in section 5; Finally, section 6 concludes the study.
2. THEORETICAL BACKGROUND
2.1. Information Systems and Tolerance Relation
An information system is represented as a data table. Each row of this table represents
an instance of an object such as people, things, etc. Information about every object is
described by object attribute (feature) values.
An information system in the rough sets study is formally defined as a pair ( ),I U A=
where U is a non-empty finite set of objects called the universe and A is a non-empty
finite set of attributes such that :af U A→ for every a A [21, 22]. The non-empty
discrete value set Va is called the domain of a. The original rough set theory deals with
complete information systems in which ( ), , ax U a A f x is a precise value.
If U contains at least an unknown value object, then I is called an incomplete
information system, ... 5.46 0.7463 0.5583 0.7258 0.7889 0.6825 0.7239
abalone9-18 16.40 0.5119 0.5060 0.5978 0.6153 0.5256 0.5256
yeast-1-4-5-8 vs 7 22.10 0.5000 0.5000 0.5848 0.6197 0.5735 0.5282
glass5 22.78 0.5000 0.5000 0.6711 0.7014 0.6931 0.7123
kr-vs-k-
one_vs_fifteen
27.77 0.9647 0.9647 0.9799 0.9647 0.9679 0.9808
yeast4 28.10 0.5333 0.5294 0.7618 0.7643 0.7539 0.8020
winequality-red-4 29.17 0.5000 0.5000 0.5498 0.5861 0.6081 0.6043
yeast-1-2-8-9 vs 7 30.57 0.5000 0.5000 0.5684 0.6764 0.5959 0.6458
ozone one hr 31.42 0.5000 0.5000 0.7231 0.6696 0.7411 0.7413
yeast5 32.73 0.7044 0.7103 0.8774 0.9391 0.8755 0.8888
winequality-red-8 vs
6.csv
35.44 0.5000 0.5000 0.5632 0.5618 0.6616 0.6609
ecoli-0-1-3-7_vs_2-
6
39.00 0.6667 0.5000 0.5000 0.6667 0.6740 0.6479
yeast6 41.40 0.6440 0.6599 0.8247 0.8644 0.7649 0.8821
winequality-red-8 vs
6-7
46.50 0.5000 0.5000 0.4754 0.5320 0.6465 0.6465
abalone-19_vs_10-
11-12-13
49.69 0.5000 0.5000 NA 0.5431 0.5674 0.5716
shuttle-2 vs 5 66.67 0.9841 0.9486 0.9524 0.9789 0.9841 0.9841
winequality-red-3 vs
5
68.10 0.5000 NA NA 0.5412 0.5725 0.5718
poker-8-9 vs 5.csv 82.00 0.5000 0.5000 NA 0.5086 0.6188 0.6189
Average AUC in Table 3 also shows that MFRIS1 archives better performance for
most lower IR imbalanced datasets. On the other hand, MFRIS3 records better
performance with high IR. MFRIS2 is also a solution for some datasets. In these cases,
maybe the number of positive instances is very small so it is impossible to remove some
or there is little different among qualities of positive instances.
In order to show the statistical significance of the performance of the proposed
method, we perform Friedman test [10] to compare the AUC to each of the state-of-the-
art algorithms. The Friedman rankings and test statistics are given in Table 4 and Table 5.
Nghiên cứu khoa học công nghệ
Tạp chí Nghiên cứu KH&CN quân sự, Số Đặc san CNTT, 12 - 2020 55
Table 4. Friedman Ranking w.r.t. AUC.
Algorithm Ranking w.r.t AUC
FRIS 1.8919
FRPSS 3.4595
MFRIS2 3.6892
MFRIS3 4.5541
MFRIS1 4.6486
Table 5. Friedman statistic.
N 37
Chi-square 49.7955
df 4
Asymp. Sig. 3.9837e-10
Let’s choose the level of significant as 0.05. Because the significance level (Asymp.
Sig.) is less than 0.05, it is plausible that there is an overall significant difference
between the mean ranks of these algorithms. Next, we perform the Wilcoxon post hoc
test [31] to compare the proposed algorithms with each of the other algorithms. The
adjusted p-values with respect to MFRIS1, MFRIS2 and MFRIS3 are given in Table 6, 7
and 8.
Table 6. Comparing MFRIS1 with others.
MFRIS1 vs p-values
FRIS 4.0463e-07
FRPSS 4.3322e-04
MFRIS2 4.2049e-02
MFRIS3 0.3292
Table 7. Comparing MFRIS2 with others.
MFRIS2 vs p-values
FRIS 1.6983e-04
FRPSS 0.5330
MFRIS1 0.9579
MFRIS3 0.9916
Table 8. Comparing MFRIS3 with others.
MFRIS3 vs p-values
FRIS 1.4046e-06
FRPSS 5.9737e-03
MFRIS1 0.6708
MFRIS3 8.4063e-03
From these tables, we can see that MFRIS2 performs better than the FRIS algorithm,
but is not so significant in comparison with FRPSS. On the other hand, MFRIS1 and
MFRIS3 relatively give the same performance, and they both significantly outperform
Công nghệ thông tin
56 T. T. Huyen, L. B. Dung, M. Dinh, “Data balancing methods by fuzzy rough sets.”
the other algorithms in terms of AUC.
To analyze the benefit of the algorithms, average and increasing AUC are depicted in
Figure 4. The left plot shows performance of kNN and SVM with and without instance
selection methods. The right plot illustrates some increasing to compare with
performance without cleansing data. From the figure, it is noticed that the algorithms
assist increasing in classification performance in overall. SVM probably receives more
support from instance selection methods.
Figure 4. Comparing average of AUC (on all datasets) and increase on average of AUC
using MFRISs for different classifiers.
To compare time consumption among fuzzy rough based instance selection methods,
Table 9 shows that the modification of fuzzy rough instance selection with granularity
1.0 = for imbalanced datasets consumes far less than the conventional methods. On the
other hands, the proposed algorithms have to spend more time for tuning granularity. In
this step, time consuming to calculate approximation of the package is most concern. In
comparison among MFRISs, time consumption increases from the first to the third
method.
Table 9. Prepossessing time (in minutes) for each instance selection method
(Datasets with more than 1000 instances).
Describe FRIS FRPSS MFRIS1 MFRIS2 MFRIS3 Granularity
segment0 4.2682 6.6287 1.9824 2.0654 2.6732 1.00
yeast-0-2-5-6_vs_3-7-
8-9
0.7493 0.7433 0.4502 0.4283 0.5663 1.00
yeast-0-2-5-7-9_vs_3-
6-8
0.7405 0.8229 0.4604 0.4393 0.5682 1.00
ozone_eight_hr 0.1625 0.2711 5.9860 6.1786 7.3651 0.15
shuttle-c0-vs-c4 2.6479 2.7236 1.0280 1.0577 1.3117 1.00
kr-vs-k-
one_vs_fifteen
0.2367 0.3219 3.1913 3.1134 3.2165 0.05
yeast4 1.6571 1.6874 0.7349 0.7291 0.9285 1.00
winequality-red-4 1.8274 1.8734 0.9359 0.9348 1.1819 1.00
ozone_one_hr 0.1558 0.1667 5.9767 6.2840 7.4906 0.15
yeast5 1.6490 1.6955 0.7117 0.7337 0.8980 1.00
Nghiên cứu khoa học công nghệ
Tạp chí Nghiên cứu KH&CN quân sự, Số Đặc san CNTT, 12 - 2020 57
yeast6 1.6440 1.6672 0.7383 0.7215 0.9212 1.00
abalone-19_vs_10-11-
12-13
2.1847 NA 0.9316 0.9337 1.1880 1.00
shuttle-2_vs_5 9.3936 9.4679 2.7952 3.0382 3.6771 1.00
poker-8-9_vs_5 0.4975 NA 1.0288 1.3063 1.4366 1.00
6. CONCLUSION AND FUTURE WORK
In this paper, we make an extension of the earlier study with three modification of
fuzzy rough instance selection for imbalanced data. By using different criteria in
selecting positive and negative classes, it possible to obtain higher classification
performance with kNN and SVM. The proposed approach is discussed with
mathematical explanation, algorithm presentation and method to optimize parameters.
The experiments demo the improvement in performance of the proposed methods.
They can also be competitive with state-of-the-art methods while eliminating using other
balancing technique such as SMOTE.
This paper confirms the robustness of using multiple criteria in fuzzy-rough sets
based instance selection approach. It solves problems of automatically choosing
parameters that exist in [16]. Further work would be studies on multiple-class
imbalanced data selection methods and fuzzy-rough sets based classifier which can deal
with imbalanced data.
REFERENCES
[1]. Jesus Alcala-Fdez, Alberto Fernandez, Julian Luengo, Joaquin Derrac, and Salvador Garcia.
Keel data-mining software tool: “Data set repository, integration of algorithms and
experimental analysis framework.” Multiple-Valued Logic and Soft Computing, 17(2-
3):255–287, 2011.
[2]. K. Bache and M. Lichman. “UCI Machine Learning Repository”, 2013.
[3]. Yaile Caballero, Rafael Bello, Delia Alvarez, Maria M. Gareia, and Yaimara Pizano.
“Improving the k-nn method: Rough set in edit training set”. In John Debenham, editor,
Professional Practice in Artificial Intelligence, volume 218 of IFIP International Federation
for Information Processing, pages 21–30. Springer US, 2006.
[4]. Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer.
“Smote: Synthetic minority over-sampling technique”. Journal of Artificial Intelligence
Research, 16:321–357, 2002.
[5]. Chris Cornelis, Nele Verbiest, and Richard Jensen. “Ordered weighted average based fuzzy
rough sets”. In Rough Set and Knowledge Technology - 5th International Conference,
RSKT 2010, Beijing, China, October 15- 17, 2010. Proceedings, pages 78–85, 2010.
[6]. Corinna Cortes and Vladimir Vapnik. “Support-vector networks”. Machine Learning,
20(3):273–297, 1995.
[7]. T. Cover and P. Hart. “Nearest neighbor pattern classification”. IEEE Trans. Inf. Theor.,
13(1):21–27, September 1967.
[8]. Didier Dubois and Henri Prade. “Rough fuzzy sets and fuzzy rough sets”. In International
Journal of General Systems, volume 17, pages 191–209. 1990.
[9]. Didier Dubois and Henri Prade. “Putting rough sets and fuzzy sets together. In Roman
Slowinski”, editor, Intelligent Decision Support, volume 11 of Theory and Decision
Library, pages 203–232. Springer Netherlands, 1992.
[10]. Friedman, M. “The use of ranks to avoid the assumption of normality implicit in the analysis
Công nghệ thông tin
58 T. T. Huyen, L. B. Dung, M. Dinh, “Data balancing methods by fuzzy rough sets.”
of variance”. Journal of the American Statistical Association, 32(200):675–701, 1973.
[11]. Grzymala-Busse, J. W., Clark, P. G., and Kuehnhausen, M. “Generalized probabilistic
approximations of incomplete data”. International Journal of Approximate Reasoning,
55(1, Part 2):180 – 196. Special issue on Decision-Theoretic Rough Sets, 2014.
[12]. Jin Huang and C.X. Ling. “Using auc and accuracy in evaluating learning algorithms”.
Knowledge and Data Engineering, IEEE Transactions on, 17(3):299 310, March 2005.
[13]. R. Jensen and C. Cornelis. “Fuzzy-rough instance selection”. In Fuzzy Systems (FUZZ),
2010 IEEE International Conference on, pages 1–7, July 2010.
[14]. Marzena Kryszkiewicz. “Rough set approach to incomplete information systems”. Inf. Sci.,
112(1-4):39–49, December 1998.
[15]. Victoria Lopez, Alberto Fernandez, Salvador Garcia, Vasile Palade, and Francisco Herrera.
“An insight into classification with imbalanced data: Empirical results and current trends
on using data intrinsic characteristics”. Information Sciences, 250(0):113 – 141, 2013.
[16]. Do Van Nguyen, Keisuke Ogawa, Kazunori Matsumoto, and Masayuki Hashimoto.
“Editing training sets from imbalanced data using fuzzyrough sets”. In IFIP Advances in
Information and Communication Technology, volume 458, pages 115–129, France, 2015.
[17]. Do Van Nguyen, Koichi Yamada, and Muneyuki Unehara. “Extended tolerance relation to
define a new rough set model in incomplete information systems”. Advances in Fuzzy
Systems, 2013. Article ID 372091.
[18]. Do Van Nguyen, Koichi Yamada, and Muneyuki Unehara. “On probability of matching in
probability based rough set definitions”. In IEEESMC2013, pages 449–454, Manchester,
The UK, 2013.
[19]. Nguyen, D. V., Yamada, K., and Unehara, M. “Rough set approach with imperfect data
based on dempster-shafer theory”. Journal of Advanced Computational Intelligence and
Intelligent Informatics, 18(3):280–288, 2014.
[20]. Nguyen, H. S. “Discretization problem for rough sets methods”. In Proceedings of the First
International Conference on Rough Sets and Current Trends in Computing, RSCTC ’98,
pages 545–552, London, UK, UK. Springer-Verlag, 1998.
[21]. Zdzislaw Pawlak. “Rough sets”. International Journal of Computer and Information
Sciences, 11:341–356, 1982.
[22]. Zdzislaw Pawlak. “Rough Sets”. Theoretical Aspects of Reasoning about Data. Kluwer
Acad., 1991.
[23]. Anna Maria Radzikowska and Etienne E. Kerre. “A comparative study of fuzzy rough sets”.
Fuzzy Sets Syst., 126(2):137–155, March 2002.
[24]. Enislay Ramentol, Yaile Caballero, Rafael Bello, and Francisco Herrera. SMOTE-RSB *:
“A hybrid preprocessing approach based on oversampling and undersampling for high
imbalanced data-sets using SMOTE and rough sets theory”. Knowl. Inf. Syst., 33(2):245–
265, 2011.
[25]. Enislay Ramentol, Nele Verbiest, Rafael Bello, Yaille Caballero, Chris Cornelis, and
Francisco Herrera. Smote-frst: “A new resampling method using fuzzy rough set theory”. In
Cengiz Kahraman, Etienne Kerre, and Faik Tunc Bozbura, editors, World Scientific
Proceedings Series on Computer Engineering and Decision Making, volume 7, pages 800–
805. World Scientific, 2012.
[26]. Enislay Ramentol, Sarah Vluymans, Nele Verbiest, Yaille Caballero, Rafael Bello, Chris
Cornelis, and Francisco Herrera. Ifrowann: “Imbalanced fuzzy rough ordered weighted average
nearest neighbor classification”. In IEEE Transaction on Fuzzy System, volume 23, 2012.
[27]. Verbiest, N. Multi threshold frps: “A new approach to fuzzy rough set prototype selection”.
In RSCTC 2014, LNAI, volume 8536, pages 83–91, 2014.
[28]. Nele Verbiest, Chris Cornelis, and Francisco Herrera. Frps: “A fuzzy rough prototype
Nghiên cứu khoa học công nghệ
Tạp chí Nghiên cứu KH&CN quân sự, Số Đặc san CNTT, 12 - 2020 59
selection method”. Pattern Recognition, 46(10):2770 – 2782, 2013.
[29]. Nele Verbiest, Enislay Ramentol, Chris Cornelis, and Francisco Herrera. “Preprocessing
noisy imbalanced datasets using SMOTE enhanced with fuzzy rough prototype selection”.
Appl. Soft Comput., 22:511–517, 2014.
[30]. Sarah Vluymans, Danel Sanchez Tarrago, Yvan Saeys, Chris Cornelis, and Francisco
Herrera. “Fuzzy rough classifier for class imbalanced multi-instance data”. Pattern
Recognition, 53:36–45, 2016.
[31]. Wilcoxon, F. “Individual comparisons by ranking methods”. Biometrics Bulletin, 1(6):80–
83, 1945.
[32]. Ronald R. Yager. “On ordered weighted averaging aggregation operators in multicriteria
decisionmaking”. IEEE Trans. Syst. Man Cybern., 18(1):183–190, January 1988.
[33]. Y. Y. Yao. “Combination of rough and fuzzy sets based on -level sets”. In Rough sets and
data mining: Analysis for imprecise data, pages 301– 321. Kluwer Academic, 1997.
[34]. Hans-Jurgen Zimmermann. “Fuzzy Set Theory and its Applications”. Springer, 2001.
TÓM TẮT
PHƯƠNG PHÁP CÂN BẰNG DỮ LIỆU BẰNG TẬP HỢP THÔ MỜ
Sự hiệu quả của lý thuyết tập hợp thô trong việc làm sạch dữ liệu đã được chứng minh
trong nhiều nghiên cứu. Gần đây, tập thô mờ cũng thực hiện xử lý dữ liệu mất cân bằng
bằng hai cách tiếp cận. Đầu tiên là sự kết hợp của các phương pháp cân bằng và trích
chọn đối tượng thô mờ. Phương pháp thứ hai cố gắng sử dụng các tiêu chí khác nhau để
làm sạch các lớp đa số và thiểu số trong dữ liệu mất cân bằng. Công việc này là một phần
mở rộng của phương pháp thứ hai đã được trình bày trong [15]. Bài báo mô tả nghiên
cứu đầy đủ về phương pháp thứ hai với một số thuật toán được đề xuất. Nó tập trung chủ
yếu vào phân loại nhị phân với kNN và SVM cho dữ liệu không cân bằng. Các thử nghiệm
và so sánh giữa các phương pháp có liên quan sẽ xác nhận ưu điểm và xu hướng của từng
phương pháp về độ chính xác của hiệu suất và mức tiêu thụ thời gian.
Từ khóa: Lý thuyết tập hợp thô; Tập thô mờ; Tính toán hạt; Trích chọn đối tượng.
Received 21st October 2020
Revised 10th December 2020
Accepted 15th December 2020
Author affiliations:
1UET, Vietnam National University, Hanoi, Vietnam;
2Institute of Information Technology, Hanoi, Vietnam;
3IOIT, Vietnam Academy of Science and Technology;
4Military College of Tank-Armour Officers.
*Corresponding author: huyenmit82@gmail.com.

File đính kèm:

data_balancing_methods_by_fuzzy_rough_sets.pdf