Inverted Heuristics in Subgroup Discovery Anita Valmarska Joef Stefan Institute Joef Stefan International Postgraduate School Jamova 39 Ljubljana, Slovenia anita.valmarska@ijs.si Marko Robnik-ikonja Faculty of Computer and Information Science Vecna pot 113 Ljubljana, Slovenia marko.robnik@fri.uni-lj.si Nada Lavrac Joef Stefan Institute Joef Stefan International Postgraduate School Jamova 39 Ljubljana, Slovenia nada.lavrac@ijs.si ABSTRACT In rule learning, rules are typically induced in two phases, rule renement and rule selection. It was recently argued that the usage of two separate heuristics for each phasein particular using the so-called inverted heuristic in the renement phaseproduces longer rules with comparable classication accuracy. In this paper we test the utility of inverted heuristics in the context of subgroup discovery. For this purpose we developed a DoubleBeam subgroup discovery algorithm that allows for combining various heuristics for rule renement and selection. The algorithm was experimentally evaluated on 20 UCI datasets using 10-fold double-loop cross validation. The experimental results suggest that a variant of the DoubleBeam algorithm using a specic combination of renement and selection heuristics generates longer rules without compromising rule quality. However, the DoubleBeam algorithm using inverted heuristics does not outperform the standard CN2-SD and SD algorithms. 1. INTRODUCTION Rule learning is one of the earliest machine learning techniques and has been used in numerous applications [5]. It is a symbolic data analysis technique whose aim is to discover comprehensible patterns or models of data [10]. The key advantage of rule learning compared to other statistical learning techniques is its inherent simplicity and human comprehensible output models and patterns. Symbolic data analysis techniques can be divided into two categories. Techniques for predictive induction produce models, typically induced from class labeled data, which are used to predict the class of previously unseen examples. The second category consists of techniques for descriptive induction, where the aim is to nd comprehensible patterns, typically induced from unlabeled data. There are also descriptive induction techniques that discover patterns in the form of rules from labeled data, which are referred to as supervised descriptive rule discovery approaches [10]. Typical representatives of these techniques are contrast set mining (CSM) [2], emerging pattern mining (EPM) [4], and subgroup discovery (SD) [9, 16]. The task of subgroup discovery is to nd interesting subgroups in the population i.e., subgroups that have a significantly dierent class distribution than the entire population. The result of subgroup discovery is a set of individual rules, where the rule consequence is a class value. The main dierence between learning of classication rules and subgroup discovery is that the latter induces only individual rules of interest, revealing interesting properties of groups of instances, and not necessarily forming a rule set covering the entire problem space, which is required for classication. An important characteristic of subgroup discovery task is a combination of predictive and descriptive induction. It provides short and understandable descriptions of subgroups regarding the property of interest. This feature of subgroup discovery has inspired many researchers to investigate new methods that will be more eective in nding more interesting patterns in the data. Most subgroup discovery approaches build on classication algorithms, e.g., EXPLORA [9], MIDOS [16], SD [6], CN2-SD [13], and RSD [14], or on the algorithms for association rule learning, e.g., APRIORISD [8], SD-MAP [1], and Merge-SD [7]. In rule learning, during the process of rule construction, conditions that optimize a certain heuristic are added. Typically, the heuristics are used in two dierent phases of the process: (i) to evaluate rule renements, i.e., to select which of the renements of the current rule will be further explored, and (ii) for rule selection, i.e., to decide which of the renements that have been explored is added to the rule set. Stecher et al. [15] proposed using separate heuristics for each of the two rule construction phases. In the rule renement phase they proposed to use the inverted heuristics, i.e., the heuristics whose isometrics are rotated around the base rule. These heuristics are used to evaluate the relative gain obtained by the renement of the current rule. In this paper we test the utility of inverted heuristics in the context of subgroup discovery. For this purpose we developed a DoubleBeam subgroup discovery algorithm that allows for combining various heuristics for rule renement and selection. The algorithm was experimentally evaluated on 20 UCI datasets using 10-fold double-loop cross validation. This paper is organized as follows. In Section 2 we present the ndings of Stecher et al. about the use of inverted heuristics in the rule learning process. Section 3 presents the DoubleBeam subgroup discovery algorithm. In Section 4 we describe the data sets used, followed by the empirical evaluation and the obtained results. Finally, in Section 5 we present our conclusions and ideas for further work. 2. INVERTED HEURISTICS Rule learning algorithms rely on heuristic measures to determine the quality of induced rule. Stecher et al. [15] propose distinction between rule renement and rule selection heuristics in inductive rule learning. They argue that the nature of the separate-and-conquer rule learning algorithms opens up a possibility to use two dierent heuristics in two fundamental steps in the process of rule learning rule renement and rule selection. They show in the coverage space why it is benecial to separate the evaluation of candidates for rule renement and the selection of rules for the nal theory. The rule renement step in a top-down search requires inverted heuristics, which, in principle, results in better rules. Such heuristics evaluate rules from the point of the current base rule, instead of the empty rule. They adapt three standard heuristics with slightly dierent but related properties: Precision: Laplace: m-estimate: hprec(p, n) = p p + n ; hlap(p, n) = p + 1 p + n + 2 ; hmest(p, n, m) = p + m P P +N p + n + m . (1) (2) (3) Parameters p and n denote the number of positive and negative examples in a potential subgroup, respectively, and regarding the target class of interest. For the purpose of rule renement an inverted heuristic is used. Isometrics of inverted heuristics do not rotate around the origin, but rotate around the base rule. Representations of the inverted heuristics in the coverage space reveal the following relationship with the basic heuristics: (cid:48) h (p, n) = h(N n, P p) (4) where parameters P and N denote the number of positive and negative examples in the data set with respect to the target class, and dependent on the predecessor rule. Consequently, the inverted heuristics have the following forms: Inverted precision: (cid:48) h prec(p, n) = N n (P + N ) (p + n) investigated the use of inverted heuristics in subgroup discovery. For that purpose we implemented a new DoubleBeam algorithm for subgroup discovery which implements the usage of separate renement and selection heuristics with beam search. 3. DOUBLEBEAM ALGORITHM We implemented a DoubleBeam algorithm for subgroup discovery. This algorithm consists of two beams, renement and selection beam. Upon initialization, each beam is lled with the best features according to their renement and selection quality. The algorithm then enters a loop, where it rst renes the elements from the renement beam with features from the dataset. In each step, rules from the renement beam are rened by adding features to existing rules. Newly produced rules are added to the renement beam if their renement quality exceeds the renement quality of existing rules in the renement beam. Newly produced rules are then evaluated according to their selection quality. Selection beam is updated with newly induced rules whose selection quality is better than the selection quality of rules already in the beam. The algorithm exits the loop and stops when there are no changes in the selection beam. The DoubleBeam algorithm is outlined in Algorithm 1. Input : E = P N (E is the training set. |E| is the training set size, P are the positive (class) examples, N are negative (non-target) examples), T argetClass Output Parameters : subgroups : min support, ref inementBeamW idth, selectionBeamW idth, ref inement heuristics, selection heuristics CandidateList all feature values or intervals for each candidate in CandidateList do evaluate candidate with ref inement quality evaluate candidate with selection quality end sort CandidateList according to the ref inement quality for i = 0 to ref inementBeamW idth) do Ref inementBeam(i) CandidateList(i) end sort CandidateList according to the selection quality ; (5) for i = 0 to selectionBeamW idth) do SelectionBeam(i) CandidateList(i) Inverted Laplace: (cid:48) lap(p, n) = h N n + 1 (P + N ) (p + n + 2) ; (6) end do Inverted m-estimate: (cid:48) mest(p, n, m) = h N n + m P (P + N ) (p + n + m) P +N Ref inementCandidates rene Ref inementBeam with CandidateList update Ref inementBeam with Ref inementCandidates using ref inement quality update SelectionBeam with Ref inementCandidates using selection quality . (7) while while there are changes in SelectionBeam; return SelectionBeam Algorithm 1: DoubleBeam algorithm Overall, in [15] the combination of Laplace heuristic hlap (cid:48) in the rule selection step and inverted Laplace heuristic h lap in rule renement step outperformed other combinations in terms of average classication accuracy. An interesting side conclusion from [15] is that the usage of inverted heuristics in the rule renement produces on average longer rules. The tendency of inverted heuristics to nd longer descriptions and no additional parameters make the separation of rule renement and rule selection an appealing research approach in the domain of subgroup discovery, therefore, we 4. EXPERIMENTAL RESULTS The DoubleBeam algorithm was implemented in the Clowd Flows platform [12]. For the purpose of our evaluation, we used the following combinations of renement and selec(cid:48) mest, hmest), tion heuristics: (h (cid:48) (h g, hg), and (hg, hg), (named DB-ILL (DoubleBeam-Inverted Laplace, Laplace), DB-IPP (DoubleBeam-Inverted precision, (cid:48) prec, hprec), (h (cid:48) lap, hlap), (h precision), DB-IMM (DoubleBeam-Inverted m-estimate, mestimate), DB-IGG (DoubleBeam-Inverted generalizationquotient, generalization quotient), and DB-GG (DoubleBeamgeneralization quotient, generalization quotient) respectively). The hg heuristic is the generalization quotient proposed in (cid:48) [6], while h g is its inverted variant. The generalization quotient is a heuristic used in the SD algorithm. The SD algorithm and the algorithms CN2-SD and APRIORI-SD were already implemented in this platform. 4.1 Experimental setting We use the same 20 UCI classication data sets as [15] to compare three state-of-the-art subgroup discovery algorithms (SD, CN2-SD, and APRIORI-SD) and the DoubleBeam algorithm with ve combinations of renement and selection heuristics (DB-ILL, DB-IPP, DB-IMM, DB-IGG, and DB-GG). The comparison is performed in 10-fold double-loop cross validation on each dataset. For each algorithm, a grid of possible parameter values was set beforehand. The value of min sup is set to 0.01. Each learning set (10 learning sets) was additionally split into training and test data. For each algorithm, models were built using the training data and its parameters from the grid. Parameters maximizing the value of unusualness of the produced subgroups were then chosen for building a model using the learning set. Unusualness is a measure which was presented in [13] and dened as: W RAcc(Class Cond) = p(Cond) (p(Class|Cond) p(Class)). (8) We use the subgroup discovery evaluation function implemented in Orange by Kralj et al. [11]. The function calculates the following measures: coverage, support, size, complexity, signicance, unusualness i.e., WRACC, classication accuracy, and AUC. 4.2 Results The WRACC values obtained in the experiments are shown in Table 1. These values are averaged over all the classes for every particular dataset. The values for the APRIORI-SD algorithm tested on the horse-colic dataset are missing as the algorithm did not converge in period over 5 days. For the datasets where the WRACC values for the APRIORI-SD algorithm are 0.000 the algorithm returned over 10, 000, 000 itemsets and did not nish properly. According to the obtained results, the CN2-SD and the SD algorithm have the best average ranks, and the Apriori-SD algorithm performs the worst. For comparison between methods we use the methodology proposed by Demsar [3]. We operate under the nullhypothesis that all the algorithms are equivalent. Two algorithms dier signicantly if the dierence between their average ranks is larger than the value of the critical dierence. The results of the Nemenyi test for the average values of WRACC are shown in Figure 1. Average ranks of algorithms are written in parentheses. The critical value is 2.35. It is evident that the CN2-SD algorithm produces the most interesting subgroups, which are statistically more unusual than the ones produced by the DoubleBeam al(cid:48) gorithm with the combinations (h prec, hprec), (cid:48) mest, hmest), and the APRIORI-SD algorithm. There (h (cid:48) lap, hlap), (h are no statistically signicant dierences between the CN2SD algorithm, the SD algorithm, and the DoubleBeam algorithm with the combinations (hg, hg) and (h (cid:48) g, hg). Figure 1: Nemenyi test on WRACC values with a signicance level of 0.05. The results of the Nemenyi test for the average rule size are shown in Figure 2. The DoubleBeam algorithm with the (cid:48) combination (h prec, hprec) produces subgroups which are on average described by the longest rules. The DB-IPP algorithm generates subgroups described by rules that are statistically longer only than the ones produced by the DB-IGG algorithm. There is no statistical evidence that the DBIPP algorithm produces longer rules than other evaluated algorithms. These results do not conrm that the DoubleBeam algorithm with inverted renement heuristic produces statistically longer rules than other subgroup discovery algorithms. Figure 2: Nemenyi test on ranking of average rule sizes (note that larger rules produce lower rankings) with a signicance level of 0.05. 5. CONCLUSIONS The experiments indicate that subgroup describing rules created using inverted heuristics used in [15] as rule renement heuristics in subgroup discovery are signicantly less interesting than the subgroups induced by the CN2-SD algorithm, the SD algorithm, and the DB-GG algorithm. There is no signicant dierence of the unusualness of the subgroups induced by the CN2-SD algorithm, the SD algorithm, the DB-GG algorithm, and the DB-IGG algorithm. However, it has to be mentioned that the CN2-SD algorithm uses WRACC as its heuristics for building subgroups. The results also suggest that when the combination (cid:48) (h prec, hprec) of heuristics is used, the obtained rules tend to have more rule conditions than the rules built by the other state-of-the-art algorithms for subgroup discovery. However, this dierence is not statistically signicant. The longer rules created by the algorithms using inverted heuristics used in [15] are more specic, thus subgroups contain lower number of examples and this decreases the unusualness of the subgroups. Considering the evaluation results, we can conclude that the DoubleBeam algorithm which uses the combination (hg, hg) as renement and selection heuristics can be CN2-SD(2.27)SD(2.30)DB-GG(2.45)DB-IGG(4.38)DB-ILL(4.80)DB-IPP(6.03)DB-IMM(6.65)APRIORI-SD(7.12)CD=2.3587654321DB-IPP(3.67)APRIORI-SD(3.73)DB-GG(3.75)DB-IMM(4.05)DB-ILL(4.28)CN2-SD(4.75)SD(4.80)DB-IGG(6.97)CD=2.3587654321 Table 1: Ten-fold double-loop cross validation WRACC results for subgroup discovery. Best values are written in bold. SD APRIORI-SD CN2-SD DB-ILL DB-IPP DB-IMM DB-GG DB-IGG Datasets breast-cancer car contact-lenses futebol glass hepatitis horse-colic hypothyroid idh ionosphere iris labor lymphography monk3 mushroom primary-tumor soybean tic-tac-toe vote zoo 0.045 0.029 0.080 0.017 0.050 0.046 0.131 0.025 0.088 0.115 0.162 0.074 0.058 0.068 0.128 0.018 0.033 0.053 0.184 0.083 0.003 0.006 0.031 0.004 0.017 0.000 0.000 0.053 0.000 0.119 0.018 0.015 0.014 0.000 0.004 0.000 0.007 0.052 0.000 0.024 0.031 0.066 0.009 0.047 0.061 0.071 0.031 0.094 0.111 0.197 0.106 0.046 0.065 0.163 0.009 0.037 0.021 0.201 0.096 0.015 0.021 0.036 0.003 0.029 0.041 0.045 0.019 0.078 0.084 0.148 0.082 0.045 0.041 0.147 0.007 0.023 0.017 0.157 0.041 0.010 0.021 0.023 0.002 0.026 0.016 0.022 0.009 0.068 0.049 0.098 0.021 0.042 0.024 0.025 0.009 0.023 0.011 0.076 0.020 0.011 0.003 0.000 0.000 0.019 0.031 0.054 0.007 0.000 0.062 0.045 0.029 0.038 0.007 0.023 0.004 0.009 0.007 0.120 0.025 0.045 0.028 0.081 0.001 0.045 0.060 0.131 0.032 0.081 0.120 0.165 0.090 0.056 0.068 0.146 0.011 0.053 0.188 0.086 0.041 0.028 0.081 0.000 0.038 0.043 0.084 0.024 0.088 0.101 0.169 0.063 0.036 0.058 0.122 0.008 0.028 0.053 0.164 0.058 a good choice for subgroup discovery. It induces subgroups that are comparable to the subgroups induced by the CN2SD algorithm and the SD algorithm in terms of their unusualness. The subgroups induced by the DB-GG algorithm are on average described by longer rules (see Figure 2). No additional parameters required with inverted heuristics and the obtained results regarding the average rule length make the proposed approach an interesting research direction. In future we want to focus on the reasons why the rules induced by the DB-ILL algorithm, the DB-IPP algorithm and the DB-IMM algorithm are less interesting than the ones produced by the standard subgroup discovery algorithms and implement an approach which will solve this issue. We also want to research the inuence of inverted heuristics in other state-of-the-art subgroup discovery algorithms. 6. REFERENCES [1] M. Atzmuller and F. Puppe. SD-map A fast algorithm for exhaustive subgroup discovery. In Proceedings of Knowledge Discovery in Databases, PKDD 2006, pages 617, 2006. [2] S. D. Bay and M. J. Pazzani. Detecting group dierences: Mining contrast sets. Data Mining and Knowledge Discovery, 5(3):213246, 2001. [3] J. Demsar. Statistical comparisons of classiers over multiple data sets. Journal of Machine Learning Research, 7:130, 2006. [4] G. Dong and J. Li. Ecient mining of emerging patterns: Discovering trends and dierences. In Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999, pages 4352, 1999. [5] J. Furnkranz, D. Gamberger, and N. Lavrac. Foundations of Rule Learning. Springer, 2012. [6] D. Gamberger and N. Lavrac. Expert-guided subgroup discovery: Methodology and application. Journal of Articial Intelligence Research, 17:501527, 2002. [7] H. Grosskreutz and S. Ruping. On subgroup discovery in numerical domains. Data Mining and Knowledge Discovery, 19(2):210226, 2009. [8] B. Kavsek and N. Lavrac. APRIORI-SD: adapting association rule learning to subgroup discovery. Applied Articial Intelligence, 20(7):543583, 2006. [9] W. Klosgen. Explora: A multipattern and multistrategy discovery assistant. In Advances in Knowledge Discovery and Data Mining, pages 249271. 1996. [10] P. Kralj Novak, N. Lavrac, and G. I. Webb. Supervised descriptive rule discovery: A unifying survey of contrast set, emerging pattern and subgroup mining. Journal of Machine Learning Research, 10:377403, 2009. [11] P. Kralj Novak, N. Lavrac, B. Zupan, and D. Gamberger. Experimental comparison of three subgroup discovery algorithms: Analysing brain ischemia data. In Proceedings of the 8th International Multiconference Information Society, pages 220223, 2005. [12] J. Kranjc, V. Podpecan, and N. Lavrac. ClowdFlows: A cloud based scientic workow platform. In Proceedings of Machine Learning and Knowledge Discovery in Databases European Conference, ECML PKDD 2012, pages 816819, 2012. [13] N. Lavrac, B. Kavsek, P. A. Flach, and L. Todorovski. Subgroup discovery with CN2-SD. Journal of Machine Learning Research, 5:153188, 2004. [14] N. Lavrac, F. Zelezny, and P. A. Flach. RSD: relational subgroup discovery through rst-order feature construction. In Proceedings of the 12th International Conference on Inductive Logic Programming, pages 149165, 2002. [15] J. Stecher, F. Janssen, and J. Furnkranz. Separating rule renement and rule selection heuristics in inductive rule learning. In Proceedings of Machine Learning and Knowledge Discovery in Databases European Conference, ECML PKDD 2014, pages 114129, 2014. [16] S. Wrobel. An algorithm for multi-relational discovery of subgroups. In Proceedings of the First European Symposium on Principles of Data Mining and Knowledge Discovery, PKDD 1997, pages 7887, 1997.