K-Fold Cross Validation for Selection of Cardiovascular Disease Diagnosis Features by Applying Rule-Based Datamining

Coronary heart disease occurs when atherosclerosis inhibits blood flow to the heart muscle in the coronary arteries. This disease is often the cause of human death. The method for diagnosing coronary heart disease that is often a doctor's referral is coronary angiography, but it is invasive, expensive

electrocardiography (ECG), echocardiogram, stress tests, nuclear imaging and coronary angriography [5]. The series of follow-up tests requires carefulness and accuracy of the cardiologist, has a risk to the patient and also requires expensive costs. The difficult process of diagnosing CAD disease is exacerbated by the small number of cardiologists in Indonesia. Along with the development of information technology, the diagnosis of CAD has been developed using computer aided methods.
Research using the VPRS method for the case of coronary heart disease diagnosis has been carried out to find data patterns in the form of rules-based classification [6], which produces fewer rules than the rough set method. whereas the rules produced by VPRS are easier to understand and if the rules are reduced the accuracy value decreases [7]. Research on the diagnosis of coronary heart disease has resulted in an accuracy value of 75.22% with the VPRS classification method [8] [9], the trial was carried out by randomizing the data 30 times, but the accuracy performance of each rule is unknown. While the selection of features based on the rules with VPRS for the diagnosis of coronary heart disease has been carried out, in order to select the best features used in the process of diagnosing coronary heart disease [10], this study compared the performance of medical expert-based feature selection (MFS) with computer-based feature selection using the VPRS method, the result of which was an increase in accuracy by selecting features with VPRS compared to diagnoses without feature selection that had been done in previous studies [8][9] [11]. The method of selection feature combination of VPRS and MFS, produces fewer Rules compared to MFS, whereas for the accuracy value for VPRS with a combination of VPRS and MFS has the same accuracy value that is 84.84% [10]. However, in this research [10] the treatment of the dataset in the testing process is only by splitting data that is 2/3 data as training data and 1/3 other data as testing data, this is a very large occurrence of data noise because not all data can be tested on the performance of the classification method so that the testing data is limited to just that data.
K-Fold Cross-validation (CV) is a statistical method, where data is divided into two subsets, namely training data for the learning process and data testing for validation or evaluation, which is used to evaluate the performance of models or methods or algorithms. CVs can be selected based on the size of the dataset. Usually K-Fold is used to reduce computing time and also to maintain the accuracy of the estimate [12].

Dataset
The dataset used in this study is the Cleveland Heart Disease dataset from the UCI machine learning repository. The amount of data used is 303 data that has 7 missing value data, the missing value data is deleted, so that it does not affect the classification results. The dataset used has 14 attributes and 2 classes, sick and not sick. Table 1 describes the attributes in the Cleveland heart disease dataset [13].

Method
The research method used is there are four main processes consisting of pre-processing, discretization, feature selection, randomization with k-fold cross validation, generating rules and evaluating performance. As shown in the research flow diagram in Figure 1.

3.1.Preprocessing
Define abbreviations and acronyms the first time they are used in the text, even after they have been defined in the abstract. Abbreviations such as IEEE, SI, MKS, CGS, sc, dc, and rms do not have to be defined. Do not use abbreviations in the title or heads unless they are unavoidable.  Do not mix complete spellings and abbreviations of units: "Wb/m2" or "webers per square meter," not "webers/m2." Spell units when they appear in text: "...a few henries," not "...a few H."  Use a zero before decimal points: "0.25," not ".25." Use "cm3," not "cc." (bullet list)

3.3.Feature Selection
Computer based feature selection is done to reduce Cleveland data and also choose features that relevant to decision for the diagnosis of coronary heart disease. This research used two methods to feature selection applied, which are computer based namely VPRS and medical expert based or motivated feature selection (MTF). This research for feature selection with VPRS method uses ROSE2 software.

3.4.Medical expert based feature selection
Medical expert-based feature selection or motivated feature selection (MFS) is based on knowledge possessed by medical experts. In cases of coronary heart disease, medical expert determine eight factors of medical significance that influence on diagnosis process, which are age, chest pain type(angina, abnang, notang, asympt), resting blood pressure, cholesterol, fasting blood sugar, resting heart rate (normal, abnormal, ventricular hypertrophy), maximum heart rate dan exercise induced angina [11] [16]. These eight factors are used as result of feature selection based medical expert.

3.5.Variable Precision Rough Set (VPRS)
Computer-based feature selection used VPRS methods. This research used VPRS to feature selection dan classification method.
VPRS is the continuation of classical model of rough set. In this research, it is proposed to analyze and identify the data pattern, which is representing the functional statistic trend [17]. VPRS related to classification of partial precision detection of parameter β. Ziarko defines the value of β as misclassification and ranged in value 0 ≤ β <0.5. Procedures VPRS models have four steps [18], namely : VPRS is an approach to data analysis that relies on two basic concepts, namely β-lower and β-upper approximations which can be expressed in following equation: β-lower approximations of the set in Equation (1), and β-upper approximations of set in Equation (2). VPRS is an approach to data analysis that relies on two basic concepts, namely β-lower and β-upper approximations which can be expressed in following equation: β-lower approximations of the set in Equation (1), and β-upper approximations of set in Equation (2). Here, are the lower and upper approximation of D with precision level β, respectively. Where E(P) indicates a set of equivalent classes, and class conditions based on subsets of attributes P, while According to [17], the size of classification quality for VPRS models can be defined by the following equation: ISSN 2714-6677 Vol.
dan C P  , for certain β value. The value of equation (4) measures the proportion of objects on set universe (U) for classification based on decision attribute D, and allowing for certain β value.
The procedure to produce a decision rule of an information system is done by two major steps as follows:  Step 1 : Selection of the best smallest set of attributes (eg, β-reduct value election)  Step 2 : Simplification of information systems can be achieved by dropping the specific values of attributes.
Ziarko [17] indicates that every smallest set of attributes is considered as an alternative to group attributes that are used as substitute all attributes available in case based decision making.

3.6.K-Fold Cross Validation
The next process is randomization of the dataset using the k-Fold cross validation method for data testing to evaluate the performance of the VPRS method. The dataset is divided into 'k 'subsets with the same amount of data. This research will use 10-fold, 5-fold and 3-fold. The data is divided into 10 folds that are approximately the same size for each fold, so they have 10 data subsets. For each of the 10 data subsets, the Cross-Validation test will use 9-fold for training and 1-fold for testing as illustrated in Figure 2. The method is also done for 5fold and 3-fold.

3.7.Generate Rules IF-THEN
This research uses VPRS method to generate rules which by using ROSE2 software.

3.8.Performance Evaluation
Medical Performance evaluation is done by doing classification. Evaluation is done by analyzing confusion matrix [19], which consist of accuracy, sensitivity and specificity. Confusion matrix is shown in table 2. The accuracy is the success rate of classification or classification accuracy was measured by counting the number of correct classifications divided by the total classification. Sensitivity is the probability the patient said to suffer from coronary heart disease was diagnosed positive illness (Sick), while specificity was diagnosed negative illness (Health).

Results and Discussion
In this study 296 data were used taken from the Cleveland Heart Disease dataset. Data discretization is carried out by the ROSETTA software. Feature selection is done to choose features that are relevant to the results of the diagnosis of coronary heart disease. The feature selection process and the rule making process for the VPRS method are carried out using the ROSE2 software. Classification process is calculated manually by using Microsoft Excel.

4.1.Preprocessing
The first step in the preprocessing data is cleaning data process. Cleaning data is removed missing value data in dataset Cleveland, and then converted multiclass dataset into binary class dataset with assumed that positive class is healthy (0) and negative class is sick (1).
The second step is discretization data. Discretization changes the data type of attributes from numeric into discrete. Some attributes with type of numeric which have Cleveland dataset are age, trestbps, chol, thalach, oldpeak and ca, they are transformed into discrete type using Entropy/MDL algorithm. Table 3 shows the result of discretization data.
The last step in preprocessing data is splitting data. Splitting data process split dataset into two parts. Dataset is split into two part. The three-quarter of data is used to train data and the other is used to test data. Splitting data is done to split dataset into training and testing datasets. Training datasets is used to find rules and knowledge on dataset for diagnosing coronary heart disease. While, testing dataset is used to test data with matching class prediction result of rules knowledge with class dataset.

4.2.Feature Selection
The first step in feature selection process uses training dataset to reduce data and tested according to the features that have been selected. Table IV shows the result of feature selection with MTF and VPRS methods [13]. This process is the same as done in previous studies.

4.3.K-Fold Cross-Validation
The next step is randomizing the dataset with the k-Fold validation method for testing data to evaluate the performance of the VPRS method. The dataset is divided into 'k 'subsets with the same amount of data. This research will use 10-fold, 5-fold and 3-fold subsets.
For K = 10, the data is divided into 10-folds that are approximately the same size for each fold, so that they have 10 data subsets. For each of the 10 data subsets, Cross Validation testing will use 9-fold for training and 1-fold for testing, as shown in Figure 3. For K = 5, the data is divided into 5-folds that are approximately the same size for each fold, so that they have 5 subsets of data. For each of the 5 data subsets, Cross Validation testing will use 4-fold for training and 1-fold for testing, as shown in Figure 4. For K = 3, the data is divided into 3 folds that are approximately the same size for each fold, so that they have 3 subsets of data. For each of the 3 data subsets, Cross Validation testing will use 2-fold for training and 1 fold for testing, as shown in Figure 5.

4.4.Generate Rule IF-THEN
The next step is generating IF-THEN rules by using result feature selection datasets with MTF and VPTS, which have randomized with k-fold cross validation.

4.5.Variable Precision Rough Set (VPRS)
In order to get IF-THEN rules or decision rules for VPRS method, the value β = 0.15 is used by using ROSE2 software [19]. In the research work, 3 datasets are used which resulted from the feature selection process. Each dataset produces different rules and numbers. Table  VIII, IX, X, XI, and XII shows the number of rules result datasets.

4.6.Classification
Classification based on VPRS is done. The resulting rules are tested on a test dataset that has been randomized by the k-fold cross validation method. The test is applied into the test dataset from the result feature selection dataset, so a confusion matrix is obtained for each dataset.

4.7.Performance Evaluation
In the medical contexts, there are only two classes "sick" or "healthy", which "sick" is more important than "healthy". The medical diagnosis purpose is to focus on the improvement of the accuracy of "sick" class or sensitivity and maintain the accuracy of "healthy" class or specificity. The accuracy, sensitivity and specificity values can be calculated from confusion matrix for each method by using Equation (5) Table 5 to Table 10 shows the result performance evaluation from datasets feature selection MFS and VPRS. From tables show that Classification performance by implementing k-fold for feature selection of the VPRS method is better than the MFS method. The feature selection method with VPRS that implements K-Fold Cross Validation on the evaluation of classification performance produces the highest accuracy of 76.34% at k = 5 subset dataset, because this research performs the k-fold randomization phase where all records in the dataset have a role as data testing and also training data, so all data plays a role in the process of generating rules and also performance evaluation, but randomization is still structured as many folds have been determined. So, The results of diagnosis of coronary heart ISSN 2714-6677 Vol. disease by implementing k-fold cross validation with feature selection using the VPRS method have decreased the accuracy value compared to diagnosis with feature selection without k-fold implementation [11].

Conclusion
The feature selection method with VPRS that implements K-Fold Cross Validation on the evaluation of classification performance produces the highest accuracy of 76.34% at k = 5 subset dataset. This happens because this research performs the k-fold randomization phase where all records in the dataset have a role as data testing and also training data, so all data plays a role in the process of generating rules and also performance evaluation, but randomization is still structured as many folds have been determined.
Classification performance by implementing k-fold for feature selection of the VPRS method is better than the MFS method. Implementation of k-fold for the diagnosis of heart disease by the VPRS method still results in lower accuracy since the distribution of the k-fold subset is only 10-fold, 5-fold, and 3-fold. The comparison result from testing process shows that the results of diagnosis of coronary heart disease by implementing k-fold cross validation with feature selection using the VPRS method have decreased the accuracy value compared to diagnosis with feature selection without k-fold implementation.