[ad_1]

The methodology for this work and the steps used for feature selection, model training, and model evaluation are discussed. FigureÂ 1 depicts the proposed methods employed for this work.

### Datasets

The datasets leveraged in this study are summarized in TableÂ 1. The activities performed on the datasets are discussed in the data pre-processing subsection following.

#### Dataset visualization before feature selection

For each of the five datasets, four visualization techniques were employed to gain insights into the dataâs structure prior to feature selection.

Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is utilized to identify and visualize outliers, to aids the understanding of the dataâs distribution and identifying anomalous observations. Correlation Heatmaps was utilized to examine the relationships between features while.

Boxplots was used for to visualize feature distribution. Histograms and density plots give insight on the target variableâs distribution across different classes or values.

The dataset visualizations reveal in the DBSCAN plot a lack of clear, separable clusters, or it could be indicating a need to adjust the algorithmâs parameters for a more meaningful clustering. The correlation heatmap provides insights into potential multicollinearity, which could influence feature selection and model performance. The boxplots on the other hand reveal the presence of outliers and the spread of the data. The target distribution plot shows a fairly balanced dataset. A general observation on the datasets reveal similar observations underpinning the need for the preprocessing tasks and feature selection. The visualizations for the datasets are shown in Figs.Â 2, 3, 4, 5, 6, 7 and 8 representing the combined, Z-AlizadehSani, Framingham, South African and the Cleveland UCI datasets respectively.

### Data pre-processing

Data pre-processing is an essential activity in most ML pipelines. It includes tasks such as dataÂ cleaning, transformation, and organizing data before passing it to a model. This step is essential to improve data quality and make it suitable for building accurate and efficient models.

In this work, the targetÂ values indicatingÂ the existence of HD are coded as 0 or 1 (absent and present, respectively). Other categorical data fields, such as âfamhistâ in the SA heart dataset with values âabsentâ and âpresentâ, are coded as 0 for absent and 1 for present. âMaleâ and âFemaleâ values for all datasets are encoded as 1 and 0, respectively. All other fields with similar values are coded with numerical values.

#### Data standardization and regularization

The standardization process utilizing the StandardScaler was employed. This involved adjusting the features present in the data to be transformed to possess a zero (0) mean and a one (1) standard deviation. Standardizing the features with StandardScaler has a lower error than MinMaxScaler because StandardScaler scales each feature with zero mean and unit variance, scaling them in an equivalent bell curve and handles missing values preventing overfitting^{50,51}. A Logistic Regression model with an L2 regularization penalty is fitted with the standardized data. Another parameter, âCâ, which controls the trade-off between fitting the data well and the simplicity (parsimonious) of the model, is utilized in the logistic regression model. The Logistic Regression classifier fits the model to the scaled train data and the corresponding target. The modelâs coefficients, which indicate each featureâs significance in making predictions, are stored in a variable applied to the training data to obtain a new, regularized training dataset. This regularized training data is utilized for model training. Afterwards, the stored coefficient is applied to the test data to regularize the test data and target. Logistic regression with L2 regularization has proven useful in prior studies to promote sparse solutions^{52,53}. The regularized train data is passed to the whale optimization algorithm to select the optimal features for model training modelling and testing. The WOA algorithm is discussed in the adjoining pages.

### The whale optimization algorithm (WOA)

WOA is an optimization algorithm that draws inspiration from nature and imitates humpback whalesâ hunting techniques developed by Seyedali Mirjalili^{46}. The main components of whale optimization are position updating, encircling prey, and searching for prey. The entire WOA steps are discussed in this section and modified to be utilized for feature selection, which is further explained.

#### Encircling prey

Encircling uses the first current solution as the target or near to it since the optimal designâs location is initially unknown. The remaining agents then update their locations guided by the location of the best agent. This behaviour is modelled by Eqs.Â (1) and (2).

$${\vec{\text{D}}} = \left| {{\vec{\text{C}}} \cdot {\vec{\text{X}}}^{*} ({\text{t}}) – {\vec{\text{X}}}({\text{t}})} \right|$$

(1)

$${\vec{\text{X}}}\left( {{\text{t}} + 1} \right) = {\vec{\text{X}}}^{*} ({\text{t}}) – {\vec{\text{A}}} \cdot {\vec{\text{D}}}$$

(2)

where \(t\) is the current iteration, the coefficients \({\vec{\text{A}}}\) and \(\vec{C}\) being vectors and \({\text{X}}^{*}\) being the absolute position vector value for the best solution of the iteration. \({\vec{\text{A}}}\) and \(\vec{C}\) are determined using Eqs.Â (3) and (4).

$${\vec{\text{A}}} = 2{\vec{\text{a}}} \cdot {\vec{\text{r}}} – {\vec{\text{a}}}$$

(3)

$${\vec{\text{C}}} = 2 \cdot {\vec{\text{r}}}$$

(4)

where \(\overrightarrow{a}\) linearly reduces from 2 to 0 and \(\overrightarrow{r}\)*,* a randomly generated vector with a value within \(\mathrm{0,1}\) is generated.

#### Bubble-net attacking method (exploitation phase)

Bubble-net attack (exploitation) uses encircling and spiral updating techniques to control the whale mechanisms in WOA. The shrinking encircling mechanism is governed by Eqs.Â (2) and (3), with the varying range of vector A relying on vector a. EquationÂ (3) can compute this range. The vector \(a\) diminishes from 2 to 0 across numerous rounds, affecting vector A. When vector A lies within [ââ1, 1], the agentâs following location is between the present andÂ target positions. The spiral-based position update procedure begins by computing distances between the whaleâs present position and the positions of the targets. Based on this range, the spiral motion mimicking the humpback whaleâs swimming pattern is generated. The modelled pattern is explained in Eq.Â (5).

$${\vec{\text{X}}}({\text{t}} + 1) = \overrightarrow {{{\text{D}}^{\prime } }} \cdot {\text{e}}^{{{\text{bl}}}} \cdot \cos (2\uppi {\text{l}}) + {\vec{\text{X}}}^{*} ({\text{t}})$$

(5)

The logarithmic spiral is defined by the constant *b*, and the value of l is chosen at random from [ââ1, 1]. There is aÂ 50% likelihood of using shrinking or the spiral technique to attack the prey successfully. EquationÂ (6) defines the equation.

$${\vec{\text{X}}}({\text{t}} + 1) = \left\{ {\begin{array}{*{20}l} {{\vec{\text{X}}}^{*} ({\text{t}}) – {\vec{\text{A}}} \cdot {\vec{\text{D}}}} \hfill & {{\text{if}}\,p < 0.5} \hfill \\ {\overrightarrow {{{\text{D}}^{\prime } }} \cdot {\text{e}}^{{{\text{bl}}}} \cdot \cos (2\uppi {\text{l}}) + {\vec{\text{X}}}^{*} ({\text{t}})} \hfill & {{\text{if}}\,p \ge 0.5} \hfill \\ \end{array} } \right.$$

(6)

\(p\) is a number between 0 and 1.

#### Search for prey

The last process to be discussed in the WOA process is how the target is searched. Humpback whales disperse and explore the search space at random to identify the target, as Eq.Â (3) describes. When the value of vector \(A\) exceeds \(1\) or falls below -1, it prompts the whales to spread out and perform a random search. The goal of this phase is to integrate exploratory abilities into the WOA. During this stage, the following location of theÂ agents is determined at random, irrespective of the existing best solutionâs value. EquationsÂ (7) and (8) explain the search process further.

$${\vec{\text{D}}} = \left| {{\vec{\text{C}}} \cdot {\vec{\text{X}}\text{r}} – {\vec{\text{X}}}} \right|$$

(7)

$${\vec{\text{X}}}({\text{t}} + 1) = {\vec{\text{X}}\text{r}} – {\vec{\text{A}}} \cdot {\vec{\text{D}}}$$

(8)

Vector (\(X_{r}\)) is randomly positioned and drawn from the selected population. WOA begins with randomly generated solutions and iteratively refines the optimal solution using either random search agents or the best-performing one. The technique is based on three key parameters: vector a, vector A, and vector *p*. Vector a, which diminishes through 2 to 0, is utilized to maintain a suitable equilibrium between exploration and exploitation. Vector *A* determines whether to use a random or the best search agent for position updating. Suppose vector *A* is more significant than one (1); a randomly generated search agent is used to effect an update on the position. However, the current best solution is used if it is less than one. This assistsÂ the WOA in maintaining the proper equilibrium between exploring and exploiting. *p* facilitatesÂ the search agents to vary between cyclic and spiral movement, increasing their adaptability.

#### Feature selection using WOA

The proposed modified WOA determines the best features using the whale optimization algorithm with input from the preprocessed data. The initial location of the whales (indicating feature selection or non-selection) is generated randomly as binary values. The algorithm then updates each whaleâs position in each iteration, shrinking the search space with each iteration. The new position is determined by combining the best position found thus far with random numbers. If the new position has better fitness, it becomes the best. The algorithm then returns the indices of the selected features, considered the best feature. The returned indices are passed to the next step to aid the prediction process. Each whale represents a potential solution in the context of feature selection, with a binary vector encoding the presence or lack of features. The algorithm updates each whaleâs location based on a linearly decreasing search space coefficient and the best position over a predetermined number of iterations (10, 20, 30, 40, 50 to 100) and agents (10, 20, 30, 40, 50 to 100).

Ten agents are paired against ten iterations for the WOA. This is repeated in additions of 10 (for agents and iterations). By training a Logistic Regression classifier penalized by an L1 regularization on the chosen feature sets, the fitness function is utilized to compute the performance of the provided subset of features. Each iteration updates the whale position. If the new position improves fitness, the position and fitness are updated accordingly. The maximized prediction performance on the validation with a minimized number of features is considered the optimal feature for that iteration. The best position, therefore, reflects the ideal subset of features so far as the algorithm has run through the required number of iterations. The Scaled training data and its target, the number of whales and iterations are the input factors used by WOA for feature selection. The entire process is captured in AlgorithmÂ 1.

##### Fitness function

Our study presents a novel approach to the fitness function in the whale optimization algorithm (WOA). Central to this method is integrating a logistic regression model with L1 regularization, a choice motivated by the modelâs inherent capacity for feature selection and sparsity^{52}.

The logistic regression model, known for its effectiveness in binary classification problems, is enhanced with L1 regularization. This regularization technique introduces a penalty term equivalent to the absolute value of the magnitude of the coefficients. The primary advantage of incorporating L1 regularization is its tendency to produce sparse solutions, inherently performing feature selection by driving the coefficients of less significant features to zero^{54,55}. This aspect is particularly beneficial in our study, where model simplicity and interpretability are paramount.

In evaluating the fitness of the WOA, we adopt a cross-validation with five folds. This partitions the data into five subsets, iteratively using one subset for validation and the rest for training. Such a technique strengthens the modelâs validation on different data samples and mitigates the risk of overfitting, leading to a more reliable performance evaluation^{56}.

Also, while L1 regularization naturally promotes feature sparsity, our approach further penalizes the fitness score based on the number of features the logistic regression model selects. This penalty is designed to encourage the selection of the most relevant features and enhance interpretability.

With its unique integration of logistic regression with L1 regularization, k-fold cross-validation, and feature selection penalty, our proposed fitness function is a robust tool in the whale optimization algorithm. It adheres to the principles of parsimony and generalizability and aligns to achieve high predictive accuracy while maintaining model simplicity. This approach has significant implications for applications in various domains, particularly those involving high-dimensional datasets where feature selection is critical in model performance.

##### Transfer function

In the research, a sigmoid transfer is defined for use in WOA. The sigmoid function is mathematically depicted as Eq.Â (9).

$$f\left( {\text{x}} \right) = \frac{1}{{1 + {\text{e}}^{ – 1} }}$$

(9)

The sigmoid function can receive a numerical input in real numbers and subsequently act upon it by transforming input data into a finite range between 0 and 1. The sigmoid function converts the continuous output of the whale optimization algorithm into a binary format amenable to indicating the inclusion or exclusion of features. By employing a threshold value on the sigmoid, established at 0.5, we can ascertain the appropriateness of incorporating a feature, denoted as â1â, or removing it as â0â.

##### Optimal features selected

WOA is a metaheuristic algorithm with stochastic characteristics; hence, it generates marginally distinct feature sets for every dataset in ten (10) different runs. WOA was executed ten times for every dataset. The selected frequency for each feature was determined by tallying the occurrences across ten separate runs indicating the prominence of each feature. The features that exhibit the highest occurrence among the various iterations are considered potential optimal features due to their higher consistency and relevance. To determine the final set of optimal features, a threshold value of 80% was established. This threshold ensures that only those features that appear in at least 80% of the experimental runs (a minimum of 8 out of 10 runs) will be selected. The optimal features chosen are subjected to individual evaluation by executing the classifiers. The performance of these features is documented for further analysis. The findings offer valuable insight into the effects of feature selection on model performance. Using the methodology above, the experiment endeavours to alleviate the stochastic nature inherent in the whale optimization algorithm (WOA) and produce a resilient collection of optimal features for diagnosing heart disease risk. TableÂ 2 outlines the various datasets and the optimal features identified and selected.

##### Visualization after feature selection

The DBSCAN Outlier Detection scatter plots across the datasets illustrate the DBSCAN algorithmâs ability to identify clusters and outliers. The clusters appear more distinct in the selected features compared to the broader spread seen in the full feature set. This suggests that feature selection has potentially removed noisy variables, allowing for more apparent patterns to emerge. The Correlation Heatmaps provide a detailed view of the interdependencies between features. After feature selection, the heatmaps are generally less cluttered, with fewer variables exhibiting strong correlations. This reduction in multicollinearity can benefit many machine learning models, as it tends to enhance model interpretability and performance. The boxplots also highlight the distribution and variance of each feature within the datasets. Post-feature selection, the plots are fewer but more focused, often showing a reduced number of extreme outliers. This indicates that feature selection has likely discarded features with extreme values that could skew the modelâs learning process. The shapes of the target distributions are generally consistent before and after feature selection, signifying that the selected features maintain the original structure of the target variable. However, in some cases, the distribution appears more balanced post-feature selection, which may positively influence model performance, especially in classification tasks. In all, the visualization suggests that feature selection has streamlined the datasets, potentially improving the efficiency and efficacy of subsequent analyses. By focusing on the most informative features, we expect that the chosen subsets will provide clearer, more relevant insights and facilitate the development of more robust predictive models. FiguresÂ 9, 10, 11, 12, 13 and 14 present a visual representation of the various features after performing feature selection.

### Training and testing models

After selecting the optimal features, the data is passed to the prediction model for training and testing. The models used are Support Vector Machine, Decision Tree, Random Forest, Multi-Layer Perceptron, Recurrent Neural Network, Adaptive Boosting, Long Short-Term Memory, Extreme Gradient Boosting, K-Nearest Neighbors, and NaÃ¯ve Bayes classifiers. The choice of prediction models used ranges from classical ML classifiers, ensemble classifiers, and deep learning classifiers to get a broader perspective on performance.

#### Model evaluation metrics

The models are evaluated using accuracy, precision, recall, AUC and F1 score. The evaluation metrics utilized and the reasons for the choice are discussed next.

##### Accuracy

Accuracy measures the frequency with which a model accurately predicts outcomes. This metric provides a simple and intuitive understanding of how well the model performs regarding correct classifications^{57,58} making it a valuable choice for this work. The formula for calculating accuracy is provided in Eq.Â (10).

$${\text{Accurancy}} = \frac{{({\text{TP}} + {\text{TN}})}}{{{\text{TP}} + {\text{TN}} + {\text{FP}} + {\text{FN}}}}$$

(10)

##### Precision

Precision primarily focuses on minimizing false positives to avoid unnecessary stress and medical interventions for patients^{59}. High precision indicates that the model accurately identifies heart disease cases, facilitating informed clinical decision-making^{60}. This is crucial in building trust in the modelâs diagnostic capabilities, hence its usage in our work. The precision formula is listed as Eq.Â (11).

$${\text{precision}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FP}}}}$$

(11)

##### Recall

Recall, also called sensitivity, is the ratio of correctly classified positive samples to all samples assigned to the positive class^{11}. It shows the percentage of positive samples that are correctly classified. Given that a high recall is achieved by missing as few positive instances as possible, this metric is also thought to be among the most crucial for medical research. It is depicted by Eq.Â (12).

$${\text{Recall}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FN}}}}$$

(12)

##### AUC

The Area Under the Curve (AUC) of the Receiver Operating Characteristic (ROC) is a metric that measures the modelâs ability to distinguish between the negative and positive classes^{61}. It is calculated based on the ROC curve, which plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various thresholds^{62}. Computed using Eq.Â (13), a higher AUC value indicates overall model performance, signifying a high rate of correctly identified positive cases (sensitivity) and a low rate of false positives.

$${\text{AUC}} = \frac{{\left( {1 + {\text{TRP}} – {\text{FPR}}} \right)}}{2}$$

(13)

##### F1 score

Simple accuracy can be misleading as a model could inaccurately appear highly accurate by predominantly predicting the majority class^{63,64}. To avoid this issue, F1 score is used. F1 score is the balance mean precision and recall^{65} and defined by the Eq.Â (14).

$${\text{F1}} = 2 \cdot \frac{{\left( {{\text{Precision}}*{\text{Recall}}} \right)}}{{\left( {{\text{Precision}} + {\text{Recall}}} \right)}}$$

(14)

#### Model hyperparameters

TableÂ 3 lists the optimal hyperparameters for each of the ten models utilized in this work.

### Experimental setup

In this study, we conduct experiments on a Dell Latitude 5430 laptop with a 12th Gen Intel(R) Core (TM) i5 1245U CPU, 16Â GB of RAM, and 12 logical processors running Windows 10.

### Ethical and informed consent

All ethical and informed consent for data use has been taken care of by the data providers.

[ad_2]

Source link