Comprehensive evaluation and performance analysis of machine learning in heart disease prediction

[ad_1]

The research endeavors to establish a comprehensive framework for achieving precise heart disease prediction, employing a multifaceted approach that integrates advanced machine learning methodologies, feature selection, and dimensionality reduction techniques. By harnessing sophisticated machine learning algorithms, the model adeptly discerns intricate patterns embedded within patient data, utilizing ensemble deep learning and innovative feature fusion strategies. This holistic strategy ensures the accurate and timely prognosis of cardiac disease, furnishing healthcare practitioners with a valuable tool to enhance patient treatment outcomes.

Moreover, the hybrid system encompasses three pivotal stages: data gathering, pre-processing, and classification, each playing pivotal roles in refining the predictive model. During the meticulous pre-processing phase, a series of tasks are meticulously executed to uphold data integrity and model efficacy. These tasks encompass the meticulous imputation of missing data, wherein the ML-HDPM method is proficiently utilized to accurately estimate absent values within the database. Furthermore, exhaustive feature selection endeavors are undertaken to pinpoint the most pertinent attributes for predictive modeling. This process is facilitated by a hybrid technique that synergistically amalgamates the capabilities of genetic algorithm (GA) and recursive feature elimination method (RFEM), facilitating the identification of informative features crucial for precise predictions.

Additionally, to ensure uniformity in the impact of features, a conventional scalar approach is employed to recalibrate the coefficients of all features, aligning their means to 0 and standard deviations to 1, thereby mitigating potential biases stemming from feature scale discrepancies. Within the database, classification is facilitated by delineating between two distinct classes: class 0 and class 1, representing the absence and presence of heart disease, respectively. Specifically, 164 cases are attributed to class 0, indicative of the absence of heart disease, while 139 instances are categorized as class 1, denoting the presence of heart disease. To mitigate the inherent class imbalance, the synthetic minority over-sampling technique (SMOTE) is deployed, ensuring equitable representation of both classes and bolstering the modelâs predictive prowess across diverse scenarios.

Implementation of the synthetic minority over-sampling technique (SMOTE) within the research framework entails a systematic approach to redress the inherent class imbalance between class 0 (indicating the absence of heart disease) and class 1 (signifying the presence of heart disease). SMOTE functions by generating synthetic samples from the minority class, thereby augmenting its representation within the dataset. This technique involves selecting individual instances from the minority class and creating synthetic instances along the line segments connecting them to their nearest neighbors. By interpolating between existing instances, SMOTE effectively enhances the diversity of the minority class without introducing biases, thus fostering a balanced representation of both classes. This approach fortifies the modelâs predictive capabilities, ensuring robust performance across a myriad of scenarios and enhancing its utility in clinical practice.

Specifically, SMOTE facilitates equitable representation of both classes, enabling the model to learn from a more diverse range of examples and effectively capture the underlying patterns and nuances associated with both the presence and absence of heart disease. As a result, the predictive model becomes more adept at accurately discerning between cases of heart disease and non-heart disease, thereby enhancing its diagnostic precision and reliability. Additionally, by mitigating the impact of class imbalance, SMOTE contributes to the modelâs generalizability and robustness, ensuring consistent performance across diverse patient populations and clinical settings. Overall, the integration of SMOTE into the proposed model significantly enhances its predictive capabilities, ultimately empowering healthcare practitioners with a more effective tool for accurate heart disease diagnosis and prognosis.

The methodology generates artificial instances of underrepresented categories, achieving a balanced distribution across both types. The process of classification involves the use of chosen characteristics and using various classifiers such as support vector machine (SVM)²¹, PCA²², linear discriminant analysis (LDA)²³, naÃ¯ve Bayes (NB)²⁴, decision tree (DT)²⁵, and random forest (RF)²⁶. Ultimately, the classifier predicts an individualâs presence or absence of heart disease. The mechanism used in the suggested approach for heart disease forecasting is shown in Fig.Â 1.

Pre-processing

This first phase is the primary step of the diagnostic procedure. The process consists of three stages: substituting absent qualities, eliminating duplications, and segregation. A characteristicâs missing value is substituted after a comprehensive examination of the patientâs age category, cholesterol levels, and blood pressure levels. If the majority of attribute values of a patient exhibit similarity, the corresponding value is replaced in the same spot. The redundancy reduction process aims to decrease the quantity of data by eliminating redundant or useless qualities. The patients are categorized according to the specific form of chest pain they exhibit, namely: (1) classic angina, (2) atypical angina, (3) non-anginal suffering, and (4) asymptomatic suffering.

Feature selection

The GA is founded on the principles of natural selection. The algorithm produces many solutions within a single generation. Each solution is referred to as a genome. The aggregate of solutions within a single generation is often called the population. The algorithm undergoes several iterations across generations to provide an improved result. During each process iteration, genetic algorithms are used on genomes derived from previous generations to generate the generation. Selection, crossovers, and mutations are genetic operators often used in genetic algorithms.

The selection agent is responsible for choosing the most optimal people from every age group. A fitness function assesses each personâs fitness level about others within the general population. The genomes that a selection agent chooses are then put into the mating pool, contributing to the formation of the following generation. The crossover agent is a genetic algorithm technique that combines individuals from mating pools to generate improved offspring for generations. Various crossover agents exist, including single-point, two-point, and multipoint crossover. If a mixture is not introduced into the general population, individuals in the following generation will likely exhibit similarities to the preceding population. The variety is presented using a mutation manager, which introduces random alterations to the individuals. The process of the genetic algorithmâs development is shown in Fig.Â 2.

A total of 18 iterations have been used. A total of 80 genomes were chosen in every successive generation. A total of 40 optimal genomes were selected at each age and then transmitted to the next generation. A total of 40 genomes were picked randomly. A crossover rate of 5% and a mutation rate 0.05 were used. The userâs data is negated. The fitness of the genomes was determined by using the mean square error of the genome as the objective function. The RFEM method is an iterative process that recursively eliminates unnecessary characteristics. The classifier undergoes training using the provided training data, and afterward, the significance of the features is determined by calculation. The least influential elements are eliminated during each process iteration, and the model is retrained using the remaining subset of features. The procedures are executed iteratively until the necessary quantity of features is attained. The method receives the number of characteristics to be maintained as a parameter. The functionality of the RFEM method is shown in Fig.Â 3.

Heart feature extraction

The investigation involves the extraction of diverse attributes from the medical data obtained via healthcare devices. The database comprises several physiological measurements, including heart rate, arterial pressure, and blood sugar level. To accurately determine the presence of cardiac disease, it is essential to extract critical characteristics, including statistical and temporal aspects. The equations provided below constitute the list of feature extraction.

The maximum voltage, total harmonic distortion (THD), heart rate, zero crossing rate (ZCR), entropy (ent), energy (eg), and standard deviation (SD) are expressed in Eqs. (1), (2), (3), (4), (5), (6), (7).

$${V}_{max}={\text{max}}\left\{V\right\}$$

(1)

$$THD=\frac{\sum_{i=0}^{n-1}{HC}_{i}}{{P}_{ff}}$$

(2)

$${H}_{r}=\frac{60}{{RR}_{t}}$$

(3)

$$ZCR=\frac{{S}_{c}}{{S}_{c}+{S}_{nc}}$$

(4)

$$Ent=\sum_{x=0}^{m-1}\sum_{y=0}^{n-1}\frac{{P}_{xy}}{{\text{log}}\left({P}_{xy}\right)}$$

(5)

$$Eg=\sum_{x=0}^{m-1}\sum_{y=0}^{n-1}{\left({P}_{xy}\right)}^{2}$$

(6)

$$SD=\frac{1}{M}\sqrt{\sum_{x=0}^{m-1}{\left({RR}_{x}-{RR}_{x-1}-k\right)}^{2}}$$

(7)

The voltage is denoted V, the harmonic component is denoted HC, the probability of frequency filter is denoted ${P}_{ff}$, the signal change and signal no change are expressed ${S}_{c} {\text{and}} {S}_{nc}$. The probability is expressed ${P}_{xy}$, and the deviation is denoted k. The collected data contains an interval referred to as RR. The variation is expressed in Eq.Â (8).

$$k=\frac{\sum_{x=0}^{m-1}{RR}_{x}-{RR}_{x-1}}{M-1}$$

(8)

The variable M represents the aggregate count of RR intervals in the database. The root mean square of the sum of the consecutive differences is expressed in Eq.Â (9).

$$R=\sqrt{\frac{\sum_{x=0}^{m-1}{RR}_{x}-{RR}_{x-1}}{M-1}}$$

(9)

${RR}_{x}$ refers to the integrated heartbeat, k represents the mean value, and ${P}_{xy}$ signifies the probability value. The qualities have been derived from the medical gadget data. The retrieved properties are retained as factors to forecast heart disease and alterations in the heart pattern. The ensuing section explains the intricate procedure involved in identifying heart disease.

Cluster-based over-sampled method

A novel approach has been devised using resampling and grouping methodologies to handle unbalanced stroke data. Resampling techniques include both under sampling and oversampling methods. Under sampling is a technique to remove data points from the majority class specimens selectively. Oversampling is used to augment the representation of minority specimens. Grouping is the process of grouping data so that models within each group exhibit similar qualities.

FigureÂ 4 illustrates the primary operations inside the system that has been developed. The method performs an under sampling function on collecting specimens with majority labels. Clustering is performed on the subset of minority labels within the overall data set. The example set generated by the underestimating procedure is separated into distinct testing, training, and validation sets. This division is done according to the ratio of 8:1:1. The acquired training sets have been combined. The validation sets, and testing sets have been completed. Next, the SMOTE algorithm performs oversampling on the specimens with minority labels inside the merged training set, resulting in the acquisition of the final stage of training data²⁷. In the concluding training set, the quantity of positive tests is about equivalent to the number of negative specimens, and each category of positive model is comprised. This training database offers an abundant number of examples that are used for the training of machine learning algorithms to extract characteristics. The method generates balanced activity, verification, and testing sets.

FigureÂ 5 illustrates the UCOM process for heart disease prediction. The UCOM algorithm is comprised of three distinct parts. The number of individuals without heart attacks is much larger than those without, so the data set of individuals without heart attacks is underrepresented. The random underestimating procedure resulted in the selection of specimen number 120. The prediction outcome achieved with a specimen count of 120 demonstrates superior performance. A clustering operation is performed on the specimens associated with heart attacks to ensure the inclusion of distinct characteristics from each subgroup of close-distance data throughout the training process. To get a balanced distribution of specimens with and without heart attacks throughout the training, oversampling is used for the models with heart attacks.

(1) The use of under sampling techniques to the majority class of negative specimens, namely those about stroke without heart attack cases. Most specimens, namely those without heart attack but with stroke, were picked randomly to create three subsets: a training subset, a validation subset, and a testing subset. The UCOM technique incorporates the parameter of randomized pick specimens, which depends on the data collection characteristics and the specimen set representing the minority label. The ratio of models with a heart attack to specimens without cardiac attack in the MIMIC-III database is 2:3, indicating an under sampling percentage²⁸. In the MIMIC-III database, the stroke specimens database entails randomly selecting two-thirds of the stroke specimens, excluding those that include heart attacks.
(2) Clustering the specimen set consisting of minority labels, namely stroke and heart attack data, is being performed.

Figure 5

UCOM process for heart disease prediction.

The research used a clustering algorithm to aggregate minority specimens, namely stroke specimens with cardiac arrest, into several clusters. Next, divide each group into three sets: the set for training, validation, and testing set. This division should be done by the ratio 8:1:1. The number of K is obtained by calculating the mean square error. The computation of the Sum of Squared Errors (SSE) is represented by Eq. (10).

$$E=\sum_{i=0}^{{C}_{x}}{\left|i-{m}_{x}\right|}^{2}$$

(10)

${C}_{x}$ represents the x-th cluster, $i$ denotes the specimen inside ${C}_{x}$, and ${m}_{x}$ refers to the centroid of ${C}_{x}$
(3) The process of dividing and combining clusters.

The underspecified specimens and each cluster are divided into training, validation, and testing sets using an 8:1:1 ratio. These sets are combined to create a new activity, validation, and testing set. The three newly introduced groups consist of training, verification, and testing sets.
(4) One approach to address the class imbalance in the training set is to use the SMOTEÂ to specimen the pseudo training set.

SMOTE method is a sampling technique designed to address imbalanced databases by generating synthetic specimens for the minority class. The method exhibits enhancements above the random sampling approach. The pseudo training set should be inputted into the SMOTE algorithm. The method employs a random selection process to choose specimens with a minority label and then identifies the closest neighbors. The system produces additional models with the minority label inside the set of centering specimens and their corresponding neighbourâs. The algorithm continues its execution until the count of samples with the minority label equals the count of models with the majority label. At this point, it produces the updated training set.

Classification method

The MLDCNN is a deep-learning neural network that utilizes the AEHOM for weight value optimization. Following the process of feature selection, the selected features are subjected to classification using the MLDCNN. The MLDCNN classifier receives each chosen component as its input in this context. The weights are arbitrarily given values associated with each piece of information. The concealed nodes inside the succeeding concealed layer compute the sum of the input value multiplied by the weight vector associated with all the input nodes connected to it. Random weight values have been shown to enhance the efficacy of the backpropagation process in obtaining the desired outcome. The optimization process is conducted in this manner. The activation procedure is then used, and the output of this layer is sent to the subsequent layer. The weights have a significant impact on the work of the classifier. The classification process in the MLDCNN involves a series of algorithmic phases, outlined as follows.

Step 1: Eqs.Â (11), (12) represent the feature values and their corresponding weights.

$${F}_{x}=\left\{{F}_{1},{F}_{2},\cdots ,{F}_{m}\right\}$$

(11)

$${W}_{x}=\left\{{W}_{1},{W}_{2},\cdots ,{W}_{m}\right\}$$

(12)

The symbol ${F}_{x}$ represents the input value, which represents a selection of n features denoted as ${F}_{1},{F}_{2},\cdots ,{F}_{m}$. Similarly, ${W}_{x}$ represents the weight value of ${F}_{x}$, which shows the appropriate weights for the n features as ${W}_{1},{W}_{2},\cdots ,{W}_{m}$.

Step 2: The inputs are multiplied by weight vectors that have been randomly selected, and the resulting products are then summed using Eq. (13).

$$M={F}_{1}{W}_{1}+{F}_{2}{W}_{2}+\cdots +{F}_{m}{W}_{m}$$

(13)

The input and the weight are denoted $F, and W$. $M$ represents the aggregate value.

Step 3: The objective is to ascertain that the activation function is expressed in Eq. (14) and the classification function is expressed in Eq. (15).

$${A}_{{f}_{x}}={C}_{x}\left\{\sum_{x=0}^{m-1}{F}_{x}{W}_{x}\right\}$$

(14)

$${C}_{x}={\text{exp}}\left(-{\left({F}_{x}\right)}^{2}\right)$$

(15)

The symbol ${A}_{{f}_{x}}$ represents the activation operation, whereas ${C}_{x}$ represents the exponential of ${F}_{x}$. The suggested system employs a Gaussian function as a kind of activation function.

Step 4: Assess the hidden layer output using a rigorous evaluation using Eq.Â (16).

$${Y}_{x}={B}_{x}+{C}_{1}{W}_{1}+{C}_{2}{W}_{2}+\cdots +{C}_{m}{W}_{m}$$

(16)

The bias value is denoted as ${B}_{x}$, whereas the weight across the hidden input and layers is specified as ${W}_{x}$. The classification function is denoted ${C}_{x}$.

Step 5: The three processes above are executed for every layer inside the MLDCNN. The output unit is assessed by aggregating the weights of every input signal to get the values of the neurons in the layer that produces signals expressed in Eq. (17).

$${R}_{x}={B}_{x}+{O}_{1}{W}_{1}+{O}_{2}{W}_{2}+\cdots +{O}_{3}{W}_{3}$$

(17)

The variable ${O}_{x}$ represents the value of the level that comes before the resultant layer, ${W}_{x}$ identifies the layer weights concealed, and ${R}_{x}$ denotes the measurement unit of the output.

Step 6: This stage involves the comparison of the output generated by the network with the desired target value. The disparity between these two quantities is often called the error signal. The mathematical expression for this disparity is expressed in Eq. (18).

$${E}_{x}={D}_{x}-{R}_{x}$$

(18)

The error indication, denoted as ${E}_{x}$, represents the discrepancy between the actual and desired output, specified by ${D}_{x}$. In this step, a comparison is made between the value of the output metric and the goal value. The associated mistake is identified. The error at the output is sent back to all other devices in the network by computing a value ${k}_{x}$ according to this mistake, denoted in Eq.Â (19).

$${k}_{x}={E}_{x}\left(f\left({R}_{x}\right)\right)$$

(19)

Heart rate is denoted ${R}_{x}$, and the error disparity is expressed ${E}_{x}$.

Step 7: The use of the backpropagation technique achieves the weight adjustment. The provided relation is presented in Eq. (20).

$${W}_{{c}_{x}}=\alpha {k}_{x}\left({F}_{x}\right)$$

(20)

The weight adjustment is denoted by ${W}_{{c}_{x}}$, the momentum is represented by $\alpha$, and ${k}_{x}$ refers to the mistake dispersed throughout the network. The weight values are adjusted using the AEHOM technique.

AEHOM

The AEHOM framework is based on the notion that the elephant population is divided into several clans. Each clan has a distinct quantity of elephants. Typically, male elephants separate from their clans and choose a solitary lifestyle. The leadership of each clan is entrusted to the oldest female elephant, commonly called the matriarch. In a herd of elephants, the matriarchs are known to choose the most favorable option, whereas the male elephantsâ attitude indicates the least favorable alternative. The elephant population is divided into several clans. Every individual, denoted as v, belonging to a particular family, indicated as c, is directed by the matriarch, denoted as ${E}_{n}$, who has the most significant fitness value within the given generation. The procedure above is represented by Eq.Â (21).

$${P}_{x+1,{E}_{n,v}}={P}_{{E}_{n,v}}+c\left({P}_{x,{E}_{n}}-{P}_{{E}_{n},v}\right){R}_{D}$$

(21)

${P}_{x+1,{E}_{n,v}}$ denotes the updated position of the variable u in the context of the algorithm c. ${P}_{x,{E}_{n}}$ reflects the prior location of u, while ${P}_{{E}_{n,v}}$ signifies the optimal solution of ${E}_{n,v}$. The variable c, which belongs to the interval [0, 1], determines the impact of the matriarch in the procedure. ${R}_{D}$ is a random number employed to enhance population diversity during the method’s later phases. The optimal location of the elephant inside clan ${P}_{x+1,{E}_{n,v}}$ is revised via Eq.Â (22).

$${P}_{x+1,{E}_{n,v}}=l\left({P}_{c,{E}_{n}}\right)$$

(22)

The symbol l represents a parameter within the range of [0, 1]. This variable significantly determines how ${P}_{c,{E}_{n}}$ influences the overall outcome, as shown in Eq.Â (23).

$${P}_{c,{E}_{n}}=\frac{1}{{V}_{{E}_{n}}}\sum_{y=0}^{V{E}_{n}}{b}_{{E}_{n},y,l}$$

(23)

The variable ${b}_{{E}_{n},y,l}$ represents the yth dimension, where l is more than or equal to 1 and less than or equal to L. The variable L represents the total width of the space. $V{E}_{n}$ symbolizes the number of elephants in clan c.

Male elephants that separate from their social group are used to conduct exploratory activities. A subset of elephants exhibiting the lowest fitness values within each clan is assigned new locations, as Eq. (24) shows

$${P}_{w,{E}_{n}}={P\left({P}_{min}({\text{max}})k\right)}_{min}$$

(24)

The variables are defined as follows: ${P}_{min}$ represents the lower limit of the searching space. ${P}_{max}$ represents the upper bound of the searching area. The variable k is constrained to the interval [0, 1]. A stochastic variable generated from a uniform probability distribution.

After evaluating the placements of the elephants, mutation and crossing procedures are implemented to enhance the optimization process. The use of a two-point crossover is implemented. This approach involves the selection of two sites on the parental genome. The genetic material between these two specific locations undergoes reciprocal exchange between the parentâs genomes, resulting in the acquisition of the offspringâs genome. These points are evaluated in Eqs. (25) and (26).

$${x}_{1}=\frac{{P}_{x+1,{E}_{n}}}{3}$$

(25)

$${x}_{2}={x}_{1}+\frac{{P}_{x+1,{E}_{n}}}{2}$$

(26)

The mutation process (${P}_{x+1,{E}_{n}}$) entails exchanging a certain number of genes from each genome with novel genes. The exchanged genes refer to the genetically modified genes intentionally introduced into the genome without duplication. The procedure is iteratively executed until a solution exhibiting a higher fitness level is achieved.

The presented technique incorporates a comprehensive approach to predicting cardiac disease, which involves feature selection, data preparation, and machine learning algorithms. The data about patients is extracted and processed, resulting in the identification of valuable aspects. These characteristics are used in advanced machine learning methods, such as ensemble deep learning and classification methodologies. This merger aims to enhance precision and expedite the prognosis of cardiac disease with accuracy.

[ad_2]

Source link