0% found this document useful (0 votes)

52 views5 pages

Lung Cancer Detection with ML Algorithms

This study explores the detection of lung cancer using various supervised machine learning algorithms, including logistic regression, SVM, KNN, random forest, and decision trees. The research emphasizes the importance of early detection and analyzes the impact of smoking across different age groups. The results indicate that tree-based models, particularly random forest and decision trees, achieve the highest accuracy rates in predicting lung cancer.

Uploaded by

Dhana Prabhu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views5 pages

Lung Cancer Detection with ML Algorithms

Uploaded by

Dhana Prabhu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

© 2023 JETIR March 2023, Volume 10, Issue 3 [Link].

org (ISSN-2349-5162)

DETECTION OF LUNG CANCER USING

SUPERVISED MACHINE LEARNING
ALGORITHMS
1
Swarnalatha Prathipati,2Yada Roshik,3Nandini Shailesh Kadre,4Kolla Abhishek and
5
Kommera Nikhil Reddy
1Assistant
Professor,2,3,4,5Student
1
Dept. of CSE, GITAM University, Visakhapatnam, India

ABSTRACT: Lung cancer is the leading cause of cancer-related death in this generation and is expected to remain so for the
foreseeable future. If lung cancer signs are identified early, the disease may be treatable. The main factor which causes lung cancer is
smoking. Various machine learning algorithms can be used to detect lung cancer. In this study, we are using various machine learning
algorithms like logistic regression, support vector machine (SVM), k-Nearest Neighbor (KNN), Random forest, decision tree
algorithm to detect what type of cancer low, medium, high i.e., low depicts no cancer, medium depicts benign, high depicts malignant
is present across different age groups like Youth, Working Class and Elderly. This study will also focus on how smoking affects
various. The classification models will be generated using the training data and the corresponding models will be evaluated using the
test data to obtain the accuracy of the models. Finally, we will compare the accuracy rates of each classification model we
will implement based on metrics such as F1 score, recall, precision, specificity and arrive at a conclusion.

Keywords--- Machine Learning; Decision Trees; Lung Cancer; Non-Small Cell Lung Cancer (NSCLC);Feature selection; Small
Cell Lung Cancer (SCLC); Feature Importance; Data Preprocessing; Bootstrapping

[Link]

Lung Cancer plays a vital role. The lung cancer is a type of cancer which begins in lungs.
Lungs are most useful organs in human body where these are used to inhale of oxygen and exhale of carbon-dioxide and these are
spongy like structured in our chest. There many types in cancers like Breast Cancer, Prostate Cancer, Lung Cancer, Bladder Cancer,
Kidney Cancer, Liver Cancer, Pancreatic Cancer, Thyroid Cancer. In the majority of cancer’s people are suffering from lung cancer.

The two histological subgroups of lung malignancies are non-small cell lung cancer (NSCLC) (80.4%) and small cell lung
cancer (SCLC) (16.8%).[4] Despite the fact that the precise mechanism behind lung cancer have not yet been fully understood, a
number of factors, such as cigar smoke, ionising radiation, and viral infection, it may play a major role in the emergence of lung
cancer.

Lung cancer, the second-most hazardous disease in the world, is recoverable if detected early.
In furthermore, ongoing research is being done in this realm of lung cancer detection to achieve 100% detection accuracy.[5] The
characteristics of the tumour termed as prognostic factors for cancer affect and forecast the likelihood that cancer patients will
survive.[6] Lung cancer can occur in two different ways. They are Non-Small cell lung cancer (NSCLC) and the Small cell lung
cancer (SCLC). The SCLC is also called as the oat cell.

According to a research from 2021, there were 1.9 million cancer patients in India, with the most prevalent six types being
breast, lung, pancreas, ovary, colon-rectum, and stomach. The graph in [10] depicts about the analysis of the most typical cancer
patients. The majority of cancer patients were examined at a nearby radical phase for the above six cancers mentioned.

[Link] WORK

In a paper where we analyzed symptoms using machine learning for early stage lung cancer authors Atharva Bankar, Kewal
Padamwar, Aditi Jahagirdar rather than analyzing symptoms and their lifestyles using text-based data, this study helps doctors and
other health professionals more effectively treat and diagnose lung cancer patients. In my research, some algorithms such as Decision
Trees, Random Forest, XG Boost achieved 100% accuracy on the data set they used, and these 3 algorithms gave 100% accuracy in
the youth and working age groups. active and 93% in the elderly group. . lung cancer prediction groups.[1]

JETIR2303663 Journal of Emerging Technologies and Innovative Research (JETIR) [Link] g424
© 2023 JETIR March 2023, Volume 10, Issue 3 [Link] (ISSN-2349-5162)
In this paper they have used 2 Algorithms SVM, Random Forest the cancer is a very important disease that claims many lives
worldwide, we focused on cancer prediction in this study. The three forms of cancer mentioned in [2] are lung cancer, breast cancer,
and prostate cancer. Breast cancer prognosis is significant in the Medicare and biomedical fields.[2]

The main purpose of this research conducted by the authors was to analyze and compare the results of multi-layer
perceptrons, neural networks, decision trees, naive Bayes, gradient-boosted trees, support vector machines, random forests, and
majority voting. Gradient-boosted trees were found to outperform all other classifiers used in studies using K-fold cross-validation on
the University of California, Irvine lung cancer dataset. [7]

The main goal of [8] is that cancer is a condition in which the body's cells grow out of control. Lung cancer is the name given
to cancers that first appear in the lungs. Lung cancer can also start in the lungs, in addition to other organs, including lymph nodes and
the brain. Lung cancer can spread from other organs. Metastases are the term used to describe the spread of cancer cells through one
organ to another. In this study, lung cancer is predicted using GNB machine learning techniques. Using the University of California,
Irvine Machine Learning Repository, the suggested GNB algorithm's performance is assessed. Performance investigation reveals that
GNB's prediction model outperforms other machine learning methods by 98%.

[Link]

The methodology is described in Fig.1 at first we have identified the problem and then we have gathered the data i.e., dataset from the
[3] and after the data gathering we have done data preprocessing It is the process of preparing data for machine learning is critical,
involving the conversion of unrefined information into a structure that can be utilized to coach models. The objective behind this
process is to format data in such a way as to bestow effective knowledge upon our model.

There are some Preprocessing techniques:-

[Link] Cleaning:- To complete this task, it is necessary to eradicate any absent or inaccurate information from the dataset.

[Link] Normalization:- The process consists of adjusting the data so that it falls within a uniform scope or arrangement. This is done
to guarantee that each aspect shares comparable scopes and strengthens certain machine learning algorithms' functions.

[Link] Encoding:- The process entails the transformation of types of information into numeric values that are suitable for utilization in
machine learning frameworks. Such a conversion may be accomplished through diverse approaches such as employing methodologies
like one-hot encoding or label encoding.

[Link] Selection:- To increase model accuracy, it is crucial to select the most pertinent traits from all available data. This process
involves reducing dimensionality and optimizing feature selection techniques.

[Link] Splitting:- The process entails the partitioning of collected information into distinct training and testing sets, which is
indispensable in assessing how well a model can perform when exposed to unprecedented data.

Data visualization it is the next step after the data preprocessing it is Visualizing data is the art of representing information through
graphics. This entails transforming sets of numerical or qualitative data into visually appealing forms such as diagrams, scatterplots
and heat maps to facilitate comprehension. Utilized in analyzing information for knowledge pattern recognition, trend identification
and correlation discovery between variables; visual representation is a necessary tool used by professionals across various industries
seeking insight from vast amounts of collected data.

Feature Extraction in this step The act of feature extraction involves the selection and transformation of vital data into a set of features
that are pertinent to analysis or modeling. This process revolves around reducing dataset dimensionality by choosing crucial elements
and transforming them into a more significant representation, which is compact yet meaningful. To put it differently, this activity
entails taking the most relevant information from raw data for further processing purposes.

Model Building is the art of machine learning involves crafting a numerical blueprint for analyzing the information, known as model
building. This scientific method entails employing algorithms and datasets to create an accurate portrayal of a system or problem. The
end goal is to produce a prognostic pattern that empowers individuals to make knowledgeable choices derived from new data results.
The typical progression for creating a model requires several stages they are Data preparation, selection of features, selection of
algorithms, training of models, evaluation and tuning of models, and deployment of models.

JETIR2303663 Journal of Emerging Technologies and Innovative Research (JETIR) [Link] g425
© 2023 JETIR March 2023, Volume 10, Issue 3 [Link] (ISSN-2349-5162)

Fig. 1. workflow

[Link] / PROPOSED WORK

The data used in this paper to train the models were obtained using the dataset "Lung Cancer Data" from the [Link]'s global data
catalogue. There are 1000 instances and 25 properties per instance. (With 24 independent variables and 1 dependent variable). The
data's primary component ordinal values make it ideal for performing a relative analysis on the independent variables.[1]

The dataset's attributes are listed as follows:-

Patient-ID, Chronic Lung Diseases, Passive Smoker, Weight Loss, Swallowing Difficulty, Snoring, Age, Dust Allergy, Balanced Diet,
Chest Pain, Shortness of Breath, Clubbing of Finger Nails, Alcohol Use, Gender, Occupational Hazards, Obesity, Coughing of Blood,
Wheezing, Frequent Cold, Air Pollution Index, Genetic Risks, Smoking, Fatigue, Dry Cough, Level [1,3]

[Link]

Machine learning algorithms such as SVM, also known as Support Vector Machines, have become increasingly popular for tasks such
as classification and regression. SVMs are based on the idea of finding the best hyperplane that separates data into different classes. A
good generalization is achieved by choosing a hyperplane that maximizes the margin between classes. A SVM algorithm divides
training data points into classes based on the best hyperplane that separates the vectors in the high-dimensional space. In order to
achieve good generalization performance, the hyperplane is chosen in such a way as to maximize the margin between the classes.

[Link]

Amongst the plethora of machine learning techniques, K-Nearest Neighbor stands out as a fundamental algorithm that relies on
supervised learning. Its implementation involves intricate calculations and complex decision-making processes while operating within
vast datasets. Assuming that the new example and the prior cases are comparable, the K-NN method places the new instance in the
category that most closely resembles the present categories. The KNN approach simply stores the data during the training phase and
summarises real time data into a category that is very similar to the training data.

[Link] FOREST

The Random Forest algorithm is versatile, handling both classification and regression tasks. It employs ensemble learning by creating
many decision trees to create a more accurate model. In this method, numerous randomized subsets of training data and features are
JETIR2303663 Journal of Emerging Technologies and Innovative Research (JETIR) [Link] g426
© 2023 JETIR March 2023, Volume 10, Issue 3 [Link] (ISSN-2349-5162)
utilized in each generated decision tree; such randomness reduces overfitting while elevating the generalization ability of the final
model. To build these trees during training requires bootstrapping which consists of randomly taking multiple samples from provided
data with replacements made along the way. Given that every trained tree uses different selected sections within one piece set aside for
sampling purposes means predictions will be mixed-and-matched resulting in an overall prediction combination pulled from all built
models created beforehand without interaction through varying stages when running or making new project resolutions because
everything works together collaboratively as intended throughout implementation stages alike!

[Link] REGRESSION

The algorithm of Logistic Regression is utilized in machine learning for the purpose of tackling binary classification tasks. These
types of tasks are concerned with predicting a specific outcome that falls under two categories, such as yes or no and true or false. The
model stands out among the various linear models due to its utilization of a logistic function which allows it to predict probability
pertaining to outcomes better than other methods used by different linear [Link] regression is a method in which the features
are combined linearly to create a logit. This created value then goes through transformation using logistic function, resulting in an
outcome of probability ranging from 0 and up to 1. The input can be considered as real-valued since it undergoes mapping with
logistic function converting values between zero or one that indicates class positivity likelihoods.

[Link] TREE

A machine learning algorithm called Decision Trees is quite popular in performing classification and regression tasks. This type of
supervised learning approach learns a set of hierarchical decisions based on the input features to predict what variable it’s targeting.
The method partitions your data recursively into smaller subsets founded solely upon its value within an input feature at each given
node throughout this tree-like model structure: the best separator for these classes or groups is then chosen through metrics such as
metric impurity or gain ratio, continuing until pure sets exist among leaves that target variables can be found inside them almost
exclusively without other factors being present in either class's composition - regardless if they are less than completely perfect
matches!

[Link] IMPORTANCE

The effort of this study is to gain understanding on the analysis of triggers that have a direct influence within an array of age groups,
acknowledging variations in their lifestyles. One notable finding was the linkage between advancing years and prevalence rates for
lung cancer as discussed by [9]. The materials are sorted into three sections with reference to particular age brackets, aimed at carrying
out a comparative assessment pertaining symptoms. Table-1 herein displays such groupings distinctively.
Table-1
The Age Groups and Number of Instances
Age Groups Youth: 0-25 Middle Age: 25-50 Old age: 50-80

No. of Instances 165 700 134

Within the realm of machine learning, performance measurement for models is accomplished through evaluation metrics. Tasks
involving classification rely heavily on several commonly used evaluation metrics such as:

[Link] is the measure of how frequently a model's predictions are correct compared to all its estimations. The accuracy of a
prediction is measured as a percentage of all model-induced forecasts.

[Link] can be determined by calculating the proportion of true positive predictions to all positive predictions made. This
calculation measures how accurately a model has predicted positivity in relation to its overall number of prognostications.

[Link] the concept, one calculates Recall by dividing the true positive predictions with all actual instances that are positively
identified in a dataset. This metric is indicative of how effective a model performs when it tries to detect such specimens.

[Link] score known as F1-score it is calculated by taking the reciprocal of the sum of one divided by precision and one divided by
recall, then multiplying that result with two. This provides a harmonious balance between these two metrics to get an accurate measure
of performance in a single value. It's commonly used because it takes into account both false positives (precision) and false negatives
(recall).

[Link] Assistance is determined by the total quantity of occurrences within each category present in a dataset.

One can implement a variety of measuring criteria to evaluate the efficiency of classification models and contrast varying installations.
To obtain an extensive comprehension of how adequate a model operates, it is crucial to examine numerous assessment metrics
concurrently.

JETIR2303663 Journal of Emerging Technologies and Innovative Research (JETIR) [Link] g427
© 2023 JETIR March 2023, Volume 10, Issue 3 [Link] (ISSN-2349-5162)
[Link] AND ANALYSIS

Table-2 demonstrates about the accuracy of models based on trees performs better across the entire dataset. We are able to locate and
choose the most crucial features that help patients be identified as having lung cancer thanks to these tree-based models' regulations
operate the metrics.

Table-2
The Performance Metrics of:-
Accuracy, Recall, Precision, Support, F1-score
Algorithm Accuracy Recall Precision Support F1-Score

SVM 78% 0.77 0.77 250 0.77

KNN 99.2% 0.99 0.99 250 0.99

Random Forest 100% 1 1 250 1

Logistic 87.6% 0.87 0.87 250 0.87

Regression
Decision Tree 100% 1 1 250 1

[Link] AND FUTURE SCOPE

With the help of this research, we come to know how the lung cancer is varying across different age groups and what causes lung
cancer across various age groups. With the accuracy of data and using various machine learning algorithms we can conclude that if the
data is reliable we can get good accuracy. The accuracy of SVM algorithm is 78% which is the least whereas the Random forest and
Decision Tree algorithms have an accuracy of 100% which is the highest accuracy. These algorithms will be useful for the doctors to
analyse the symptoms and causes of lung cancer.

The future scope of the study includes predicting cancer through CT scan images and use of deep learning algorithms for more
enhanced and detailed analysis of lung cancer.

REFERENCES

1. “Symptom Analysis using a Machine Learning approach for Early Stage Lung Cancer” by Atharva Bankar, Kewal Padamwar
and Aditi Jahagirdar.
2. “Cancer Prediction using Machine Learning” by Ganta Sruthi, Chokkakula Likitha Ram,Malegam Koushik Sai, Bhanu
Pratap Singh, Nikhil Majhotra, Neha Sharma
3. Data set from [Link] website.
4. Travis WD, Travis LB, Devesa SS. Lung cancer. Cancer. 1995;75(1 Suppl):191–202.
5. A Study On Prediction Of Lung Cancer Using Machine Learning Algorithms
Abhishek Gupta, Zuha Zuha ,Israr Ahmad, Zeeshan Ansari [Link]
6. C-Q Zhu, W Shih, C-H Ling, M-S Tsao [Link] markers of prognosis in non-small cell lung cancer: a
review and proposal for a multiphase approach to marker evaluation. Journal of Clinical Pathology2006;59:790-800
7. M. I. Faisal, S. Bashir, Z. S. Khan and F. Hassan Khan, "An Evaluation of Machine Learning Classifiers and Ensembles for
Early Stage Prediction of Lung Cancer," 2018 3rd International Conference on Emerging Trends in Engineering, Sciences and
Technology (ICEEST), Karachi, Pakistan, 2018, pp. 1-4, doi: 10.1109/ICEEST.2018.8643311
8. Anita, C. S., Vasukidevi, G., Rajalakshmi, D., Selvi, K., & Ramesh, T. (2022). Lung cancerprediction model using machine
learning techniques. International Journal of HealthSciences, 6(S2), 12533–12539. [Link]
9. R. P.R., R. A. S. Nair and V. G., "A Comparative Study of Lung Cancer Detection using Machine Learning Algorithms,"
2019 IEEE International Conference on Electrical, Computer and Communication Technologies (ICECCT), Coimbatore, India, 2019,
pp. 1-4, doi: 10.1109/ICECCT.2019.8869001
10. ANALYSIS OF DIFFERENT MACHINE LEARNING ALGORITHMS USED TO DETECT LUNG CANCER ANJALI
RAJ, AMBILY JACOB [Link] or [Link]

JETIR2303663 Journal of Emerging Technologies and Innovative Research (JETIR) [Link] g428

Common questions

Age and lifestyle play significant roles in lung cancer prognosis as they influence the prevalence and type of lung cancer. Machine learning models used for detection and prognosis, such as those discussed in the study, found that smoking, a key lifestyle factor, is a primary cause of lung cancer. The analysis demonstrated that lung cancer prevalence increases with age, with the 'Old age' group (50-80 years) seeing higher incidences compared to 'Youth' (0-25 years) and 'Middle Age' (25-50 years) groups, likely due to accumulated exposure to risk factors over time . Models that incorporate these variables can help stratify risk levels more accurately by considering how age interacts with lifestyle risks like smoking and occupational hazards, thereby influencing model predictions about cancer's potential severity and type (low, medium, or high). These insights enable healthcare professionals to prioritize interventions based on age groups and tailored lifestyle modifications .

Bootstrapping and feature selection are integral components of Random Forest models that enhance their generalization ability. Bootstrapping involves creating multiple subsets from the original dataset through random sampling with replacement. This process builds diverse decision trees whose combined predictions tend to be more robust and less prone to overfitting, as no single tree sees the entirety of the training data, hence diversifying the model's learning process . Feature selection in Random Forest models further improves their generalization by randomly selecting a subset of features at each split in each tree. This randomness reduces the model's reliance on any single feature, ensuring that it generalizes well to unseen data. It also helps in identifying the most important features that significantly contribute to the final prediction, making the model efficient and avoiding noise from irrelevant or less important predictors . Together, these methods ensure that the Random Forest model maintains high prediction accuracy and effectively handles variability across different instances of lung cancer outcomes, as evidenced by its perfect score of 100% accuracy in studies .

Feature importance methods are pivotal in enhancing the accuracy of machine learning models used in lung cancer studies by identifying which features contribute most significantly to the prediction outcomes. This process involves assigning scores to input features based on their impact on a predictive model's accuracy, helping to prioritize features that provide the most insightful information . In lung cancer prediction, important features generally include factors like smoking status, age, genetic risks, and symptoms such as weight loss or shortness of breath. By identifying these key indicators, models can improve their performance by focusing on the most relevant data, therefore enhancing the accuracy of predictions . Random Forests, for example, naturally rank features by importance due to their structure of decision trees. This ranking allows the models to weigh each feature correctly during model training, leading to more precise predictions of cancer presence (low, medium, or high). Additionally, feature selection helps reduce overfitting, as models are less likely to learn noise from irrelevant data, further improving their generalization capability and accuracy .

Smoking is a critical factor influencing both the incidence of lung cancer and the performance of machine learning models designed to predict it. As a well-documented cause of lung cancer, smoking significantly heightens the likelihood of developing the disease, which all models need to account for when predicting outcomes. In the models discussed, smoking status is considered a major predictor variable, impacting prediction outcomes across youth, working-age, and elderly age groups . The inclusion of smoking as a feature enhances model accuracy by providing a crucial dimension of information that distinguishes between different risk categories of lung cancer, such as benign (medium) or malignant (high). Age interacts with smoking habits, where older populations may have more accumulated damage from prolonged exposure to cigarette smoke, leading to higher chances of predicting malignant lung cancer . Thus, models that effectively incorporate smoking data, and understand its compounding effects with age, can deliver more reliable predictions. For instance, accurate models like Random Forest and Decision Trees achieved high accuracy by appropriately weighting such influential factors, reflecting real-world statistics on lung cancer occurrences and contributing to precise predictions across all age groups .

Logistic regression and decision trees differ fundamentally in their approach to model building and decision-making. Logistic regression is a linear model used primarily for binary classification tasks, predicting the probability of class membership using a logistic function to handle binary outcome tasks effectively. This model excels when relationships between the predictors are linear and the dataset is relatively clean with independent features. However, its performance may diminish when the underlying relationships are non-linear or interactive, which is often the case in complex datasets like those used for lung cancer detection . Decision Trees, on the other hand, offer a non-linear approach and can capture complex interactions between features, making them particularly suited for varied and intricate datasets. They make decisions based on the hierarchical partitioning of data, utilizing nodes to split the various features until it reaches a conclusion. Each final leaf represents a predicted target value . In terms of performance metrics, decision trees tend to perform well on complex datasets as they can naturally handle both binary and multi-classification problems, which is demonstrated by their accuracy of 100% in test studies . In contrast, logistic regression, while useful for simpler linear cases, achieved a lower accuracy of 87.6% in the same studies . This suggests that decision trees are better suited to capturing the nuances of lung cancer data, especially when multiple factors and interactions come into play.

The K-Nearest Neighbors (KNN) algorithm is a simple, yet effective machine learning model used for classification tasks, which bases its predictions on the closest training examples in the feature space. For lung cancer prediction, KNN can be particularly useful in capturing the similarities between patients based on symptoms and risk factors, allowing it to classify an individual's likelihood of cancer by comparing against historical cases . In terms of accuracy, the KNN algorithm performs exceptionally well with a 99.2% accuracy on the dataset utilized for lung cancer detection, indicating its strong predictive capability provided that the data is adequately preprocessed and relevant features are selected . However, KNN's dependency on the structure of the data means it may suffer from high computational costs and requires well-defined feature scaling to perform optimally. Compared to other models such as Random Forest and Decision Trees, which reached a perfect accuracy of 100%, KNN is slightly less accurate; however, it still outperforms algorithms like SVM which achieved 78% accuracy. This highlights KNN's effectiveness in cases where simple yet accurate predictions based on proximity are suitable .

Both logistic regression and support vector machines (SVM) are designed for binary classification tasks, but each faces unique challenges when applied to lung cancer prediction. Logistic regression handles binary classification by predicting the probability of class membership using a linear combination of features. One challenge for logistic regression is its limitation to linear decision boundaries, making it less capable of capturing complex non-linear relationships often present in medical data like lung cancer symptoms and risk factors . This issue can lead to reduced accuracy, particularly when critical pathological interactions are non-linear. SVM, while more flexible due to its ability to use kernelling techniques to map input data into high-dimensional spaces, requires careful parameter tuning and feature scaling, which can complicate its application to diverse and unstandardized datasets like those used in lung cancer studies . The method's reliance on defining an optimal hyperplane for separation may be challenged by overlapping classes or non-separability in the data, leading to potentially lower prediction accuracies compared to more adaptable models. Despite the potential of both models for accurate binary classification, their challenges highlight the need for advanced preprocessing and possibly complementary algorithms, such as ensemble learning, to improve accuracy and reliability of lung cancer predictions.

Evaluation metrics commonly employed to assess the performance of machine learning models in lung cancer prediction include accuracy, precision, recall, specificity, and F1-score. Each metric provides different insights into how well a model performs . Accuracy measures the proportion of true results (both true positives and true negatives) among the total number of cases examined, offering a general overview of model performance. Precision indicates the number of true positive predictions relative to those predicted as positive, highlighting the model's ability to avoid false positives. Recall, or sensitivity, measures the proportion of actual positives correctly identified, crucial for ensuring that cases of lung cancer are not missed. The F1-score, which is the harmonic mean of precision and recall, provides a balanced measure when dealing with models that may have uneven class distributions . In the studies analyzed, accuracy and F1-score were deemed most critical because they collectively offer comprehensive insights into both correct prediction and the balance between sensitivity and precision. These metrics were instrumental in evaluating models like Decision Trees and Random Forest, which achieved high accuracy scores of 100%, thereby confirming their strong performance in lung cancer prediction .

Future directions for improving lung cancer prediction models involve integrating more sophisticated techniques such as deep learning and advanced imaging analysis. These approaches can enhance precision and early detection rates by leveraging the comprehensive and nuanced data available from CT scans and other medical imaging modalities . Deep learning, particularly convolutional neural networks (CNNs), are adept at handling image data; thus, implementing these networks can improve the diagnostic accuracy by detecting subtle patterns indicative of early-stage cancer that traditional methods might miss. There is also a suggestion to include larger and more varied datasets that encompass diverse demographic and lifestyle variables, strengthening model training and enabling robust cross-population comparisons. Furthermore, continuous integration of real-world clinical data could enhance the adaptability and applicability of these models in practice, leading to personalized medicine approaches where risk assessments and treatment options are tailored to individual patient profiles . Finally, fostering collaboration between interdisciplinary teams comprising data scientists, clinicians, and researchers will be crucial to driving innovations and optimizing algorithms for better lung cancer prediction and patient outcomes.

Random Forest and Support Vector Machine (SVM) are both supervised machine learning algorithms used for classification tasks, including differentiating types of lung cancer. Random Forest operates by building numerous decision trees and combining their results to improve accuracy and control overfitting. This ensemble approach excels at capturing complex patterns and interactions within data, explaining its high accuracy of 100% in lung cancer predictions, particularly in distinguishing between low (no cancer), medium (benign), and high (malignant) categories across different age groups . SVM, on the other hand, functions by finding the optimal hyperplane to separate different classes. It is effective for binary classification and handles high-dimensional spaces well, but it typically requires a careful choice of the kernel and is sensitive to the feature scaling. SVM achieved an accuracy rate of 78% in the same tasks, proving less effective than Random Forest in dealing with the intricate variations present in lung cancer datasets . Overall, while SVM is robust in scenarios with clear class separation and smaller datasets, Random Forest surpasses it in complex, multi-classification problems, making it more suitable for accurately predicting lung cancer types.

Lung Cancer Prediction with ML in Python
No ratings yet
Lung Cancer Prediction with ML in Python
7 pages
Lung Cancer Prediction with Machine Learning
No ratings yet
Lung Cancer Prediction with Machine Learning
11 pages
Lung Cancer Prediction with ML Techniques
No ratings yet
Lung Cancer Prediction with ML Techniques
13 pages
Machine Learning for Lung Cancer Prediction
No ratings yet
Machine Learning for Lung Cancer Prediction
7 pages
Lung Cancer Detection at Amrita Hospital
No ratings yet
Lung Cancer Detection at Amrita Hospital
4 pages
Machine Learning for Lung Cancer Detection
100% (1)
Machine Learning for Lung Cancer Detection
9 pages
Lung Cancer Detection via Machine Learning
No ratings yet
Lung Cancer Detection via Machine Learning
4 pages
Machine Learning for Lung Cancer Detection
No ratings yet
Machine Learning for Lung Cancer Detection
12 pages
Early Lung Cancer Detection with ML
No ratings yet
Early Lung Cancer Detection with ML
55 pages
Lung Cancer Detection with ML Techniques
No ratings yet
Lung Cancer Detection with ML Techniques
5 pages
Lung Cancer Detection with ML Techniques
No ratings yet
Lung Cancer Detection with ML Techniques
6 pages
Lung Cancer Prediction with ML Methods
No ratings yet
Lung Cancer Prediction with ML Methods
6 pages
Lung Cancer Prediction via Chi-Square Test
100% (4)
Lung Cancer Prediction via Chi-Square Test
21 pages
Enhanced Lung Cancer Prediction Model
No ratings yet
Enhanced Lung Cancer Prediction Model
5 pages
Machine Learning for Lung Cancer Prediction
No ratings yet
Machine Learning for Lung Cancer Prediction
7 pages
Early Lung Cancer Prediction with ML
No ratings yet
Early Lung Cancer Prediction with ML
7 pages
Hybrid Model for Lung Cancer Prediction
No ratings yet
Hybrid Model for Lung Cancer Prediction
10 pages
Lung Cancer Prediction via Machine Learning
No ratings yet
Lung Cancer Prediction via Machine Learning
14 pages
Lung Cancer Prediction with ML Techniques
No ratings yet
Lung Cancer Prediction with ML Techniques
6 pages
Lung Disease Prediction via Data Mining
No ratings yet
Lung Disease Prediction via Data Mining
6 pages
Lung Cancer Detection with Machine Learning
No ratings yet
Lung Cancer Detection with Machine Learning
4 pages
Lung Tumor Detection Using Machine Learning
No ratings yet
Lung Tumor Detection Using Machine Learning
41 pages
Machine Learning for Lung Cancer Prediction
No ratings yet
Machine Learning for Lung Cancer Prediction
15 pages
Machine Learning for Lung Cancer Survival Analysis
No ratings yet
Machine Learning for Lung Cancer Survival Analysis
8 pages
Lung Cancer Risk Prediction with ML
No ratings yet
Lung Cancer Risk Prediction with ML
6 pages
Machine Learning in Lung Adenocarcinoma Detection
No ratings yet
Machine Learning in Lung Adenocarcinoma Detection
7 pages
Deep Learning for Lung Cancer Detection
No ratings yet
Deep Learning for Lung Cancer Detection
10 pages
Lung Cancer Prediction Using Regression
No ratings yet
Lung Cancer Prediction Using Regression
24 pages
ML Model for Early Lung Cancer Detection
No ratings yet
ML Model for Early Lung Cancer Detection
18 pages
Early Lung Cancer Prediction Framework
No ratings yet
Early Lung Cancer Prediction Framework
5 pages
Machine Learning for Lung Cancer Detection
No ratings yet
Machine Learning for Lung Cancer Detection
5 pages
Early Lung Cancer Prediction with ML
No ratings yet
Early Lung Cancer Prediction with ML
11 pages
Random Forest for Lung Cancer Prognosis
No ratings yet
Random Forest for Lung Cancer Prognosis
12 pages
Machine Learning for Lung Cancer Diagnosis
No ratings yet
Machine Learning for Lung Cancer Diagnosis
6 pages
ML for Early Lung Cancer Detection
No ratings yet
ML for Early Lung Cancer Detection
11 pages
Machine Learning for Lung Cancer Prediction
No ratings yet
Machine Learning for Lung Cancer Prediction
11 pages
Lung Cancer Detection Algorithms
No ratings yet
Lung Cancer Detection Algorithms
6 pages
Lung Cancer Prediction Using Ensemble Classifier
No ratings yet
Lung Cancer Prediction Using Ensemble Classifier
25 pages
Explainable ML for Lung Cancer Diagnosis
No ratings yet
Explainable ML for Lung Cancer Diagnosis
16 pages
Machine Learning for Lung Cancer Detection
No ratings yet
Machine Learning for Lung Cancer Detection
6 pages
Lung Cancer Prediction Using Data Mining Techniques
No ratings yet
Lung Cancer Prediction Using Data Mining Techniques
6 pages
Lung Cancer Prediction with Machine Learning
No ratings yet
Lung Cancer Prediction with Machine Learning
8 pages
Lung Cancer Prediction Using Stochastic Diffusion Search (SDS) Based Feature Selection and Machine Learning Methods
No ratings yet
Lung Cancer Prediction Using Stochastic Diffusion Search (SDS) Based Feature Selection and Machine Learning Methods
14 pages
Machine Learning for Cancer Prediction
No ratings yet
Machine Learning for Cancer Prediction
5 pages
Machine Learning for Lung Cancer Prediction
No ratings yet
Machine Learning for Lung Cancer Prediction
6 pages
Lung Cancer Detection via Image Processing
No ratings yet
Lung Cancer Detection via Image Processing
9 pages
Machine Learning for Lung Cancer Detection
No ratings yet
Machine Learning for Lung Cancer Detection
4 pages
Machine Learning for Lung Cancer Prediction
No ratings yet
Machine Learning for Lung Cancer Prediction
15 pages
Lung Cancer Prediction Using AI Techniques
No ratings yet
Lung Cancer Prediction Using AI Techniques
5 pages
Lung Cancer Detection with Ensemble Classifiers
No ratings yet
Lung Cancer Detection with Ensemble Classifiers
9 pages
Lung Cancer Diagnosis via CT Imaging AI
No ratings yet
Lung Cancer Diagnosis via CT Imaging AI
16 pages
Deep Learning for Lung Cancer Detection
No ratings yet
Deep Learning for Lung Cancer Detection
22 pages
Machine Learning for Lung Cancer Detection
No ratings yet
Machine Learning for Lung Cancer Detection
21 pages
Analyzing Thoracic Survival with AI
No ratings yet
Analyzing Thoracic Survival with AI
9 pages
Machine Learning for Cancer Prediction
No ratings yet
Machine Learning for Cancer Prediction
8 pages
Machine Learning for Lung Cancer Detection
No ratings yet
Machine Learning for Lung Cancer Detection
6 pages
Marketing Analytics Question Bank
No ratings yet
Marketing Analytics Question Bank
5 pages
Lung Cancer Prediction with KNN Algorithm
No ratings yet
Lung Cancer Prediction with KNN Algorithm
7 pages
Machine Learning for Early Lung Cancer Detection
No ratings yet
Machine Learning for Early Lung Cancer Detection
9 pages
Machine Learning for Lung Cancer Detection
No ratings yet
Machine Learning for Lung Cancer Detection
19 pages
Thermodynamics and Heat Transfer Basics
No ratings yet
Thermodynamics and Heat Transfer Basics
1 page
AXI Quad SPI v3.2 Product Guide
No ratings yet
AXI Quad SPI v3.2 Product Guide
120 pages
Actuarial Risk Management in East Africa
No ratings yet
Actuarial Risk Management in East Africa
20 pages
Special Relativity: Space and Time Insights
No ratings yet
Special Relativity: Space and Time Insights
15 pages
Cladtek Bi-Metal Manufacturing Overview
No ratings yet
Cladtek Bi-Metal Manufacturing Overview
30 pages
T.Y.B.Sc. CS Practical Exam Syllabus
100% (1)
T.Y.B.Sc. CS Practical Exam Syllabus
25 pages
Vertical Motion Problems and Solutions
No ratings yet
Vertical Motion Problems and Solutions
1 page
Computer Basics and Programming Overview
No ratings yet
Computer Basics and Programming Overview
8 pages
Von Mises Yield Criterion - Wikipedia, The Free Encyclopedia
No ratings yet
Von Mises Yield Criterion - Wikipedia, The Free Encyclopedia
7 pages
Kinetic Data Analysis in Batch Reactors
No ratings yet
Kinetic Data Analysis in Batch Reactors
7 pages
Word 2003 Keystroke Shortcuts
No ratings yet
Word 2003 Keystroke Shortcuts
2 pages
Magnetic NDE of Aged 2.25Cr-1Mo Steel
No ratings yet
Magnetic NDE of Aged 2.25Cr-1Mo Steel
9 pages
1-Bit ALU Design and Analysis at 90nm
No ratings yet
1-Bit ALU Design and Analysis at 90nm
15 pages
Polymers 15 04431
No ratings yet
Polymers 15 04431
17 pages
Statistics Revision Exercises and Solutions
No ratings yet
Statistics Revision Exercises and Solutions
3 pages
MR For Production Skid
No ratings yet
MR For Production Skid
39 pages
Emotional Solidarity in Faith-Based Tourism
No ratings yet
Emotional Solidarity in Faith-Based Tourism
171 pages
CFRP-Metal Joining Processes Review
No ratings yet
CFRP-Metal Joining Processes Review
11 pages
B.B.A (C.A) 2019 PATTERNApr23-31-32
No ratings yet
B.B.A (C.A) 2019 PATTERNApr23-31-32
2 pages
Causes of Refrigeration Compressor Failure
No ratings yet
Causes of Refrigeration Compressor Failure
5 pages
Class 11 Physics: Motion in Plane MCQs
No ratings yet
Class 11 Physics: Motion in Plane MCQs
3 pages
Enzyme Kinetics and Turnover Number Analysis
No ratings yet
Enzyme Kinetics and Turnover Number Analysis
11 pages
Mono-Facial Module Installation Manual
No ratings yet
Mono-Facial Module Installation Manual
34 pages
CVG4107_5187_Midterm_FormulaSheet
No ratings yet
CVG4107_5187_Midterm_FormulaSheet
8 pages
Atoms, Elements, and Compounds Overview
No ratings yet
Atoms, Elements, and Compounds Overview
53 pages
High-Speed Networking Overview
No ratings yet
High-Speed Networking Overview
74 pages
Stats Chap03 Bluman
No ratings yet
Stats Chap03 Bluman
86 pages
Enthalpy Changes in Chemical Reactions
No ratings yet
Enthalpy Changes in Chemical Reactions
5 pages
Statistics For Criminology And Criminal Justice Jacinta M. Gau eBook essential guide eBook
100% (1)
Statistics For Criminology And Criminal Justice Jacinta M. Gau eBook essential guide eBook
63 pages
High-Low Method for Cost Analysis
No ratings yet
High-Low Method for Cost Analysis
7 pages

Lung Cancer Detection with ML Algorithms

Uploaded by

Lung Cancer Detection with ML Algorithms

Uploaded by

© 2023 JETIR March 2023, Volume 10, Issue 3 [Link].

DETECTION OF LUNG CANCER USING

There are some Preprocessing techniques:-

[Link] / PROPOSED WORK

The dataset's attributes are listed as follows:-

No. of Instances 165 700 134

SVM 78% 0.77 0.77 250 0.77

KNN 99.2% 0.99 0.99 250 0.99

Random Forest 100% 1 1 250 1

Logistic 87.6% 0.87 0.87 250 0.87

[Link] AND FUTURE SCOPE

Common questions

In what ways do factors like age and lifestyle affect lung cancer prognosis, according to the analyzed machine learning models?

How do bootstrapping and feature selection in Random Forest models enhance their generalization ability in predicting lung cancer outcomes?

How do feature importance methods contribute to enhancing the accuracy of machine learning models in lung cancer studies?

How does smoking influence the performance and prediction outcomes of machine learning models in detecting lung cancer across different age groups?

How do the features of logistic regression and decision trees differ in their application to lung cancer detection, and what impact does this have on their respective performance metrics?

What are the implications of using KNN algorithm for lung cancer prediction and how does it compare to other machine learning models in terms of accuracy?

In what ways do the methodologies of logistic regression and support vector machines highlight challenges in binary classification for lung cancer prediction?

What are the evaluation metrics used to assess the performance of machine learning models in predicting lung cancer, and which metric was most critical in the studies?

What future directions are suggested for lung cancer prediction models to improve precision and early detection rates?

What roles do supervised machine learning algorithms like Random Forest and SVM play in differentiating types of lung cancer, and how do their prediction accuracies compare?

You might also like