Feature Selection Techniques in Machine Learning
In data science many times we encounter vast of features present in a
dataset. But it is not necessary all features contribute equally in prediction
that where feature engineering comes. It helps in choosing important
features while discarding rest. In this article we will learn more
about it and its techniques.
Feature Selection Foundation
Feature selection is a important step in machine learning which involves
selecting a subset of relevant features from the original feature
set to reduce the feature space while improving the model’s
performance by reducing computational power. It’s a critical step
in the machine learning especially when dealing with high-
dimensional data.
In real-world machine learning tasks not all features in the dataset
contribute equally to model performance. Some features may be
redundant, irrelevant or even noisy. Feature selection helps remove these
improving the model’s accuracy instead of random guessing based on all
features and increased interpretability.
There are various algorithms used for feature selection and are grouped
into three main categories:
1. Filter Methods
2. Wrapper Methods
3. Embedded Methods
Each one has its own strengths and trade-offs depending on the use case.
1. Filter Methods
Filter methods evaluate each feature independently with target variable.
Feature with high correlation with target variable are selected as it means
this feature has some relation and can help us in making predictions.
These methods are used in the preprocessing phase to remove irrelevant
or redundant features based on statistical tests (correlation) or other
criteria.
Filter Methods Implementation
Advantages:
Fast and inexpensive: Can quickly evaluate features without
training the model.
Good for removing redundant or correlated features.
Limitations: These methods don’t consider feature interactions so they
may miss feature combinations that improve model performance.
Some techniques used are:
Information Gain – It is defined as the amount of information
provided by the feature for identifying the target value and
measures reduction in the entropy values. Information gain of each
attribute is calculated considering the target values for feature
selection.
Chi-square test — Chi-square method (X2) is generally used to
test the relationship between categorical variables. It compares the
observed values from different attributes of the dataset to its
expected value.
Chi-square Formula
Fisher’s Score – Fisher’s Score selects each feature independently
according to their scores under Fisher criterion leading to a
suboptimal set of features. The larger the Fisher’s score is, the
better is the selected feature.
Correlation Coefficient – Pearson’s Correlation Coefficient is a
measure of quantifying the association between the two continuous
variables and the direction of the relationship with its values ranging
from -1 to 1.
Variance Threshold – It is an approach where all features are
removed whose variance doesn’t meet the specific threshold. By
default, this method removes features having zero variance. The
assumption made using this method is higher variance features are
likely to contain more information.
Mean Absolute Difference (MAD) – This method is similar to
variance threshold method but the difference is there is no square in
MAD. This method calculates the mean absolute difference from the
mean value.
Dispersion Ratio – Dispersion ratio is defined as the ratio of the
Arithmetic mean (AM) to that of Geometric mean (GM) for a given
feature. Its value ranges from +1 to ∞ as AM ≥ GM for a given
feature. Higher dispersion ratio implies a more relevant feature.
2. Wrapper methods
Wrapper methods are also referred as greedy algorithms that train
algorithm. They use different combination of features and compute
relation between these subset features and target variable and
based on conclusion addition and removal of features are
done. Stopping criteria for selecting the best subset are usually pre-
defined by the person training the model such as when the performance
of the model decreases or a specific number of features are achieved.
Wrapper Methods Implementation
Advantages:
Can lead to better model performance since they evaluate feature
subsets in the context of the model.
They can capture feature dependencies and interactions.
Limitations: They are computationally more expensive than filter
methods especially for large datasets.
Some techniques used are:
Forward selection – This method is an iterative approach where
we initially start with an empty set of features and keep adding a
feature which best improves our model after each iteration. The
stopping criterion is till the addition of a new variable does not
improve the performance of the model.
Backward elimination – This method is also an iterative approach
where we initially start with all features and after each iteration, we
remove the least significant feature. The stopping criterion is till no
improvement in the performance of the model is observed after the
feature is removed.
Recursive elimination – This greedy optimization method selects
features by recursively considering the smaller and smaller set of
features. The estimator is trained on an initial set of features and
their importance is obtained using feature_importance_attribute.
The least important features are then removed from the current set
of features till we are left with the required number of features.
3. Embedded methods
Embedded methods perform feature selection during the model training
process. They combine the benefits of both filter and wrapper methods.
Feature selection is integrated into the model training allowing the model
to select the most relevant features based on the training process
dynamically.
Embedded Methods Implementation
Advantages:
More efficient than wrapper methods because the feature selection
process is embedded within model training.
Often more scalable than wrapper methods.
Limitations: Works with a specific learning algorithm so the feature
selection might not work well with other models
Some techniques used are:
L1 Regularization (Lasso): A regression method that applies L1
regularization to encourage sparsity in the model. Features with
non-zero coefficients are considered important.
Decision Trees and Random Forests: These algorithms naturally
perform feature selection by selecting the most important features
for splitting nodes based on criteria like Gini impurity or information
gain.
Gradient Boosting: Like random forests gradient boosting models
select important features while building trees by prioritizing features
that reduce error the most.
Choosing the Right Feature Selection Method
Choice of feature selection method depends on several factors:
Dataset Size: Filter methods are often preferred for very large
datasets due to their speed.
Feature Interactions: Wrapper and embedded methods are better
for capturing complex feature interactions.
Model Type: Some methods like Lasso and decision trees are more
suitable for certain models like linear models or tree-based models.
For example filter methods like correlation or variance threshold are
excellent when we have a lot of features and want to remove irrelevant
ones quickly. However if we want to maximize model performance and
have the computational resources we might want to explore wrapper
methods like RFE or embedded methods like Lasso.
Feature selection is a critical step in building efficient and accurate
machine learning models. By choosing the right features we can improve
our model’s accuracy, reduce overfitting and make it more interpretable.
Each feature selection method has its strengths and weaknesses and
understanding them will help us to choose the right approach for our
dataset and task.