CST383 B
CST383 B
Overfitting occurs when a machine learning model captures noise and details from the training data to the extent that it negatively impacts the model's performance on new data. This happens when the model is excessively complex, having learned the idiosyncrasies of the training examples rather than the underlying data pattern. Underfitting, conversely, occurs when a model is too simple, capturing only the overall trend and failing to grasp the underlying relationship in the data well enough. This typically results in high bias and low variance, where the model performs poorly on both the training and test sets .
Hierarchical Clustering creates a tree (dendrogram) representing the nested grouping of samples based on distance or similarity metrics, without requiring a pre-defined number of clusters. It can be divisive (top-down) or agglomerative (bottom-up). It is useful for visualizing inherent data structures, like phylogenetic trees in biological taxonomy. In contrast, K-means Clustering needs the user to specify the number (K) of clusters before clustering begins. It assigns each data point to the nearest cluster centroid, ideal for partitioning flat geometry groups like in market segmentation or image compression. K-means is computationally faster but unsuitable for determining the hierarchical structure present in data .
The matrix method calculates regression coefficients by representing data as matrices, using formulas derived from the least squares method. Given sample data with X (input values) and y (output values), arrange them in matrix form. Calculate the weight matrix using the formula W = (X^T * X)^{-1} * X^T * y, where W contains the regression coefficients. Applying this in practice, consider students' degree marks and post-graduate marks: create matrices with marks and solve for W to obtain coefficients that define the linear relationship between degree and post-grad marks, predicting performance in one based on the other .
Unsupervised learning is used when the output labels are not known or available. This type of learning is ideal for exploratory data analysis to identify patterns or groupings within datasets. Examples include grouping students into different academic groups through clustering (e.g., K-means or hierarchical clustering) based on performance metrics or behavior without predefined categories. Another example is customer segmentation in marketing, where clusters of customers are created based on purchase history, demographics, and behavior, allowing businesses to tailor strategies specific to each segment .
K-means clustering partitions data into K clusters, where each data point belongs to the cluster with the nearest mean. The algorithm involves randomly initializing K centroids, assigning each point to the nearest centroid, and recalculating the centroids based on the points assigned to each cluster. This process repeats until convergence, typically when assignments no longer change. For example, categorizing data points (x, y): (1,1), (2,1), (2,3), (3,2), (4,3), (5,5), (7) into two clusters would result in two separate groups around two centroids, effectively dividing the data into distinct clusters based on proximity .
Decision Trees are interpretable models that split data based on feature values resulting in a tree-like model of decisions. Unlike SVM, which focuses on finding the optimal hyperplane to separate classes and Neural Networks that develop complex hierarchical feature representations through layers, Decision Trees offer simplicity and transparency in their decision-making process. However, they tend to overfit, particularly with deep trees or when the data is noisy. Unlike more computationally intensive models like SVM or Neural Networks, Decision Trees are quick to train and perform well on smaller datasets or as a basis for ensemble methods such as Random Forests .
Linear Regression and Logistic Regression are both used for predictive modeling but serve different purposes. Linear Regression is used for predicting continuous outcomes, such as estimating a person's weight from height. It assumes a linear relationship between input and output variables. Logistic Regression, on the other hand, is used for binary classification problems, predicting discrete outcomes like spam vs. non-spam emails. It uses the logistic function to model probabilities between 0 and 1, enabling it to classify data into categories. A limitation of Linear Regression is its assumption of a linear relationship, which may not hold for complex datasets. Logistic Regression's limitation is its suitability for only binary or binary-like outcomes, though extensions like multinomial logistic regression can allow for multiple categories .
A confusion matrix is a tool for assessing the performance of a classification model by showing the count of true positive (TP), true negative (TN), false positive (FP), and false negative (FN) predictions. For instance, in a binary classification task of identifying emails as spam (positive) or not spam (negative), if out of 100 emails, 30 are spam, and 70 are not, a confusion matrix might show 25 TP (correctly identified spam), 5 FN (misclassified spam), 60 TN (correctly identified non-spam), and 10 FP (misclassified non-spam as spam). These numbers allow for calculation of metrics such as accuracy, precision, recall, and F1 score to comprehensively evaluate model performance .
Backpropagation is a supervised learning algorithm used for training neural networks, involving two main steps: the feedforward process to compute output and the backward pass for error correction. Initially, inputs are passed through the network to compute the output and loss function. During backpropagation, the error is calculated, and gradients of the loss concerning each weight are computed using chain rule, propagating backward from the output to input layers. These gradients are used to minimize loss by updating weights, typically using an optimization technique like gradient descent. The iterative process updates weights in each layer to reduce prediction error and fit the model to the training data efficiently .
A soft margin in SVM allows for some misclassification of data points, providing flexibility to the boundary decision, which is crucial for achieving a balance between maximizing the margin and minimizing classification errors in non-linearly separable and noisy data. This approach improves the generalization of the model by permitting a few datapoints to be on the wrong side of the margin or hyperplane. It helps to avoid overfitting by not forcing a hard decision boundary in datasets with overlap or noise in feature space, ultimately leading to better performance on test data .