0% found this document useful (0 votes)
66 views9 pages

Supervised vs Unsupervised Learning

Uploaded by

laharicgopal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views9 pages

Supervised vs Unsupervised Learning

Uploaded by

laharicgopal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

1. Supervised and Unsupervised Learning with suitable block diagram (Pg no.

187)

Supervised learning

Definition:

• Training data includes desired solutions called labels.

• The algorithm learns to predict outputs based on labeled examples.

Typical Tasks:

• Classification: Assigns a category or class to input data.


Example: A spam filter trained with emails labeled as spam or ham.

• Regression: Predicts a numeric target value.


Example: Estimating car prices based on features like mileage, age, brand, etc.

Shared Algorithms:

• Some algorithms can handle both classification and regression tasks.


Example: Logistic Regression is used for classification and predicts probabilities (e.g., 20%
spam).

Important Supervised Learning Algorithms:

• k-Nearest Neighbors (k-NN)

• Linear Regression

• Logistic Regression

• Support Vector Machines (SVMs)

• Decision Trees and Random Forests

• Neural Networks

Training Process:

• Requires many examples with both predictors (features) and labels (desired outputs).

• Example: Cars dataset with features like mileage, age, and prices (labels).
Unsupervised learning

Definition:

• Training data is unlabeled; the system learns without supervision.

Common Tasks:

• Clustering: Groups similar data points.


Example: Grouping blog visitors by preferences and behavior.

• Dimensionality Reduction: Simplifies data while retaining important information.


Example: Merging correlated features like mileage and age into one (feature extraction).

• Anomaly Detection: Identifies unusual instances.


Example: Detecting credit card fraud or manufacturing defects.

• Novelty Detection: Identifies new patterns in data.

• Association Rule Learning: Finds relationships between attributes.


Example: Discovering that people who buy barbecue sauce also buy steak.

Important Algorithms:

• Clustering:

o K-Means

o DBSCAN

o Hierarchical Cluster Analysis (HCA)

• Anomaly & Novelty Detection:

o One-Class SVM

o Isolation Forest

• Visualization & Dimensionality Reduction:

o Principal Component Analysis (PCA)

o Kernel PCA

o Locally-Linear Embedding (LLE)

o t-distributed Stochastic Neighbor Embedding (t-SNE)

• Association Rule Learning:

o Apriori

o Eclat

Use Cases:

• Clustering: Understand data organization (e.g., blog visitor behavior).

• Visualization: Represent complex data in 2D or 3D for pattern recognition.


• Dimensionality Reduction: Combine correlated features for simpler analysis.

• Anomaly Detection: Identify outliers or irregularities in datasets.

• Association Rules: Discover useful correlations (e.g., supermarket sales patterns)


2. with suitable block diagram Batch and Online Learning (Pg no. 194)
3. Main challenges of Machine learning (Pg no. 203)

Batch Learning:

• The model is trained using the entire dataset at once.

• Once trained, it cannot be updated without retraining from scratch.

• Suitable for static datasets.

• Process:

1. Collect entire dataset.

2. Train the model offline.

3. Deploy the trained model.

Online Learning:

• The model is updated incrementally with new data as it arrives.

• Suitable for dynamic or large datasets.

• Process:

1. Receive a data instance.

2. Update the model in real-time.

Block Diagram:

• Batch Learning: Full Dataset → Model Training → Model Deployment.

• Online Learning: Data Stream → Incremental Updates → Continuous Model.


4. FIND-S: finding a maximally specific hypothesis with suitable diagram (Module 4 PPT no. 78, Text
book 3 Pg no. 38)
5. Candidate Elimination Algorithm (Module 4 PPT Pg no. 82, Text book 3, Pg no. 44)

6. Explain data pipeline, with an example of "Machine learning pipeline for real estate investments"
(Page no. 218)

Data Pipeline with Example of "Machine Learning Pipeline for Real Estate Investments" A data
pipeline refers to a sequence of data processing steps designed to automate data handling,
transformation, and model processing. Each stage in the pipeline processes data and passes it to the
next. Components are often modular and independent.

For example, in a machine learning pipeline for real estate investments:

• Input: Raw data about districts, including population, median income, and other features.

• Processing: Transformations like filling missing values, scaling features, and encoding
categorical variables.

• Model Training: Train a regression model using the prepared data.

• Output: Predicted housing prices, used in investment decisions


7. How to Prepare the Data for Machine Learning Algorithms (Page no. 244)

Preparing data involves the following steps:

• Data Cleaning: Handle missing data by dropping rows/columns or imputing values (e.g.,
using median values with SimpleImputer).

• Feature Engineering: Create new features like ratios or combinations that may better
represent the data's relationships.

• Scaling and Encoding: Use tools like StandardScaler for numerical attributes and
OneHotEncoder for categorical attributes.

• Pipelines: Automate these transformations using libraries like ColumnTransformer, enabling


consistency across datasets.

8. How to train a Linear Regression model (Page no.256)

Data Preparation: Use a preprocessing pipeline to clean and transform data.

Model Training

from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()

lin_reg.fit(housing_prepared, housing_labels)

Evaluation: Test the model's predictions using metrics like Root Mean Squared Error (RMSE) to
understand performance.
9. Grid Search to fine tune your model (Page no.257)

Grid Search is used to find the best hyperparameters by systematically testing combinations:

• Define a grid of hyperparameters to explore.

• Use GridSearchCV to automate the search with cross-validation

from sklearn.model_selection import GridSearchCV

param_grid = [

{'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},

{'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},

grid_search = GridSearchCV(RandomForestRegressor(), param_grid, cv=5,


scoring='neg_mean_squared_error')

grid_search.fit(housing_prepared, housing_labels)

Results include the best parameters and model performance.

Common questions

Powered by AI

Batch learning involves training the model using the entire dataset at once, which is suitable for static datasets. Once trained, the model is not updated unless it is retrained from scratch, making it appropriate for situations where data does not change frequently . Online learning, on the other hand, incrementally updates the model with new data as it arrives, making it suitable for dynamic or large datasets. It is ideal in scenarios where data continuously flows, such as stock price prediction or real-time recommendation systems .

Grid search enhances model performance by systematically searching through a specified grid of hyperparameters to find the best model configuration. The key components include defining a parameter grid, such as various 'n_estimators' and 'max_features' for a RandomForestRegressor, and using GridSearchCV to automate this exploration with cross-validation . By evaluating combinations using a scoring function, like 'neg_mean_squared_error', grid search identifies hyperparameter sets that minimize error, optimizing model performance on the given data .

The FIND-S algorithm finds a maximally specific hypothesis that fits the given positive training examples, starting with the most specific hypothesis possible and generalizing it only when necessary to cover more positive examples . However, it only considers positive examples and does not handle negative examples, which limits its ability to explore the entire hypotheses space effectively. In contrast, the Candidate Elimination algorithm considers both positive and negative examples, maintaining a general and a specific boundary for hypothesis space exploration, thus providing a more comprehensive hypothesis search and potential better generalization .

Data preparation involves several crucial steps, including data cleaning, feature engineering, scaling, and encoding, each essential for improving model performance and accuracy. Data cleaning addresses missing values using techniques like dropping or imputing them with median values via SimpleImputer . Feature engineering creates new features that better capture relationships within the data. Scaling and encoding ensure numerical and categorical consistency using tools like StandardScaler and OneHotEncoder . Automation of these transformations through libraries like ColumnTransformer ensures consistency across datasets, inevitably impacting the model's ability to learn meaningful patterns .

Different learning algorithms are used for classification and regression to address the distinct nature of the tasks—classification requires assigning input data to discrete categories, while regression predicts continuous variables. Algorithms are designed with these task-specific requirements in mind to optimize performance . Logistic regression exemplifies a shared algorithm that primarily performs classification by predicting probabilities rather than exact class labels, often expressed in percentage terms (e.g., 20% spam). Despite its classification focus, it operates similarly to regression in lining features to a linear decision boundary, demonstrating its dual-task capability .

In a machine learning pipeline for real estate investments, the processing stage includes several critical transformations to prepare raw data for model training. These transformations involve filling missing values to ensure dataset completeness, scaling features for consistency in units, and encoding categorical variables to numerical format for algorithm compatibility . Each of these transformations enables the subsequent model to effectively process input data, leading to more accurate predictions of housing prices, which are foundational for sound investment decisions .

The choice between k-Nearest Neighbors (k-NN) and Support Vector Machines (SVMs) involves several trade-offs. k-NN is a straightforward algorithm that classifies data points based on the majority class of their neighbors, making it simple to implement and requiring no explicit training phase. However, it can become computationally expensive on large datasets and is sensitive to feature scaling . SVMs, on the other hand, are powerful for clear margin separation in high-dimensional spaces and generally perform well when the number of features exceeds the number of samples. They are less prone to overfitting due to their ability to maximize the margin. Yet, SVMs can be complex and demanding in terms of computation time and resources, particularly with non-linear kernels on large datasets . The choice depends largely on the specific dataset characteristics and problem requirements.

Integrating pipelines with machine learning algorithms streamlines and automates the entire data processing and model deployment workflow, significantly enhancing efficiency in dynamic environments. Pipelines ensure consistent data transformations, such as cleaning, scaling, and encoding, across varying datasets, which is critical when data characteristics are constantly changing . They also facilitate easy adaptation and updating of models with new data streams, as seen in online learning contexts, thereby maintaining high model performance over time. By automating these processes, pipelines reduce manual intervention, allow for rapid iteration, and ensure robustness in model deployment, essential for responsive and resource-efficient solutions in dynamic real-world applications .

Supervised learning requires labeled data, meaning that the training data includes the desired output or solution (labels), allowing the algorithm to learn and predict based on examples. Typical tasks include classification and regression, like a spam filter trained with labeled emails for classification and car price prediction for regression . Unsupervised learning, in contrast, works with unlabeled data, where the system learns without explicit supervision. It includes tasks like clustering, dimensionality reduction, anomaly detection, novelty detection, and association rule learning, such as grouping blog visitors by preferences or discovering purchasing patterns .

Clustering and association rule learning extract patterns and insights from unstructured, unlabeled data, offering significant real-world value. Clustering groups similar data points, enhancing understanding of data organization; for instance, segmenting blog visitors by behavior aids targeted marketing strategies . Association rule learning uncovers relationships between attributes, such as correlating purchases in retail, leading to effective cross-selling strategies, like promoting items frequently bought together . These tasks empower businesses to make data-driven decisions, optimize operations, and personalize customer interactions without requiring labeled datasets.

You might also like