Understanding AI: Machine Learning & Generative AI
Understanding AI: Machine Learning & Generative AI
ARTIFICIAL INTELLIGENCE
Introduction
Artificial intelligence (AI) is technology that enables computers and machines to simulate
human learning, comprehension, problem solving, decision making, creativity and autonomy.
Applications and devices equipped with AI can see and identify objects. They can understand and
respond to human language. They can learn from new information and experience. They can make
detailed recommendations to users and experts. They can act independently, replacing the need for
human intelligence or intervention (a classic example being a self-driving car). But in 2024, most AI
researchers, practitioners and most AI-related headlines are focused on breakthroughs in generative
AI (gen AI), a technology that can create original text, images, video and other content. To fully
understand generative AI, it’s important to first understand the technologies on which generative AI
MACHINE LEARNING
A simple way to think about AI is as a series of nested or derivative concepts that
have emerged over more than 70 years .How artificial intelligence, machine learning, deep learning
and generative AI are related. Directly underneath AI, we have machine learning, which involves
creating models by training an algorithm to make predictions or decisions based on data. It
encompasses a broad range of techniques that enable computers to learn from and make inferences
based on data without being explicitly programmed for specific [Link] are many types of
machine learning techniques or algorithms, including linear regression, logistic regression, decision
trees, random forest, support vector machines (SVMs), k-nearest neighbor (KNN), clustering and
more. Each of these approaches is suited to different kinds of problems and [Link] one of the most
popular types of machine learning algorithm is called a neural network (or artificial neural
network). Neural networks are modeled after the human brain's structure and function. A neural
network consists of interconnected layers of nodes (analogous to neurons) that work together to
process and analyze complex data. Neural networks are well suited to tasks that involve identifying
complex patterns and relationships in large amounts of data. The simplest form of machine
learning is called supervised learning, which involves the use of labeled data sets to train algorithms
to classify data or predict outcomes accurately. In supervised learning, humans pair each training
example with an output label. The goal is for the model to learn the mapping between inputs and
outputs in the training data, so it can predict the labels of new, unseen data.
DEEP LEARNING
Deep learning is a subset of machine learning that uses multilayered neural networks,
called deep neural networks, that more closely simulate the complex decision-making power
AI Page 1
Artificial Intelligence 2025
of the human brain. Deep neural networks include an input layer, at least three but usually
hundreds of hidden layers, and an output layer, unlike neural networks used in classic machine
learning models, which usually have only one or two hidden [Link] multiple layers
enable unsupervised learning: they can automate the extraction of features from large, unlabeled
and unstructured data sets, and make their own predictions about what the data represents
Because deep learning doesn’t require human intervention, it enables machine learning at a
tremendous scale. It is well suited to natural language processing (NLP), computer vision, and
other tasks that involve the fast, accurate identification complex patterns and relationships in
large amounts of data. Some form of deep learning powers most of the artificial intelligence
(AI) applications in our lives today.
Generative AI
Generative AI, sometimes called "gen AI", refers to deep learning models that can create complex
original content such as long-form text, high-quality images, realistic video or audio and more in
response to a user’s prompt or request.
AI Page 2
Artificial Intelligence 2025
• In a deep neural network, multiple layers of nodes can extract meaning and relationships from large
volumes of unstructured, unlabeled data. At a high level, generative models encode a simplified
representation of their training data, and then draw from that representation to create new work that’s
similar, but not identical, to the original data. Generative models have been used for years in
statistics to analyze numerical data. But over the last decade, they evolved to analyze and generate
more complex data types. This evolution coincided with the emergence of three sophisticated deep
learning model types:
• Variational autoencoders or VAEs, which were introduced in 2013, and enabled models that could
generate multiple variations of content in response to a prompt or instruction.
• Diffusion models, first seen in 2014, which add "noise" to images until they are
unrecognizable, and then remove the noise to generate original images in response to prompts.
• Transformers (also called transformer models), which are trained on sequenced data to generate
extended sequences of content (such as words in sentences, shapes in an image, frames of a video or
commands in software code). Transformers are at the core of most of today’s headline-making
generative AI tools, including ChatGPT and GPT-4, Copilot, BERT, Bard and Midjourney.
Training
Generative AI begins with a "foundation model"; a deep learning model that serves as the basis
for multiple different types of generative AI applications. The most common foundation models
today are large language models (LLMs), created for text generation applications. But there are also
foundation models for image, video, sound or music generation, and multimodal foundation models
that support several kinds of content. To create a foundation model, practitioners train a deep
learning algorithm on huge volumes of relevant raw, unstructured, unlabeled data, such as
terabytes or petabytes of data text or images or video from the internet. The training yields a
neural network of billions of parameters encoded representations of the entities, patterns and
relationships in the data that can generate content autonomously in response to prompts. This is
the foundation model. This training process is compute-intensive, time-consuming and expensive. It
requires thousands of clustered graphics processing units (GPUs) and weeks of processing, all of
which typically costs millions of dollars. Open source foundation model projects, such as Meta's
Llama-2, enable gen AI developers to avoid this step and its costs.
AI Page 3
Artificial Intelligence 2025
Tuning
Next, the model must be tuned to a specific content generation task. This can be done in
various ways, including:
• Fine-tuning, which involves feeding the model application-specific labeled data, questions or prompts the
application is likely to receive, and corresponding correct answers in the wanted format.
• Reinforcement learning with human feedback (RLHF), in which human users evaluatethe accuracy or
relevance of model outputs so that the model can improve itself. This can be as simple as having people type
or talk back corrections to a chatbot or virtual assistant.
Generation, evaluation and more tuning
Developers and users regularly assess the outputs of their generative AI apps, and further tune the
model even as often as once a week for greater accuracy or relevance. In contrast, the foundation
model itself is updated much less frequently, perhaps every year or 18 [Link] option for
improving a gen AI app's performance is retrieval augmented generation (RAG), a technique for
extending the foundation model to use relevant sources outside of the training data to refine the
parameters for greater accuracy or relevance.
Benefits of AI
AI offers numerous benefits across various industries and applications. Some of the most
commonly cited benefits include:
AI Page 4
Artificial Intelligence 2025
collection, entering and preprocessing, and physical tasks such as warehouse stock-picking and
manufacturing processes. This automation frees to work on higher value, more creative work.
Enhanced decision-making
Whether used for decision support or for fully automated decision-making, AI enables
faster, more accurate predictions and reliable, data-driven decisions. Combined with
automation, AI enables businesses to act on opportunities and respond to crises as they
emerge, in real time and without human intervention.
AI use cases
The real-world applications of AI are many. Here is just a small sampling of use cases across various
industries to illustrate its potential:
Fraud detection
Machine learning and deep learning algorithms can analyze transaction patterns and flag
anomalies, such as unusual spending or login locations, that indicate fraudulent transactions. This
AI Page 5
Artificial Intelligence 2025
enables organizations to respond more quickly to potential fraud and limit its impact, giving
themselves and customers greater peace of mind.
Personalized marketing
Retailers, banks and other customer-facing companies can use AI to create personalized customer
experiences and marketing campaigns that delight customers, improve sales and prevent churn.
Based on data from customer purchase history and behaviors, deep learning algorithms can recommend
products and services customers are likely to want, and even generate personalized copy and special
offers for individual customers in real time.
Predictive maintenance
Machine learning models can analyze data from sensors, Internet of Things (IoT) devices and
operational technology (OT) to forecast when maintenance will be required and predict equipment
failures before they occur. AI-powered preventive maintenance helps prevent downtime and enables
you to stay ahead of supply chain issues before they affect the bottom line.
Data risks
AI systems rely on data sets that might be vulnerable to data poisoning, data tampering, data
bias or cyberattacks that can lead to data breaches. Organizations can mitigate these risks by
protecting data integrity and implementing security and availability throughout the entire AI
lifecycle, from development to training and deployment and postdeployment.
Model risks
Threat actors can target AI models for theft, reverse engineering or unauthorized
manipulation. Attackers might compromise a model’s integrity by tampering with its architecture,
weights or parameters; the core components that determine a model’s behavior, accuracy and
AI Page 6
Artificial Intelligence 2025
performance.
Operational risks
Like all technologies, models are susceptible to operational risks such as model drift, bias and
breakdowns in the governance structure. Left unaddressed, these risks can lead to system failures and
cybersecurity vulnerabilities that threat actors can use.
Ethics and legal risks
If organizations don’t prioritize safety and ethics when developing and deploying AI systems,
they risk committing privacy violations and producing biased outcomes. For example, biased training
data used for hiring decisions might reinforce gender or racial stereotypes and create AI models that
favor certain demographic groups over others.
AI Page 7
Artificial Intelligence 2025
known as ―narrow AI,‖ defines AI systems designed to perform a specific task or a set of tasks.
Examples might include ―smart‖ voice assistant apps, such as Amazon’s Alexa, Apple’s Siri, a social
media chatbot or the autonomous vehicles promised by Tesla. Strong AI: Also known as ―artificial
general intelligence‖ (AGI) or ―general AI,‖ possess the ability to understand, learn and apply
knowledge across a wide range of tasks at a level equal to or surpassing human intelligence. This level
of AI is currently theoretical and no known AI systems approach this level of sophistication.
Researchers argue that if AGI is even possible, it requires major increases in computing power.
Despite recent advances in AI development, self-aware AI systems of science fiction remain firmly
in that realm.
History of AI
The idea of "a machine that thinks" dates back to ancient Greece. But since the advent of electronic
computing (and relative to some of the topics discussed in this article) important events and milestones in
the evolution of AI include the following: 1950 Alan Turing publishes Computing Machinery and
Intelligence. In this paper, Turing famous for breaking the German ENIGMA code during WWII and often
referred to as the "father of computer science" asks the following question: "Can machines think?" From
there, he offers a test, now famously known as the "Turing Test," where a human interrogator would try to
distinguish between a computer and human text response. While this test has undergone much scrutiny
since it was published, it remains an important part of the history of AI, and an ongoing concept within
philosophy as it uses ideas around linguistics.1956 John McCarthy coins the term "artificial intelligence" at
the first-ever AI conference at Dartmouth College. (McCarthy went on to invent the Lisp language.) Later
that year, Allen Newell, J.C. Shaw and Herbert Simon create the Logic Theorist, the first-ever running AI
computer program.1967 Frank Rosenblatt builds the Mark 1 Perceptron, the first computer based on a
neural network that "learned" through trial and error. Just a year later, Marvin Minsky and Seymour Papert
publish a book titled Perceptrons, which becomes both the landmark work on neural networks and, at least
for a while, an argument against future neural network research initiatives. 1980 Neural networks, which
use a back propagation algorithm to train itself, became widely used in AI applications.1995 Stuart Russell
and Peter Norvig publish Artificial Intelligence: A Modern Approach, which becomes one of the leading
textbooks in the study of AI. In it, they delve into four potential goals or definitions of AI, which
differentiates computer systems based on rationality and thinking versus acting. 1997 IBM's Deep Blue
beats then world chess champion Garry Kasparov, in a chess match (and rematch). 2004 John McCarthy
writes a paper, What Is Artificial Intelligence?, and proposes an often- cited definition of AI. By this time,
the era of big data and cloud computing is underway, enabling organizations to manage ever-larger data
estates, which will one day be used to train AI models. 2011 IBM Watson® beats champions Ken Jennings
and Brad Rutter at Jeopardy! Also, around this time, data science begins to emerge as a popular discipline.
2015 Baidu's Minwa supercomputer uses a special deep neural network called a convolutional neural
network to identify and categorize images with a higher rate of accuracy than the average human. 2016
DeepMind's AlphaGo program, powered by a deep neural network, beats Lee Sodol, the world champion Go
player, in a five-game match. The victory is significant given the huge number of possible moves as the
game progresses (over 14.5 trillion after just four moves). Later, Google purchased Deep Mind for a reported
USD 400 million. 2022 A rise in large language models or LLMs, such as OpenAI’s ChatGPT, creates an
enormous change in performance of AI and its potential to drive enterprise value. With these new
generative AI practices, deep-learning models can be pretrained on large amounts of data. 2024 The latest
AI trends point to a continuing AI renaissance. Multimodal models that can take multiple types of data as
input are providing richer, more robust experiences These models bring together computer vision image
AI Page 8
Artificial Intelligence 2025
recognition and NLP speech recognition capabilities. Smaller models are also making strides in an age of
diminishing returns with massive models with large parameter counts.
AI Page 9
Artificial Intelligence 2025
SUPERVISED LEARNING
Supervised learning is a machine learning technique that uses human-labeled input and output
datasets to train artificial intelligence models. The trained model learns the underlying relationships
between inputs and outputs, enabling it to predict correct outputs based on new, unlabeled real-
world input data. Labeled data consists of example data points along with the correct outputs or
answers. As input data is fed into the machine learning algorithm, it adjusts its weights until the
model has been fitted appropriately. Labeled training data explicitly teaches the model to identify
the relationships between features and data [Link] machine learning helps organizations
solve for various real-world problems at scale, such as classifying spam or predicting stock prices. It
can be used to build highly accurate machine learning [Link] learning uses a labeled
training dataset to understand the relationships between inputs and output data. Data scientists
manually create training datasets containing input data along with the corresponding labels.
Supervised learning trains the model to apply the correct outputs to new input data in real-world use
[Link] training, the model’s algorithm processes large datasets to explore potential
correlations between inputs and outputs. Then, model performance is evaluated with test data to find
out whether it was trained successfully. Cross-validation is the process of testing a model using a
different portion of the [Link] gradient descent family of algorithms, including stochastic
gradient descent (SGD), are the most commonly used optimization algorithms, or learning
algorithms, when training neural networks and other machine learning models. The model’s
optimization algorithm assesses accuracy through the loss function: an equation that measures the
discrepancy between the model’s predictions and actual [Link] loss function’s slope, or gradient,
is the primary metric of model performance. The optimization algorithm descends the gradient to
minimize its value. Throughout training, the optimization algorithm updates the model’s
parameters—its operating rules or ―settings‖—to optimize the model.
1. Identify the type of training data to be used for training the model. This data should be similar to the
intended input data that the model will process when ready for use.
AI Page 10
Artificial Intelligence 2025
2. Assemble the training data and label it to create the labeled training dataset. The training data must
be free of data bias to avoid resultant algorithmic bias and other performance flaws.
3. Create three groups of data: training data, validation data and test data. Validation
assesses the training process for further tuning and adjustment, and testing evaluates the final
model.
7. Keep an eye on model performance and maintain accuracy with regular updates.
image classification model created to recognize images of vehicles and determine which type of
vehicle they are. Such a model can power the CAPTCHA tests many websites use to detect spam
[Link] train this model, data scientists prepare a labeled training dataset containing numerous
vehicle examples along with the corresponding vehicle type: car, motorcycle, truck, bicycle and
more. The model’s algorithm attempts to identify the patterns in the training data that causes an
measured against actual data values in a test set to determine whether it has made accurate
predictions. If not, the training cycle continues until the model’s performance has reached a
satisfactory level of accuracy. The principle of generalization refers to a model’s ability to make
appropriate predictions on new data from the same distribution as its training data.
AI Page 11
Artificial Intelligence 2025
➢ Regression is used to understand the relationship between dependent and independent variables. In
regression problems, the output is a continuous value, and models attempt to predict the target
output. Regression tasks include projections for sales revenue or financial planning. Linear
regression, logistical regression and polynomial regression are three examples of regression
algorithms.
➢ Because large datasets typically contain many features, data scientists can simplify this complexity
through dimensionality reduction. This data science technique reduces the number of features to
those most crucial for predicting data labels, which preserves accuracy while increasing efficiency.
➢ Supervised learning algorithms
➢ Optimization algorithms such as gradient descent train a wide range of machine .
. learning algorithms that excel in supervised learning tasks.
➢ Naive Bayes: Naive Bayes is a classification algorithm that adopts the principle of class conditional
independence from Bayes’ theorem. This means that the presence of one feature does not impact the
presence of another in the probability of an outcome, and each predictor has an equal effect on that
result.
➢ Naïve Bayes classifiers include multinomial, Bernoulli and Gaussian Naïve Bayes. This technique is
often used in text classification, spam identification and recommendation systems.
➢ Linear regression: Linear regression is used to identify the relationship between a continuous dependent
variable and one or more independent variables. It is typically used to make predictions about
future outcomes.
➢ Linear regression expresses the relationship between variables as a straight line. When there is one
independent variable and one dependent variable, it is known as simple linear regression. As the
number of independent variables increases, the technique is referred to as multiple linear
regression.
➢ Nonlinear regression: Sometimes, an output cannot be reproduced from linear inputs. In these cases,
outputs must be modeled with a nonlinear function. Nonlinear regression expresses a relationship
between variables through a nonlinear, or curved line. Nonlinear models can handle complex
relationships with many parameters.
➢ Logistic regression: Logistic regression handles categorical dependent variables—when they have binary
outputs, such as true or false or positive or negative. While linear and logistic regression models seek
to understand relationships between data inputs, logistic regression mainly solves binary
classification problems, such as spam identification.
➢ Polynomial regression: Similar to other regression models, polynomial regression models a relationship
between variables on a graph. The functions used in polynomial regression express this relationship
though an exponential degree. Polynomial regression is a subset of nonlinear regression.
➢ Support vector machine (SVM): A support vector machine is used for both data classification and
regression. That said, it usually handles classification problems. Here, SVM separates the classes of
data points with a decision boundary or hyperplane. The goal of the SVM algorithm is to plot the
hyperplane that maximizes the distance between the groups of data points.
➢ K-nearest neighbor: K-nearest neighbor (KNN) is a nonparametric algorithm that classifies data points
based on their proximity and association to other available data. This algorithm assumes that
similar data points can be found near each other when plotted mathematically.
➢ Its ease of use and low calculation time make it efficient when used for recommendation engines and
image recognition. But as the test dataset grows, the processing time lengthens, making it less
appealing for classification tasks. Random forest: Random forest is a flexible supervised machine
AI Page 12
Artificial Intelligence 2025
learning algorithm used for both classification and regression purposes. The "forest" references
accollection of uncorrelated decision trees which are merged to reduce variance and increase
accuracy.
➢
Supervised learning versus other learning methods
Supervised learning is not the only learning method for training machine learning
models. Other types of machine learning include:
➢ Unsupervised learning
➢ Semisupervised learning
➢ Self-supervised learning
➢ Reinforcement learning
➢ Supervised versus unsupervised learning
The difference between supervised learning and unsupervised learning is that unsupervised
machine learning uses unlabeled data. The model is left to discover patterns and
relationships in the data on its own. Many generative AI models are initially trained with
unsupervised learning and later with supervised learning to increase domain expertise.
Unsupervised learning can help solve for clustering or association problems in which common
properties within a dataset are uncertain. Common clustering algorithms are hierarchical, K-means
and Gaussian mixture models.
Semi-supervised learning labels a portion of the input data. Because it can be time-
consuming and costly to rely on domain expertise to label data appropriately for supervised
learning, semi-supervised learning can be an appealing alternative.
Supervised versus self-supervised learning
Self-supervised learning (SSL) mimics supervised learning with unlabeled data. Rather than use
the manually created labels of supervised learning datasets, SSL tasks are configured so that the model
can generate implicit labels from unstructured data. Then, the model’s loss function uses those labels
in place of actual labels to assess model [Link]-supervised learning sees widespread use in
computer vision and natural language processing (NLP) tasks requiring large datasets that are
prohibitively expensive and time-consuming to label.
Supervised versus reinforcement learning
Reinforcement learning trains autonomous agents, such as robots and self-driving cars, to
make decisions through environmental interactions. Reinforcement learning does not use labeled data
and also differs from unsupervised learning in that it teaches by trial-and-error and reward, not by
identifying underlying patterns within datasets.
Real-world supervised learning use cases
➢ Supervised learning models can build and advance business applications, including:
➢ Image- and object-recognition: Supervised learning algorithms can be used to locate, isolate and
categorize objects out of videos or images, making them useful with computer vision and image
analysis tasks.
AI Page 13
Artificial Intelligence 2025
➢ Predictive analytics: Supervised learning models create predictive analytics systems to provide insights.
This allows enterprises to anticipate results based on an output variable and make data-driven
decisions, in turn helping business leaders justify their choices or pivot for the benefit of the
organization.
➢ Regression also allows healthcare providers to predict outcomes based on patient criteria and
historical data. A predictive model might assess a patient’s risk for a specific disease or condition
based on their biological and lifestyle data.
➢ Customer sentiment analysis: Organizations can extract and classify important pieces of information
from large volumes of data—including context, emotion and intent—with minimal human
intervention. Sentiment analysis gives a better understanding of customer interactions and can be
used to improve brand engagement efforts.
➢ Customer segmentation: Regression models can predict customer behavior based on various traits and
historical trends. Businesses can use predictive models to segment their customer base and create
buyer personas to improve marketing efforts and product development.
➢ Spam detection: Spam detection is another example of a supervised learning model. Using supervised
classification algorithms, organizations can train databases to recognize patterns or anomalies in new
data to organize spam and non-spam-related correspondences effectively.
➢ Forecasting: Regressive models excel at forecasting based on historical trends, making them suitable
for use in the financial industries. Enterprises can also use regression to predict inventory needs,
estimate employee salaries and avoid potential supply chain hiccups.
➢ Recommendation engines: With supervised learning models in play, content providers and online
marketplaces can analyze customer choices, preferences and purchases and build recommendation
engines that offer tailored recommendations more likely to convert.
AI Page 14
Artificial Intelligence 2025
UNSUPERVISED LEARNING
Unsupervised learning, also known as unsupervised machine learning, uses machine learning
(ML) algorithms to analyze and cluster unlabeled data sets. These algorithms discover hidden patterns
or data groupings without the need for human intervention Unsupervised learning's ability to discover
similarities and differences in information make it the ideal solution for exploratory data analysis,
cross-selling strategies, customer segmentation and image recognition.
Exclusive clustering is a form of grouping that stipulates a data point can exist only in one
cluster. This can also be referred to as ―hard‖ clustering. K-means clustering is a common example of
an exclusive clustering method where data points are assigned into K groups, where K represents the
number of clusters based on the distance from each group’s centroid. The data points closest to a
given centroid will be clustered under the same category. A larger K value will be indicative of
smaller groupings with more granularity whereas a smaller K value will have larger groupings and less
granularity. K-means clustering is commonly used in market segmentation, document clustering,
image segmentation, and image compression. Overlapping clusters differs from exclusive clustering
in that it allows data points to belong to multiple clusters with separate degrees of membership. ―Soft‖ or
fuzzy k- means clustering is an example of overlapping clustering.
Hierarchical clustering
AI Page 15
Artificial Intelligence 2025
2. Average linkage: This method is defined by the mean distance between two points in each cluster.
3. Complete (or maximum) linkage: This method is defined by the maximum distance between two points in
each cluster.
4. Single (or minimum) linkage: This method is defined by the minimum distance between two points
in each cluster.
Euclidean distance is the most common metric used to calculate these distances; however, other
metrics, such as Manhattan distance, are also cited in clustering literature. Divisive clustering can be
defined as the opposite of agglomerative clustering; instead it takes a ―top-down‖ approach. In this
case, a single data cluster is divided based on the differences between data points. Divisive
clustering is not commonly used, but it is still worth noting in the context of hierarchical clustering.
These clustering processes are usually visualized using a dendrogram, a tree-like diagram that
documents the merging or splitting of data points at each iteration.
AI Page 16
Artificial Intelligence 2025
Probabilistic clustering
are made up of an unspecified number of probability distribution functions. GMMs are primarily
leveraged to determine which Gaussian, or normal, probability distribution a given data point belongs
to. If the mean or variance are known, then we can determine which distribution a given data point
belongs to. However, in GMMs, these variables are not known, so we assume that a latent, or hidden,
variable exists to cluster data points appropriately. While it is not required to use the Expectation-
Maximization (EM) algorithm, it is a commonly used to estimate the assignment probabilities for a
AI Page 17
Artificial Intelligence 2025
Association Rules
An association rule is a rule-based method for finding relationships between variables
in a given dataset. These methods are frequently used for market basket analysis, allowing companies
to better understand relationships between different products. Understanding consumption habits of
customers enables businesses to develop better cross-selling strategies and recommendation engines.
Examples of this can be seen in Amazon’s ―Customers Who Bought This Item Also Bought‖ or
Spotify’s "Discover Weekly" playlist. While there are a few different algorithms used to generate
association rules, such as Apriori, Eclat, and FP-Growth, the Apriori algorithm is most widely used.
Apriori algorithms
➢ Apriori algorithms have been popularized through market basket analyses, leading to
different recommendation engines for music platforms and online retailers. They are used within
transactional datasets to identify frequent itemsets, or collections of items, to identify the likelihood
of consuming a product given the consumption of another product. For example, if I play Black
Sabbath’s radio on Spotify, starting with their song ―Orchid‖, one of the other songs on this channel
will likely be a Led Zeppelin song, such as ―Over the Hills and Far Away.‖ This is based on my prior
listening habits as well as the ones of others. Apriori algorithms use a hash tree to count itemsets,
navigating through the dataset in a breadth-first manner.
➢ Dimensionality reduction
➢ While more data generally yields more accurate results, it can also impact the performance of
machine learning algorithms (e.g. overfitting) and it can also make it difficult to visualize datasets.
AI Page 18
Artificial Intelligence 2025
Dimensionality reduction is a technique used when the number of features, or dimensions, in a given
dataset is too high. It reduces the number of data inputs to a manageable size while also preserving
the integrity of the dataset as much as possible. It is commonly used in the preprocessing data stage,
and there are a few different dimensionality reduction methods that can be used, such as:
Principal component analysis
Principal component analysis (PCA) is a type of dimensionality reduction algorithm which is
used to reduce redundancies and to compress datasets through feature extraction. This method uses a
linear transformation to create a new data representation, yielding a set of "principal components."
The first principal component is the direction which maximizes the variance of the dataset. While the
second principal component also finds the maximum variance in the data, it is completely
uncorrelated to the first principal component, yielding a direction that is perpendicular, or orthogonal,
to the first component. This process repeats based on the number of dimensions, where a next
principal component is the direction orthogonal to the prior components with the most variance.
Singular value decomposition
Singular value decomposition (SVD) is another dimensionality reduction approach which
factorizes a matrix, A, into three, low-rank matrices. SVD is denoted by the formula, A = USVT,
where U and V are orthogonal matrices. S is a diagonal matrix, and S values are considered singular
values of matrix A. Similar to PCA, it is commonly used to reduce noise and compress data, such as
image files.
Autoencoders
Autoencoders leverage neural networks to compress data and then recreate a new representation
of the original data’s input. Looking at the image below, you can see that the hidden layer specifically
acts as a bottleneck to compress the input layer prior to reconstructing within the output layer. The
stage from the input layer to the hidden layer is referred to as ―encoding‖ while the stage from the
hidden layer to the output layer is known as ―decoding.‖
the same story from various online news outlets. For example, the results of a presidential election
AI Page 19
Artificial Intelligence 2025
➢ Computer vision: Unsupervised learning algorithms are used for visual perception tasks, such as object
recognition.
➢ Medical imaging: Unsupervised machine learning provides essential features to medical imaging
devices, such as image detection, classification and segmentation,used in radiology and pathology to
diagnose patients quickly and accurately.
➢ Anomaly detection: Unsupervised learning models can comb through large amounts of data and
discover atypical data points within a dataset. These anomalies can raise awareness around faulty
➢ Customer personas: Defining customer personas makes it easier to understand common traits and business
clients' purchasing habits. Unsupervised learning allows businesses to build better buyer persona
➢ Recommendation Engines: Using past purchase behavior data, unsupervised learning can help to discover
data trends that can be used to develop more effective cross-selling strategies. This is used to make
relevant add-on recommendations to customers during the checkout process for online retailers.
AI Page 20
Artificial Intelligence 2025
SEMI-SUPERVISED LEARNING
Semi-supervised learning is a branch of machine learning that combines supervised and unsupervised
learning by using both labeled and unlabeled data to train artificial intelligence (AI) models for classification
and regression [Link] semi-supervised learning is generally employed for the same use cases in
which one might otherwise use supervised learning methods, it’s distinguished by various techniques that
incorporate unlabeled data into model training, in addition to the labeled data required for conventional
supervised [Link]-supervised learning methods are especially relevant in situations where
obtaining a sufficient amount of labeled data is prohibitively difficult or expensive, but large amounts of
unlabeled data are relatively easy to acquire. In such scenarios, neither fully supervised nor unsupervised
learning methods will provide adequate solutions.
AI Page 21
Artificial Intelligence 2025
supervised classification model can technically learn a decision boundary using only a few labeled
data points, it might not generalize well to real-world examples, making the model's predictions
[Link] classic ―half-moons‖ dataset visualizes the shortcomings of supervised models relying
on too few labeled data points. Though the ―correct‖ decision boundary would separate each of the
two half-moons, a supervised learning model is likely to overfit the few labeled data points available.
The unlabeled data points clearly convey helpful context, but a traditional supervised algorithm cannot
process unlabeled [Link] only the very limited labeled data points available, a supervised
model may learn a decision boundary that will generalize poorly and be prone to
misclassifying new examples.
Semi-supervised learning vs unsupervised learning
Unlike semi-supervised (and fully supervised) learning, unsupervised learning algorithms use
neither labeled data nor loss functions. Unsupervised learning eschews any ―ground truth‖ context
against which model accuracy can be measured and optimized .An increasingly common semi-
supervised approach, particularly for large language models, is to ―pre-train‖ models via
unsupervised tasks that require the model to learn meaningful representations of unlabeled data sets.
When such tasks involve a ―ground truth‖ and loss function (without manual data annotation), they’re
called self- supervised learning. After subsequent ―supervised fine tuning‖ on a small amount of
labeled data, pre-trained models can often achieve performance comparable to fully supervised
[Link] unsupervised learning methods can be useful in many scenarios, that lack of context
can make them ill-suited to classification on their own. Take, for example, how a typical clustering
algorithm—grouping data points into a pre-determined number of clusters based on their proximity to
one another—would treat the half- moon dataset.A typical unsupervised algorithm, k-means
clustering, might incorrectly group data points together based only on their relative closeness to
"average" datapoints (centroids).
Semi-supervised learning vs self-supervised learning
Both semi- and self-supervised learning aim to circumvent the need for large amounts of
labeled data—but whereas semi-supervised learning When combined with supervised downstream
tasks, self-supervised pretext tasks thus comprise part of a semi-supervised learning process: a
learning method using both labeled and unlabeled data for model training.
How does semi-supervised learning work?
Semi-supervised learning relies on certain assumptions about the unlabeled data used to train
the model and the way data points from different classes relate to one another.A necessary condition of
semi-supervised learning (SSL) is that the unlabeled examples used in model training must be relevant
to the task the model is being trained to perform. In more formal terms, SSL requires that the
distribution p(x) of the input data must contain information about the posterior distribution
p(y|x)—that is, the conditional probability of a given data point (x) belonging to a certain class (y). So,
for example, if one is using unlabeled data to help train an image classifier to differentiate between
pictures of cats and pictures of dogs, the training dataset should contain images of both cats and
dogs—and images of horses and motorcycles will not be helpful. Accordantly, while a 2018 study of
semi-supervised learning algorithms found that ―increasing the amount of unlabeled data tends to
improve the performance of SSL techniques,‖ it also found that ―adding unlabeled data from a
AI Page 22
Artificial Intelligence 2025
mismatched set of classes can actually hurt performance compared to not using any unlabeled data at
all." 1The basic condition of p(x) having a meaningful relationship to p(x|y) gives rise to multiple
assumptions about the nature of that relationship. These assumptions are the driving force behind
most, if not all, SSL methods: generally speaking, any semi- supervised learning algorithm relies on
one or more of the following assumptions being explicitly or implicitly satisfied.
Cluster assumption
The cluster assumption states that data points belonging to the same cluster–a set of data points
more similar to each other than they are to other available data points– will also belong to the same
[Link] sometimes considered to be its own independent assumption, the clustering assumption
has also been described by van Engelen and Hoos as ―a generalization of the other assumptions."2 In
this view, the determination of data point clusters depends on which notion of similarity is being
used: the smoothness assumption, low-density assumption and manifold assumption each simply
leverage a different definition of what comprises a ―similar‖ data point.
Smoothness assumption
The smoothness assumptions states that if two data points, x and x’, are close to each other in the
input space—the set of all possible values for x–then their labels, y and y’, should be the [Link]
assumption, also known as the continuity assumption, is common to most supervised learning: for
example, classifiers learn a meaningful approximation (or ―representation‖) of each relevant class
during training; once trained, they determine the classification of new data points via which
representation they most closely [Link] the context of SSL, the smoothness assumption has the
added benefit of being applied transitively to unlabeled data. Consider a scenario involving three data
points:
➢ a labeled data point, x1
➢ another unlabeled data point, x3, that’s close to x2 but not close to x1
The smoothness assumption tells us that x2 should have the same label as x1. It also tells us that x3
should have the same label as x2. Therefore, we can assume that all three data points have the same
label, because x1’s label is transitively propagated to x3 because of x3’s proximity to x2.
Low-density assumption
The low-density assumption states that the decision boundary between classes should not pass
through high-density regions. Put another way, the decision boundary should lie in an area that
contains few data [Link] low-density assumption could thus be thought of as an extension of the
cluster assumption (in that a high-density cluster of data points represents a class, rather than the
boundary between classes) and the smoothness assumption (in that if multiple data points are near each
other, they should share a label, and thus fall on the same side of the decision boundary).This diagram
illustrates how the smoothness and low-density assumptions can inform a far more intuitive decision
boundary than would be possible with supervised methods that can only consider the (very few)
labeled data [Link]: van Engelen, et al (2018)
Manifold assumption
AI Page 23
Artificial Intelligence 2025
The manifold assumption states that the higher-dimensional input space comprises
multiple lower dimensional manifolds on which all data points lie, and that data points on the same
manifold share the same [Link] an intuitive example, consider a piece of paper crumpled up into a
ball. The location of any points on the spherical surface can only mapped with three- dimensional
x,y,z coordinates. But if that crumpled up ball is now flattened back into a sheet of paper, those same
points can now be mapped with two-dimensional x,y coordinates. This is called dimensionality
reduction, and it can be achieved mathematically using methods like autoencoders or convolutions.
In machine learning, dimensions correspond not to the familiar physical dimensions, but to each
attribute or feature of data. For example, in machine learning, a small RGB image measuring 32x32
pixels has 3,072 dimensions: 1,024 pixels, each of which has three values (for red, green and blue).
Comparing data points with so many dimensions is challenging, both because of the complexity and
computational resources required and because most of that high-dimensional space does not contain
information meaningful to the task at [Link] manifold assumption holds that when a model learns
the proper dimensionality reduction function to discard irrelevant information, disparate data points
converge to a more meaningful representation for which the other SSL assumptions are more
reliable. Mapping the data points to a lower-dimensional manifold can provide a more accurate
decision boundary, which can then be translated back to higher-dimensional space. (source: van
Engelen, et al, 2018)
Transductive learning
Transductive learning methods use available labels to discern label predictions for a given set
of unlabeled data points, so that they can be used by a supervised base [Link] inductive
methods aim to train a classifier that can model the entire (labeled and unlabeled) input space,
transductive methods aim only to yield label predictions for unlabeled data. The algorithms used for
transductive learning are largely unrelated to the algorithm(s) to be used by the supervised
classifier model to be trained using this newly labeled data.
Label propagation
Label propagation is a graph-based algorithm that computes label assignments for unlabeled
data points based on their relative proximity to labeled data points, using the smoothness assumption
and cluster assumption. The intuition behind the algorithm is that one can map a fully connected
graph in which the nodes are all available data points, both labeled and unlabeled. The closer two
nodes are based on some chosen measure of distance, like Euclidian distance (link resides outside
[Link]), the more heavily the edge between them is weighted in the algorithm. Starting from the
labeled data points, labels then iteratively propagate through neighboring unlabeled data points, using
the smoothness and cluster assumptions .LEFT: original labeled and unlabeled data points.
RIGHT: using label propagation, the unlabeled data points have been assigned pseudo-
labels.
Active learning
Active learning algorithms do not automate the labeling of data points: instead, they are used
in SSL to determine which unlabeled samples would provide the most helpful information if
manually labeled.3 The use of active learning in semi- supervised settings has achieved promising
results: for example, a recent study found that it more than halved the amount of labeled data
AI Page 24
Artificial Intelligence 2025
required to effectively train a model for semantic segmentation.4
Inductive learning
Inductive methods of semi-supervised learning aim to directly train a classification (or
regression) model, using both labeled and unlabeled data Inductive SSL methods can generally be
differentiated by the way in which they incorporate unlabeled data: via a pseudo-labeling step, an
unsupervised pre- processing step, or by direct incorporation into the model’s objective function.
Wrapper methods A relatively simple way to extend existing supervised algorithms to a semi-
supervised setting is to first train the model on the available labeled data—or simply use a suitable
pre-existing classifier—and then generate pseudo-label predictions for unlabeled data points. The
model can then be re-trained using both the originally labeled data and the pseudo-labeled data, not
differentiating between the [Link] primary benefit of wrapper methods, beyond their simplicity,
is that they are compatible with nearly any type of supervised base learner. Most wrapper methods
introduce some regularization techniques to reduce the risk of reinforcing potentially inaccurate
pseudo-label predictions.
Self-training
Self-training is a basic wrapper method. It requires probabilistic, rather than
deterministic, pseudo-label predictions: for example, a model that outputs ―85 percent dog, 15
percent cat‖ instead of simply outputting ―dog.‖Probabilistic pseudo-label predictions allow
self-training algorithms to accept only predictions that exceed a certain confidence threshold,
in a process akin to entropy minimization.5 This process can be done iteratively, to either
optimize the pseudo- classification process or reach a certain number of pseudo-labeled
samples.
Co-training
Co-training methods extend the self-training concept by training multiple supervised base
learners to assign [Link] diversification is intended to reduce the tendency to reinforce
poor initial predictions. It’s therefore important that the predictions of each base learner not be
strongly correlated with one another. A typical approach is to use different algorithms for each
classifier. Another is for each classifier to focus on a different subset of the data: for example, in
video data, training one base learner on visual data and the other on audio [Link] pre-
processing Unlike wrapper methods (and intrinsically semi-supervised algorithms), which use labeled
and unlabeled data simultaneously, some SSL methods use unlabeled and labeled data in separate
stages: an unsupervised pre-processing stage, followed by a supervised [Link] wrapper methods,
such techniques can essentially be used for any supervised base learner. But in contrast to wrapper
methods, the ―main‖ supervised model is ultimately trained only on originally (human-annotated)
labeled data points. Such pre-processing techniques range from extracting useful features from
unlabeled data to pre-clustering unlabeled data points to using ―pre-training‖ to determine the initial
parameters of a supervised model (in a process akin to the pretext tasks performed in self-supervised
learning).Cluster-then-label One straightforward semi-supervised technique involves clustering all
data points (both labeled and unlabeled) using an unsupervised algorithm. Leveraging the
clustering assumption, those clusters can be used to help train an independent classifier model—or,
if the labeled data points in a given cluster are all of the same class, pseudo-labeling the unlabeled
AI Page 25
Artificial Intelligence 2025
data points and proceeding in a manner similar to wrapper methods. As demonstrated by the ―half-
moons‖ example earlier in this article, simple methods (like k-nearest neighbors) may yield
inadequate predictions. More refined clustering algorithms, like DBSCAN (which implements the
low-density assumption),6 have achieved greater reliability.
Pre-training and feature extraction
Unsupervised (or self-supervised) pre-training allows models to learn useful representations
of the input space, reducing the amount of labeled data needed to fine tune a model with supervised
learning.A common approach is to employ a neural network, often an autoencoder, to learn an
embedding or feature representation of the input data—then using these learned features to train a
supervised base learner. This often entails dimensionality reduction, helping to make use of the
manifold assumption.
Intrinsically semi-supervised methods
Some SSL methods directly unlabeled data into the objective function of the base learner,
rather than processing unlabeled data in a separate pseudo-labeling or pre- processing [Link]-
supervised support vector machines When data points of different categories are not linearly
separable–when no straight line can neatly, accurately define the boundary between categories—
support vector machine (SVMs) algorithms map data to a higher-dimensional feature space in which
the categories can be separated by a hyper plane. In determining this decision boundary, SVM
algorithms maximize the margin between the decision boundary and the data points closest to it. This,
in practice, applies the low-density [Link] a supervised setting, a regularization term penalizes
the algorithm when labeled data points falls on the wrong side of the decision boundary. In semi-
supervised SVMs (S3VMs), this isn’t possible for unlabeled data points (whose classification is
unknown)—thus, S3VMs also penalize data points that lie within the prescribed margin.
Intrinsically semi-supervised deep learning models A variety of neural network architectures have
been adapted for semi-supervised learning. This is achieved by adding or modifying the loss terms
typically used in these architectures, allowing for the incorporation of unlabeled data points in training.
Proposed semi-supervised deep learning architectures include ladder networks, pseudo-ensembles temporal
ensembling, and select modifications to generative adversarial networks (GANS).
AI Page 26
Artificial Intelligence 2025
REINFORCEMENT LEARNING
In reinforcement learning, an agent learns to make decisions by interacting with an
environment. It is used in robotics and other decision-making settings..Reinforcement learning (RL)
is a type of machine learning process that focuses on decision making by autonomous agents. An
autonomous agent is any system that can make decisions and act in response to its environment
independent of direct instruction by a human user. Robots and self-driving cars are examples of
autonomous agents. In reinforcement learning, an autonomous agent learns to perform a task by trial
and error in the absence of any guidance from a human user.1 It particularly addresses sequential
decision-making problems in uncertain environments, and shows promise in artificial intelligence
development.
AI Page 27
Artificial Intelligence 2025
In Markov decision processes, state space refers to all of the information provided by an
environment’s state. Action space denotes all possible actions the agent may take within a state.
Exploration-exploitation trade-off Because an RL agent has no manually labeled input data guiding
its behavior, it must explore its environment, attempting new actions to discover those that receive
rewards. From these reward signals, the agent learns to prefer actions for which it was rewarded in
order to maximize its gain. But the agent must continue exploring new states and actions as well. In
doing so, it can then use that experience to improve its decision-making. RL algorithms thus
require an agent to both exploit knowledge of previously rewarded state-actions and explore other
state-actions. The agent cannot exclusively pursue exploration or exploitation. It must continuously
try new actions while also preferring single (or chains of) actions that produce the largest
cumulative reward.6 Components of reinforcement learning Beyond the agent-environment-goal
triumvirate, four principal sub-elements characterize reinforcement learning problems Policy. This
defines the RL agent’s behavior by mapping perceived environmental states to specific actions the
agent must take when in those states. It can take the form of a rudimentary function or more
involved computational process. For instance, a policy guiding an autonomous vehicle may map
pedestrian detection to a stop [Link] signal. This designates the RL problem’s goal. Each of the
RL agent’s actions either receives a reward from the environment or not. The agent’s only objective
is to maximize its cumulative rewards from the environment. For self- driving vehicles, the reward
signal can be reduced travel time, decreased collisions, remaining on the road and in the proper lane,
avoiding extreme de- or accelerations, and so forth. This example shows RL may incorporate multiple
reward signals to guide an [Link] function. Reward signal differs from value function in that the
former denotes immediate benefit while the latter specifies long-term benefit. Value refers to a state’s
desirability per all of the states (with their incumbent rewards) that are likely to follow. An
autonomous vehicle may be able to reduce travel time by exiting its lane, driving on the sidewalk, and
accelerating quickly, but these latter three actions may reduce its overall value function. Thus, the
vehicle as an RL agent may exchange marginally longer travel time to increase its reward in the latter
three areas. Model. This is an optional sub-element of reinforcement learning systems. Models allow
agents to predict environment behavior for possible actions. Agents then use model predictions to
determine possible courses of action based on potential outcomes. This can be the model guiding
the autonomous vehicle and that helps it predict best routes, what to expect from surrounding
vehicles given their position and speed, and so forth.7 Some model-based approaches use direct
human feedback in initial learning and then shift to autonomous leanring.
Online versus offline learning
There are two general methods by which an agent collects data for learning policies:
Online. Here, an agent collects data directly from interacting with its surrounding environment. This
data is processed and collected iteratively as the agent continues interacting with that environment.
Offline. When an agent does not have direct access to an environment, it can learn through logged data
of that environment. This is offline learning. A large subset of research has turned to offline learning
given practical difficulties in training models through direct interaction with environments.
AI Page 28
Artificial Intelligence 2025
AI Page 30