0% found this document useful (0 votes)

57 views429 pages

Statistical Programming in R Course Guide

The document outlines the course pack for 'Statistical Programming using R' for the BCA program at Bharati Vidyapeeth Institute of Management & Research for the academic year 2024-25. It includes details about the course objectives, outcomes, assessment methods, and a session plan, emphasizing the importance of R programming in data science. The course aims to equip students with practical skills in statistical techniques and data management through a structured curriculum.

Uploaded by

anvisuri05

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

57 views429 pages

Statistical Programming in R Course Guide

Uploaded by

anvisuri05

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

Course Pack For Statistical Programming using R

Course: BCA Course

Code: Data Science 504-3A
Semester: V
Year: 2024-25

Course Leader: Dr. Rakhee Chhibber

Course Instructor: Dr. Rakhee Chhibber

Mr. Nripesh Kumar Nrip Dr. Daljeet Singh Bawa Dr. Yamini Agarwal
Program Coordinator Forwarded by: HOD Approved By:
Director

Bharati Vidyapeeth (Deemed to be University) Institute of

Management & Research, New Delhi
An ISO 9001: 2008 & 14001:2004 Certified Institute
A-4, Paschim Vihar, New Delhi-110063
(Ph: 011-25284396, 25285808, Fax No. 011-25286442)

Note: “Strictly for Internal academic use only”

BVIMR SNAPSHOT

Established in 1992, Bharati Vidyapeeth (Deemed to be University) Institute of Management

and Research (BVIMR), New Delhi focuses on imbibing the said values across various
stakeholders through adequate creation, inclusion, and dissemination of knowledge in
management education.
The institute has over the past few years emerged in the lead with a vision of Leadership in
professional education through innovation and excellence. This excellence is sustained by
consistent value enhancement and initiation of value-added academic processes in institute’s
academic systems.

Based on the fabulous architecture and layout on the lines of Nalanda Vishwa Vidyalaya, the
institute is a scenic marvel of lush green landscape with modern interiors. The Institute which is
ISO 9001:2015 certified is under the ambit of Bharati Vidyapeeth University (BVU), Pune as
approved by Govt. of India on the recommendation of UGC under Section 3 of UGC Act vide its
letter notification No. F. 9 – 16 / 2004 – U3 dated 25th February 2005.

Strategically located in West Delhi on the main Rohtak Road, BVIMR, New Delhi has splendid
layout on sprawling four acres of plot with 'state-of-art' facilities with all classrooms, Library
Labs, Auditorium etc. that are fully air-conditioned. The Institute that has an adjacent Metro
station “Paschim Vihar (East)”, connects the entire Delhi and NCR.

We nurture our learners to be job providers rather than job seekers. This is resorted to by
fostering the skill and enhancement of knowledge base of our students through various
extracurricular, co-curricular and curricular activities by our faculty, who keep themselves
abreast by various research and FDPs and attending Seminars/Conferences. The Alumni has a
key role here by inception of SAARTHI Mentorship program who update and create
professional environment for learners’ centric academic ambiance and bridging industry-
academia gap.

Our faculty make distinctive contribution not only to students but to Academia through
publications, seminars, conferences apart from quality education. We also believe in enhancing
corporate level interaction including industrial projects, undertaken by our students under
continuous guidance of our faculty. These form the core of our efforts which has resulted in
being one of the premier institutes of management.

At BVIMR, we are imparting quality education in management at Doctorate, Postgraduate and

Undergraduate levels.

I
Dr. Rakhee Chhibber (Guest Faculty, BVIMR)

Dr. Rakhee Chhibber

Mobile: 9868080676
E-Mail: rakheechhibber1971@[Link]
She is Doctorate in Computer science form Mewar University, Rajasthan, MTech (IT)
from Karnataka State University, MCA and MBA. She is currently working in IT
industry as an Application Developer (Data Science) - Technical Lead and previously
worked as Assistant Professor in RDIAS (IPU college) for 10 years. Also Worked as
visiting Faculty in BVIMR, Delhi for 5 years. Total Experience is more than 20+yeas in
Academics and Industry. Her specialized areas are Data Science, Machine Learning,
Artificial Intelligence, R programming, Power BI , Tableau, Python , Java C, C++,
teaching subjects are C#, Java, C, C++ programming, Database Management System ,
Oracle, SQL, PLSQL, ETL tools – informatica, Talend, Data Warehousing and Data
Mining , Operating System , Digital Electronics, Computer Organization and
Architecture, Software Engineering.

II
Index

SN CONTENTS PAGE NO

1 Course outline I - XVII

2 Study Notes 1-366

3 Unit-wise MCQ 367-391

4 Practice Questions 392-396

5 Case Studies 397-406

6 Internal Question Paper Format 407-408

III
Programme: BCA CBCS –Revised Syllabus w.e.f. - Year 2022 – 2023
Semester Course Code Course Title
V Data Science Statistical Programming using R
504-3-A
Prepared by Dr. [Link]
Type Credits Evaluation Marks
DSE 3 IA 100
Course Objectives:
 To teach the Beginners of R Programming of a master level.
 A variety of topics will be covered that are important for Data science to prepare the students
for real life prediction of data engineering.
 To impart knowledge of the concepts related to Probability and Application on data sets.
 It also gives the idea how data is managed in various environments with emphasis on
Predictions measures as implemented in data sets.
Course Outcomes:
CO1: Remember the definitions of concepts and their Implementation in R.
CO2: Understand the concept of data and statistical techniques for its Implementation.
CO3: Design different data behaviors and their Predictions.
CO4: Analyzing Data set & Studying Historical Data.
CO5: Convert the historical Data into Prediction Model using R
Unit Unit Session COs Teaching Cognition Evaluatio
No. (Hrs.) Number Methodology Level n Tools
1 Introduction of 8 CO 1 Lecture with Understand Problems
Probability Concept, CO 2 PPTs and its
Types of Probability,
Solution
Permutation and
Combination concept,
Addition and
Multiplication Theorem,
Condition Probability,
Bayes’s Theorem
2 Random Variable 5 CO 1 Problem Apply Problems
Concept, Discrete and CO 2 Illustration (Analyze) and its
Continuous Random
Solution
Variable, Probability
density function,
IV
Mathematical Expectation
and their Theorem
3 Data Distribution 7 CO 3 Concept Analyze Problems
Distribution, Types of Data Explanation, and its
distribution, Exponential
Mathematical Solution
distribution, Binomial
distribution, Normal Problems, and
distribution, Poisson its Solution
distribution, Random
number generation, Monte
Carlo Simulation.

4 Testing of Hypothesis 5 CO4 Concept Evaluate Problems

Procedure of Testing Explanation, and its
Hypothesis, Standard
Mathematical Solution
Error and Sampling
distribution, Estimation, Problems, and
Student’s t-distribution, its Solution
Chi-Square test and
goodness of fit, F-test and
analysis of variance.
Factor analysis.
5 Introduction to R 5 CO 5 Software Create Problems
programming language Demonstration and its
Getting R, Managing R,
and use of Solution
Arithmetic and Matrix
Operations, Introduction to R Language
Functions, Control Software
Structures. Working with
Demonstration
Objects and Data:
Introduction to Objects,
Manipulating Objects,
Constructing Data
Objects, types of Data
items, Structure of Data
items, Reading and
Getting Data,
Manipulating Data,
Storing Data.

V
6 Graphical Analysis using 5 CO 5 Software Evaluate Problems
R Demonstration and its
Basic Plotting,
and use of Solution
Manipulating the plotting
window, Box Whisker R Language
Plots, Scatter Plots, Pair
Plots, Pie Charts, Bar
Charts.
7 Advanced R 10 CO 5 Software Evaluate Problems
Statistical models in R, Demonstration and its
Correlation and regression
and use of Solution
analysis, Analysis of
Variance (ANOVA), R Language
creating data for
complex
analysis, Summarizing
data, and case studies.

Text Books "Fundamentals of Statistics" Seven Edition By [Link]

Refer 1. "Fundamentals of Statistics" Seven Edition By [Link]
ence 2. “R Programming Fundamentals by KaelenMedeiras
Book 3. “ Reinforcement Learning e-book.
s 4. Learning R Programming Guide on line
Suggested MOOC :Please refer these websites for MOOCS: NPTEL / Swayam
www. [Link], [Link]

VI
1. CO-PO Mapping

CO/PO PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO1 PO1 PO1
0 1 2
CO1 3 3 - - 1 - 2 - - - - -

CO2 2 3 2 - 1 - 2 - - - - -

CO3 3 3 3 3 2 1 2 - - - - -

CO4 1 3 3 - 2 - - - - - - -

CO5 1 3 3 3 3 1 - - - - - -

CO. 1.67 2.5 1.83 1.16 1.67 0.5 1 - - - - 0.16

CO 2 3 2 1 2 1 1 0

1- Low , 2- Medium, 3- High, If no correlation put ‘-’

2. Evaluation

Internals: 40%
Externals: 60%
Total : 100%

VII
3. Assessment Mapping

Parameter Marks CO1 CO2 CO3 CO4 CO5

Class Participation/
Attendance 20 4 5 4 4 2
Assignments/ Projects 10 1 2 3 3 1

Test 10 2 2 2 3 1

Internal 40 7 9 9 10 4

End Term (Univ)

VIII
4. Rationale for Mapping Program Outcomes and Course Outcomes:

CO1 & PO1 These objectives and outcomes provide students with a
Mapped comprehensive understanding of programming fundamentals
at 3 and their practical applications.
CO1 & PO 2 These objectives and outcomes prepare students to excel in both
Mapped at 3 programming fundamentals and problem-solving skills in the
field of computer science.
CO1 & PO5 These objectives and outcomes work together to prepare
Mapped at 1 students to excel in both the foundational aspects of
programming and the application of cutting-edge technology in
software development.
CO1 & PO7 These objectives and outcomes work together to prepare
Mapped at 2 students to excel in programming fundamentals and to maintain
relevance and competitiveness as computing professionals.
CO2 & PO1 These focus on the importance of applying mathematical and
Mapped at 2 computational knowledge to conceptualize and address
problems in various domains, ensuring that students can apply
what they've learned effectively.
CO2 & PO2 These objectives and outcomes prepare students for success in
Mapped at 3 computer science and problem-solving.
CO2 & PO3 These objectives and outcomes prepare students for success in
Mapped at 2 the field of computer science, problem-solving, and technology
integration.
CO2 & PO5 These objectives and outcomes improve students'
Mapped at 1 understanding and retention of core computer science
concepts, dynamic programming, and more.

IX
Mapped at 1 driven problem-solving ensuring that students make
responsible and compliant decisions in their computing
practices.
CO2 & PO7 These objectives and outcomes improve students'
Mapped at 2 understanding and emphasize the need for ongoing learning
and adaptation within the ever-evolving computing industry,
preparing students to excel as computing professionals.
CO3 & PO1 These objectives and outcomes prepare students for proficiency
Mapped at 3 in data structures and computational problem-solving.
CO3 & PO2 These objectives and outcomes prepare students to excel in
Mapped at 3 data structures and problem-solving in the field of computer
science.
CO3 & PO3 These objectives and outcomes prepare students for success
Mapped at 3 in data structures, problem-solving, and technology
integration in practical contexts.
CO3 & These objectives and outcomes provide students with a strong
Mapped at 3 foundation in data structures and emphasizes the ability to use
scientific methods to experiment, collect data, and draw
meaningful conclusions, ensuring a comprehensive skill set in
the field of computer science.
CO3 & PO5 These objectives and outcomes equip students with a strong
Mapped at 2 foundation in software development in today's rapidly evolving
technological landscape.
CO3 & PO7 These objectives and outcomes ensure that students are well-
Mapped at 2 prepared for success as computing professionals.
CO4 & PO1 These objectives and outcomes prepare students for
Mapped at 1 proficiency in handling data in this common format.
CO4 & PO2 These objectives and outcomes prepare students for
Mapped at 3 proficiency in working with CSV data and addressing related
challenges.
CO4 & These objectives and outcomes prepare students to excel in
Mapped at 3 data analysis and problem-solving in the context of emerging
technologies

X
and business scenarios.
CO4 & PO5 These objectives and outcomes focus on teaching students how
Mapped at 2 to analyze problems based on CSV files, and techniques for
developing innovative software solutions, ensuring students are
well-prepared for success in data analysis.
CO5 & PO1 These objectives and outcomes ensure that students can make
Mapped at 1 well- informed decisions about data structures in their problem-
solving processes.
CO5 & PO2 These objectives and outcomes prepare students for proficiency
Mapped at 3 in data structure selection and problem-solving in the field of
computer science.
CO5 & PO3 These objectives and outcomes prepare students to excel in
Mapped at 3 data structure selection and problem-solving using emerging
technologies in practical contexts.
CO5 & PO4 These objectives and outcomes ensure students are well-
Mapped at 3 prepared for effective data-driven problem-solving.
CO5 & PO 5 These objectives and outcomes prepare students for effective
Mapped at 3 data- driven problem-solving and software development in a
rapidly evolving technological landscape.

XI
5. Session plan

Session Topic Pedagogy Learning

Outcome
Module I: Introduction of Probability
1 Introduction to Lecture with PPT CO1, CO2
Probability and Discussion
2 Probability concept Lecture with PPT CO1, CO2
and Discussion
3 Types of Probability Lecture with PPT, CO1, CO2
sample examples,
Discussion and
practice questions
4 Permutation and Lecture with PPT, CO1, CO2
Combination concept sample examples,
Discussion and
practice questions
5 Addition Theorem Lecture with PPT, CO1, CO2
sample examples,
Discussion and
practice questions
6 Multiplication Theorem Lecture with PPT, CO1, CO2
sample examples,
Discussion and
practice questions
7 Condition Probability Lecture with PPT, CO1, CO2
sample examples,
Discussion and
practice questions
8 Bayes’s Theorem Lecture with PPT, CO1, CO2
sample examples,
Discussion and
practice questions
Module II: Random Variable
XII
9 Concept of Random Lecture with PPT, CO1, CO2
Variables sample examples,
Discussion and
practice questions
10 Discrete and Lecture with PPT, CO1, CO2
Continuous Random sample examples,
Variable Discussion and
practice questions
11 Probability density Lecture with PPT, CO1, CO2
function sample examples,
Discussion and
practice questions
12-13 Mathematical Lecture with PPT, CO1, CO2
Expectation and their sample examples,
Theorem Discussion and
practice questions
Module III: Data Distribution
14 Data Distribution Lecture with PPT, CO1, CO2, CO3
concepts sample examples,
Discussion and
practice questions
15 Types of Data Lecture with PPT, CO1, CO2, CO3
distribution sample examples,
Discussion and
practice questions
16 Exponential distribution Lecture with PPT, CO1, CO2, CO3
sample examples,
Discussion and
practice questions
17 Binomial distribution Lecture with PPT, CO1, CO2, CO3
sample examples,
Discussion and
practice questions
XIII
18 Normal distribution, Lecture with PPT, CO1, CO2, CO3
sample examples,
Discussion and
practice questions
19 Poisson distribution Lecture with PPT, CO1, CO2, CO3
sample examples,
Discussion and
practice questions
20 Random Lecture with PPT, CO1, CO2, CO3
number generation, sample examples,
Monte Carlo Simulation. Discussion and
practice questions
Module IV: Testing of Hypothesis

21 CES 1

22 Introduction to Lecture with PPT, CO1, CO2, CO3, CO4

Hypothesis in Data sample examples,
Analysis Discussion and
practice questions
23 Procedure of Testing Lecture with PPT, CO1, CO2, CO3, CO4
Hypothesis, sample examples,
Discussion and
practice questions
24 Standard Error and Lecture with PPT, CO1, CO2, CO3, CO4
Sampling distribution, sample examples,
Estimation, Discussion and
practice questions

25 Student’s t-distribution, Lecture with PPT, CO1, CO2, CO3, CO4

Chi-Square test and sample examples,
goodness of fit, F-test Discussion and
and analysis of practice questions
variance.

XIV
Factor analysis.
Module V: Introduction to R programming language
26 Introduction to R Lecture with PPT and CO1, CO2, CO5
programming language, practical on R-studio
Getting R, Managing R, with coding
Arithmetic and Matrix
Operations
27 Introduction to Lecture with PPT and CO1, CO2, CO5
Functions, Control practical on R-studio
Structures. with coding
28 Working with Lecture, CO1, CO2, CO5
Objects and Data: Demonstration and
Introduction to Objects, Practical
Manipulating Objects, Exercise
Constructing Data
Objects
29 types of Data items, Demonstration CO1, CO2, CO5
Structure of Data items, Practical
Exercise
30 Reading and Getting Demonstration CO1, CO2, CO5
Data, Manipulating Practical Exercise
Data, Storing Data.
Module VI: Graphical Analysis using R
31 Basic Plotting, Lecture, CO1, CO2, CO3, CO5
Demonstration and
Practical
Exercise
32 Manipulating the Lecture, CO1, CO2, CO3, CO5
plotting Demonstration and
window Practical
Exercise
33 Box Whisker Plots, Demonstration CO1, CO2, CO3, CO5
Practical

XV
Exercise
34 Scatter Plots, Pair Demonstration CO1, CO2, CO3, CO5
Plots Practical
Exercise
35 Pie Charts, Bar Lecture, CO1, CO2, CO3, CO5
Charts. Demonstration and
Practical
Exercise
Module VI: Advanced R
36 CES 2

37-38 Statistical models in R Lecture with CO1, CO2, CO3, CO5

PPT, Demonstration
and Practical Exercise
39 Correlation Analysis Lecture with CO1, CO2, CO3, CO5
using R PPT, Demonstration
and Practical Exercise
40 Regression analysis Lecture with CO1, CO2, CO3, CO5
using R PPT, Demonstration
and Practical Exercise
41-42 Analysis of Lecture with CO1, CO2, CO3, CO5
Variance (ANOVA), PPT, Demonstration
and Practical Exercise
43 Creating data for Lecture with CO1, CO2, CO3, CO5
complex PPT, Demonstration
analysis and Practical Exercise
44 Summarizing Lecture with CO1, CO2, CO3, CO5
data PPT, Demonstration
and Practical Exercise
45 Case studies Lecture with CO1, CO2, CO3, CO5
PPT, Demonstration
and Practical Exercise

6. Textbook:

XVI
1. Artificial Intelligence by Elaine Rich and Kevin Knight, Tata McGraw Hill
2. Understanding Machine Learning. Shai Shalev-Shwartz and Shai Ben-David. Cambridge
University Press.
3. Artificial Neural Network, B. Yegnanarayana, PHI, 2005 Tom Mitchell, “Machine Learning”,
McGraw Hill, 1997
2. E. Alpaydin, “Introduction to Machine Learning”, PHI, 2005.

8. Reference Book:
1. Christopher M. Bishop. Pattern Recognition and Machine Learning (Springer)
2. Introduction to Artificial Intelligence and Expert Systems by Dan W. Patterson, Prentice Hall
of India
3. Andrew Ng, Machine learning yearning, [Link]
yearning/
4. Aurolien Geron,” Hands-On Machine Learning with Scikit-Learn and TensorFlow,
Shroff/O’Reilly”,2017
5. Andreas Muller and Sarah Guido,” Introduction to Machine Learning with Python: A Guide
for Data Scientists”, Shroff/O’Reilly, 2016

9. MOOC

a) Swayam : [Link]
b) NEPTEl [Link]
c) EDX : [Link]

XVII
Unit 1
Introduction to
Probability

1
Topics

 Introduction of Probability Concept

 Types of Probability

 Permutation and Combination concept

 Addition and Multiplication Theorem

 Condition Probability

 Bayes’s Theorem

2
Introduction of Probability Concept

History of Probability
Probability theory, as a formal branch of mathematics, traces its roots back to the 16th and 17th
centuries. However, the concepts underpinning probability have existed much earlier in the form
of gambling and games of chance. Ancient civilizations like the Egyptians, Greeks, and Romans
used early forms of probability in divination and decision-making processes. Despite these early
instances, it wasn’t until the Renaissance that probability began to be studied systematically.

The Beginnings: The Renaissance and the Birth of Probability Theory

The formal study of probability is often attributed to the correspondence between French
mathematicians Blaise Pascal and Pierre de Fermat in the 1650s. Their discussions were centered
around problems in gambling, such as the “problem of points,” which concerned dividing stakes
in a game of chance that was interrupted before its conclusion. This collaboration laid the
groundwork for the mathematical theory of probability.

During the same period, the Italian mathematician Gerolamo Cardano, in his work *Liber de
Ludo Aleae* (The Book on Games of Chance), was one of the first to formalize the calculation
of odds and outcomes in gambling. Although Cardano's work was published posthumously, it
demonstrated a clear understanding of the principles of probability and laid a foundation for
future developments.

The Development of Classical Probability: 17th to 18th Centuries

The 17th century saw the formalization of probability theory, which continued into the 18th
century. One of the key figures during this period was the Dutch mathematician Christiaan
Huygens, who, in 1657, published the first book on probability theory titled *De Ratiociniis in
Ludo Aleae* (On Reasoning in Games of Chance). Huygens' work built upon the ideas of Pascal
and Fermat, further formalizing the concepts of expected value and fair games.

In the 18th century, probability theory was further developed by figures such as Jakob Bernoulli
3
and Pierre-Simon Laplace. Bernoulli’s work, *Ars Conjectandi* (The Art of Conjecture),
published posthumously in 1713, introduced the law of large numbers, a fundamental theorem in
probability theory. This theorem states that as the number of trials of a random event increases,
the average of the results will converge to the expected value.

Laplace, in his seminal work *Théorie Analytique des Probabilités* (Analytical Theory of
Probability) published in 1812, provided a comprehensive framework for probability theory and
applied it to various fields, including astronomy, statistics, and social sciences. Laplace's
definition of probability as the ratio of favorable outcomes to the total number of equally likely
outcomes became the foundation of classical probability theory.

The Expansion of Probability: 19th to Early 20th Centuries

The 19th century saw probability theory expanding beyond gambling and games of chance into
broader applications. The development of statistics and the theory of errors in measurement
contributed significantly to the evolution of probability. Karl Friedrich Gauss, in the early 19th
century, introduced the concept of the normal distribution, also known as the Gaussian
distribution, which became central to probability theory and statistics.

Another major development during this period was the concept of the random walk, introduced
by Karl Pearson in 1905, and the notion of Brownian motion, studied by Albert Einstein in 1905.
These concepts laid the foundation for the theory of stochastic processes, which would become a
significant area of research in the 20th century.

The Formalization of Probability Theory: 20th Century

The 20th century marked a significant shift in the formalization and abstraction of probability
theory. The Russian mathematician Andrey Kolmogorov played a crucial role in this process. In
1933, Kolmogorov published *Grundbegriffe der Wahrscheinlichkeitsrechnung* (Foundations
of the Theory of Probability), which established a rigorous axiomatic foundation for probability
theory. Kolmogorov’s axioms provided a formal mathematical structure for probability, defining
it as a measure on a sigma-algebra of events.

4
The mid-20th century also saw the application of probability theory in various scientific fields,
including quantum mechanics, genetics, economics, and computer science. The development of
Bayesian probability, named after the Reverend Thomas Bayes, who introduced Bayes' Theorem
in the 18th century, gained significant attention. Bayesian probability provided a framework for
updating probabilities based on new evidence, becoming widely used in statistics, decision
theory, and machine learning.

Modern Applications and Ongoing Developments

Today, probability theory is a cornerstone of modern mathematics and is applied in a wide range
of disciplines. From predicting stock market trends to understanding the behavior of subatomic
particles, probability theory continues to evolve and find new applications. The development of
computational methods and algorithms has further expanded the scope of probability theory,
allowing for the analysis of complex systems and large datasets.

The history of probability is a testament to the power of mathematical abstraction and its ability
to provide insights into the uncertain and unpredictable aspects of the world. As probability
theory continues to evolve, it will undoubtedly play an increasingly important role in shaping our
understanding of the world and the decisions we make.

5
Probability Concept : What is Probability?

Probability is a mathematical concept that measures the likelihood or chance of an event

occurring. It is a fundamental tool for dealing with uncertainty and is widely used in various
fields such as mathematics, statistics, finance, science, engineering, and everyday life. The
concept of probability helps us quantify the uncertainty surrounding the outcomes of random
processes and make informed predictions about future events.

Basic Definition

In its simplest form, probability is defined as the ratio of the number of favorable outcomes to
the total number of possible outcomes in a given experiment. It is typically expressed as a
fraction, decimal, or percentage, and always falls within the range of 0 to 1, where:

- 0 indicates that the event will not occur (impossible event).

- 1 indicates that the event will certainly occur (certain event).
- Any value between 0 and 1 represents the likelihood of the event occurring, with higher values
indicating greater likelihood.

Life is full of uncertainties. We don’t know the outcomes of a particular situation until it
happens. Will it rain today? Will I pass the next math test? Will my favorite team win the toss?
Will I get a promotion in next 6 months? All these questions are examples of uncertain
situations we live in. Let us map them to few common terminology which we will use going
forward.
 Experiment – are the uncertain situations, which could have multiple outcomes. Whether it
rains on a daily basis is an experiment.
 Outcome is the result of a single trial. So, if it rains today, the outcome of today’s trial from
the experiment is “It rained”
 Event is one or more outcome from an experiment. “It rained” is one of the possible event for
this experiment.
 Probability is a measure of how likely an event is. So, if it is 60% chance that it will rain
tomorrow, the probability of Outcome “it rained” for tomorrow is 0.6.
 Sample Space - The set of all possible outcomes, e.g. we can roll a one, two, three, four, five
6
or six.
 Mutually Exclusive - Two events are mutually exclusive if both cannot occur at the same
time, for example, we cannot roll a six and an odd number at the same time.
 Independent - Two events are independent if the occurrence of one does not affect the
probability of the other occurring, e.g. rolling a 6 the first time does not affect the probability
of rolling a 6 the next time.

Why do we need probability?

In an uncertain world, it can be of immense help to know and understand chances of various
events. You can plan things accordingly. If it’s likely to rain, I would carry my umbrella. If I am
likely to have diabetes on the basis of my food habits, I would get myself tested. If my customer
is unlikely to pay me a renewal premium without a reminder, I would remind him about it.

The mathematical expression for probability is:

Example

Consider a simple example of rolling a six-sided die. The die has six faces, numbered from 1 to
6. If we want to calculate the probability of rolling a 4, we can determine the following:

- Total number of possible outcomes: There are 6 possible outcomes (1, 2, 3, 4, 5, and 6).
- Number of favorable outcomes: There is 1 favorable outcome (rolling a 4).

Thus, the probability of rolling a 4 is:

This means there is approximately a 16.7% chance of rolling a 4.

7
Types of Probability

Probability can be interpreted and calculated in several ways depending on the context:

1. Classical Probability: This is the traditional approach where all outcomes are assumed to be
equally likely. It is often used in situations involving games of chance, such as flipping a coin or
rolling a die.

2. Empirical Probability: This type of probability is based on experimental data or observed

frequencies. It is calculated by dividing the number of times an event occurs by the total number
of trials. For example, if you flip a coin 100 times and it lands on heads 60 times, the empirical
probability of getting heads is 60/100 = 0.6.

3. Subjective Probability: Subjective probability is based on personal judgment, experience, or

intuition rather than precise calculations. It is often used when there is no historical data or
when the probability is difficult to measure. For instance, a doctor may estimate the probability
of a patient recovering from an illness based on their experience with similar cases.

4. Bayesian Probability: Bayesian probability involves updating the probability of an event as

new evidence or information becomes available. It is grounded in Bayes' Theorem, which
provides a mathematical framework for revising probabilities considering new data.

Applications of Probability

Probability is a versatile tool with a wide range of applications:

- Statistics : Probability forms the basis for many statistical methods, including hypothesis
testing, confidence intervals, and regression analysis. It helps statisticians make inferences
about populations based on sample data.

- Finance : In finance, probability is used to model and assess risks, price financial instruments,
and develop investment strategies. Techniques like Monte Carlo simulations rely on
probabilistic models to predict market behavior.

8
- Science and Engineering : Probability is essential in scientific research for analyzing
experimental data and modeling natural phenomena. Engineers use probability to assess system
reliability, manage risks, and optimize processes.

- Everyday Life : People use probability, often unconsciously, in everyday decision-making. For
example, when deciding whether to carry an umbrella based on a weather forecast, you are
considering the probability of rain.

Types of Probability – Detailed Explanation with Practical Examples

Probability is a versatile concept in mathematics that can be interpreted and applied in various
ways depending on the context. Understanding the different types of probability is crucial for
effectively applying probabilistic reasoning in diverse fields such as statistics, finance,
engineering, and everyday decision-making. This section delves into the primary types of
probability, providing detailed explanations and practical examples for each.

1. Classical Probability

Definition: Classical probability, also known as "a priori" or "theoretical probability," is based
on the assumption that all possible outcomes of an experiment are equally likely. It is calculated
using the ratio of the number of favorable outcomes to the total number of possible outcomes.

Formula:

Key Characteristics:
 Relies on known and finite sample spaces.
 Assumes perfect randomness and equal likelihood of all outcomes.
 Often used in games of chance and combinatorial problems.

9
Practical Examples:
1. Rolling a Fair Die:
- Experiment: Rolling a standard six-sided die.
- Sample Space: {1, 2, 3, 4, 5, 6}
- Event : Rolling an even number (2, 4, 6).
- Probability Calculation:

- Interpretation: There is a 50% chance of rolling an even number.

2. Flipping a Fair Coin:

- Experiment: Flipping a coin.
- Sample Space: {Heads, Tails}
- Event: Getting Heads.
- Probability Calculation:

- Interpretation: There is a 50% chance of the coin landing on Heads.

3. Drawing a Card from a Standard Deck:

- Experiment: Drawing one card from a standard 52-card deck.
- Sample Space: 52 unique cards.
- Event: Drawing an Ace.
- Probability Calculation:

- Interpretation: There is approximately a 7.7% chance of drawing an Ace.

4. Spinning a Fair Spinner:

- Experiment: Spinning a spinner divided into 8 equal sectors numbered 1 through 8.
- Sample Space: {1, 2, 3, 4, 5, 6, 7, 8}
10
- Event: Spinner lands on a number greater than 5 (i.e., 6, 7, 8).
- Probability Calculation:

- Interpretation: There is a 37.5% chance of the spinner landing on a number greater than 5.

2. Empirical Probability

Definition:
Empirical probability, also known as "experimental" or "a posteriori" probability, is based on
observed data or experiments rather than theoretical calculations. It is determined by conducting
experiments or collecting data and calculating the relative frequency of the event occurring.

Key Characteristics:
- Relies on actual experiments or historical data.
- Can accommodate complex and non-uniform sample spaces.
- Useful when theoretical probabilities are difficult to determine.

Practical Examples:

1. Weather Forecasting:
- Experiment: Recording daily occurrences of rain over a year.
- Data: Suppose it rained 120 days out of 365.
- Probability Calculation:

- Interpretation: Based on past data, there is approximately a 32.9% chance of rain on any
given day.

11
2. Quality Control in Manufacturing:
- Experiment: Inspecting 1,000 units produced by a factory.
- Data: Found 50 defective units.
- Probability Calculation:

- Interpretation: There is a 5% probability that a randomly selected unit is defective.

3. Sports Performance:
- Experiment: Tracking a basketball player's free-throw success rate over 200 attempts.
- Data: Successfully made 150 free throws.
- Probability Calculation:

- Interpretation: The player has a 75% probability of making a free throw based on past
performance.

4. Epidemiology Studies:
- Experiment: Observing the occurrence of a particular disease in a population over a decade.
- Data: 300 out of 10,000 individuals developed the disease.
- Probability Calculation:

- Interpretation: There is a 3% probability of an individual in the population developing the

disease based on historical data.

12
3. Subjective Probability

Definition:
Subjective probability is based on personal judgment, intuition, or experience rather than on
formal calculations or empirical data. It reflects an individual's degree of belief in the
occurrence of an event.

Key Characteristics:
- Influenced by personal opinions, biases, and experiences.
- Not necessarily quantifiable or consistent across different individuals.
- Useful in scenarios where objective data is unavailable or incomplete.

Practical Examples:

1. Investment Decisions:
- Scenario: An investor assesses the likelihood of a stock's price increasing based on their
intuition and market experience.
- Subjective Probability: The investor believes there is a 70% chance the stock will rise,
based on their analysis of market trends and company performance.
- Interpretation: The probability is a personal estimate and may differ from objective
measures.

2. Medical Diagnoses:
- Scenario: A doctor estimates the probability that a patient has a specific disease based on
symptoms and medical history.
- Subjective Probability: The doctor believes there is an 80% chance the patient has the
disease, informed by their clinical experience.
- Interpretation: The probability reflects the doctor's judgment and may be adjusted with
further tests.

3. Project Management:
- Scenario: A project manager assesses the likelihood of a project being completed on time.
- Subjective Probability: Based on team performance and project complexity, the manager

13
estimates a 60% probability of on-time completion.
- Interpretation: The estimate relies on the manager’s experience and perception of project
dynamics.

4. Personal Decision-Making:
- Scenario: Deciding whether to carry an umbrella based on the forecast and personal
judgment.
- Subjective Probability: Believing there is a 40% chance of rain based on the weather
forecast and personal observation of cloud patterns.
- Interpretation: The decision is influenced by both objective data and personal intuition.

14
4. Bayesian Probability

Definition:
Bayesian probability is an interpretation of probability that incorporates prior knowledge or
beliefs and updates them as new evidence becomes available. It is grounded in Bayes' Theorem,
which provides a mathematical framework for revising probabilities.

Formula (Bayes' Theorem):

Where:

Key Characteristics:
- Combines prior beliefs with new evidence.
- Provides a dynamic approach to probability updating.
- Widely used in fields like statistics, machine learning, and decision theory.

Practical Examples:

1. Medical Testing:
- Scenario: Determining the probability that a patient has a disease given a positive test result.
- Prior Probability (P(Disease)): 1% (prevalence of the disease).
- Likelihood (P(Positive|Disease)): 99% (test accuracy).
- Marginal Probability (P(Positive)): Calculated based on overall prevalence and test
accuracy.

15
- Bayesian Calculation:

- Interpretation: Even with a positive test, the posterior probability may remain low due to the
disease's low prevalence.

2. Spam Email Filtering:

- Scenario: Updating the probability that an email is spam based on the presence of certain
keywords.
- Prior Probability (P(Spam)): 20%.
- Likelihood (\P(Keyword|Spam)): 80%.
- Likelihood (P(Keyword|Not Spam)): 10%.
- Bayesian Calculation:

- Interpretation: The posterior probability of an email being spam increases if it contains

specific keywords.

3. Quality Assurance:
- Scenario: Estimating the probability that a product is defective after an initial test result.
- Prior Probability (P(Defective)): 5%.
- Likelihood (P(Passed Test|Defective)): 20%.
- Likelihood (P(Passed Test|Not Defective)): 95%.
- Bayesian Calculation:

- Interpretation: Even if a product passes the test, there remains a non-zero probability that it
is defective, adjusted based on test characteristics.

4. Legal Proceedings:

16
- Scenario: Assessing the probability of a defendant's guilt based on new evidence.
- Prior Probability (P(Guilt)): 10% (based on initial evidence).
- Likelihood (P(New Evidence|Guilt)): 90%.
- Likelihood (P(New Evidence|Innocent)): 30%.
- Bayesian Calculation:

- Interpretation: The new evidence significantly increases the probability of guilt compared
to the prior probability.

17
5. Frequentist Probability (Additional Type)

Definition:
Frequentist probability defines the probability of an event as the limit of its relative frequency in
many trials. It interprets probability strictly in terms of long-run frequencies of events.

Key Characteristics:
- Objective interpretation based on long-term frequencies.
- Does not incorporate prior beliefs or subjective opinions.
- Commonly used in classical statistical inference.

Practical Examples:

1. Coin Tossing Experiments:

- Scenario: Tossing a fair coin 10,000 times.
- Frequency Calculation: If heads appear 5,002 times, the frequentist probability of heads is:

- Interpretation: The probability of getting heads is approximately 50%, aligning with the
theoretical probability.

2. Manufacturing Defects:
- Scenario: Monitoring the defect rate in a production line over time.
- Frequency Calculation: Out of 50,000 units produced, 250 are defective.

- Interpretation: The frequentist probability of producing a defective unit is 0.5%.

3. Elections Polling:
- Scenario: Conducting a poll with 1,000 respondents to estimate voter preference.
18
- Frequency Calculation: If 600 respondents favor Candidate A, the frequentist probability
is:

- Interpretation: Based on the poll, there is a 60% probability that a randomly selected voter
favors Candidate A.

4. Quality Control Testing:

- Scenario: Testing batches of products to determine the failure rate.
- Frequency Calculation: In 200 tested units, 4 fail.

- Interpretation: The frequentist probability of a unit failing is 2%.

Understanding the different types of probability—Classical, Empirical, Subjective, Bayesian,

and Frequentist—is essential for accurately modeling and interpreting uncertainty in various
contexts. Each type offers unique perspectives and tools for assessing likelihoods, making
informed decisions, and conducting rigorous analyses. By applying the appropriate type of
probability based on the nature of the problem and the available information, one can
effectively navigate and quantify uncertainty in both theoretical and practical scenarios.

Example
In Newcastle, 70% of small businesses use the internet to advertise new products; 50% of
small businesses use flyers to advertise new products and a quarter of small businesses
use both flyers and the internet.
(A) What is the probability that a randomly chosen small business in Newcastle
uses either flyers or the internet to advertise new products?
(B) What is the proportion of small businesses in Newcastle that use neither the
internet nor flyers to advertise new products?
Solution (A)
 Let F denote the event that a business advertises new products using flyers and I denote the
event that a business uses the internet to advertise new products.
19
 We wish to find P(F or I). Using the Addition Law, we have:

 There is a 95% probability that a randomly chosen business in Newcastle uses either flyers or
the internet to advertise new products

Tree Diagrams

Tree Diagrams can be used to help us to visualize and calculate complex probabilities. When
drawing a tree diagram we begin with a dot. From this dot lines (“branches”) are then drawn,
extending from the right of the first dot, to represent all possible outcomes for the given
situation. The probabilities of each of these outcomes is written just above the corresponding
line.
To calculate the probability that two events both happen, we draw another “branch” extending
from the “branch” corresponding to one of these events to represent the second event occurring
after the first. Above this line we write the probability (or conditional probability for events
which are not independent) of the second event occurring after the first. Multiplying these
probabilities “along the branches” gives the required probability.
To calculate the probability that one or both of two independent events occurs we add the
probabilities of the two events “down the columns”.

Example
60% of employees at a department store in Newcastle are women. Government research
into methods of commuting to city jobs in the North East has shown on average that:
 12% of people cycle into work.
 A quarter of the people drive.
 10% of people walk.
 And the rest use public transport.
What is the probability that a randomly selected employee of the department store in
Newcastle commutes using public transport and is male? Now calculate the probability
20
that a randomly selected employee is female and drives into work.
Solution We can use a tree diagram to present all of the information given to us and calculate
the required probability.

The probabilities in blue are calculated using the multiplication law. So, the probability the
employee is female and drives is 0.6×0.25=0.15
Tip: To make sure all your calculations are correct, you can check to see that your final
probabilities (the blues ones) add up to 11. This must be the case because at least one of all
of the possible events must (is certain to) occur.

Decision Trees - Decision trees are very similar to the probability tree diagrams but are used
specifically to calculate expected monetary values.

Example 4
The manager of a small business has the opportunity to buy a fixed quantity of a new product
and offer it for sale for a limited time.
There will be a fixed cost of £100,000 to buy the product and offer it for sale. The amount of the
product that the manager would be able to sell is not certain but market research has suggested
that:

21
 The probability that sales would be “poor” is 0.25. Selling this quantity would raise an
income of £75,000.
 The probability that sales would be “medium” is 0.6. Selling this quantity would raise an
income of £110,000.
 The probability that sales would be “good” is 0.15. Selling this quantity would raise an
income of £145,000.
The product can be sold for a trial period before a final decision is made and it costs £18,000 to
run the trial. The results of the trial will be “poor” with probability 0.35, “medium” with
probability 0.4 or “good” with probability 0.25. Knowing the outcome of the trial changes the
probabilities for the main sales project:

The manager will make decisions based on expected monetary value.

(A) Draw a decision tree for this problem.
(B) What is the EMV of a decision to go ahead with the product without a trial?
(C) Complete the solution of the decision problem and determine the optimal course of
action for the company.

22
Solution (A)

The values in the blue boxes are the final incomes from buying the new product when sales
are “poor”, “medium” or “good” (top to bottom).
Solution (B)
To calculate the expected monetary value, we need to utilize the formula in the pink box above.
 For No Trial:
EMV=0.25×£75,000+0.6×£110,000+0.15×£145,000=£106,500.
The manager has an expected income of £106,500 from selling the new product without a trial.
Solution (C)
To solve the decision problem it is best to first calculate the seperate EMVEMVs for when a trial
is run and when a trial is not run. We must then compare the EMVEMVs for each option (trial or
no trial) and choose the option with the highest EMVEMV. This is the optimal course of action
for the company.
 When a trial is carried out and has a poor result:
EMV=0.75×£75,000+0.15×£110,000+0.1×£145,000=£87,250.
 When a trial is carried out and has a medium result:

23
EMV=0.25×£75,000+0.5×£110,000+0.25×£145,000=£110,000.
 When a trial is carried out and has a good result:
EMV=0.1×£75,000+0.15×£110,000+0.75×£145,000=£132,750.
Now to calculate the overall EMV we multiply each of these by their associated
probabilities:
EMV=P(Poor result) × £87,250 + P(Medium result) × £110,000 + P(Good
result)×£132,750
=£107,725.
We now need to calculate the expected profit (or loss) the business would make from each
option (trial or no trial).
 No Trial:
{Expected Profit} ={EMV} { for no trial} - {Cost of new product}
=£106,500 - £100,000
=£6,500.
So if the manager goes ahead with the product without a trial, the expected profit is £6,500.
 Trial:
{Expected Profit} = {EMV}{ for trial} - {Cost of new product} - {Cost of trial}\
=£107,725 - £100,000 - £18,000
=-£10,275.
With the trial, there will be an expected loss of £10,275.

From these results we can see that optimal course of action for the company is to sell the new
product but without the trial period as this yields a higher EMV. It is important to note that
although the expected monetary value is higher when the manager chooses not to run the trial,
the realized profit or loss may be or may not be better than it would have been if a trial had been
carried out.

24
Permutation
and
Combination
Concept

25
What is a Permutation?

A permutation is a mathematical technique that determines the number of possible arrangements

in a set when the order of the arrangements matters. Common mathematical problems involve
choosing only several items from a set of items in a certain order. Permutations in probability
theory and other branches of mathematics refer to sequences of outcomes where the order
matters. For example, 9-6-8-4 is a permutation of a four-digit PIN because the order of numbers
is crucial. When calculating probabilities, it’s frequently necessary to calculate the number of
possible permutations to determine an event’s probability. A permutation is an arrangement of
all or part of a set of objects, with consideration of the order of arrangement. The concept of
permutation is used when the arrangement order matters. For example, the permutation of letters
"ABC" is different from "CAB" because the order of letters differs.

26
Permutations are frequently confused with another mathematical technique called combinations.
However, in combinations, the order of the chosen items does not influence the selection. In
other words, the arrangements ab and be in permutations are considered different arrangements,
while in combinations, these arrangements are equal.

Formula for Calculating Permutations

The general permutation formula is expressed in the following way:

Where:
 n – the total number of elements in a set
 k – the number of selected elements arranged in a specific order
 ! – factorial
Factorial (noted as “!”) is the product of all positive integers less than or equal to the number
preceding the factorial sign. For example, 3! = 1 x 2 x 3 = 6.
The formula above is used in situations when we want to select only several elements from a set
of elements and arrange the selected elements in a special order.
Example of a Permutation
You are a partner in a private equity firm. You want to invest $5 million in two projects. Instead
of equal allocation, you decided to invest $3 million in the most promising project and $2
million in the less promising project. Your analysts shortlisted six projects for
potential investment. How many possible arrangements are available for your investment
decision?
The example above is a permutation problem. Since the allocation of the money for the
two projects is not equal, the selection order matters in this problem. For example, consider the
following arrangement: invest $3 million in Project A and $2 million in Project B vs. invest $2
million in Project A and $3 million in Project B. The options are not equal to each other.
Therefore, we must use the formula above to determine the number of available arrangements:

27
Using the formula above, we can determine the number of available arrangements:

Therefore, you can get 30 possible investment arrangements based on the six projects shortlisted
by your analysts.

Permutations of Distinct Objects

Permutations of distinct objects refer to the various ways in which a set of distinct (unique)
objects can be arranged or ordered. In combinatorial mathematics, a permutation is an
arrangement of all the elements in a specific sequence or order. When the objects are distinct,
each arrangement is considered unique because the order of objects matters.
Formula for Permutations of Distinct Objects
For n distinct objects, the number of possible permutations is given by n! (n factorial). The
factorial of a number n is the product of all positive integers from 1 to n.

Mathematically, n! is expressed as: n!=n×(n−1)×(n−2)×⋯×2×1

When all the objects in the set are distinct, the permutation formula is straightforward.
Example 1: Consider the set {A, B, C}. The number of ways to arrange all three letters is:

The possible permutations are: ABC, ACB, BAC, BCA, CAB, and CBA.
Example 2: For a set of 5 different books, the number of ways to arrange 2 books on a shelf is:

28
29
Permutations with Repetition
Permutations with repetition involve arranging items where some items may be repeated, and
we're interested in the number of different sequences that can be formed considering the
repetitions.
For example, on a pizza, you might have a combination of three toppings: pepperoni, ham, and
mushroom. The order doesn’t matter. For example, using letters for the toppings, you can have
PHM, PMH, HPM, and so on. It doesn’t matter for the person who eats the pizza because you
have the same combination of three toppings. In other words, the order of these three letters does
not matter and they form one combination.

However, imagine we’re using those letters for a weak password. In this case, the order is
crucial, making them permutations. PHM, PMH, HPM, etc., are distinct permutations. If the
password is PHM, entering HPM will not work. When you have at least two permutations, the
number of permutations is greater than the number of combinations. Learn more about
the differences between permutation vs combination.

This type of lock should be known as a permutation lock because the order of digits matters!

Concept of Permutations with Repetition

When we allow for repetitions, the problem often involves arranging items where the order is
still important, but items are not necessarily unique. The key difference from permutations of
distinct objects is that here, we may have multiple arrangements involving the same items.
When the outcomes in a permutation can repeat, statisticians refer to it as permutations with

30
repetition. For example, in a four-digit PIN, you can repeat values, such as 1-1-1-1. Analysts also
call this permutations with replacement.
To calculate the number of permutations, take the number of possibilities for each event and then
multiply that number by itself X times, where X equals the number of events in the sequence.
For example, with four-digit PINs, each digit can range from 0 to 9, giving us 10 possibilities for
each digit. We have four digits. Consequently, the number of permutations with repetition for
these PINs = 10 * 10 * 10 * 10 = 10,000.

Imagine that a class with 15 children can choose one cookie from five types of cookies:
Gingerbread, Sugar, Chocolate Chip, Mint, and Peanut Butter. There are enough cookies that
they are free to choose one of any type. How many possible permutations of cookies are there?

In this example,
o n = 5 because there are five possible cookie choices.
o r = 15 because there are 15 students in the class, making it the size of the permutation.
Consequently, the are 515 = 30,517,578,125 permutations with repetition. That’s over 30 billion
permutations! If you were to make random guesses for the cookie choice of all 15 children,
you’d have a probability of 1/30,517,578,125 of correctly guessing the selections for the entire
class! That assumes you don’t have insider knowledge about each child’s cookie preference! I
think you’d have better luck in a lottery!

Formula for Permutations with Repetition

If we have n types of items, and we are arranging r items (where repetition of items is allowed),
the number of possible permutations is given by:

31
where:
 n is the number of distinct types of items.
 r is the number of positions to fill.
Examples
Example 1: 3 Types of Items, 2 Positions
Suppose we have 3 types of items (say, A, B, and C) and want to arrange 2 items.
To find the number of permutations:
r
Apply the formula: n =32=9
List the permutations:
o AA
o AB
o AC
o BA
o BB
o BC
o CA
o CB
o CC
There are 9 unique ways to arrange 2 items where each position can be filled with any of the 3
types of items.
Example 2: 4 Types of Items, 3 Positions
Consider 4 distinct items (say, 1, 2, 3, and 4), and we want to arrange 3 items.
r
Apply the formula: n =43=64
List a few permutations (for illustration):
o 111
o 112
o 113
o 121
o 122
o 123
o (and so on...)
There are 64 possible arrangements of 3 items with 4 possible choices for each position.

32
Permutations with repetition involve:

 Permutations with Repetition (Formula): nrn^rnr, where nnn is the number of distinct
items, and rrr is the number of positions to be filled.
 Order Matters: The arrangement of items is significant, so permutations consider the
sequence of items.
 Repetition Allowed: The same item can appear in multiple positions.
This concept is widely applicable in scenarios like password generation, where each
position can be filled by any of the allowed characters, and in scenarios where choices are
repeated multiple times.

Permutations without Repetition

When the outcomes cannot repeat, statisticians call them permutations without repetition. This
situation frequently occurs when you’re working with unique physical objects that can occur
only once in a permutation. Imagine you have 10 different books and want to calculate how
many possible ways you can arrange them on a bookshelf. After you place the first book, the
second book must be a different book. Consequently, this is an example of permutations without
repetition. Analysts also call these permutations without replacement.

For the first book, you have 10 books from which to choose. For the second book, you have nine.
There are eight options for the third book, and so on. Like before, this process involves
multiplying the number of possible outcomes together. However, we must reduce the number of
outcomes for each subsequent event.
Mathematically, we’d calculate the permutations for the book example using the following
method:
10 * 9 * 8 * 7 * 6 * 5 * 4 * 3 * 2 * 1 = 3,628,800
33
There are 3,628,800 permutations for ordering 10 books on a shelf without repeating books.
Whew! I bet you didn’t realize there we so many possibilities with 10 books. I’ll stick to
alphabetical order!

Using Factorials for Permutations

When you multiply all numbers from 1 to n, it’s a factorial. In the book example, we multiplied
all numbers from 1 to 10. Instead of using the long string of multiplication, you can write it as
10! and read it as 10 factorial.
In general, n! equals the product of all numbers up to n. For example, 3! = 3 * 2 * 1 = 6. The
exception is 0! = 1, which simplifies equations.
Factorials are crucial concepts for permutations without replication. The number of permutations
for n unique objects is n!. This number snowballs as the number of items increases, as the table
below shows.

Partial Permutations without Repetition

In some cases, you want to consider only a portion of the possible permutations. In the bookshelf
example, we wanted to know the total number for 10 books. But what if we could fit only five of
the 10 books on the shelf? How many permutations of five books are possible using our 10
books?
Use the following formula to calculate the number of arrangements of r items from n objects.
There are several standard methods that statisticians use to notate permutations without
repetition, which I show below with the formula.

34
Where:
o n = the number of unique items. For instance, n = 10 for the book example because there are
10 books.
o r = the size of the permutation. For example, r = 5 for the five books we want to place on the
shelf.
This equation works both for the complete and partial sets of permutations without repetitions,
depending on the values you enter in the equation. For complete sets, n = r. Additionally, r
cannot be greater than n because there are no repetitions.
For the book example, we have 10 books, but we can put only five on the shelf. The first book
still has 10 options. However, for placing the second book, we have only nine options because
we already placed one. We have eight options for the third book and so on until we place the
fifth book. Mathematically, we’d write this as the following for the five books:
10 * 9 * 8 * 7 * 6 = 30,240
There are 30,240 permutations for placing five books out of our 10 books on a shelf.
Using the equation to calculate the number of permutations
Now, we’ll use the formula to calculate this example. Again, we’ll use n=10 and r=5.

Notice how the 5! cancels itself out in the fraction? That leaves us with the 10 * 9 * 8 * 7 * 6 that
we had before.
Here’s how the equation works. The numerator calculates the complete number of permutations
for all the unique items. The denominator cancels out the permutations in which we’re not
interested. For the book example, the denominator cancels out permutations with more than five
books.
Using one form of the notation, we’d write this problem as P (10, 5) = 30,240.
Worked Example of Using Permutations to Calculate Probabilities
When you’re given a probability problem that uses permutations, you need to follow these steps
to solve the problem.
1. Set up a ratio to determine the probability.
2. Determine whether the numerator and denominator require combinations, permutations, or a
mix? For this post, we’ll stick with permutations.
35
3. Are these permutations with repetitions, without, or a mix?
4. Both types of repetition require you to identify the n and r to enter into the equations.

Problem: What is the probability that a four-digit PIN does not have repeated digits?
This question builds on several of the examples in this post.
Let’s set up our ratio for the probability. In this example, we can use the following ratio for the
events of interests and the total number of events.

Numerator
Let’s tackle the numerator. We need to find the number of four-digit PINs that do not have
repeating digits. That’s a permutation because order matters, and it’s without replication because
we can’t have repeats. Let’s identify the n and r. We’ll use n=10 because 10 digits are available
for the first item and r=4 because we’re discussing four-digit PINs.
Let’s enter that into the equation for permutations without repetition to calculate the numerator:

Denominator
For the denominator, we need to calculate all possible permutations for four-digit PINs with
repeats. We need to enter our n and r into the equation for permutations with repeats.
nr = 104 = 10,000
Consequently, the probability of a four-digit PIN with no repeating digits equals the following:

Just over half of all four-digit PINs have repeated values.

The birthday problem is a classic probability problem. What’s the smallest size group that has a
greater than 50% chance of people sharing a birthday? Solving this problem uses similar
methods. Read my post about Solving the Birthday Problem to find out!

Circular Permutation
Circular permutations refer to arrangements of objects in a circle where the order of the objects
is important, but rotations of the arrangement are considered identical. This concept is useful in
problems where the arrangement is cyclic and you want to count distinct configurations that are
rotationally unique.

36
Formula for Circular Permutations
For n distinct objects arranged in a circle, the number of distinct circular permutations is
given by:

Here’s why: In a linear permutation, n objects can be arranged in n! ways. However, in a

circular arrangement, each unique arrangement can be rotated n different ways (all
considered the same circular permutation). Therefore, to find the number of distinct circular
permutations, you divide the total number of linear permutations by n.
Examples
Example 1: 3 Distinct Objects
Consider 3 distinct objects: A, B, and C.
1. Calculate the number of circular permutations: (n−1)!=(3−1)!=2!=2
2. List the permutations:
o ABC
o BCA
o CAB
Here, rotations of ABC (like BCA and CAB) are considered the same circular permutation.
Thus, there are 2 unique circular permutations.
Example 2: 4 Distinct Objects
Consider 4 distinct objects: 1, 2, 3, and 4.
1. Calculate the number of circular permutations: (n−1)!=(4−1)!=3!=6
2. List the permutations:
o 1234
o 2341
o 3412
o 4123
o 3412
o 4123
These are the unique circular permutations, where each arrangement is considered identical
under rotation.
Example 3: 5 Distinct Objects
Consider 5 distinct objects: A, B, C, D, and E.

37
1. Calculate the number of circular permutations: (n−1)!=(5−1)!=4!=24
2. List the permutations:
o ABCDE
o BCDEA
o CDEAB
o DEABC
o EABCD
o (and so forth)
Each permutation is distinct in a circular arrangement, and there are 24 unique ways to
arrange 5 objects in a circle.
Example 4: 6 Distinct Objects
Consider 6 distinct objects: 1, 2, 3, 4, 5, and 6.
1. Calculate the number of circular permutations: (n−1)!=(6−1)!=5!=120
2. List the permutations:
o 123456
o 234561
o 345612
o 456123
o 561234
o 612345
o (and so on)
There are 120 unique circular permutations for 6 distinct objects.
Example 5: 7 Distinct Objects
Consider 7 distinct objects: A, B, C, D, E, F, and G.
1. Calculate the number of circular permutations: (n−1)!=(7−1)!=6!=720(n - 1)! = (7 -
1)! = 6! = 720(n−1)!=(7−1)!=6!=720
2. List the permutations:
o ABCDEFG
o BCDEFGH
o CDEFGAB
o DEFGABC
o EFGABCD
o FGABCDE
o GABCDEF
38
o (and so forth)
With 7 distinct objects, there are 720 unique circular permutations.

 Circular Permutations Formula: For n distinct objects, the number of unique circular
permutations is (n−1)!.
 Consider Rotations as Identical: Each permutation can be rotated nnn ways, so you
only count one arrangement per unique rotation.
Circular permutations are particularly useful in scenarios where the arrangement is cyclic,
such as in circular tables, clock arrangements, or certain scheduling problems. If you have
more questions or need further examples, feel free to ask!

39
Basics of Combinations
Definition of Combination
A combination is a selection of all or part of a set of objects without regard to the order of
arrangement. The concept of combination is used when the order of selection does not
matter.
Formula for Combination
The number of combinations of nnn distinct objects taken rrr at a time is given by:

where:
 N! is the factorial of n
 r! is the factorial of r.

Combinations of Distinct Objects

When all the objects in the set are distinct, and the order does not matter, the combination
formula is used.

Example 1: Consider the set {A, B, C, D}. The number of ways to select 2 letters out of 4 is:

The possible combinations are: AB, AC, AD, BC, BD, and CD.
Example 2: In a lottery where 6 numbers are chosen out of 49, the number of possible
combinations is:

Combinations with Repetition

When repetition is allowed in combinations, the formula changes slightly. The number of
combinations of n objects taken r at a time with repetition is given by:

40
Example 3: If you have three types of fruits (apple, banana, cherry), and you want to select 2
fruits with repetition, the number of possible combinations is:

The possible combinations are AA, AB, AC, BB, BC, and CC.

41
Practical Applications of Permutations and Combinations
Permutations in Real-Life Scenarios

Example 4: Password Generation Consider a scenario where you need to create a password
using 4 letters (where repetition is not allowed) from the alphabet set of 26 distinct letters.
The total number of permutations possible is:

This indicates there are 358,800 possible unique passwords.

Example 5: Arranging People Suppose 5 people need to be arranged in a row for a group
photo. The total number of permutations possible is:

Thus, there are 120 different ways to arrange 5 people in a line.

Combinations in Real-Life Scenarios

Example 6: Forming Committees Imagine you need to form a committee of 3 members

from a group of 10 people. The number of possible combinations is:

This means there are 120 different ways to form a committee of 3 members from 10 people.
Example 7: Selecting Ingredients In a recipe, you can choose 4 ingredients out of 8
available. The number of combinations is:

So, there are 70 possible ways to choose 4 ingredients from 8.

42
Advanced Permutation and Combination Concepts

Permutations of Non-Distinct Objects

When some objects in a set are identical, the formula for permutations needs to be adjusted.
The number of distinct permutations of nnn objects where there are n1, n2 …..nk objects of the
same type is given by:

Example 8: Arranging Letters Consider the word "BALLOON". The total number of
distinct permutations of these letters is:

So, there are 1,260 unique ways to arrange the letters in "BALLOON".

Practical Examples of Permutations

Example 1: Arranging Books on a Shelf Suppose you have 5 different books and you want
to arrange them on a shelf. The total number of permutations is:

So, there are 120 different ways to arrange these 5 books.

Example 2: Forming a 3-Digit Number Consider forming a 3-digit number using the digits
1, 2, 3, 4, and 5, without repetition. The number of possible permutations is:

This means you can form 60 different 3-digit numbers.

Example 3: Seating Arrangements If you have 4 people and want to seat them in a row, the
number of permutations is:

43
Thus, there are 24 possible ways to arrange these 4 people in a row.
Example 4: Selecting and Arranging Employees Suppose you need to select 3 employees
from a group of 6 and assign them different positions. The number of permutations is:

There are 120 different ways to select and assign these 3 employees to the positions.
Example 5: Creating a Password If you need to create a 4-character password using the
letters A, B, C, and D, with repetition not allowed, the total number of permutations is:

There are 24 possible unique passwords.

Example 6: Lottery Number Arrangement Imagine a lottery where you must choose 3
numbers from a set of 5 (1, 2, 3, 4, 5), and the order matters. The number of possible
permutations is:

So, there are 60 different possible outcomes in this lottery.

Example 7: Arranging Letters in a Word Consider the word "TRAIN". The number of
ways to arrange the letters is:

So, there are 120 different ways to arrange the letters in "TRAIN".
Example 8: Organizing a Race Suppose 7 runners are competing in a race, and you want to
know how many different ways the first 3 places can be awarded. The number of
permutations is:

This indicates there are 210 possible ways to award the top 3 positions.

44
Example 9: Forming Committees If you need to select a president, vice-president, and
treasurer from a group of 8 members, the number of permutations is:

There are 336 different ways to assign these positions.

Example 10: Deck of Cards If you want to arrange 5 cards from a standard deck of 52,
without repetition, the number of permutations is:

There are 311,875,200 different ways to arrange 5 cards.

Example 11: Permutations with Repetition If you want to create a 2-letter code from the
letters A, B, C, D, allowing repetition, the number of permutations is:

There are 16 possible codes.

Addition Theorem in Probability

The Addition Theorem is used to find the probability of the occurrence of at least one of
two events. There are two cases to consider:

1. Mutually Exclusive Events (Disjoint Events): Events that cannot happen

simultaneously. For example, rolling a die and getting either a 3 or a 5.
2. Non-Mutually Exclusive Events: Events that can occur together. For example,
drawing a card from a deck that is both a heart and an ace.

1. Addition Theorem for Mutually Exclusive Events

For mutually exclusive events, the probability of either event A or event B occurring is
the sum of their individual probabilities. Formula

45
2. Addition Theorem for Non-Mutually Exclusive Events

For non-mutually exclusive events, the probability of either event A or event B occurring
is the sum of their individual probabilities, minus the probability of both events occurring
together (to avoid double-counting).

46
Multiplication Theorem in Probability

The Multiplication Theorem is used to find the probability of the occurrence of two or
more events together. Like the Addition Theorem, the Multiplication Theorem also has
two cases:

1. Independent Events: Events where the occurrence of one does not affect the
occurrence of the other. For example, tossing two coins.

2. Dependent Events: Events where the occurrence of one affects the occurrence of the
other. For example, drawing cards from a deck without replacement.

1. Multiplication Theorem for Independent Events

For independent events, the probability of both events A and B occurring is the product
of their individual probabilities.

2. Multiplication Theorem for Dependent Events

For dependent events, the probability of both events A and B occurring is the product of
the probability of A and the probability of B given that A has occurred.

47
Here are examples for both the Addition and Multiplication Theorems in
probability:

Example 1: Addition Theorem

Scenario: You are attending a party where a game involves drawing a card from a
standard deck of 52 cards. You win a prize if you draw either a spade or a face card (jack,
queen, or king).

Problem: What is the probability of winning the prize?

Solution:

- There are 13 spades in a deck: ( P({Spade}) = 13/52

- There are 12 face cards (3 per suit): (P({Face Card}) = 12/52
- There are 3 face cards that are also spades: ( P({Spade and Face Card}) = 3/52

Using the Addition Theorem for Non-Mutually Exclusive Events:

48
Example 2: Multiplication Theorem

Scenario: You are playing a board game where you need to roll two six-sided dice. To
win the game, you need to roll a 4 on the first die and a 5 on the second die.

Problem: What is the probability of rolling a 4 on the first die and a 5 on the second
die?

Solution:

- Probability of rolling a 4 on the first die: (P({4 on first die}) = 1/6

- Probability of rolling a 5 on the second die: (P({5 on second die}) = 1/6
Since these are independent events:

P(4 on first die and 5 on second die) =P(4 on first die)×P(5 on second die)=

So, the probability of rolling a 4 on the first die and a 5 on the second die is 1/36.

49
Bayes’s Theorem: Detailed Concept
Definition
Bayes’s Theorem is a fundamental concept in probability theory that describes how to
update the probability of a hypothesis, HHH, based on new evidence, EEE. It is named
after the Reverend Thomas Bayes, who first provided an equation that allows new
evidence to update beliefs about the likelihood of a given event.
Bayes’s Theorem links the conditional probability of the hypothesis given the evidence,
P(H∣E)P(H|E)P(H∣E), with the conditional probability of the evidence given the
hypothesis, P(E∣H)P(E|H)P(E∣H), along with the prior probability of the hypothesis,
P(H)P(H)P(H), and the marginal likelihood of the evidence, P(E)P(E)P(E).
Concept
Bayes’s Theorem is built upon the concept of conditional probability, which is the
probability of an event occurring given that another event has already occurred. In real
life, we often encounter situations where we have some prior knowledge about an event,
and as we gather more information, we refine our predictions or beliefs about that event.
Bayes’s Theorem is particularly useful in situations where we need to make decisions
based on incomplete or evolving information. It allows us to revise our predictions or
hypotheses by incorporating new data. This concept is widely used in various fields like
medical diagnosis, machine learning, finance, and more.
The Formula
The mathematical formula for Bayes’s Theorem is:

Where:
 P(H∣E) is the posterior probability, the probability of the hypothesis HHH given the
evidence EEE.
 P(E∣H) is the likelihood, the probability of the evidence EEE given that the
hypothesis HHH is true.
 P(H) is the prior probability, the initial probability of the hypothesis HHH before
considering the evidence EEE.
 P(E) is the marginal likelihood or evidence, the total probability of the evidence
EEE under all possible hypotheses.
50
Example
Let’s consider a classic example related to medical diagnosis.
Scenario: A patient is being tested for a rare disease. The test for the disease is 99%
accurate, meaning it correctly identifies the disease 99% of the time if the patient has it,
and it correctly identifies 99% of healthy patients as not having the disease. However, the
disease is quite rare, affecting only 1 in 10,000 people.
Problem: If the test result comes back positive, what is the probability that the patient
actually has the disease?

Solution:
 Let H be the event that the patient has the disease.
 Let E be the event that the test result is positive.

We know the following:

First, we need to calculate the marginal likelihood P(E), which is the total probability of
getting a positive test result under both scenarios (having the disease or not having the
disease):
P(E)=P(E∣H)⋅P(H)+P(E∣¬H)⋅P(¬H)
Substituting the values:

P(E)=(0.99×0.0001)+(0.01×0.9999)

P(E)=0.000099+0.009999=0.010098

Now, using Bayes’s Theorem to calculate the posterior probability P(H∣E):

51
So, even after a positive test result, the probability that the patient actually has the disease is
only about 0.98%, which is surprisingly low. This result is due to the rarity of the disease
combined with the fact that the test, while accurate, still has a small false positive rate.

Intuition and Insights

 Prior and Posterior Probability: The prior probability reflects what we know before
considering new evidence, while the posterior probability updates this belief in light of the
new evidence.

 Impact of Rare Events: When dealing with rare events, even highly accurate tests can lead to
counterintuitive results. This is known as the base rate fallacy, where the base rate (prior
probability) of the event significantly influences the outcome.

 Relevance in Decision-Making: Bayes’s Theorem is crucial in decision-making processes

that involve uncertainty. By systematically updating probabilities, it helps make more
informed decisions.

Applications of Bayes’s Theorem

Bayes’s Theorem is applied in various fields:

 Medical Diagnosis: Estimating the likelihood of a disease given a test result, as illustrated
in the example above.

 Machine Learning: Algorithms like Naive Bayes classifiers rely on Bayes’s Theorem for
classifying data.

 Finance: Estimating the likelihood of market movements based on new financial data or
news.

 Law: Assessing the likelihood of a suspect’s guilt given new evidence in a case.

Bayes’s Theorem provides a robust framework for updating beliefs and making decisions
under uncertainty, making it a powerful tool in both theoretical and applied probability.

Permutation vs Combination

52
The key differences between permutation and combination, some of those differences are
listed as follows:

Aspect Permutations Combinations

Selections of elements
Arrangements of elements in
without considering the
a specific order.
Definition order.

n n
Formula Pr = n!/(n−r)! Pr = n!/[(n−r)! × r!]

n n
Notation Pr OR P(n, r) Cr OR C(n, r)

Order
Yes, order matters. No, order doesn’t matter.
Matters

Selecting members for a

Arranging books on a shelf.
Example committee.

Sample How many ways to arrange 3 How many ways to choose

Problems books out of 5? 2 fruits from a basket of 7?

Combinations are used

Permutations are used when when order doesn’t
Application order matters, such as matter, like selecting a
arranging items in a sequence group of people or choosing
or forming a code. items without caring about
their order.

53
Unit 2
Random
Variables

54
Topics

Random Variable Concept

Discrete and Continuous Random Variable

Probability density function

Mathematical Expectation and their Theorem

55
Random Variable Concept - Introduction to Random Variables

Definition

A Random Variable is a fundamental concept in probability and statistics, representing a

variable whose possible values are outcomes of a random phenomenon. It is a function that
assigns a numerical value to each outcome in a sample space, which is the set of all possible
outcomes of a random experiment.

In simpler terms, a random variable is a way to quantify the outcomes of a random process. For
example, when you roll a die, the outcome can be any number between 1 and 6. If we define a
random variable X to represent the outcome, then X can take any of these six values.

Types of Random Variables

There are two main types of random variables: Discrete and Continuous.

a. Discrete Random Variable

A Discrete Random Variable takes on a countable number of distinct values. These values are
often integers and can be listed out. Common examples include the number of heads when
flipping a coin multiple times, the number of students in a classroom, or the number rolled on a
die.

Example:

Let X represent the number of heads in three flips of a fair coin. The possible values of X are 0,
1, 2, or 3, because you can get anywhere from 0 to 3 heads in three flips.

b. Continuous Random Variable

A Continuous Random Variable can take on an infinite number of possible values within a
given range. These values are often real numbers, and they are typically measured rather than
counted. Examples include the height of students in a class, the time it takes to run a race, or the
temperature at a particular location.

56
Example:

Let Y represent the time it takes for a runner to complete a marathon. Y can take any value
from, say, 2 hours to 6 hours, including any fractional value within this range (e.g., 3.5 hours,
4.1 hours).

Probability Distribution

A Probability Distribution describes how the probabilities are distributed over the values of a
random variable. The probability distribution depends on whether the random variable is
discrete or continuous.

a. Probability Mass Function (PMF) for Discrete Random Variables

For a discrete random variable, the probability distribution is described by a Probability Mass
Function (PMF). The PMF gives the probability that a random variable is exactly equal to some
value.

Example:

For a fair six-sided die, let X be the outcome when the die is rolled. The PMF is:

This means each outcome (1 through 6) has an equal probability of 1/6.

b. Probability Density Function (PDF) for Continuous Random Variables

For a continuous random variable, the probability distribution is described by a Probability

Density Function (PDF). Unlike the PMF, the PDF does not give probabilities directly but
rather describes the density of the probability at each point. The probability of the variable
falling within a specific range is given by the area under the curve of the PDF over that range.

Example:

Let Y represent the height of adult men in a population, and assume Y ) follows a normal
distribution with a mean of 70 inches and a standard deviation of 3 inches. The PDF of Y
describes how heights are distributed around the mean. The probability that a randomly chosen
man is between 68 and 72 inches tall is given by the area under the PDF curve between these

57
two values.

Expectation (Mean) and Variance

a. Expectation (Mean)

The Expectation or Mean of a random variable is the long-run average value of repetitions of
the experiment it represents. It gives a measure of the central tendency of the distribution.

- For a Discrete Random Variable X with possible values x1, x2….., xn ) and corresponding
probabilities P(X = x1), P(X = x2), ….., P(X = xn) , the expectation E(X) is:

- For a Continuous Random Variable Y with a probability density function f(y) , the
expectation E(Y) is:

Example:

If you roll a fair die, the expected value (mean) of the outcome is:

b. Variance and Standard Deviation

The Variance of a random variable measures the spread or dispersion of the values around the
mean. It is the expected value of the squared deviation of the random variable from its mean.

- For a Discrete Random Variable X with mean μ :

- For a Continuous Random Variable Y with mean μ :

58
The Standard Deviation is the square root of the variance, providing a measure of spread in the
same units as the original variable.

Example:

If you roll a fair die, the variance of the outcome can be calculated as:

Applications of Random Variables

Random variables are extensively used in statistical analysis and modeling across various fields:

- Finance: Modeling stock prices, where the price at any given time can be considered a
random variable.

- Insurance: Estimating risks, where claims and losses are modeled as random variables.

- Engineering: Reliability analysis, where the failure time of a component is treated as a

random variable.

- Science: Experiment outcomes, such as particle counts in physics experiments, are

modeled using random variables.

Real-Life Example

Scenario: Imagine a company produces light bulbs, and historically, 5% of the bulbs are
defective. Let X be the random variable representing the number of defective bulbs in a sample
of 100 bulbs.

Solution:

- X is a discrete random variable because it counts the number of defective bulbs.

- The probability of a bulb being defective is P(\text{defective}) = 0.05 .

- The expected number of defective bulbs in a sample of 100 is:

E(X) = n x p = 100 * 0.05 = 5

- The variance can be calculated using the formula for the variance of a binomial distribution

Var(X) = n x p x (1 - p) :
59
Var(X) = 100 x 0.05 x 0.95 = 4.75

- The standard deviation is sqrt(4.75) = approx. 2.18 .

This analysis helps the company understand the expected number of defective bulbs and the
variability around this expectation, which is crucial for quality control and decision-making.

Random variables are the building blocks of statistical analysis, allowing us to model and
understand the uncertainty inherent in various processes. By defining, analyzing, and
interpreting random variables, we can make informed decisions in fields as diverse as finance,
engineering, science, and beyond.

60
Discrete and Continuous Random Variable

A Random Variable is a variable that takes on different values based on the outcomes of a
random experiment. It quantifies the outcomes of random phenomena and is a key concept in
probability and statistics. Random variables can be categorized into two main types: Discrete
and Continuous.

Discrete Random Variables

Definition

A Discrete Random Variable is a random variable that can take on a countable number of
distinct values. These values are typically integers and can be listed individually. The term
"discrete" indicates that there are gaps between the possible values of the variable, meaning the
variable can only assume specific points on the number line.

Characteristics

 Countable Outcomes: The values a discrete random variable can take are finite or countably
infinite. For example, the number of students in a classroom or the number of heads when
flipping a coin multiple times.

 Probability Mass Function (PMF): The probability distribution of a discrete random variable
is described by a Probability Mass Function (PMF). The PMF gives the probability that the
random variable is exactly equal to each possible value. The sum of all probabilities in the
PMF equals 1.

Example 1: Number of Heads in Coin Flips

Let’s say you flip a fair coin three times. Define the discrete random variable XXX as the
number of heads obtained.

 Possible values of X: 0, 1, 2, 3.

 The probabilities can be calculated using the binomial distribution:

The PMF for X is:

61
Example 2: Rolling a Die

Consider rolling a fair six-sided die. Define the discrete random variable YYY as the outcome of
the roll.

 Possible values of Y: 1, 2, 3, 4, 5, 6.

 Each outcome has an equal probability of 1/6.

The PMF for YYY is:

Continuous Random Variables

Definition

A Continuous Random Variable is a random variable that can take on an infinite number of
possible values within a given range. These values are uncountable and typically include real
numbers, meaning the variable can assume any value within a certain interval.

Characteristics

 Uncountable Outcomes: The values of a continuous random variable are uncountable and can
include any real number within a certain range. Examples include the height of people, the
time taken to complete a task, or the temperature in a room.

 Probability Density Function (PDF): The probability distribution of a continuous random

variable is described by a Probability Density Function (PDF). Unlike the PMF, the PDF
does not give the probability that the variable takes a specific value; instead, it gives the
density of probability at each point. The probability that the variable falls within a specific
range is found by calculating the area under the curve of the PDF over that range.

62
Example 1: Height of Students

Let’s say the height of students in a class is normally distributed with a mean of 170 cm and a
standard deviation of 10 cm. Define the continuous random variable ZZZ as the height of a
randomly selected student.

 The PDF of Z is described by the normal distribution:

where μ=170 and σ=10

 The probability that a student's height is between 160 cm and 180 cm is given by the area
under the PDF curve from 160 to 180.

Example 2: Time to Complete a Task

Consider the time TTT it takes to complete a task, which could be any value between, say, 0 and
10 hours. If TTT is uniformly distributed, the PDF is constant over the interval.

 The PDF for T is:

 The probability that the task takes between 4 and 6 hours is the area under the PDF from 4 to
6, calculated as:

Key Differences Between Discrete and Continuous Random Variables

Continuous Random
Feature Discrete Random Variable
Variable

Uncountable, infinite
Countable, distinct values (e.g., 0,
Possible Values values (e.g., 1.5, 2.75,
1, 2, 3)
3.14)

Probability Probability Density

Probability Mass Function (PMF)
Distribution Function (PDF)

63
Continuous Random
Feature Discrete Random Variable
Variable

Probability Direct probabilities for exact Probability for intervals

Calculation values (area under the curve)

Number of students in a class, Height of students, time

Example
number of heads to complete a task

Sum of Total area under the

Sum of all probabilities equals 1
Probabilities PDF curve equals 1

Applications in Statistical Analysis

a. Discrete Random Variables

Discrete random variables are often used in situations where the data is inherently countable.
Some applications include:

 Quality Control: Number of defective items in a batch.

 Epidemiology: Number of new cases of a disease.

 Insurance: Number of claims filed in a year.

b. Continuous Random Variables

Continuous random variables are used when the data can take any value within a range. Some
applications include:

 Finance: Modeling stock prices, where prices can take any real number within a range.

 Environmental Science: Measuring pollutants in the air, which can take any value within a
given concentration range.

 Engineering: Analyzing the time until failure of a machine component.

64
Probability density function
Probability Density Function is the function of probability defined for various distributions of variables
and is the less common topic in the study of probability throughout the academic journey of students.
However, this function is very useful in many areas of real life such as predicting rainfall, financial
modelling such as the stock market, income disparity in social sciences, etc.

This article explores the topic of the Probability Density Function in detail including its definition,
condition for existence of this function, as well as various examples.

What is Probability Density Function(PDF)?

Probability Density Function is used for calculating the probabilities for continuous random variables.
When the cumulative distribution function (CDF) is differentiated we get the probability density function
(PDF). Both functions are used to represent the probability distribution of a continuous random variable.

The probability density function is defined over a specific range. By differentiating CDF we get PDF and
by integrating the probability density function we can get the cumulative density function.

Probability Density Function Definition

Probability density function is the function that represents the density of probability for a continuous
random variable over the specified ranges.

Probability Density Function is abbreviated as PDF and for a continuous random variable X, Probability
Density Function is denoted by f(x).

PDF of the random variable is obtained by differentiating CDF (Cumulative Distribution Function) of X.
The probability density function should be a positive for all possible values of the variable. The total area
between the density curve and the x-axis should be equal to 1.

Necessary Conditions for PDF

Let X be the continuous random variable with probability density function f(x). For a function to be valid
probability function should satisfy below conditions.

 f(x) ≥ 0, ∀ x ∈ R

 f(x) should be piecewise continuous.

So, the PDF should be the non-negative and piecewise continuous function whose total value evaluates to
1.
65
Example of a Probability Density Function

Let X be a continuous random variable and the probability density function pdf is given by f(x) = x – 1 ,
0 < x ≤ 5. We have to find P (1 < x ≤ 2).

To find the probability P (1 < x ≤ 2) we integrate the pdf f(x) = x – 1 with the limits 1 and 2. This results
in the probability P (1 < x ≤ 2) = 0.5

Probability Density Function Formula

Let Y be a continuous random variable and F(y) be the cumulative distribution function (CDF) of Y.
Then, the probability density function (PDF) f(y) of Y is obtained by differentiating the CDF of Y.

If we want to calculate the probability for X lying between the interval a and b, then we can use the
following formula:

Key Points about PDF Formula

What Does a Probability Density Function (PDF) Tell Us?

A Probability Density Function (PDF) is a function that describes the likelihood of a continuous random
variable taking on a particular value. Unlike discrete random variables, where probabilities are assigned
to specific outcomes, continuous random variables can take on any value within a range. Probability
Density Function (PDF) tells us

 Relative Likelihood

 Distribution Shape

 Expected Value and Variance, etc.

66
How to Find Probability from Probability Density Function

To find the probability from the probability density function we have to follow some steps.

Step 1: First check the PDF is valid or not using the necessary conditions.

Step 2: If the PDF is valid, use the formula and write the required probability and limits.

Step 3: Divide the integration according to the given PDF.

Step 4: Solve all integrations.

Step 5: The resultant value gives the required probability.

Graph for Probability Density Function

If X is continuous random variable and f(x) be the probability density function. The probability for the
random variable is given by area under the pdf curve. The graph of PDF looks like bell curve, with the
probability of X given by area below the curve. The following graph gives the probability for X lying
between interval a and b.

Probability Density Function Properties

Let f(x) be the probability density function for continuous random variable x. Following are some
probability density function properties:

 Probability density function is always positive for all the values of x.

f(x) ≥ 0, ∀ x ∈ R

67
 Total area under probability density curve is equal to 1.

 For continuous random variable X, while calculating the random variable probabilities end
values of the interval can be ignored i.e., for X lying between interval a and b

P (a ≤ X ≤ b) = P (a ≤ X < b) = P (a < X ≤ b) = P (a < X < b)

 Probability density function of a continuous random variable over a single value is zero.

 Probability density function defines itself over the domain of the variable and over the range of
the continuous values of the variable.

Mean of Probability Density Function

Mean of the probability density function refers to the average value of the random variable. The mean is
also called as expected value or expectation. It is denoted by μ or E[X] where, X is random variable.

Mean of the probability density function f(x) for the continuous random variable X is given by:

Median of Probability Density Function

Median is the value which divides the probability density function graph into two equal halves. If x = M
is the median then, area under curve from -∞ to M and area under curve from M to ∞ are equal which
gives the median value = 1/2.

Median of the probability density function f(x) is given by:

Variance Probability Density Function

Variance of probability density function refers to the squared deviation from the mean of a random
variable. It is denoted by Var(X) where, X is random variable.

Variance of the probability density function f(x) for continuous random variable X is given by:
68
Standard Deviation of Probability Density Function

Standard Deviation is the square root of the variance. It is denoted by σ and is given by:

Probability Density Function Vs Cumulative Distribution Function

The key differences between Probability Density Function (PDF) and Cumulative Distribution Function
(CDF) are listed in the following table:

Cumulative Distribution
Aspect Probability Density Function (PDF) Function (CDF)

The PDF gives the probability The CDF gives the

that a random variable takes on probability that a random
a specific value within a certain variable is less than or
Definition range. equal to a specific value.

Defined for both continuous

Defined for continuous random
Range of and discrete random
variables.
Values variables.

F(x), where 0≤F(x)≤1 for

f(x), where f(x)≥0 and ∫−∞∞
Mathematical all x, and F(−∞)=0 and
f(x)dx=1
Expression F(∞)=1

Represents the probability

Represents the likelihood of the
that the random variable is
random variable taking on a
Interpretatio less than or equal to a
specific value.
n specific value.

Area Under The area under the PDF curve The value of the CDF at a

69
Cumulative Distribution
Aspect Probability Density Function (PDF) Function (CDF)

over a certain interval gives the specific point gives the

probability that the random probability that the random
variable falls within that variable is less than or
the Curve interval. equal to that point.

The CDF can be obtained

The PDF can be obtained by
by integrating the PDF with
differentiating the CDF with
Relationship respect to the random
respect to the random variable.
with CDF variable.

The probability of a random The probability of a random

variable falling within a specific variable being less than or
Probability interval (a,b) is given by ∫ab equal to a specific value x is
Calculation f(x)dx. given by F(x).

The CDF is a
The PDF is always non-
monotonically increasing
negative: f(x)≥0 for all x.
function: F(x1) ≤ F(x2) if
The total area under the PDF
x1 ≤ x2.
curve is equal to 1.
Properties 0≤F(x)≤1 for all x.

Normal Distribution PDF: Normal Distribution CDF:

1σ2πe−(x−μ)22σ2σ2π1 12(1+erf(x−μσ2))21
e−2σ2(x−μ)2 (1+erf(σ2x−μ))

Exponential distribution PDF: Exponential distribution

Examples λe−λx CDF: 1−e−λx

70
Types of Probability Density Function

There are different types of probability density functions given below:

 Uniform Distribution

 Binomial Distribution

 Normal Distribution

 Chi-Square Distribution

Difference Between PDF and Joint PDF

The PDF is the function defined for single variable whereas joint PDF is the function defined for two or
more than two variables, and other key differences between these both concepts are listed in the
following table:

PDF (Probability Density Function) Joint PDF

Probability Density Function is the Joint Probability Density Function is the

probability function defined for single probability function defined for more
variable. than one variable.

It is denoted as f(x). It is denoted as f (x, y, …).

Probability Density Function is obtained by Joint Probability Density Function is obtained

differentiating the CDF. by differentiating the joint CDF

It can be calculated using multiple

It can be calculated by single integral.
integrals as there are multiple variables.

71
Applications of Probability Density Function

Some of the applications of Probability Density function are:

 Probability density functions are used in statistics for calculating probabilities for random
variables.

 It is used in modelling various scientific data.

 Cumulative Frequency Distribution

 Probability Distribution Function

Examples on Probability Density Function

Solution:

72
Mathematical Expectation and their Theorem

73
Because random variables are random, knowing the outcome on any one realisation of the
random process is not possible. Instead, we can talk about what we might expect to happen, or
what might happen on average.

This is the idea of mathematical expectation. In more usual terms, the mathematical expression
of the probability distribution of a random variable is the mean of the random variable.
Mathematical expectation goes far beyond just computing means, but we begin here as the idea
of a mean is easily understood.

The definition looks different in detail for discrete and continuous random variables, but the
intention is the same.

Definition 3.1 (Expectation) The expectation or expected value (or mean) of a random
variable X is defined as

Effectively E(X) is a weighted average of the points in RX, the weights being the probabilities in
the discrete case and probability densities in the continuous case.

Example - (Expectation for discrete variables) Consider the discrete random

variable U with probability function

74
Example - (Expectation for continuous variables) Consider a continuous random
variable X with pdf

Example - (Expectation for a coin toss) Consider tossing a coin once and counting the number
of tails. Let this random variable be T. The probability function is

75
Of course, 0.50.5 tails can never actually be observed in practice on one toss. But it would be
silly to round up (or down) and say that the expected number of tails on one toss of a coin is one
(or zero). The expected value of 0.50.5 simply means that over a large number of repeats of this
random process, we expect a tail to occur in half of those repeats.

Example (Mean not defined) Consider the distribution of Z, with the probability density
function

76
Expectation of a function of a random variable

While the mean can be expressed in terms of mathematical expectation, mathematical

expectation is a more general concept.

Let X be a discrete random variable with a probability function pX(x), or a continuous random
variable with pdf fX(x). Also assume g(X) is a real-valued function of X. We can then define the
expected value of g(X).

Definition (Expectation for function of a random variable) The expected value of some
function g(⋅) of a random variable X is:

77
Unit 3
Distrubution

78
Topics

Distribution
Types of Data distribution
Exponential distribution
Binomial distribution
Normal distribution
Poisson distribution
Random number generation
Monte Carlo Simulation.

79
Distribution

In statistics and probability, the term distribution refers to the way in which values or
observations are spread or arranged across a range of possibilities. It describes how probabilities
or frequencies of different outcomes are distributed in a dataset or a random variable. A
distribution can provide insights into the characteristics of a dataset, such as its central tendency,
dispersion, shape, and outliers.
Basic Definitions

A distribution is essentially a function or a set of rules that assigns probabilities to the possible
outcomes of a random variable. A random variable is a quantity that can take on different values
due to randomness or uncertainty.

Distributions are broadly categorized into two types:

1. Discrete Distribution: This type of distribution is used when the set of possible
outcomes is countable (e.g., number of heads when flipping a coin).

2. Continuous Distribution: This type is used when outcomes can take any value within a
range (e.g., height of people, temperature, time).

Key Concepts in Distribution

1. Probability Distribution

A probability distribution is a mathematical description of the likelihood of different outcomes

in an experiment. It lists all the possible values a random variable can assume and assigns
probabilities to them. The sum of probabilities in a probability distribution always equals 1.

 For a discrete random variable X, the probability distribution is often represented as a

probability mass function (PMF):

 For a continuous random variable Y, the distribution is represented using a probability

density function (PDF):

The PDF describes the relative likelihood of different values, but to get the actual probability,
one has to integrate the PDF over an interval.

2. Cumulative Distribution Function (CDF)

80
The cumulative distribution function describes the probability that a random variable takes a
value less than or equal to a certain value. It accumulates the probabilities or densities over the
range of the distribution.

For a random variable X, the CDF is defined as:

This function provides the probability that X will be less than or equal to a particular value,
helping to understand the spread of values in a more intuitive way.

Moments of Distribution

The moments of a distribution are statistical measures that describe different characteristics of a
distribution. They include:

 Mean (First Moment): The average or expected value of the distribution. It provides a
measure of the central tendency.

 Variance (Second Moment): It measures the dispersion or spread of the distribution

around the mean. A higher variance indicates that the data points are more spread out.

 Skewness (Third Moment): It measures the asymmetry of the distribution around the
mean. Positive skewness indicates that the right tail is longer, while negative skewness
indicates a longer left tail.

 Kurtosis (Fourth Moment): It measures the "tailedness" of the distribution, or the

likelihood of extreme values. A high kurtosis means there are more extreme values or
outliers.

Normal Distribution

81
The normal distribution, often called the Gaussian distribution, is one of the most important
continuous probability distributions. It is symmetric about the mean and characterized by its
bell-shaped curve. The standard normal distribution has a mean of 0 and a standard deviation of
1.

The probability density function of the normal distribution is:

Many natural phenomena, such as heights, test scores, and measurement errors, follow a normal
distribution.

Uniform Distribution

In a uniform distribution, all outcomes are equally likely. For a discrete uniform distribution,
each of the possible values has the same probability. For a continuous uniform distribution, the
probability density function is constant over the range of possible outcomes.

The PDF for a continuous uniform distribution between a and b is:

Binomial Distribution

The binomial distribution is a discrete probability distribution that models the number of
successes in a fixed number of independent trials, where each trial has two possible outcomes
(success or failure) and a constant probability of success p.

The probability mass function of the binomial distribution is:

where n is the number of trials and k is the number of successes.

Exponential Distribution
82
The exponential distribution is often used to model the time between events in a Poisson
process. It is characterized by a constant hazard rate, which makes it a good model for waiting
times.

The PDF of the exponential distribution is:

Examples of Distributions in Real Life

Normal Distribution Example: Heights of adult males in a population tend to follow a normal
distribution. Most individuals will have a height around the mean (e.g., 170 cm), with fewer
individuals being extremely short or tall. The mean height and standard deviation can describe
the entire distribution of heights.

Binomial Distribution Example: Suppose a factory tests a batch of 100 items, and the
probability of an item being defective is 0.05. The number of defective items in this batch
follows a binomial distribution with parameters n=100 and p=0.05.

Exponential Distribution Example: The time between arrivals of buses at a bus stop might
follow an exponential distribution. If buses arrive on average every 10 minutes, the time
between arrivals would follow an exponential distribution with a rate λ=1/10.

Uniform Distribution Example: Rolling a fair six-sided die is an example of a discrete uniform
distribution. Each outcome (1, 2, 3, 4, 5, or 6) has an equal probability of 1/6. Another example
is selecting a random number between 0 and 1, which would follow a continuous uniform
distribution over that interval.

Understanding distributions is fundamental to the study of statistics and probability.

Distributions provide a framework for making predictions, interpreting data, and identifying
patterns in randomness. From binomial to normal distributions, each type offers unique insights
into how data behaves under different conditions, helping to solve real-world problems and
make informed decisions based on observed data. Whether it's estimating risks, calculating
probabilities, or analyzing trends, distributions are at the heart of statistical reasoning and data
science.

Types of Data distribution

83
Distributions are at the core of understanding probability and statistical analysis. They provide
the foundation for describing data, estimating probabilities, and making predictions. Different
types of distributions apply to various types of data, and understanding their properties is critical
for correct data analysis. In this write-up, we will explore the characteristics, applications, and
examples of several key types of distributions, including normal, binomial, Poisson, uniform,
exponential, and chi-square distributions.

1. Normal Distribution (Gaussian Distribution)

The normal distribution is one of the most well-known and widely used continuous probability
distributions. It is often referred to as the Gaussian distribution after Carl Friedrich Gauss, who
introduced it in the early 19th century.

Properties

 Symmetry: The normal distribution is symmetric about the mean. This symmetry results
in the mean, median, and mode being equal.

 Bell-Shaped Curve: The distribution has a characteristic bell-shaped curve, where the
majority of data points are concentrated around the mean, and the tails taper off
symmetrically on either side.

 Mean and Standard Deviation: The distribution is fully described by its mean μ\muμ
and standard deviation σ\sigmaσ. The spread or width of the distribution is determined
by the standard deviation.

 68-95-99.7 Rule: In a normal distribution, approximately 68% of the data lies within one
standard deviation of the mean, 95% within two standard deviations, and 99.7% within
three standard deviations.

Probability Density Function (PDF)

The PDF of the normal distribution is given by:

Example
An example of a normal distribution is the distribution of IQ scores. The mean IQ score is
typically set at 100, with a standard deviation of 15. Most people have an IQ score close to 100,
and fewer individuals have extremely high or low scores.

84
2. Binomial Distribution
The binomial distribution is a discrete probability distribution that represents the number of
successes in a fixed number of independent trials, where each trial has only two possible
outcomes: success or failure. This distribution is particularly useful in situations involving a
sequence of independent yes/no experiments.
Properties
 Number of Trials: The binomial distribution requires a fixed number of trials nnn.
 Success Probability: Each trial has the same probability of success ppp.
 Independent Trials: The outcome of each trial is independent of the others.
 Discrete: The outcomes are discrete values representing the number of successes in the
trials.
Probability Mass Function (PMF)
The PMF of the binomial distribution is given by:

Example
An example of a binomial distribution is the number of heads in 10 coin flips, where each flip
has a 50% chance of landing heads. Here, n=10, p=0.5 and k is the number of heads observed in
10 flips.

3. Poisson Distribution
The Poisson distribution is a discrete probability distribution used to model the number of
events occurring within a fixed interval of time or space. It is particularly useful for modeling
rare events that occur independently of each other.
Properties

85
 Rare Events: The distribution is used for rare events or occurrences.
 Single Parameter: The Poisson distribution is characterized by a single parameter λ\
lambdaλ, which represents both the mean and variance of the distribution.
 Non-Negative Integers: The distribution models the probability of observing a non-
negative integer number of events.
 Memoryless Property: The Poisson process is memoryless, meaning that the occurrence
of an event in one interval does not affect the probability of an event in another interval.
Probability Mass Function (PMF)
The PMF of the Poisson distribution is given by:

Example

An example of a Poisson distribution is the number of emails a person receives per hour. If a
person receives, on average, 5 emails per hour, then the number of emails received in a given
hour follows a Poisson distribution with λ=5.

3. Uniform Distribution

The uniform distribution is a type of probability distribution where all outcomes are equally
likely. There are two types: discrete uniform distribution and continuous uniform distribution.
Properties
 Equally Likely Outcomes: In a uniform distribution, every outcome has the same
probability of occurring.
 Constant Probability: The PDF or PMF is constant across the range of possible
outcomes.
 Range: The distribution is defined over a specific range [a,b][a, b][a,b], and all values

86
within this range have the same probability.
Probability Density Function (PDF)
For a continuous uniform distribution over the interval [a,b] the PDF is given by:

Example
An example of a continuous uniform distribution is the rolling of a fair die. Each of the six sides
(1 through 6) has an equal probability of appearing, so the distribution is uniform.

4. Exponential Distribution

The exponential distribution is a continuous probability distribution often used to model the
time between events in a Poisson process. It describes the waiting time between independent
events that occur at a constant average rate.
Properties
 Memoryless: Similar to the Poisson process, the exponential distribution is
memoryless, meaning the probability of an event occurring in the future does not
depend on how much time has already passed.
 Mean and Variance: The mean and variance of the exponential distribution are related
to the rate parameter λ\lambdaλ. The mean is 1/λ1/\lambda1/λ, and the variance is
1/λ21/\lambda^21/λ2.
 Positive Values: The distribution only takes positive values, as it models time
intervals.
Probability Density Function (PDF)
The PDF of the exponential distribution is given by:

Where:

 λ is the rate parameter (the number of events per unit time).

Example

87
An example of an exponential distribution is the time between arrivals of buses at a bus stop. If
buses arrive on average every 10 minutes, the time between consecutive bus arrivals follows an
exponential distribution with λ=1/10.

6. Chi-Square Distribution
The chi-square distribution is a continuous probability distribution that arises in statistical
hypothesis testing, particularly in the context of testing the goodness-of-fit or independence
between categorical variables.
Properties
 Non-Negative Values: The chi-square distribution only takes positive values, as it
involves the sum of squared terms.
 Degrees of Freedom: The shape of the chi-square distribution depends on the degrees of
freedom kkk, which is related to the number of independent variables in the analysis.
 Right-Skewed: The chi-square distribution is skewed to the right, with more degrees of
freedom leading to a distribution that becomes more symmetric.
Probability Density Function (PDF)
The PDF of the chi-square distribution is given by:

Example
An example of a chi-square distribution is in hypothesis testing for categorical data. If we want
to test whether the observed frequencies of different categories match the expected frequencies,
we use the chi-square distribution to evaluate the goodness-of-fit.

88
Exponential
distribution

89
Exponential Distribution

The Exponential Distribution is another important distribution and is typically used to model
times between events or arrivals. The distribution has one parameter, λ which is assumed to be
the average rate of arrivals or occurrences of an event in a given time interval.
If the random variable X follows an Exponential distribution then we write: X~Exp(λ).
The probability density function is:

(Cumulative) probabilities can be calculated using:

Expectation and Variance

The expectation and variance of an Exponential random variable are:

90
Example

It is assumed that the average time customers spends on hold when contacting a gas company's
call centre is five minutes. The company has a policy that if a customer waits for longer than 15
minutes they are entitled to claim £5 off their next quarterly bill.
If the company employs a new team, at some expense, then the average waiting time is reduced
to four minutes.
The director of the company must decide whether or not to employ a new team. He thinks the
idea is only worthwhile if the probability that a customer waits for longer than 1515 minutes is
reduced by at least 0.025.
This situation can be modelled using Exponential distributions: one for waiting times (times on
hold) under the current team and one for waiting times under a new team.
With the current team the mean waiting time is 55 minutes and so the mean rate of calls
answered per minute is given by λ1=1/5=0.2. The corresponding Exponential distribution
is Exp(0.2).
Similarly, with a new team, we have λ2=1/4=0.25 and so the corresponding Exponential
distribution is Exp(0.25).
Determine whether the director should employ a new team or keep his current team.

Solution

Let X denote the waiting time of a customer under the current team. We know that X follows an
Exponential distribution with parameter λ1=0.2 so we have X∼Exp(0.2). Now we need to
calculate P(X>15).

So, the probability that a customer waits for longer than 15 minutes is 0.050 (to 3 d.p.).
Now we consider the probability of this event under a new team. Let YY denote the time a
customer waits under a new team Y∼Exp(0.25). We know that Y follows an Exponential

91
distribution with parameter λ2=0.25 so we have Y∼Exp(0.25). Thus,

Recall that the director of the company would only opt for recruiting a new team if the
probability that a customer waits longer than 15 minutes is reduced by at least 0.025.

Since 0.026>0.0250, the director of the company should recruit a new team.

What does an Exponential distribution look like?

92
93
Binomial
distribution

94
In probability theory and statistics, the binomial distribution is the discrete probability
distribution that gives only two possible results in an experiment, either Success or Failure. For
example, if we toss a coin, there could be only two possible outcomes: heads or tails, and if any
test is taken, then there could be only two results: pass or fail. This distribution is also called a
binomial probability distribution.
There are two parameters n and p used here in a binomial distribution. The variable ‘n’ states the
number of times the experiment runs and the variable ‘p’ tells the probability of any one
outcome. Suppose a die is thrown randomly 10 times, then the probability of getting 2 for
anyone throw is ⅙. When you throw the dice 10 times, you have a binomial distribution of n =
10 and p = ⅙.

Binomial Probability Distribution

In binomial probability distribution, the number of ‘Success’ in a sequence of n experiments,

where each time a question is asked for yes-no, then the boolean-valued outcome is represented
either with success/yes/true/one (probability p) or failure/no/false/zero (probability q = 1 − p). A
single success/failure test is also called a Bernoulli trial or Bernoulli experiment, and a series of
outcomes is called a Bernoulli process. For n = 1, i.e. a single experiment, the binomial
distribution is a Bernoulli distribution. The binomial distribution is the base for the famous
binomial test of statistical importance.

Negative Binomial Distribution

In probability theory and statistics, the number of successes in a series of independent and
identically distributed Bernoulli trials before a particularised number of failures happens. It is
termed as the negative binomial distribution. Here the number of failures is denoted by ‘r’. For
instance, if we throw a dice and determine the occurrence of 1 as a failure and all non-1’s as
successes. Now, if we throw a dice frequently until 1 appears the third time, i.e., r = three
failures, then the probability distribution of the number of non-1s that arrived would be the
negative binomial distribution.

Binomial Distribution Examples

As we already know, binomial distribution gives the possibility of a different set of outcomes. In
real life, the concept is used for:
95
 Finding the quantity of raw and used materials while making a product.
 Taking a survey of positive and negative reviews from the public for any specific product or
place.
 By using the YES/ NO survey, we can check whether the number of persons views the particular
channel.
 To find the number of male and female employees in an organisation.
 The number of votes collected by a candidate in an election is counted based on 0 or 1
probability.

Binomial Distribution Formula

The binomial distribution formula is for any random variable X, given by;

Where,
n = the number of experiments
x = 0, 1, 2, 3, 4, …
p = Probability of Success in a single experiment
q = Probability of Failure in a single experiment = 1 – p
The binomial distribution formula can also be written in the form of n-Bernoulli trials,
where nCx = n!/x!(n-x)!. Hence,
P(x:n,p) = n!/[x!(n-x)!].px.(q)n-x

Binomial Distribution Mean and Variance

For a binomial distribution, the mean, variance and standard deviation for the given number of
success are represented using the formulas
Mean, μ = np

Variance, σ2 = npq
96
Variance, σ2 = npq
Standard Deviation σ= √(npq)
Where p is the probability of success
q is the probability of failure, where q = 1-p

Binomial Distribution Vs Normal Distribution

The main difference between the binomial distribution and the normal distribution is that
binomial distribution is discrete, whereas the normal distribution is continuous. It means that the
binomial distribution has a finite amount of events, whereas the normal distribution has an
infinite number of events. In case, if the sample size for the binomial distribution is very large,
then the distribution curve for the binomial distribution is similar to the normal distribution
curve.

Properties of Binomial Distribution

The properties of the binomial distribution are:
 There are two possible outcomes: true or false, success or failure, yes or no.
 There is ‘n’ number of independent trials or a fixed number of n times repeated trials.
 The probability of success or failure remains the same for each trial.
 Only the number of success is calculated out of n independent trials.
 Every trial is an independent trial, which means the outcome of one trial does not affect the
outcome of another trial.

Binomial Distribution Examples And Solutions

Example 1: If a coin is tossed 5 times, find the probability of:
97
(a) Exactly 2 heads
(b) At least 4 heads.
Solution:
(a) The repeated tossing of the coin is an example of a Bernoulli trial. According to the problem:
Number of trials: n=5
Probability of head: p= 1/2 and hence the probability of tail, q =1/2
For exactly two heads:
x=2
P(x=2) = 5C2 p2 q5-2 = 5! / 2! 3! × (½)2× (½)3
P(x=2) = 5/16
(b) For at least four heads,
x ≥ 4, P(x ≥ 4) = P(x = 4) + P(x=5)
Hence,
P(x = 4) = 5C4 p4 q5-4 = 5!/4! 1! × (½)4× (½)1 = 5/32
P(x = 5) = 5C5 p5 q5-5 = (½)5 = 1/32
Therefore,
P(x ≥ 4) = 5/32 + 1/32 = 6/32 = 3/16

Example 2: For the same question given above, find the probability of:
a) Getting at most 2 heads
Solution: P (at most 2 heads) = P(X ≤ 2) = P (X = 0) + P (X = 1) + + P (X = 2)
P(X = 0) = (½)5 = 1/32
P(X=1) = 5C1 (½)5.= 5/32
P(x=2) = 5C2 p2 q5-2 = 5! / 2! 3! × (½)2× (½)3 = 5/16
Therefore,
P(X ≤ 2) = 1/32 + 5/32 + 5/16 = ½

Example 3:
A fair coin is tossed 10 times, what are the probability of getting exactly 6 heads and at least six

98
heads.
Solution:
Let x denote the number of heads in an experiment.
Here, the number of times the coin tossed is 10. Hence, n=10.
The probability of getting head, p ½
The probability of getting a tail, q = 1-p = 1-(½) = ½.
The binomial distribution is given by the formula:
P(X= x) = nCxpxqn-x, where = 0, 1, 2, 3, …
Therefore, P(X = x) = 10Cx(½)x(½)10-x
(i) The probability of getting exactly 6 heads is:
P(X=6) = 10C6(½)6(½)10-6
P(X= 6) = 10C6(½)10
P(X = 6) = 105/512.
Hence, the probability of getting exactly 6 heads is 105/512.
(ii) The probability of getting at least 6 heads is P(X ≥ 6)
P(X ≥ 6) = P(X=6) + P(X=7) + P(X= 8) + P(X = 9) + P(X=10)
P(X ≥ 6) = 10C6(½)10 + 10C7(½)10 + 10C8(½)10 + 10C9(½)10 + 10C10(½)10
P(X ≥ 6) = 193/512.

99
Example 4 :
In a box of floppy discs it is known that 95% will work. A sample of three of the discs is selected at
random.
Find the probability that (a) none (b) 1, (c) 2, (d) all 3 of the sample will work.

100
Normal
distribution

101
In probability theory and statistics, the Normal Distribution, also called the Gaussian
Distribution, is the most significant continuous probability distribution. Sometimes it is also
called a bell curve. A large number of random variables are either nearly or exactly represented
by the normal distribution, in every physical science and economics. Furthermore, it can
be used to approximate other probability distributions, therefore supporting the usage of the
word ‘normal ‘as in about the one, mostly used.

Normal Distribution Definition

The Normal Distribution is defined by the probability density function for a continuous random
variable in a system. Let us say, f(x) is the probability density function and X is the random
variable. Hence, it defines a function which is integrated between the range or interval (x to x +
dx), giving the probability of random variable X, by considering the values between x and x+dx.
f(x) ≥ 0 ∀ x ϵ (−∞,+∞)
And -∞∫+∞ f(x) = 1

Normal Distribution Formula

The probability density function of normal or gaussian distribution is given by;

Where,
 x is the variable
 μ is the mean
 σ is the standard deviation

Normal Distribution Curve

The random variables following the normal distribution are those whose values can find any
unknown value in a given range. For example, finding the height of the students in the school.
Here, the distribution can consider any value, but it will be bounded in the range say, 0 to 6ft.
This limitation is forced physically in our query.
Whereas, the normal distribution doesn’t even bother about the range. The range can also extend
to –∞ to + ∞ and still we can find a smooth curve. These random variables are called Continuous
Variables, and the Normal Distribution then provides here probability of the value lying in a

102
particular range for a given experiment. Also, use the normal distribution calculator to find the
probability density function by just providing the mean and standard deviation value.

Normal Distribution Standard Deviation

Generally, the normal distribution has any positive standard deviation. We know that the mean
helps to determine the line of symmetry of a graph, whereas the standard deviation helps to know
how far the data are spread out. If the standard deviation is smaller, the data are somewhat close
to each other, and the graph becomes narrower. If the standard deviation is larger, the data are
dispersed more, and the graph becomes wider. The standard deviations are used to subdivide the
area under the normal curve. Each subdivided section defines the percentage of data, which falls
into the specific region of a graph.

Using 1 standard deviation, the Empirical Rule states that,

 Approximately 68% of the data falls within one standard deviation of the mean. (i.e.,
Between Mean- one Standard Deviation and Mean + one standard deviation)
 Approximately 95% of the data falls within two standard deviations of the mean. (i.e.,
Between Mean- two Standard Deviation and Mean + two standard deviations)
 Approximately 99.7% of the data fall within three standard deviations of the mean. (i.e.,
Between Mean- three Standard Deviation and Mean + three standard deviations)

Thus, the empirical rule is also called the 68 – 95 – 99.7 rule.

Example: Using the empirical rule in a normal distribution - You collect SAT scores from
students in a new test preparation course. The data follows a normal distribution with a mean
score (M) of 1150 and a standard deviation (SD) of 150.

103
Following the empirical rule:
 Around 68% of scores are between 1,000 and 1,300, 1 standard deviation above and below
the mean.
 Around 95% of scores are between 850 and 1,450, 2 standard deviations above and below the
mean.
 Around 99.7% of scores are between 700 and 1,600, 3 standard deviations above and below
the mean.

The empirical rule is a quick way to get an overview of your data and check for any outliers or
extreme values that don’t follow this pattern.
If data from small samples do not closely follow this pattern, then other distributions like the t-
distribution may be more appropriate. Once you identify the distribution of your variable, you
can apply appropriate statistical tests.

Central limit theorem

The central limit theorem is the basis for how normal distributions work in statistics.
In research, to get a good idea of a population mean, ideally you’d collect data from
multiple random samples within the population. A sampling distribution of the mean is the

104
distribution of the means of these different samples.
The central limit theorem shows the following:
 Law of Large Numbers: As you increase sample size (or the number of samples), then the
sample mean will approach the population mean.
 With multiple large samples, the sampling distribution of the mean is normally distributed,
even if your original variable is not normally distributed.
Parametric statistical tests typically assume that samples come from normally distributed
populations, but the central limit theorem means that this assumption isn’t necessary to meet
when you have a large enough sample.
You can use parametric tests for large samples from populations with any kind of distribution as
long as other important assumptions are met. A sample size of 30 or more is generally
considered large.
For small samples, the assumption of normality is important because the sampling distribution of
the mean isn’t known. For accurate results, you have to be sure that the population is normally
distributed before you can use parametric tests with small samples.
Formula of the normal curve
Once you have the mean and standard deviation of a normal distribution, you can fit a normal
curve to your data using a probability density function.

In a probability density function, the area under the curve tells you probability. The normal
distribution is a probability distribution, so the total area under the curve is always 1 or 100%.
The formula for the normal probability density function looks fairly complicated. But to use it,
you only need to know the population mean and standard deviation.
For any value of x, you can plug in the mean and standard deviation into the formula to find the
105
probability density of the variable taking on that value of x.

Example: Using the probability density functionYou want to know the probability that SAT
scores in your sample exceed 1380.
On your graph of the probability density function, the probability is the shaded area under the
curve that lies to the right of where your SAT scores equal 1380.

What is the standard normal distribution?

The standard normal distribution, also called the z-distribution, is a special normal
distribution where the mean is 0 and the standard deviation is 1.
Every normal distribution is a version of the standard normal distribution that’s been stretched or

106
squeezed and moved horizontally right or left.

While individual observations from normal distributions are referred to as x, they are referred to
as z in the z-distribution. Every normal distribution can be converted to the standard normal
distribution by turning the individual values into z-scores.
Z-scores tell you how many standard deviations away from the mean each value lies.

You only need to know the mean and standard deviation of your distribution to find the z-score
of a value.
107
We convert normal distributions into the standard normal distribution for several reasons:
 To find the probability of observations in a distribution falling above or below a given value.
 To find the probability that a sample mean significantly differs from a known population
mean.
 To compare scores on different distributions with different means and standard deviations.
Finding probability using the z-distribution
Each z-score is associated with a probability, or p-value, that tells you the likelihood of values
below that z-score occurring. If you convert an individual value into a z-score, you can then find
the probability of all values up to that value occurring in a normal distribution.
Example: Finding probability using the z-distribution To find the probability of SAT scores in
your sample exceeding 1380, you first find the z-score.
The mean of our distribution is 1150, and the standard deviation is 150. The z-score tells you
how many standard deviations away 1380 is from the mean.

Formula Calculation

For a z-score of 1.53, the p-value is 0.937. This is the probability of SAT scores being 1380 or
less (93.7%), and it’s the area under the curve left of the shaded area.

108
To find the shaded area, you take away 0.937 from 1, which is the total area under the curve.
Probability of x > 1380 = 1 – 0.937 = 0.063
That means it is likely that only 6.3% of SAT scores in your sample exceed 1380.

109
Normal Distribution Table
The table here shows the area from 0 to Z-value.

Normal Distribution Problems and Solutions

Question 1: Calculate the probability density function of normal distribution using the
following data. x = 3, μ = 4 and σ = 2.
Solution: Given, variable, x = 3
Mean = 4 and
Standard deviation = 2
By the formula of the probability density of normal distribution, we can write;

Hence, f(3,4,2) = 1.106.

Question 2: If the value of random variable is 2, mean is 5 and the standard deviation is 4,
110
then find the probability density function of the gaussian distribution.
Solution: Given,
Variable, x = 2
Mean = 5 and
Standard deviation = 4
By the formula of the probability density of normal distribution, we can write;

f(2,2,4) = 1/(4√2π) e0
f(2,2,4) = 0.0997
There are two main parameters of normal distribution in statistics namely mean and standard
deviation. The location and scale parameters of the given normal distribution can be estimated
using these two parameters.

Normal Distribution Properties

Some of the important properties of the normal distribution are listed below:

 In a normal distribution, the mean, median and mode are equal.(i.e., Mean = Median=
Mode).
 The total area under the curve should be equal to 1.
 The normally distributed curve should be symmetric at the centre.
 There should be exactly half of the values are to the right of the centre and exactly half of the
values are to the left of the centre.
 The normal distribution should be defined by the mean and standard deviation.
 The normal distribution curve must have only one peak. (i.e., Unimodal)
 The curve approaches the x-axis, but it never touches, and it extends farther away from the
mean.

Applications

111
The normal distributions are closely associated with many things such as:
 Marks scored on the test
 Heights of different persons
 Size of objects produced by the machine
 Blood pressure and so on.

Examples

It is good to know the standard deviation, because we can say that any value is:
 likely to be within 1 standard deviation (68 out of 100 should be)
 very likely to be within 2 standard deviations (95 out of 100 should be)
 almost certainly within 3 standard deviations (997 out of 1000 should be)

112
113
114
115
Poisson
distribution

116
Poisson distribution is a theoretical discrete probability and is also known as the Poisson
distribution probability mass function. It is used to find the probability of an independent event
that is occurring in a fixed interval of time and has a constant mean rate. The Poisson distribution
probability mass function can also be used in other fixed intervals such as volume, area, distance,
etc. A Poisson random variable will relatively describe a phenomenon if there are few successes
over many trials. The Poisson distribution is used as a limiting case of the binomial distribution
when the trials are large indefinitely. If a Poisson distribution models the same binomial
phenomenon, λ is replaced by np. Poisson distribution is named after the French mathematician
Denis Poisson.

What is Poisson Distribution?

Poisson distribution definition is used to model a discrete probability of an event where

independent events are occurring in a fixed interval of time and have a known constant mean
rate. In other words, Poisson distribution is used to estimate how many times an event is likely to
occur within the given period of time. λ is the Poisson rate parameter that indicates the expected
value of the average number of events in the fixed time interval. Poisson distribution has wide
use in the fields of business as well as in biology.

Let us try and understand this with an example, customer care center receives 100 calls per hour,
8 hours a day. As we can see that the calls are independent of each other. The probability of the
number of calls per minute has a Poisson probability distribution. There can be any number of
calls per minute irrespective of the number of calls received in the previous minute. Below is the
curve of the probabilities for a fixed value of λ of a function following Poisson distribution:

117
Probability Mass Function for Poisson distribution

If we are to find the probability that more than 150 calls could be received per hour, the call
center could improve its standards on customer care by employing more services and catering to
the needs of its customers, based on the understanding of the Poisson distribution.

Poisson Distribution Formula

Poisson distribution formula is used to find the probability of an event that happens
independently, discretely over a fixed time period, when the mean rate of occurrence is constant
over time. The Poisson distribution formula is applied when there is a large number of possible
outcomes. For a random discrete variable X that follows the Poisson distribution, and λ is the
average rate of value, then the probability of x is given by:
f(x) = P(X=x) = (e-λ λx )/x!
Where
 x = 0, 1, 2, 3...
 e is the Euler's number(e = 2.718)
 λ is an average rate of the expected value and λ = variance, also λ>0
Poisson Distribution Mean and Variance

118
For Poisson distribution, which has λ as the average rate, for a fixed interval of time, then the
mean of the Poisson distribution and the value of variance will be the same. So for X following
Poisson distribution, we can say that λ is the mean as well as the variance of the distribution.
Hence: E(X) = V(X) = λ
where
 E(X) is the expected mean
 V(X) is the variance
 λ>0


Properties of Poisson Distribution

The Poisson distribution is applicable in events that have a large number of rare and independent
possible events. The following are the properties of the Poisson Distribution. In the Poisson
distribution,
 The events are independent.
 The average number of successes in the given period of time alone can occur. No two
events can occur at the same time.
 The Poisson distribution is limited when the number of trials n is indefinitely large.
 mean = variance = λ
 np = λ is finite, where λ is constant.
 The standard deviation is always equal to the square root of the mean μ.
 The exact probability that the random variable X with mean μ =a is given by P(X= a) =
μa / a! e -μ
 If the mean is large, then the Poisson distribution is approximately a normal distribution.

Poisson Distribution Table

Similar to the binomial distribution, we can have a Poisson distribution table which will help us
to quickly find the probability mass function of an event that follows the Poisson distribution.
The Poisson distribution table shows different values of Poisson distribution for various values
of λ, where λ>0. Here in the table given below, we can see that, for P(X =0) and λ = 0.5, the
value of the probability mass function is 0.6065 or 60.65%.

119
Applications of Poisson Distribution
There are various applications of the Poisson distribution. The random variables that follow a
Poisson distribution are as follows:
 To count the number of defects of a finished product
 To count the number of deaths in a country by any disease or natural calamity
 To count the number of infected plants in the field
 To count the number of bacteria in the organisms or the radioactive decay in atoms
 To calculate the waiting time between the events.
Important Notes
 The formula for Poisson distribution is f(x) = P(X=x) = (e-λ λx )/x!.
 For the Poisson distribution, λ is always greater than 0.
 For Poisson distribution, the mean and the variance of the distribution are equal.

120
Poisson Distribution Examples
An example to find the probability using the Poisson distribution is given below:
Example 1:
A random variable X has a Poisson distribution with parameter λ such that P (X = 1) = (0.2) P (X
= 2). Find P (X = 0).
Solution:
For the Poisson distribution, the probability function is defined as:
P (X =x) = (e– λ λx)/x!, where λ is a parameter.
Given that, P (x = 1) = (0.2) P (X = 2)
(e– λ λ1)/1! = (0.2)(e– λ λ2)/2!
⇒λ = λ2/ 10
⇒λ = 10
Now, substitute λ = 10, in the formula, we get:
P (X =0 ) = (e– λ λ0)/0!
P (X =0) = e-10 = 0.0000454
Thus, P (X= 0) = 0.0000454
Example 2:
Telephone calls arrive at an exchange according to the Poisson process at a rate λ= 2/min.
Calculate the probability that exactly two calls will be received during each of the first 5 minutes
of the hour.
Solution:
Assume that “N” be the number of calls received during a 1 minute period.
Therefore,
P(N= 2) = (e-2. 22)/2!
P(N=2) = 2e-2.
Now, “M” be the number of minutes among 5 minutes considered, during which exactly 2 calls
will be received. Thus “M” follows a binomial distribution with parameters n=5 and p= 2e -2.
P(M=5) = 32 x e-10
P(M =5) = 0.00145, where “e” is a constant, which is approximately equal to 2.718.

121
Poisson Distribution Examples
Example 1: In a cafe, the customer arrives at a mean rate of 2 per min. Find the probability of
arrival of 5 customers in 1 minute using the Poisson distribution formula.
Solution:
Given: λ = 2, and x = 5.
Using the Poisson distribution formula:
P(X = x) = (e-λ λx )/x!
P(X = 5) = (e-2 25 )/5!
P(X = 6) = 0.036
Answer: The probability of arrival of 5 customers per minute is 3.6%.
Example 2: Find the mass probability of function at x = 6, if the value of the mean is 3.4.
Solution:
Given: λ = 3.4, and x = 6.
Using the Poisson distribution formula:
P(X = x) = (e-λ λx )/x!
P(X = 6) = (e-3.4 3.46 )/6!
P(X = 6) = 0.072
Answer: The probability of function is 7.2%.
Example 3: If 3% of electronic units manufactured by a company are defective. Find the
probability that in a sample of 200 units, less than 2 bulbs are defective.
Solution:
The probability of defective units p = 3/100 = 0.03
Give n = 200.
We observe that p is small and n is large here. Thus it is a Poisson distribution.
Mean λ= np = 200 × 0.03 = 6
P(X= x) is given by the Poisson Distribution Formula as (e-λ λx )/x!
P(X < 2) = P(X = 0) + P(X= 1)
=(e-6 60 )/0! + (e--661 )/1!
= e-6 + e--6 × 6
= 0.00247 + 0.0148
P(X < 2) = 0.01727
Answer: The probability that less than 2 bulbs are defective is 0.01727

122
Random number generation

Generate one or more random numbers in your custom range from 0 to 10,000. Generate positive
or negative random numbers with repeats or no repeats.
About Random Number Generators
There are two main types of random number generators: pseudo-random and true random.
A pseudo-random number generator (PRNG) is typically programmed using a randomizing
math function to select a "random" number within a set range. These random number generators
are pseudo-random because the computer program or algorithm may have unintended selection
bias. In other words, randomness from a computer program is not necessarily an organic, truly
random event.
A true random number generator (TRNG) relies on randomness from a physical event that is
external to the computer and its operating system. Examples of such events are blips in
atmospheric noise, or points at which a radioactive material decays. A true random number
generator receives information from these types of unpredictable events to produce a truly
random number.
This calculator uses a randomizing computer program to produce random numbers, so it is a
pseudo-random number generator.
How to Generate Random Numbers
1. What is your range? Set a minimum number and a maximum number. The random number(s)
generated are selected from your range of numbers, with the min and max numbers included.
2. How many numbers? Specify how many random numbers to generate.
3. Allow repeats? If you choose No your random numbers will be unique and there is no chance
of getting a duplicate number. If you choose Yes the random number generator may produce
a duplicate number in your set of numbers.
4. Sort numbers? You can decide not to sort your random numbers. You can also order your
random numbers ascending, lowest to highest or descending, highest to lowest.

Example: Generate a Random Number to Use as a PIN.

To generate a 6-digit PIN with or without duplicate digits choose the following settings:
 Min = 0
 Max = 9
 Generate 6 numbers
 Allow repeats = yes or no
123
 Sort numbers = Do not sort

Example: Randomize a Set of Numbers

Say you have a group of 10 people represented by the numbers 1 to 10. You want to shuffle them
into a random order of selection for an event.
Choose the following settings to randomize order of selection:
 Min = 1
 Max = 10
 Generate 10 numbers
 Allow repeats = no
 Sort numbers = Do not sort
Example: Randomly Choose One Number From a Range of Numbers
Say you want randomly select one number from 1 to 10, like drawing a number out of a hat.
Choose the following settings:
 Min = 1
 Max = 10
 Generate 1 number
 Allow repeats = no
 Sort numbers = Do not sort

Example: Lottery Number Generator

You want to generate numbers for lottery tickets. You need to choose 5 numbers from a pool of
1 to 49 without duplicates.
Choose the following settings in the random number generator:
 Min = 1
 Max = 49
 Generate 5 numbers
 Allow Duplicates = no
 Sort Numbers = low to high

124
Monte Carlo Simulation

What is a Monte Carlo simulation?

A Monte Carlo simulation is a mathematical technique that simulates the range of possible
outcomes for an uncertain event. These predictions are based on an estimated range of values
instead of a fixed set of values and evolve randomly. Computers use Monte Carlo simulations to
analyze data and predict a future outcome based on a course of action.
First, Monte Carlo simulations use a probability distribution for any variable that has inherent
uncertainty. Then, it recalculates the results many times, using a different set of random numbers
within the estimated range each time. This process generates many probable outcomes, which
become more accurate as the number of inputs grows. In other words, the different outcomes
form a normal distribution or bell curve, where the most common outcome is in the middle of the
curve.

A normal distribution is also called a bell curve. It is always symmetrical around the mean.
The Monte Carlo method has been described as "faking it a billion times until the reality
emerges." It relies on the assumption that many random samples mimic patterns in the total
population.

125
Importance of Monte Carlo simulations
Monte Carlo simulations are simple conceptually but enable users to solve problems in complex
systems. They are particularly useful for long-term predictions because of their accuracy. Monte
Carlo simulations are also a good alternative to machine learning when there isn't enough data to
make a machine learning model accurate. As the number of inputs increases, so does the number
of forecasts.
They also enable accurate simulations involving randomness. For a simple example, someone
could use a Monte Carlo simulation to calculate the probability of a particular outcome -- say,
rolling a seven -- when rolling two dice. There are 36 possible combinations, and six of those
combinations add up to seven. The mathematical or expected probability of rolling a seven is
6/36, or 16.67%.
External factors, such as the shape of the dice or the surface they are rolled on, cause the actual
or experimental probability to be different from the mathematical probability. Rolling the dice
1,000 times and getting a seven on 170 of those times would be the actual probability --
170/1,000, or 17%, which is close to the actual probability but not exact. Each roll would be an
iteration of the Monte Carlo simulation, which gets more accurate with each iteration. This
property -- that the actual probability gets closer to the exact probability with more iterations -- is
known as the law of large numbers.
Someone could use Microsoft Excel, IBM SPSS Statistics or a similar program to run this
experiment.
The 4 steps in a Monte Carlo simulation
Although they might vary from case to case, the general steps to a Monte Carlo simulation are as
follows:
1. Build the model. Determine the mathematical model or transfer algorithm.
2. Choose the variables to simulate. Pick the variables, and determine an appropriate probability
distribution for each random variable.
3. Run repeated simulations. Run the random variables through the mathematical model to
perform many iterations of the simulation.
4. Aggregate the results, and determine the mean, standard deviation and variant to determine if
the result is as expected. Visualize the results on a histogram.

Monte Carlo simulation use cases

Monte Carlo simulations can be used for a spectrum of different industries. Finance is one of the
most common use case examples, but any industry that involves predicting an inherently

126
uncertain condition has a use for it.
Industry use cases for a Monte Carlo simulation include the following:
 Finance, such as risk assessment and long-term forecasting.
 Project management, such as estimating the duration or cost of a project.
 Engineering and physics, such as analyzing weather patterns, traffic flow or energy
distribution.
 Quality control and testing, such as estimating the reliability and failure rate of a product.
 Healthcare and biomedicine, such as modeling the spread of diseases.
Use cases for Monte Carlo simulations also encompass different technologies. In IT alone, there
a many use cases for Monte Carlo simulations. Some of those use cases specific to IT are the
following:
 Network and system design. Monte Carlo simulations can be used to model different
designs, identify potential bottlenecks, and perform capacity planning and resource
allocation.
 Artificial intelligence. Monte Carlo simulations provide the basis for resampling techniques
for estimating the accuracy of a model on a given data set.
 Cybersecurity. Monte Carlo simulations can be used to simulate different cyber attacks,
evaluate the probability of them occurring, evaluate their hypothetical impact and identify
vulnerabilities in IT systems.
 Performance testing. Monte Carlo simulations can be used for load testing applications and
estimating the potential impact for increased usage or scaling.
Monte Carlo simulations are used in research and real-world business applications. They are
specifically useful in research because of their ability to uncover data insights and enable the
researcher to see multiple possible outcomes. Real-world scenarios for Monte Carlo
simulations include the following:
 A researcher performing a risk assessment of potential toxic chemicals in South Korean
cabbage kimchi.
 A telecom service provider gauging the ability of its network to handle swells in viewer
traffic during the Olympics.
 A company tracking potential price movements of a given asset to price stock options.
 A random walk study of the spread of COVID-19.
 A smartphone manufacturer measuring a smartphone's performance in different
temperatures.
 An analyst predicting the outcomes of a presidential election.

127
Advantages of Monte Carlo simulations
Monte Carlo simulations are used in many different areas for a reason. They are a relatively
simple way to make complex predictions. They offer answers to hypothetical questions and
assign a certain level of order to randomness. Other advantages of Monte Carlo simulations
include the following:
 Improve decision-making. Monte Carlo simulations help users make decisions with a
degree of confidence.
 Solve complex problems simply. Monte Carlo simulations show both what could happen
and how likely each outcome is
 Visualize the range of possible outcomes and their likelihood of occurring. Monte Carlo
simulations make it easy to visualize what the result of a standard decision or outcome might
be next to the result of an unusual outcome.


Drawbacks of Monte Carlo simulations

Despite the advantages of Monte Carlo simulations, there are disadvantages. Like any
simulation, it uses historical data for a future projection, which carries the risk of being
inaccurate. Specific drawbacks of Monte Carlo simulations include the following:
 Processing power. Monte Carlo simulations require many iterations to be accurate. Running
many iterations takes time and energy and can be computationally intensive.
 Input bias. This can be damaging if the data used for input is inaccurate or incomplete.
 Sensitivity to the chosen probability distribution. It is important to choose an appropriate
probability distribution for the problem. Choosing the wrong one can render the results
meaningless.

Common probability distributions used in Monte Carlo simulations

Probability distributions represent a range of values between two limits and can consist of
discrete or continuous values. Discrete probability distributions are plotted as a sequence of
finite numbers in a table, whereas continuous distributions are plotted as a curve between two
points on a graph.
Some common probability distributions in Monte Carlo simulations are the following:
 Normal distributions. These are continuous distributions where the most data points cluster
toward the middle. It is also called a bell curve or Gaussian distribution.
 Triangular distributions. These are continuous distributions with fixed minimum and

128
maximum values. They can be either symmetrical, where the most probable value equals the
mean and the median, or asymmetrical.
 Uniform distributions. These are continuous distributions by known minimum and
maximum values. All outcomes have the same probability of occurring.
 Lognormal distributions. These are continuous distributions by mean and standard
deviation. The values are positive and create a curve that skews right.

 Different probability distributions have different shapes and are suitable for different
contexts.
 Exponential distributions. These continuous distributions are used to illustrate the time
between independent occurrences given the occurrence rate.
 Weibull distributions. These continuous distributions can model skewed data and
approximate other distributions.
 Poisson distributions. These discrete probability distributions describe the probability of an
event occurring in X periods of time.
 Discrete distributions. These discrete probability distributions help define the finite values
of all possible outcome values.

History of the Monte Carlo simulation

The Monte Carlo simulation was invented during World War II by mathematician Stanislaw
Ulam and computer scientist John von Neumann. Both von Neumann and Ulam used the Monte
Carlo simulation in the invention of the hydrogen bomb as part of the Manhattan Project.
Ulam got the idea for the Monte Carlo simulation while he was sick and bored at home. He was
playing the card game solitaire and wanted to be able to compute the probability of winning. His
approach was to play as many hands as he could, count the number he won and divide that by the
number of total hands played. He realized this would take a long time, so he involved his friend

129
von Neumann. He asked von Neumann to run the simulation on the Electronic Numerical
Integrator and Computer machine, which was one of the first computers.
The simulation was named after a casino in Monaco. The randomness in a roulette table
resembles the chance element of Monte Carlo simulations. In 1949, Ulam published the first
unclassified document describing the Monte Carlo simulation.

Monte Carlo Simulation Example

Let's consider an example of a young working couple who works very hard and has a lavish
lifestyle including expensive holidays every year. They have a retirement objective of spending
$170,000 per year (approx. $14,000/month) and leaving a $1 million estate to their children.
An analyst runs a simulation and finds that their savings-per-period is insufficient to build the
desired portfolio value at retirement; however, it is achievable if the allocation to small-cap
stocks is doubled (up to 50% to 70% from 25% to 35%), which will increase their risk
considerably.
None of the above alternatives (higher savings or increased risk) are acceptable to the client.
Thus, the analyst factors in other adjustments before running the simulation again.
The analyst delays their retirement by two years and decreases their monthly spend post-
retirement to $12,500. The resulting distribution shows that the desired portfolio value is
achievable by increasing allocation to small-cap stock by only 8%. With the available insight,
the analyst advises the clients to delay retirement and decrease their spending marginally, to
which the couple agrees.

Monte Carlo Simulation Example

For better understanding, let’s analyze the example below.

Assume that you are creating a work schedule for a research and development project. You
noticed that there is some degree of uncertainty exists in the activity duration estimates. Then
you decided to use the Monte Carlo Simulation to analyze the impact of risks that will affect
your project.

First, you create the work schedule and estimate the duration of each activity by using the three-
point estimating technique. You estimate optimistic, pessimistic and most likely durations for
each activity as shown in the below table.

130
Then you calculate the duration of each activity by using PERT Formula

PERT Estimate = ( Optimistic Estimate+ 4 x Most likely Estimate + Pessimistic Estimate) / 6

After calculating the duration of each activity, the table becomes as follows

Total Completion Time of the project is = 5,2 +6 + 8 + 3 = 22,2 Months.

For the best case, completion time of the project is ;

Total Completion Time = 4 +5 + 7 + 2 = 18 Months.

For the worst case, completion time of the project is ;

Total Completion Time = 7 +7 + 9 + 4 = 27 Months.

Now you run the Monte Carlo Simulation by using Excel or software and get the chances of
completion of the project.

Let’s assume that you get the results after performing the Monte Carlo Simulation. Below table
shows the results.

131
Note that these results are only for illustration. They are not from an actual simulation.

If you analyze the results, you will see that the possibility of completion of the project in the best
case is the lowest and in the worst case, it is highest.

As it is seen from the table, this simulation provides you a number of results to improve your
decision making.

Most business situations such as uncertainty in market demand, unknown quantity of sales,
variable costs and many others are too complex for an analytical solution. But The Monte Carlo
Simulation enables you to evaluate your plan numerically, you can change numbers, ask ‘what
if’ and see the results.

132
Unit 4
Test of
Hypothesis

133
Topics

Procedure of Testing Hypothesis

Standard Error and Sampling distribution
Estimation,
Student’s t-distribution
Chi-Square test and goodness of fit
F-test and analysis of variance
Factor analysis.

134
Hypothesis is usually considered as the principal instrument in research. Its main function is to
suggest new experiments and observations. In fact, many experiments are carried out with the
deliberate object of testing hypotheses. Decision-makers often face situations wherein they are
interested in testing hypotheses based on available information and then take decisions based on
such testing. In social science, where direct knowledge of population parameter(s) is rare,
hypothesis testing is the often-used strategy for deciding whether a sample data offer such
support for a hypothesis that generalization can be made. Thus, hypothesis testing enables us to
make probability statements about population parameter(s). The hypothesis may not be proved
absolutely, but in practice it is accepted if it has withstood a critical testing. Before we explain
how hypotheses are tested through different tests meant for the purpose, it will be appropriate to
explain clearly the meaning of a hypothesis and the related concepts for better understanding of
the hypothesis testing techniques.

WHAT IS A HYPOTHESIS?
Ordinarily, when one talks about hypothesis, one simply means a mere assumption or some
supposition to be proved or disproved. But for a researcher hypothesis is a formal question that
he intends to resolve. Thus a hypothesis may be defined as a proposition or a set of proposition
set forth as an explanation for the occurrence of some specified group of phenomena either
asserted merely as a provisional conjecture to guide some investigation or accepted as highly
probable in the light of established facts. Quite often a research hypothesis is a predictive
statement, capable of being tested by scientific methods, that relates an independent variable to
some dependent variable. For example, consider statements like the following ones:
“Students who receive counselling will show a greater increase in creativity than students not
receiving counselling”
Or
“the automobile A is performing as well as automobile B.”
These are hypotheses capable of being objectively verified and tested. Thus, we may conclude
that a hypothesis states what we are looking for and it is a proposition which can be put to a test
to determine its validity.

Characteristics of hypothesis: Hypothesis must possess the following characteristics:

(i) Hypothesis should be clear and precise. If the hypothesis is not clear and precise, the
inferences drawn on its basis cannot be taken as reliable.
(ii) Hypothesis should be capable of being tested. In a swamp of untestable hypotheses, many a

135
time the research programmes have bogged down. Some prior study may be done by researcher
in order to make hypothesis a testable one. A hypothesis “is testable if other deductions can be
made from it which, in turn, can be confirmed or disproved by observation.”1
(iii) Hypothesis should state relationship between variables, if it happens to be a relational
hypothesis.
(iv) Hypothesis should be limited in scope and must be specific. A researcher must remember
that narrower hypotheses are generally more testable and he should develop such hypotheses.
(v) Hypothesis should be stated as far as possible in most simple terms so that the same is easily
understandable by all concerned. But one must remember that simplicity of hypothesis has
nothing to do with its significance.
(vi) Hypothesis should be consistent with most known facts i.e., it must be consistent with a
substantial body of established facts. In other words, it should be one which judges accept as
being the most likely.
(vii) Hypothesis should be amenable to testing within a reasonable time. One should not use
even an excellent hypothesis, if the same cannot be tested in reasonable time for one cannot
spend a life-time collecting data to test it.
(viii) Hypothesis must explain the facts that gave rise to the need for explanation. This means
that by using the hypothesis plus other known and accepted generalizations, one should be able
to deduce the original problem condition. Thus hypothesis must actually explain what it claims
to explain; it should have empirical reference.

BASIC CONCEPTS CONCERNING TESTING OF HYPOTHESES

Null and research hypotheses
To carry out statistical hypothesis testing, research and null hypothesis are employed:
 Research hypothesis: this is the hypothesis that you propose, also known as the
alternative hypothesis Ha. For example:
HA: There is a relationship between intelligence and academic results.
HA: First year university students obtain higher grades after an intensive Statistics course.
HA; Males and females differ in their levels of stress.

 The null hypothesis (Ho) is the opposite of the research hypothesis and expresses that
there is no relationship between variables, or no differences between groups; for
example:

136
Ho: There is no relationship between intelligence and academic results.
Ho: First year university students do not obtain higher grades after an intensive Statistics
course.
Ho: Males and females will not differ in their levels of stress.

The purpose of hypothesis testing is to test whether the null hypothesis (there is no difference, no
effect) can be rejected or approved. If the null hypothesis is rejected, then the research
hypothesis can be accepted. If the null hypothesis is accepted, then the research hypothesis is
rejected.
In hypothesis testing, a value is set to assess whether the null hypothesis is accepted or
rejected and whether the result is statistically significant:
o A critical value is the score the sample would need to decide against the null hypothesis.
o A probability value is used to assess the significance of the statistical test. If the null
hypothesis is rejected, then the alternative to the null hypothesis is accepted.

The hypothesis testing process

The hypothesis testing process can be divided into five steps:
1. Restate the research question as research hypothesis and a null hypothesis about the
populations.
2. Determine the characteristics of the comparison distribution.
3. Determine the cut off sample score on the comparison distribution at which the null
hypothesis should be rejected.
4. Determine your sample’s score on the comparison distribution.
5. Decide whether to reject the null hypothesis.

This example illustrates how these five steps can be applied to text a hypothesis:
o Let’s say that you conduct an experiment to investigate whether students’ ability to
memorise words improves after they have consumed caffeine.
o The experiment involves two groups of students: the first group consumes caffeine;
the second group drinks water.
o Both groups complete a memory test.
o A randomly selected individual in the experimental condition (i.e. the group that
consumes caffeine) has a score of 27 on the memory test. The scores of people in
general on this memory measure are normally distributed with a mean of 19 and a

137
standard deviation of 4.
o The researcher predicts an effect (differences in memory for these groups) but does
not predict a particular direction of effect (i.e. which group will have higher scores on
the memory test). Using the 5% significance level, what should you conclude?

Step 1: There are two populations of interest.

Population 1: People who go through the experimental procedure (drink coffee).
Population 2: People who do not go through the experimental procedure (drink water).
o Research hypothesis: Population 1 will score differently from Population 2.
o Null hypothesis: There will be no difference between the two populations.

Step 2: We know that the characteristics of the comparison distribution (student population)
are:
Population M = 19, Population SD= 4, normally distributed. These are the mean and standard
deviation of the distribution of scores on the memory test for the general student population.

Step 3: For a two-tailed test (the direction of the effect is not specified) at the 5% level (25%
at each tail), the cut off sample scores are +1.96 and -1.99.

138
Step 4: Your sample score of 27 needs to be converted into a Z value. To calculate Z = (27-
19)/4= 2 (check the Converting into Z scores section if you need to review how to do this
process)

Step 5: A ‘Z’ score of 2 is more extreme than the cut off Z of +1.96 (see figure above). The
result is significant and, thus, the null hypothesis is rejected.

Correlation analysis
Correlation analysis explores the association between variables. The purpose of correlational
analysis is to discover whether there is a relationship between variables, which is unlikely to
occur by sampling error. The null hypothesis is that there is no relationship between the two
variables. Correlation analysis provides information about:
 The direction of the relationship: positive or negative- given by the sign of the
correlation coefficient.
 The strength or magnitude of the relationship between the two variables- given by the
correlation coefficient, which varies from 0 (no relationship between the variables) to
1 (perfect relationship between the variables).

1. Direction of the relationship.

A positive correlation indicates that high scores on one variable are associated with high
scores on the other variable; low scores on one variable are associated with low scores on the
second variable . For instance, in the figure below, higher scores on negative affect are
associated with higher scores on perceived stress

139
A negative correlation indicates that high scores on one variable are associated with low
scores on the other variable. The graph shows that a person who scores high on perceived
stress will probably score low on mastery. The slope of the graph is downwards- as it moves
to the right. In the figure below, higher scores on mastery are associated with lower scores on
perceived stress.

140
2. The strength or magnitude of the relationship
The strength of a linear relationship between two variables is measured by a statistic known
as the correlation coefficient, which varies from 0 to -1, and from 0 to +1. There are several
correlation coefficients; the most widely used are Pearson’s r and Spearman’s rho. The
strength of the relationship is interpreted as follows:
 Small/weak: r= .10 to .29
 Medium/moderate: r= .30 to .49
 Large/strong: r= .50 to 1

It is important to note that correlation analysis does not imply causality. Correlation is used
to explore the association between variables, however, it does not indicate that one variable
causes the other. The correlation between two variables could be due to the fact that a third
variable is affecting the two variables.

141
Example of Hypothesis Testing

Case Study: Hypothesis Testing

for the Mean
Scenario:
A university claims that the average time a student spends studying per week is 25 hours. A
random sample of 50 students is selected, and the sample mean is found to be 23 hours with a
sample standard deviation of 6 hours. Is there enough evidence to suggest that the average
study time is different from 25 hours? Use a significance level of 0.05.

Step 1: State the Hypotheses

We will perform a two-tailed hypothesis test because we are checking whether the average
study time is different from 25 hours (could be either greater or less).

Step 2: Choose the Significance Level

We are given a significance level of 0.05.

Step 3: Select the Appropriate Test and Test Statistic

Since the population standard deviation is unknown and the sample size is relatively large (\
(n = 50\)), we will use the t-test for a single mean. The test statistic formula is:

142
Step 4: Compute the Test Statistic

Step 5: Determine the Critical Value or P-Value

Since this is a two-tailed test, we will split the significance level between both tails of the t-
distribution. For a significance level of 0.05 and degrees of freedom (df = n - 1 = 50 - 1 =
49), the critical t-value can be found from a t-distribution table or calculator.

143
Step 6: Compare the Test Statistic and Make a Decision

Interpretation:

There is sufficient evidence to suggest that the average study time of students is different
from 25 hours per week at the 0.05 significance level.

The university's claim that students study an average of 25 hours per week may not be
accurate. The sample data indicates that students may study less than 25 hours on average,
though further investigation could provide more clarity.

Standard Error and Sampling Distribution

In statistics, the standard error is the standard deviation of the sample distribution. The
sample mean of a data is generally varied from the actual population mean. It is represented
as SE. It is used to measure the amount of accuracy by which the given sample represents its
population. Statistics is a vast topic in which we learn about data, sample and population,
mean, median, mode, dependent and independent variables, standard deviation, variance, etc.
Here you will learn the standard error formula along with SE of the mean and estimation.

144
Standard Error Meaning
The standard error is one of the mathematical tools used in statistics to estimate the
variability. It is abbreviated as SE. The standard error of a statistic or an estimate of a
parameter is the standard deviation of its sampling distribution. We can define it as an
estimate of that standard deviation.

Standard Error Formula

The accuracy of a sample that describes a population is identified through the SE formula.
The sample mean which deviates from the given population and that deviation is given as;

Where S is the standard deviation and n is the number of observations.

Standard Error of the Mean (SEM)

The standard error of the mean also called the standard deviation of mean, is represented as
the standard deviation of the measure of the sample mean of the population. It is abbreviated
as SEM. For example, normally, the estimator of the population mean is the sample mean.
But, if we draw another sample from the same population, it may provide a distinct value.
Thus, there would be a population of the sampled means having its distinct variance and
mean. It may be defined as the standard deviation of such sample means of all the possible
samples taken from the same given population. SEM defines an estimate of standard
deviation which has been computed from the sample. It is calculated as the ratio of the
standard deviation to the root of sample size, such as:

.
Where ‘s’ is the standard deviation and n is the number of observations.
The standard error of the mean shows us how the mean changes with different tests,
estimating the same quantity. Thus if the outcome of random variations is notable, then the
standard error of the mean will have a higher value. But, if there is no change observed in the
data points after repeated experiments, then the value of the standard error of the mean will
be zero.

145
Standard Error of Estimate (SEE)

The standard error of the estimate is the estimation of the accuracy of any predictions. It is
denoted as SEE. The regression line depreciates the sum of squared deviations of prediction.
It is also known as the sum of squares error. SEE is the square root of the average
squared deviation. The deviation of some estimates from intended values is given by
standard error of estimate formula.

Where xi stands for data values, x bar is the mean value and n is the sample size.
How to calculate Standard Error
Step 1: Note the number of measurements (n) and determine the sample mean (μ). It is the
average of all the measurements.
Step 2: Determine how much each measurement varies from the mean.
Step 3: Square all the deviations determined in step 2 and add altogether: Σ(xi – μ)²
Step 4: Divide the sum from step 3 by one less than the total number of measurements (n-1).
Step 5: Take the square root of the obtained number, which is the standard deviation (σ).
Step 6: Finally, divide the standard deviation obtained by the square root of the number of
measurements (n) to get the standard error of your estimate.
Go through the example given below to understand the method of calculating standard error.
Standard Error Example
Calculate the standard error of the given data:
y: 5, 10, 12, 15, 20
Solution: First we have to find the mean of the given data;
Mean = (5+10+12+15+20)/5 = 62/5 = 10.5
Now, the standard deviation can be calculated as;
S = Summation of difference between each value of given data and the mean value/Number
of values.
Hence,

After solving the above equation, we get;

146
S = 5.35
Therefore, SE can be estimated with the formula;
SE = S/√n
SE = 5.35/√5 = 2.39

Standard Error vs Standard Deviation

The below table shows how we can calculate the standard deviation (SD) using population
parameters and standard error (SE) using sample parameters.

Importance of Standard Error

Standard errors produce simplistic measures of uncertainty in a value. They are often used
because, in many cases, if the standard error of some individual quantities is known, then we
can easily calculate the standard error of some function of the quantities. Also, when the
probability distribution of the value is known, we can use it to calculate an exact confidence
interval. However, the standard error is an essential indicator of how precise an estimate of
the sample statistic’s population parameter is.

Examples for Calculating the Standard Error of the Sampling Distribution of a Sample

147
Mean
Example 1
The mean height of all adults in a particular country is 163 cm with a standard deviation of
2.5 cm. Calculate the standard error of the sampling distribution of a sample mean if the
sample size is 40. Round to three decimal places.
Step 1: Identify the standard deviation of the population, σ, and the sample size, N.
The problem states that standard deviation of the population is σ=2.5 cm, and the sample size
is N=40.
Step 2: Calculate the standard error of the sampling distribution of a sample mean by
dividing the population standard deviation by the square root of the sample size.

Using the formula for the standard error of the sampling distribution of a sample mean with
the information identified in step 1, we have:

The standard error is approximately 0.395 cm.

Example 2
The mean weekly grocery bill for a household in a certain large city is $175.30 with a
standard deviation of $6.17. What is the standard error of the sampling distribution of a
sample mean if the sample size is 100? Round to the nearest cent.
Step 1: Identify the standard deviation of the population, σ, and the sample size, N.
We have:
 σ=$6.17
 N=100
Step 2: Calculate the standard error of the sampling distribution of a sample mean by
dividing the population standard deviation by the square root of the sample size.
148
Using the formula for the standard error of the sampling distribution of a sample mean, we
have:

The standard error is approximately $0.62.

149
What is t-distribution?
Student’s t-distribution, also known as the t-distribution, is a probability distribution that is
used in statistics for making inferences about the population mean when the sample size is
small or when the population standard deviation is unknown. It is similar to the standard
normal distribution (Z-distribution), but it has heavier tails. Theoretical work on t-
distribution was done by W.S. Gosset; he has published his findings under the pen name
“Student“. That’s why it is called a Student’s t-test. The t-score represents the number of
standard deviations the sample mean is away from the population mean.
T-Score
The T-score, also known as the t-value or t-statistic, is a standardized score that quantifies
how many standard deviations a data point or sample mean is from the population mean. It is
commonly used in statistical hypothesis testing, particularly in scenarios where the sample
size is small or the population standard deviation is unknown.
The formula for calculating the T-score in the context of a t-distribution is given by:

where,
 t = t-score,
 x̄ = sample mean
 μ = population mean,
 s = standard deviation of the sample,
 n = sample size
As we know, we use t-distribution when the standard deviation of the population is unknown
and the sample size is small. The formula for the t-distribution looks very similar to the
normal distribution; the only difference is that instead of the standard deviation of the
population, we will use the standard deviation of the sample.
When to Use the t-Distribution?
Student’s t Distribution is used when :
 The sample size is 30 or less than 30.
 The population standard deviation(σ) is unknown.
 The population distribution must be unimodal and skewed.
Mathematical Derivation of t-Distribution
The t-distribution has been derived mathematically under the assumption of a normally

150
distributed population and the formula for the probability density function will be like this

So, this above equation indicates the probability density function(pdf) of the t-distribution for
df degrees of freedom.

Significance of the t-Distribution

1. Degrees of Freedom and Tail Heaviness:
The t-distribution degrees of freedom influence tail heaviness, with smaller values
yielding heavier tails. Higher degrees of freedom make the t-distribution more akin to a
standard normal distribution (mean 0, standard deviation 1), shaping its spread.
2. Small Sample Size:
The t-distribution is vital for small sample sizes, offering a precise probability
distribution for statistical inferences on population parameters, especially the mean. This
is crucial when the population standard deviation is unknown and must be estimated from
the sample.
3. t-Score Calculation for Inference:
In situations where the standard deviation of the population is not known, the t-score (T)
is calculated to make inferences about the population [Link] distinction between s and
σ (population standard deviation) and the utilization of (n – 1) degrees of freedom
delineate the characteristics of the t-distribution.
4. Comparison with Z-Score and Normal Distribution:
Unlike the z-score, which employs the population standard deviation, the t-score uses the
estimated standard deviation from the sample. This results in a t-distribution with (n – 1)
degrees of freedom, emphasizing the t-distribution’s role in handling uncertainty when
estimating the population standard deviation, especially in small sample sizes.

151
Interpretation of t-Distribution
A confidence interval for the mean is a statistical range computed from the data, designed to
encompass a plausible “population” mean. This interval is expressed as

Where t represents a critical value obtained from the t-distribution.

Suppose we are investigating the mean study time for an exam by collecting data from a
sample of 20 students. To establish a 90% confidence interval for the population mean study
time using the above formula.
Let us say xˉ = 4 hours, s= 1.5 hours and n =20. The critical t-value is obtained for a 90%
confidence interval with 19 degrees of freedom. Assuming a critical t-value of
1.729(calculated using the t table or online calculator), the calculation results in a 90%
confidence interval for the average study time, such as between 3.58 hours and 4.42 hours.
This utilization of the t-distribution addresses the uncertainty linked to estimating the
population mean from a sample, especially in cases where the population standard deviation
is unknown.
Properties of the t-Distribution
 The variable in t-distribution ranges from -∞ to +∞ (-∞ < t < +∞).
 t- distribution will be symmetric like the normal distribution if the power of t is even in
the probability density function(pdf).
 For large values of ν(i.e. increased sample size n); the t-distribution tends to a standard
normal distribution. This implies that for different ν values, the shape of t-distribution
also differs.
 The t-distribution is less peaked than the normal distribution at the center and higher
peaked in the tails. From the above diagram, one can observe that the red and green
curves are less peaked at the center but higher peaked at the tails than the blue curve.
 The value of y(peak height) attains highest at μ = 0 as one can observe the same in the
above diagram.
 The mean of the distribution is equal to 0 for ν > 1 where ν = degrees of freedom,
otherwise undefined.
 The median and mode of the distribution is equal to 0.
 The variance is equal to ν / ν-2 for ν > 2 and ∞ for 2 < ν ≤ 4 otherwise undefined.

152
Degrees of freedom refer to the number of independent observations in a set of data. When
estimating a mean score or a proportion from a single sample, the number of independent
observations is equal to the sample size minus one.
Hence, the distribution of the t statistic from samples of size 10 would be described by a t
distribution having 10 – 1 or 9 degrees of freedom. Similarly, a t- distribution having 15
degrees of freedom would be used with a sample of size 16.
t-Distribution Table
t-Distribution table gives the t-value for a different level of significance and different degrees
of freedom. The calculated t-value will be compared with the tabulated t-value. For example,
if one is performing a student’s t-test and for that performance, he has taken a 5% level of
significance and he got or calculated t-value and he has taken his tabulated t-value and if the
calculated t-value is higher than the tabulated t-value, in that case, it will say that there is a
significant difference between the population mean and the sample means at 5% level of
significance and if vice versa then, in that case, it will say that there is no significant
difference between the population means and the sample means at 5% level of significance.

153
t-scores and p-values
t-scores :
 It represents the deviation of a data point from the mean in a t-distribution, expressed in
terms of standard deviations. Particularly useful for small sample sizes or cases with
unknown population standard deviations.
 We can obtain them from a t-table or through online tools, providing a numerical
measure of how atypical a data point is within the distribution.
 t-score is important in determining confidence intervals, aiding in estimating the range
within which the true population parameter is likely to fall. The critical value of t is
integral in confidence interval calculations, guiding the determination of upper and lower
bounds.
p-value:
The p-value (probability value) is a statistical measure that helps assess the evidence against
154
a null hypothesis.
 p-value describes the likelihood of data occurring if the null hypothesis were true.
 You can use statistical software to directly obtain the p-value associated with the
calculated t-score or you can use the t-table, which provides critical values for different
levels of significance and degrees of freedom. First, find the row corresponding to your
degrees of freedom and the column corresponding to your t-score to get the p-value.
Limitations of Using a T-Distribution
 Sensitivity to Departure from Normality: The t-distribution assumes normality in the
underlying population. When data deviates significantly from a normal distribution, reliance
on the t-distribution may introduce inaccuracies in statistical inferences.
 Limited Applicability for Large Samples: As sample sizes increase, the t-distribution
converges to the normal distribution. Therefore, for sufficiently large samples and known
population standard deviation, the normal distribution is more appropriate, and using the t-
distribution may not offer additional benefits.
 Impact of Outliers and Small Sample Sizes: The t-distribution can be sensitive to outliers,
and its tails can be influenced by small sample sizes. Outliers may distort results, and in
cases where the sample size is very small, the t-distribution may have heavier tails, affecting
the accuracy of inferences.
 Requires Random Sampling: The assumptions underlying the t-distribution, such as
random sampling and independence of observations, need to be met for valid results. If these
assumptions are violated, the accuracy of inferences drawn from the t-distribution may be
compromised.
T- Distribution Applications
1. Testing for the Hypothesis of the Population Mean:T-distributions are commonly used in
hypothesis tests regarding the population mean. This involves assessing whether a sample
mean is significantly different from a hypothesized population mean.
2. Testing for the Hypothesis of the Difference Between Two Means:T-tests can be
employed to examine if there is a significant difference between the means of two
independent samples. This can be done under the assumption of equal variances or when
variances are [Link] scenarios where samples are not independent, such as paired or
dependent samples, t-tests can be used to assess the significance of the mean difference
between related observations.
3. Testing for the Hypothesis about the Coefficient of Correlation:T-distributions play a
role in hypothesis testing related to correlation coefficients. This includes situations where

155
the population correlation coefficient is assumed to be zero (ρ=0) or when testing for a non-
zero correlation coefficient (ρ≠0).
Difference Between T-Distribution and Normal Distribution

T-Distribution Normal Distribution

T-Distribution is defined by its

Normal distribution is defined by
degree of freedom which itself
its mean and standard deviation
depends upon the sample size

Normal distribution is used when

T- distribution is used when the
we have large no data points in
sample size is small
the dataset

It has a heavier tail than normal Normal distribution has a lighter

distribution which means more data tail than T-distribution which
points are away from the mean of the means more data points lie near
distribution the mean of the distribution

We use T-distribution in hypothesis

Normal distribution is used when
testing when the standard variation of
the standard deviation is known
the population is unknown

T-Distribution has a larger range of

Normal distribution has a
critical values as compared to the
smaller range as compared to t-
normal distribution as this
distribution
distribution has heavier tails

What Is a Chi-Square Test?

156
The Chi-Square test is a statistical procedure for determining the difference between observed
and expected data. This test can also be used to decide whether it correlates to our data's
categorical variables. It helps to determine whether a difference between two categorical
variables is due to chance or a relationship between them.
A chi-square test or comparable nonparametric test is required to test a hypothesis regarding the
distribution of a categorical variable. Categorical variables, which indicate categories such as
animals or countries, can be nominal or ordinal. They cannot have a normal distribution since
they only have a few particular values. Chi-Square Test Formula

Where
c = Degrees of freedom
O = Observed Value
E = Expected Value
The degrees of freedom in a statistical calculation represent the number of variables that can
vary. The degrees of freedom can be calculated to ensure that chi-square tests are statistically
valid. These tests are frequently used to compare observed data with data expected to be obtained
if a particular hypothesis were true.
The Observed values are those you gather yourselves.
The expected values are the anticipated frequencies, based on the null hypothesis.
Fundamentals of Hypothesis Testing
Hypothesis testing is a technique for interpreting and drawing inferences about a population
based on sample data. It aids in determining which sample data best support mutually exclusive
population claims.
Null Hypothesis (H0) - The Null Hypothesis is the assumption that the event will not occur. A
null hypothesis has no bearing on the study's outcome unless it is rejected.
H0 is the symbol for it, and it is pronounced H-naught.
Alternate Hypothesis(H1 or Ha) - The Alternate Hypothesis is the logical opposite of the null
hypothesis. The acceptance of the alternative hypothesis follows the rejection of the null
hypothesis. H1 is the symbol for it.
Types of Chi-Square Tests
There are two main types of Chi-Square tests:
157
1. Independence
2. Goodness-of-Fit
Independence
The Chi-Square Test of Independence is a derivable ( also known as inferential ) statistical
test which examines whether the two sets of variables are likely to be related with each other
or not. This test is used when we have counts of values for two nominal or categorical
variables and is considered as non-parametric test. A relatively large sample size and
independence of obseravations are the required criteria for conducting this test.
Example:
In a movie theatre, suppose we made a list of movie genres. Let us consider this as the first
variable. The second variable is whether or not the people who came to watch those genres of
movies have bought snacks at the theatre. Here the null hypothesis is that th genre of the film
and whether people bought snacks or not are unrelatable. If this is true, the movie genres
don’t impact snack sales.
Goodness-Of-Fit
In statistical hypothesis testing, the Chi-Square Goodness-of-Fit test determines whether a
variable is likely to come from a given distribution or not. We must have a set of data values
and the idea of the distribution of this data. We can use this test when we have value counts
for categorical variables. This test demonstrates a way of deciding if the data values have a “
good enough” fit for our idea or if it is a representative sample data of the entire population.
Example:
Suppose we have bags of balls with five different colours in each bag. The given condition is
that the bag should contain an equal number of balls of each colour. The idea we would like
to test here is that the proportions of the five colours of balls in each bag must be exact.

Chi-Square Test Examples

1. Chi-Square Test for Independence
Example: A researcher wants to determine if there is an association between gender
(male/female) and preference for a new product (like/dislike). The test can assess whether
preferences are independent of gender.
2. Chi-Square Test for Goodness of Fit
Example: A dice manufacturer wants to test if a six-sided die is fair. They roll the die 60
times and expect each face to appear 10 times. The test checks if the observed frequencies
match the expected frequencies.

158
3. Chi-Square Test for Homogeneity
Example: A fast-food chain wants to see if the preference for a particular menu item is
consistent across different cities. The test can compare the distribution of preferences in
multiple cities to see if they are homogeneous.
4. Chi-Square Test for a Contingency Table
Example: A study investigates whether smoking status (smoker/non-smoker) is related to the
presence of lung disease (yes/no). The test can evaluate the relationship between smoking
and lung disease in the sample.
5. Chi-Square Test for Population Proportions
Example: A political analyst wants to see if voter preference (candidate A vs. candidate B) is
the same across different age groups. The test can determine if the proportions of preferences
differ significantly between age groups.

Real-Life Practical Examples of Goodness of Fit and F-Test

Goodness of Fit Examples

1. Marketing Survey Responses

A company launches a survey to gauge customer preferences for four different product
colors: red, blue, green, and yellow. The company expects that 25% of the customers will
choose each color. A goodness-of-fit test can be used to compare the observed preferences
from the survey to the expected distribution (25% for each color) and determine if the actual
preferences significantly differ from the expected.

2. Genetic Traits in Offspring

In genetics, a researcher might expect that the offspring of two heterozygous parents will
display dominant and recessive traits in a 3:1 ratio. After collecting data on the observed
number of offspring showing each trait, a goodness-of-fit test can be used to see if the
observed data aligns with the expected Mendelian inheritance ratio of 3:1.

3. Manufacturing Quality Control

A manufacturer produces items that are expected to meet certain specifications, like weight
or size. Suppose the company claims that 10% of the products are defective. To test this
claim, a random sample of items is taken, and a goodness-of-fit test is used to compare the

159
observed defect rate to the expected 10% rate and check whether the defect rate is
statistically consistent with the company's claim.

4. Website Traffic Analysis

A website admin expects traffic to their website to be evenly distributed across weekdays.
If they collect data on the number of visitors each day for a month, they can use a goodness-
of-fit test to check if the actual traffic pattern follows the expected even distribution across
the five days.

5. Election Polling Results

In an election, a political analyst may expect that voters' preferences are split evenly across
three candidates (e.g., 33.3% for each candidate). After polling a sample of voters, a
goodness-of-fit test can be used to determine whether the observed proportions significantly
differ from the expected equal distribution.

F-Test Examples

1. Comparing the Effectiveness of Two Medications

A pharmaceutical company wants to compare the effectiveness of two drugs for treating
high blood pressure. The company collects data on the blood pressure reduction in patients
taking Drug A and Drug B. An F-test can be used to determine whether the variances in
blood pressure reduction are significantly different between the two groups, which could
suggest that the drugs affect patients differently.

2. Manufacturing Process Improvement

A factory implements two different methods for producing a product. Management wants
to test whether the variability (variance) in product dimensions is the same between the two
methods. An F-test is conducted to compare the variances and determine if one process is
more consistent than the other, indicating which method produces more reliable results.

3. Comparing Income Variability Between Regions

An economist is interested in whether income variability (variance) differs between urban
and rural populations. The economist collects income data from both populations and
conducts an F-test to compare the variances, helping to understand whether income

160
distribution is more unequal in one area compared to the other.

4. Analyzing Teacher Effectiveness

In an educational setting, a school administrator may want to compare the variability in test
scores between two teachers who teach the same subject. An F-test can be used to check
whether the variance in student test scores significantly differs between the two teachers,
indicating whether one teacher's students perform more consistently than the other's.

5. Quality Control in Food Processing

A food processing company wants to compare the variance in the shelf life of two different
food preservation techniques. Using an F-test, the company can compare the variability in
shelf life for the two techniques to determine which method provides more consistent product
longevity.

How to Perform a Chi-Square Test?

Let's say you want to know if gender has anything to do with political party preference. You
poll 440 voters in a simple random sample to find out which political party they prefer. The
results of the survey are shown in the table below:

To see if gender is linked to political party preference, perform a Chi-Square test of

independence using the steps below.
Step 1: Define the Hypothesis
H0: There is no link between gender and political party preference.
H1: There is a link between gender and political party preference.
Step 2: Calculate the Expected Values
Now you will calculate the expected frequency.

For example, the expected value for Male Republicans is:

161
Similarly, you can calculate the expected value for each of the cells.

Step 3: Calculate (O-E)2 / E for Each Cell in the Table

Now you will calculate the (O - E)2 / E for each cell in the table.
Where
O = Observed Value
E = Expected Value

Step 4: Calculate the Test Statistic X2

X2 is the sum of all the values in the last table
= 0.743 + 2.05 + 2.33 + 3.33 + 0.384 + 1
= 9.837
Before you can conclude, you must first determine the critical statistic, which requires
determining our degrees of freedom. The degrees of freedom in this case are equal to the
table's number of columns minus one multiplied by the table's number of rows minus
one, or (r-1) (c-1). We have (3-1)(2-1) = 2.
Finally, you compare our obtained statistic to the critical statistic found in the chi-square
table. As you can see, for an alpha level of 0.05 and two degrees of freedom, the critical
statistic is 5.991, which is less than our obtained statistic of 9.83. You can reject our null
hypothesis because the critical statistic is higher than your obtained statistic.
This means you have sufficient evidence to say that there is an association between
gender and political party preference.
162
163
What Are Categorical Variables?

Categorical variables belong to a subset of variables that can be divided into discrete
categories. Names or labels are the most common categories. These variables are also
known as qualitative variables because they depict the variable's quality or
characteristics.
Categorical variables can be divided into two categories:
1. Nominal Variable: A nominal variable's categories have no natural ordering. Example:
Gender, Blood groups
2. Ordinal Variable: A variable that allows the categories to be sorted is an ordinal
variable. An example is customer satisfaction (Excellent, Very Good, Good, Average,
Bad, and so on).

164
How to Solve Chi-Square Problems?

1. State the Hypotheses

 Null hypothesis (H0): There is no association between the variables
 Alternative hypothesis (H1): There is an association between the variables.
2. Calculate the Expected Frequencies
 Use the formula: E=(Row Total×Column Total)Grand TotalE = \frac{(Row \ Total \times
Column \ Total)}{Grand \ Total}E=Grand Total(Row Total×Column Total)
3. Compute the Chi-Square Statistic
 Use the formula: χ2=∑(O−E)2E\chi^2 = \sum \frac{(O - E)^2}{E}χ2=∑E(O−E)2, where
O is the observed frequency and E is the expected frequency.
4. Determine the Degrees of Freedom (df)
 Use the formula: df=(number of rows−1)×(number of columns−1)df = (number \ of \
rows - 1) \times (number \ of \ columns - 1)df=(number of rows−1)×(number of
columns−1)
5. Find the Critical Value and Compare
 Use the chi-square distribution table to find the critical value for the given df and
significance level (usually 0.05).
 Compare the chi-square statistic to the critical value to decide whether to reject the null
hypothesis.
These practice problems help you understand how chi-square analysis tests hypotheses and
explores relationships between categorical variables in various fields.

When to Use a Chi-Square Test?

A Chi-Square Test is used to examine whether the observed results are in order with the
expected values. When the data to be analysed is from a random sample, and when the
variable is the question is a categorical variable, then Chi-Square proves the most appropriate
test for the same. A categorical variable consists of selections such as breeds of dogs, types
of cars, genres of movies, educational attainment, male v/s female etc. Survey responses and
questionnaires are the primary sources of these types of data. The Chi-square test is most
commonly used for analysing this kind of data. This type of analysis is helpful for

165
researchers who are studying survey response data. The research can range from customer
and marketing research to political sciences and economics.

Chi-Square Distribution

Chi-square distributions (X2) are a type of continuous probability distribution. They're

commonly utilized in hypothesis testing, such as the chi-square goodness of fit and
independence tests. The parameter k, which represents the degrees of freedom, determines
the shape of a chi-square distribution.
Very few real-world observations follow a chi-square distribution. Chi-square
distributions aim to test hypotheses, not to describe real-world distributions. In contrast,
other commonly used distributions, such as normal and Poisson distributions, may
explain important things like birth weights or illness cases per year.
Chi-square distributions are excellent for hypothesis testing because of its close
resemblance to the conventional normal distribution. Many essential statistical tests rely
on the traditional normal distribution.
In statistical analysis, the Chi-Square distribution is used in many hypothesis tests and is
determined by the parameter k degree of freedom. It belongs to the family of
continuous probability distributions. The Sum of the squares of the k-independent
standard random variables is called the Chi-Squared distribution. Pearson’s Chi-Square
Test formula is -

Where X^2 is the Chi-Square test symbol

Σ is the summation of observations
O is the observed results
E is the expected results
The shape of the distribution graph changes with the increase in the value of k, i.e., the
degree of freedom.
When k is 1 or 2, the Chi-square distribution curve is shaped like a backwards ‘J’. It
means there is a high chance that X^2 becomes close to zero.
When k is greater than 2, the shape of the distribution curve looks like a hump and has a low
probability that X^2 is very near to 0 or very far from 0. The distribution occurs much longer on
the right-hand side and shorter on the left-hand side. The probable value of X^2 is (X^2 - 2).

166
When k is greater than ninety, a normal distribution is seen, approximating the Chi-square
distribution.

What is the P-Value in a Chi-Square Test?

The P-Value in a Chi-Square test is a statistical measure that helps to assess the importance of
your test results.
Here P denotes the probability; hence for the calculation of p-values, the Chi-Square test comes
into the picture. The different p-values indicate different types of hypothesis interpretations.
1. P <= 0.05 (Hypothesis interpretations are rejected)
2. P>= 0.05 (Hypothesis interpretations are accepted)
The concepts of probability and statistics are entangled with Chi-Square Test. Probability is the
estimation of something that is most likely to happen. Simply put, it is the possibility of an event
or outcome of the sample. Probability can understandably represent bulky or complicated data.
And statistics involves collecting and organising, analysing, interpreting and presenting the data.
Finding P-Value
When you run all of the Chi-square tests, you'll get a test statistic called X2. You have two
options for determining whether this test statistic is statistically significant at some alpha level:
1. Compare the test statistic X2 to a critical value from the Chi-square distribution table.
2. Compare the p-value of the test statistic X2 to a chosen alpha level.
Test statistics are calculated by taking into account the sampling distribution of the test statistic
under the null hypothesis, the sample data, and the approach which is chosen for performing the
test.
167
The p-value will be as mentioned in the following cases.
 A lower-tailed test is specified by: P(TS ts | H0 is true) p-value = cdf (ts)
 Lower-tailed tests have the following definition: P(TS ts | H0 is true) p-value = cdf (ts)
 A two-sided test is defined as follows, if we assume that the test static distribution of H0
is symmetric about 0. 2 * P(TS |ts| | H0 is true) = 2 * (1 - cdf(|ts|))
Where:
P: probability Event
TS: Test statistic is computed observed value of the test statistic from your sample cdf():
Cumulative distribution function of the test statistic's distribution (TS)
Tools and Software for Chi-Square Analysis
Here are some commonly used tools and software for performing Chi-Square analysis:
1. SPSS (Statistical Package for the Social Sciences) is a widely used software for statistical
analysis, including Chi-Square tests. It provides an easy-to-use interface for performing Chi-
Square tests for independence, goodness of fit, and other statistical analyses.
2. R is a powerful open-source programming language and software environment for statistical
computing. The [Link]() function in R allows for easy conducting of Chi-Square tests.
3. The SAS suite is used for advanced analytics, including Chi-Square tests. It is often used in
research and business environments for complex data analysis.
4. Microsoft Excel offers a Chi-Square test function ([Link]) for users who prefer
working within spreadsheets. It’s a good option for basic Chi-Square analysis with smaller
datasets.
5. Python (with libraries like SciPy or Pandas) offers robust tools for statistical analysis. The
[Link]() function can be used to perform Chi-Square tests.

Properties of Chi-Square Test

1. Variance is double the times the number of degrees of freedom.
2. Mean distribution is equal to the number of degrees of freedom.
3. When the degree of freedom increases, the Chi-Square distribution curve becomes
normal.

Limitations of Chi-Square Test

There are two limitations to using the chi-square test that you should be aware of.
 The chi-square test, for starters, is extremely sensitive to sample size. Even insignificant

168
relationships can appear statistically significant when a large enough sample is used.
Keep in mind that "statistically significant" does not always imply "meaningful" when
using the chi-square test.
 Be mindful that the chi-square can only determine whether two variables are related. It
does not necessarily follow that one variable has a causal relationship with the other. It
would require a more detailed analysis to establish causality.

Advanced Chi-Square Test Techniques

1. Chi-Square Test with Yates' Correction (Continuity Correction)

This technique is used in 2x2 contingency tables to reduce the Chi-Square value and correct for
the overestimation of statistical significance when sample sizes are small. The correction is
achieved by subtracting 0.5 from the absolute difference between each observed and expected
frequency.
2. Mantel-Haenszel Chi-Square Test
This technique is used to assess the association between two variables while controlling for one
or more confounding variables. It’s particularly useful in stratified analyses where the goal is to
examine the relationship between variables across different strata (e.g., age groups, geographic
locations).
3. Chi-Square Test for Trend (Cochran-Armitage Test)
This test is used when the categorical variable is ordinal, and you want to assess whether there is

169
a linear trend in the proportions across the ordered groups. It’s commonly used in epidemiology
to analyze trends in disease rates over time or across different exposure levels.
4. Monte Carlo Simulation for Chi-Square Test
When the sample size is very small or when expected frequencies are too low, the Chi-Square
distribution may not provide accurate p-values. In such cases, Monte Carlo simulation can be
used to generate an empirical distribution of the test statistic, providing a more accurate
significance level.
5. Bayesian Chi-Square Test
In Bayesian statistics, the Chi-Square test can be adapted to incorporate prior knowledge or
beliefs about the data. This approach is useful when existing information should influence the
analysis, leading to potentially more accurate conclusions.

Problem 1: Goodness of Fit for Dice Rolls

A die is rolled 120 times, and the following frequencies of outcomes are observed:
Outcome Frequency
1 20
2 22
3 18
4 15
5 25
6 20
Check if the die is fair using a significance level of 0.05.
Solution:
 Step 1: Set up hypotheses
o H0: The die is fair (expected frequencies are equal).
o Ha: The die is not fair (expected frequencies differ).
 Step 2: Calculate expected frequencies A fair die has an equal probability for each outcome,
so:

 Step 3: Calculate the test statistic (Chi-Square)

170
 Step 4: Compare with critical value Degrees of freedom (df) = 6 - 1 = 5. The critical value
from the Chi-Square table for α=0.05 and df = 5 is 11.07.
Since χ2=2.9 < 11.07, we fail to reject H0.
 Conclusion: The die is fair.

Problem 2: Goodness of Fit for Survey Responses

A survey is conducted to see how many people prefer four different brands of a product. The
expected preferences are equal across all brands. Out of 400 people, the responses are as follows:
Brand Observed Frequency
A 90
B 110
C 100
D 100
Test at the 0.05 significance level if the preferences differ.
Solution:
 Step 1: Set up hypotheses
o H0: Preferences are equally distributed.
o Ha: Preferences are not equally distributed.
 Step 2: Calculate expected frequencies

 Step 3: Calculate the Chi-Square statistic

Step 4: Compare with critical value Degrees of freedom = 4 - 1 = 3. The critical value for df = 3
171
and α=0.05 is 7.815.

Conclusion: Preferences do not significantly differ.

F-Test (Variance Comparison)

Problem 3: F-Test for Two Sample Variances

Two machines are producing metal rods. A random sample of 10 rods from machine 1 has a
variance of 4, and a random sample of 12 rods from machine 2 has a variance of 6. Test at the
0.05 significance level if the variances of the machines differ.
Solution:
 Step 1: Set up hypotheses

 Step 2: Compute the F-statistic

 Step 3: Find critical value The degrees of freedom are df1=n1−1=10−1=9 and
df2=n2−1=12−1=11. The critical value at α=0.05 is 3.18 for df1=9 and df2=11.
Since F=1.5< 3.18, we fail to reject H0.
 Conclusion: There is no significant difference in the variances of the two machines.

Problem 4: F-Test for Exam Scores

A teacher wants to know if two classes have different variability in their exam scores. A random
sample from class 1 (n=15) has a variance of 16, and a sample from class 2 (n=18) has a variance
of 9. Test the hypothesis at the 0.05 significance level.
Solution:
 Step 1: Set up hypotheses

172
 Step 2: Compute the F-statistic

 Step 3: Find critical value Degrees of freedom: df1=14, df2=17. The critical value
from the F-distribution table at α=0.05 is approximately 2.46.
Since F=1.78 < 2.46, we fail to reject H0
Conclusion: The variability in exam scores is not significantly different between the
two classes.

173
ANALYSIS
OF VARIANCE
(ANOVA)

174
Analysis of variance (abbreviated as ANOVA) is an extremely useful technique concerning
researches in the fields of economics, biology, education, psychology, sociology, business/industry
and in researches of several other disciplines. This technique is used when multiple sample
cases are involved. As stated earlier, the significance of the difference between the means of
two samples can be judged through either z-test or the t-test, but the difficulty arises when we
happen to examine the significance of the difference amongst more than two sample means at
the same time. The ANOVA technique enables us to perform this simultaneous test and as such
is considered to be an important tool of analysis in the hands of a researcher. Using this
technique, one can draw inferences about whether the samples have been drawn from
populations having the same mean.
The ANOVA technique is important in the context of all those situations where we want to
compare more than two populations such as in comparing the yield of crop from several
varieties of seeds, the gasoline mileage of four automobiles, the smoking habits of five groups
of university students and so on. In such circumstances one generally does not want to consider
all possible combinations of two populations at a time for that would require a great number of
tests before we would be able to arrive at a decision. This would also consume lot of time and
money, and even then certain relationships may be left unidentified (particularly the interaction
effects). Therefore, one quite often utilizes the ANOVA technique and through it investigates
the differences among the means of all the populations simultaneously.

WHAT IS ANOVA?

Professor R.A. Fisher was the first man to use the term ‘Variance’ * and, in fact, it was he who
developed a very elaborate theory concerning ANOVA, explaining its usefulness in practical
field. Later on Professor Snedecor and many others contributed to the development of this
technique. ANOVA is essentially a procedure for testing the difference among different groups
of data for homogeneity. “The essence of ANOVA is that the total amount of variation in a set
of data is broken down into two types, that amount which can be attributed to chance and that
amount which can be attributed to specified causes.”1 There may be variation between samples
and also within sample items. ANOVA consists in splitting the variance for analytical
purposes. Hence, it is a method of analysing the variance to which a response is subject into its

175
various components corresponding to various sources of variation. Through this technique one
can explain whether various varieties of seeds or fertilizers or soils differ significantly so that a
policy decision could be taken accordingly, concerning a particular variety in the context of
agriculture researches. Similarly, the differences in various types of feed prepared for a
particular class of animal or various types of drugs manufactured for curing a specific disease
may be studied and judged to be significant or not through the application of ANOVA
technique. Likewise, a manager of a big concern can analyse the performance of various
salesmen of his concern in order to know whether their performances differ significantly

THE BASIC PRINCIPLE OF ANOVA

The basic principle of ANOVA is to test for differences among the means of the populations
by examining the amount of variation within each of these samples, relative to the amount of
variation between the samples. In terms of variation within the given population, it is assumed
that the values of (Xij) differ from the mean of this population only because of random effects
i.e., there are influences on (Xij) which are unexplainable, whereas in examining differences
between populations we assume that the difference between the mean of the jth population and
the grand mean is attributable to what is called a ‘specific factor’ or what is technically
described as treatment effect. Thus while using ANOVA, we assume that each of the samples
is drawn from a normal population and that each of these populations has the same variance.
We also assume that all factors other than the one or more being tested are effectively
controlled. This, in other words, means that we assume the absence of many factors that might
affect our conclusions concerning the factor(s) to be studied.
In short, we have to make two estimates of population variance viz., one based on between
samples variance and the other based on within samples variance. Then the said two estimates
of population variance are compared with F-test, wherein we work out.

176
Illustration 1
Set up an analysis of variance table for the following per acre production data for three varieties of
wheat, each grown on 4 plots and state if the variety differences are significant.

177
178
The above table shows that the calculated value of F is 1.5 which is less than the table value of
4.26 at 5% level with d.f. being v1 = 2 and v2 = 9 and hence could have arisen due to chance.
This analysis supports the null-hypothesis of no difference is sample means. We may, therefore,
conclude that the difference in wheat output due to varieties is insignificant and is just a matter
of chance.

179
Factor
Analysis

180
What is Factor Analysis?
Factor analysis, a method within the realm of statistics and part of the general linear model
(GLM), serves to condense numerous variables into a smaller set of factors. By doing so, it
captures the maximum shared variance among the variables and condenses them into a unified
score, which can subsequently be utilized for further [Link] analysis operates under
several assumptions: linearity in relationships, absence of multicollinearity among variables,
inclusion of relevant variables in the analysis, and genuine correlations between variables and
factors. While multiple methods exist, principal component analysis stands out as the most
prevalent approach in practice.
What does Factor mean in Factor Analysis?
In the context of factor analysis, a “factor” refers to an underlying, unobserved variable or latent
construct that represents a common source of variation among a set of observed variables. These
observed variables, also known as indicators or manifest variables, are the measurable variables
that are directly observed or measured in a study.
How to do Factor Analysis (Factor Analysis Steps)?
Factor analysis is a statistical method used to describe variability among observed, correlated
variables in terms of a potentially lower number of unobserved variables called factors. Here are
the general steps involved in conducting a factor analysis:
1. Determine the Suitability of Data for Factor Analysis
 Bartlett’s Test: Check the significance level to determine if the correlation matrix is suitable
for factor analysis.
 Kaiser-Meyer-Olkin (KMO) Measure: Verify the sampling adequacy. A value greater than
0.6 is generally considered acceptable.
2. Choose the Extraction Method
 Principal Component Analysis (PCA): Used when the main goal is data reduction.
 Principal Axis Factoring (PAF): Used when the main goal is to identify underlying factors.
3. Factor Extraction
 Use the chosen extraction method to identify the initial factors.
 Extract eigenvalues to determine the number of factors to retain. Factors with eigenvalues
greater than 1 are typically retained in the analysis.
 Compute the initial factor loadings.
4. Determine the Number of Factors to Retain
 Scree Plot: Plot the eigenvalues in descending order to visualize the point where the plot
levels off (the “elbow”) to determine the number of factors to retain.

181
 Eigenvalues: Retain factors with eigenvalues greater than 1.
5. Factor Rotation
 Orthogonal Rotation (Varimax, Quartimax): Assumes that the factors are uncorrelated.
 Oblique Rotation (Promax, Oblimin): Allows the factors to be correlated.
 Rotate the factors to achieve a simpler and more interpretable factor structure.
 Examine the rotated factor loadings.
6. Interpret and Label the Factors
 Analyze the rotated factor loadings to interpret the underlying meaning of each factor.
 Assign meaningful labels to each factor based on the variables with high loadings on that
factor.
7. Compute Factor Scores (if needed)
 Calculate the factor scores for each individual to represent their value on each factor.
8. Report and Validate the Results
 Report the final factor structure, including factor loadings and communalities.
 Validate the results using additional data or by conducting a confirmatory factor analysis if
necessary.

Factor Analysis in R programming

Factor Analysis (FA) is a statistical method that is used to analyze the underlying structure of
a set of variables. It is a method of data reduction that seeks to explain the correlations
among many variables in terms of a smaller number of unobservable (latent) variables,
known as factors. In R Programming Language, the psych package provides a variety of
functions for performing factor analysis.
Factor analysis involves several steps:
1. Data preparation: The data are usually standardized (i.e., scaled) to make sure that the
variables are on a common scale and have equal weight in the analysis.
2. Factor Extraction: The factors are identified based on their ability to explain the variance in
the data. There are several methods for extracting factors, including principal components
analysis (PCA), maximum likelihood estimate(MLE), and minimum residuals (MR).
3. Factor Rotation: The factors are usually rotated to make their interpretation easier. The
most common method of rotation is Varimax rotation, which tries to maximize the variance
of the factor loadings.
4. Factor interpretation: The final step involves interpreting the factors and their loadings
(i.e., the correlation between each variable and each factor). The loadings represent the

182
degree to which each variable is associated with each factor.
Loading the Data
First, we need to load the data that we want to analyze. For this example, we will use the iris
dataset that comes with R. This dataset contains measurements of the sepal length, sepal
width, petal length, and petal width of three different species of iris flowers.

Data Preparation
Before conducting factor analysis, we need to prepare the data by scaling the variables to
have a mean of zero and a standard deviation of one. This is important because factor
analysis is sensitive to differences in scale between variables.

Determining the Number of Factors

The next step is to determine the number of factors to extract from the data. This can be done
using a variety of methods, such as the Kaiser criterion, scree plot, or parallel analysis. In this
example, we will use the Kaiser criterion, which suggests extracting factors
with eigenvalues greater than one.

183
The output of the summary() function shows the results of the factor analysis, including the
number of factors extracted, the eigenvalues for each factor, and the percentage of variance
explained by each factor.
This summary shows that the factor analysis extracted 2 factors, and provides the
standardized loadings (or factor loadings) for each variable on each factor. It also shows the
eigenvalues and proportion of variance explained by each factor, as well as the results of a
test of the hypothesis that 2 factors are sufficient. The goodness of fit statistic is also
reported.
Interpreting the Results of Factor Analysis
Once the factor analysis is complete, we can interpret the results by examining the factor
loadings, which represent the correlations between the observed variables and the extracted
factors. In general, loadings greater than 0.4 or 0.5 are considered significant.

184
The output of the loadings function shows the factor loadings for each variable and factor.
We can interpret these loadings to identify the underlying factors that explain the correlations
among the observed variables. In this example, it appears that the first factor is strongly
associated with petal length and petal width, while the second factor is strongly associated
with sepal length and sepal width.
Validating the Results of Factor Analysis
Finally, it is important to validate the results of the factor analysis by checking the
assumptions of the technique, such as normality and linearity. Additionally, it is important to
examine the factor structure for different subsets of the data to ensure that the results are
consistent and stable.

185
186
187
Factor Analysis using factanal( ) function:
The factanal() function is used to perform factor analysis on a data set. The factanal()
function takes several arguments described below
Syntax:
factanal(x, factors, rotation, scores, covmat)
where,
 x – The data set to be analyzed.
 factors – The number of factors to extract.
 rotation – The rotation method to use. Popular rotation methods include varimax,
oblimin, and promax.
 scores – Whether to compute factor scores for each observation.
 covmat – A covariance matrix to use instead of the default correlation matrix.

The output of factanal() function includes several pieces of information, including:

 Uniquenesses: The amount of variance in each variable that is not accounted for by the
factors.
 Loadings: The correlations between each variable and each factor.
 Communalities: The amount of variance in each variable that is accounted for by the
factors.
 Eigenvalues: The amount of variance explained by each factor.
 Factor Correlations: The correlations between the factors.
Here is an example code snippet that demonstrates how to use factanal() function in R:

188
# Install the required package
[Link]("psych")
# Load the psych package for
# data analysis and visualization
library(psych)
# Load the mtcars dataset
data(mtcars)
# Perform factor analysis on the mtcars dataset
factor_analysis <- factanal(mtcars, factors = 3, rotation = "varimax")

# Print the results

print(factor_analysis)

189
In this example, we load the psych package, which provides functions for data analysis and
visualization, and the mtcars data set, which contains information about different car models.
We then use the factanal() function to perform factor analysis on the mtcars data set,
specifying that we want to extract three factors and use the varimax rotation method. Finally,
we print the results of the factor analysis.
Conclusion
In conclusion, factor analysis is a useful statistical technique for identifying underlying
factors or latent variables that explain the correlations among a set of observed variables. In
R programming, the psych package provides a range of functions for conducting factor
analysis, which can be used to extract meaningful insights from complex datasets.

190
Unit 5
Introduction
to R
Programming
language
191
Topics

Getting R
Managing R
Arithmetic and Matrix Operations
Introduction to Functions
Control structures
Working with Objects and Data :
Introduction to Objects,
Manipulating Objects,
Constructing Data Objects,
types of Data items,
Structure of Data items,
Reading and Getting Data,
Manipulating Data,
Storing data

192
R Programming Language – Introduction

The R Language stands out as a powerful tool in the modern era of statistical computing and data
analysis. Widely embraced by statisticians, data scientists, and researchers, the R Language
offers an extensive suite of packages and libraries tailored for data manipulation, statistical
modeling, and visualization. In this article, we explore the features, benefits, and applications of
the R Programming Language, shedding light on why it has become an indispensable asset for
data-driven professionals across various industries.
R programming language is an implementation of the S programming language. It also combines
with lexical scoping semantics inspired by Scheme. Moreover, the project was conceived in
1992, with an initial version released in 1995 and a stable beta version in 2000.

R programming is a leading tool for machine learning, statistics, and data analysis, allowing for
the easy creation of objects, functions, and packages. Designed by Ross Ihaka and Robert
Gentleman at the University of Auckland and developed by the R Development Core Team, R
Language is platform-independent and open-source, making it accessible for use across all
operating systems without licensing costs. Beyond its capabilities as a statistical package, R
integrates with other languages like C and C++, facilitating interaction with various data sources
and statistical tools. With a growing community of users and high demand in the Data Science
job market, R is one of the most sought-after programming languages today. Originating as an
implementation of the S programming language with influences from Scheme, R has evolved
since its conception in 1992, with its first stable beta version released in 2000.
193
Why Use R Language?

The R Language is a powerful tool widely used for data analysis, statistical computing, and
machine learning. Here are several reasons why professionals across various fields prefer R:
1. Comprehensive Statistical Analysis:
 R language is specifically designed for statistical analysis and provides a vast array of
statistical techniques and tests, making it ideal for data-driven research.
2. Extensive Packages and Libraries:
 The R Language boasts a rich ecosystem of packages and libraries that extend its capabilities,
allowing users to perform advanced data manipulation, visualization, and machine learning
tasks with ease.
3. Strong Data Visualization Capabilities:
 R language excels in data visualization, offering powerful tools like ggplot2 and plotly,
which enable the creation of detailed and aesthetically pleasing graphs and plots.
4. Open Source and Free:
 As an open-source language, R is free to use, which makes it accessible to everyone, from
individual researchers to large organizations, without the need for costly licenses.
5. Platform Independence:
 The R Language is platform-independent, meaning it can run on various operating systems,
including Windows, macOS, and Linux, providing flexibility in development environments.
6. Integration with Other Languages:
 R can easily integrate with other programming languages such as C, C++, Python, and Java,
allowing for seamless interaction with different data sources and statistical packages.
7. Growing Community and Support:
 R language has a large and active community of users and developers who contribute to its
continuous improvement and provide extensive support through forums, mailing lists, and
online resources.
8. High Demand in Data Science:
 R is one of the most requested programming languages in the Data Science job market,
making it a valuable skill for professionals looking to advance their careers in this field.

Features of R Programming Language

The R Language is renowned for its extensive features that make it a powerful tool for data
analysis, statistical computing, and visualization. Here are some of the key features of R:

194
1. Comprehensive Statistical Analysis:
 R langauge provides a wide array of statistical techniques, including linear and nonlinear
modeling, classical statistical tests, time-series analysis, classification, and clustering.
2. Advanced Data Visualization:
 With packages like ggplot2, plotly, and lattice, R excels at creating complex and
aesthetically pleasing data visualizations, including plots, graphs, and charts.
3. Extensive Packages and Libraries:
 The Comprehensive R Archive Network (CRAN) hosts thousands of packages that
extend R’s capabilities in areas such as machine learning, data manipulation,
bioinformatics, and more.
4. Open Source and Free:
 R is free to download and use, making it accessible to everyone. Its open-source nature
encourages community contributions and continuous improvement.
5. Platform Independence:
 R is platform-independent, running on various operating systems, including Windows,
macOS, and Linux, which ensures flexibility and ease of use across different environments.
6. Integration with Other Languages:
 R language can integrate with other programming languages such as C, C++, Python, Java,
and SQL, allowing for seamless interaction with various data sources and computational
processes.
7. Powerful Data Handling and Storage:
 R efficiently handles and stores data, supporting various data types and structures, including
vectors, matrices, data frames, and lists.
8. Robust Community and Support:
 R has a vibrant and active community that provides extensive support through forums,
mailing lists, and online resources, contributing to its rich ecosystem of packages and
documentation.
9. Interactive Development Environment (IDE):
 RStudio, the most popular IDE for R, offers a user-friendly interface with features like
syntax highlighting, code completion, and integrated tools for plotting, history, and
debugging.
10. Reproducible Research:
 R supports reproducible research practices with tools like R Markdown and Knitr, enabling
users to create dynamic reports, presentations, and documents that combine code, text, and

195
visualizations.
Advantages of R language

 R is the most comprehensive statistical analysis package. As new technology and concepts
often appear first in R.
 As R programming language is an open source. Thus, you can run R anywhere and at any
time.
 R programming language is suitable for GNU/Linux and Windows operating systems.
 R programming is cross-platform and runs on any operating system.
 In R, everyone is welcome to provide new packages, bug fixes, and code enhancements.
Disadvantages of R language
 In the R programming language, the standard of some packages is less than perfect.
 Although, R commands give little pressure on memory management. So R programming
language may consume all available memory.
 In R basically, nobody to complain if something doesn’t work.
 R programming language is much slower than other programming languages such as Python
and MATLAB.

Applications of R language

 We use R for Data Science. It gives us a broad variety of libraries related to statistics. It also
provides the environment for statistical computing and design.
 R is used by many quantitative analysts as its programming tool. Thus, it helps in data
importing and cleaning.
 R is the most prevalent language. So many data analysts and research programmers use it.
Hence, it is used as a fundamental tool for finance.
 Tech giants like Google, Facebook, Bing, Twitter, Accenture, Wipro, and many more using
R nowadays.

R vs Python

R Programming Language and Python are both used extensively for Data Science. Both are
very useful and open-source languages as well. For data analysis, statistical computing, and
machine learning Both languages are strong tools with sizable communities and huge libraries

196
for data science jobs. A theoretical comparison between R and Python is provided below:

R Programming Language is used for machine learning algorithms, linear regression, time
series, statistical inference, etc. It was designed by Ross Ihaka and Robert Gentleman in 1993. R
is an open-source programming language that is widely used as a statistical software and data
analysis tool. R generally comes with the Command-line interface. R is available across widely
used platforms like Windows, Linux, and macOS. Also, the R programming language is the
latest cutting-edge tool.

Python Programming Language

Python is a widely-used general-purpose, high-level programming language. It was created by
Guido van Rossum in 1991 and further developed by the Python Software Foundation. It was
designed with an emphasis on code readability, and its syntax allows programmers to express
their concepts in fewer lines of code.

197
Difference between R Programming and Python Programming
Below are some major differences between R and Python:

Feature R Python

R is a language and
environment for Python is a general-
statistical programming purpose programming
Introduction
which includes language for data analysis
statistical computing and scientific computing
and graphics.

It has many features It can be used to develop

which are useful for GUI applications and web
Objective
statistical analysis and applications as well as
representation. with embedded systems

It has many easy-to-use It can easily perform

Workability packages for performing matrix computation as
tasks well as optimization

Integrated Various popular R IDEs Various popular Python

development are Rstudio, RKward, R IDEs are Spyder,
environment commander, etc. Eclipse+Pydev, Atom, etc.

Some essential packages

Libraries There are many
and libraries
and packages and libraries
are Pandas, Numpy, Scipy,
packages like ggplot2, caret, etc.
etc.

198
Feature R Python

It is mainly used for It takes a more streamlined

Scope complex data analysis approach for data science
in data science. projects.

Ecosystem in R Programming and Python Programming

Python supports a very large community of general-purpose data science. One of the most basic
uses for data analysis, primarily because of the fantastic ecosystem of data-centric Python
packages. Pandas and NumPy are one of those packages that make importing and analyzing, and
visualization of data much easier.
R Programming has a rich ecosystem to use in standard machine learning and data mining
techniques. It works in statistical analysis of large datasets, and it offers a number of different
options for exploring data and It makes it easier to use probability distributions, apply different
statistical tests.

R vs Python

Features R Python

Data collection It is used for data analysts It is used in all kinds of

to import data from data formats including
Excel, CSV, and text SQL tables

199
Features R Python

files.

It optimized for the

You can explore data
Data exploration statistical analysis of
with Pandas
large datasets

It supports Tidyverse,
You can use NumPy,
making it easy to import,
Data modeling SciPy, scikit-learn, Tanso
manipulate, visualize, and
rFlow
report on data.

You can use ggplot2 and

ggplot tools to plots You can use Matplotlib,
Data visualization
complex scatter plots with Pandas, Seaborn
regression lines.

Statistical Analysis and Machine Learning In R and Python

Statistical analysis and machine learning are critical components of data science, involving the
application of statistical methods, models, and techniques to extract insights, identify patterns,
and draw meaningful conclusions from data. Both R and Python have widely used programming
languages for statistical analysis, each offering a variety of libraries and packages to perform
diverse statistical and machine learning tasks. Some comparison of statistical analysis and
modeling capabilities in R and Python.

Capability R Python

Built-in functions (mean, NumPy (mean, median,

Basic Statistics
median, etc.) etc.)

200
Capability R Python

Statsmodels (OLS)
Ordinary Least Squares
Linear Regression lm() function and Formulas
(OLS) Method

Generalized Linear Models

glm() function Statsmodels (GLM)
(GLM)

Time Series packages

Time Series Analysis Statsmodels (Time Series)
(forecast)

ANOVA and t-tests Built-in functions (aov, [Link]) SciPy (ANOVA, t-tests)

Built-in functions ([Link], SciPy (Mann-Whitney,

Hypothesis Tests
etc.) Kruskal-Wallis)

Principal Component
princomp() function scikit-learn (PCA)
Analysis (PCA)

Clustering (K-Means, scikit-learn (KMeans,

kmeans(), hclust()
Hierarchical) AgglomerativeClustering)

scikit-learn
Decision Trees rpart() function
(DecisionTreeClassifier)

201
Capability R Python

scikit-learn
Random Forest randomForest() function
(RandomForestClassifier)

Advantages in R Programming and Python Programming

R Programming Python Programming

It supports a large dataset for statistical General-purpose programming to

analysis use data analyze

Primary users are Programmers and

Primary users are Scholar and R&D
developers

Support packages like tidyverse, Support packages like pandas, scipy,

ggplot2, caret, zoo scikit-learn, TensorFlow, caret

Support RStudio and It has a wide range

Support Conda environment with
of statistics and general data analysis
Spyder, Ipython Notebook
and visualization capabilities.

Disadvantages in R Programming and Python Programming

R Programming Python Programming

R is much more difficult as compared Python does not have too many
to Python because it mainly uses for libraries for data science as compared
statistics purposes. to R.

R might not be as fast as languages like Python might not be as specialized for

202
R Programming Python Programming

statistics and data analysis as R. Some

Python, especially for computationally
statistical functions and visualization
intensive tasks and large-scale data
capabilities might be more
processing.
streamlined in R.

Memory management in R might not be Python visualization capabilities

as efficient as in some other languages, might not be as polished and
which can lead to performance issues streamlined as those offered by R’s
and memory-related errors ggplot2.

R and Python usages in Data Science

Python and R programming language is most useful in data science and it deals with identifying,
representing, and extracting meaningful information from data sources to be used to perform
some business logic with these languages. It has a popular package for Data collection, Data
exploration, Data modeling, Data visualization, and statical analysis.

Introduction to R Studio

R Studio is an integrated development environment(IDE) for R. IDE is a GUI, where you can
write your quotes, see the results and also see the variables that are generated during the course
of programming.
 R Studio is available as both Open source and Commercial software.
 R Studio is also available as both Desktop and Server versions.
 R Studio is also available for various platforms such as Windows, Linux, and macOS.

Introduction to R studio for beginners:

Rstudio is an open-source tool that provides Ide to use R language, and enterprise-ready
professional software for data science teams to develop share the work with their team.
R Studio can be downloaded from its official Website ([Link] and instructions for
installation are available on
203
After the installation process is over, the R Studio interface looks like:

 The console panel(left panel) is the place where R is waiting for you to tell it what to do, and see
the results that are generated when you type in the commands.
 To the top right, you have the Environmental/History panel. It contains 2 tabs:
o Environment tab: It shows the variables that are generated during the course of
programming in a workspace that is temporary.
o History tab: In this tab, you’ll see all the commands that are used till now from the start
of usage of R Studio.
 To the right bottom, you have another panel, which contains multiple tabs, such as files,
plots, packages, help, and viewer.
o The Files tab shows the files and directories that are available within the default
workspace of R.
o The Plots tab shows the plots that are generated during the course of programming.
o The Packages tab helps you to look at what are the packages that are already installed in
the R Studio and it also gives a user interface to install new packages.
o The Help tab is the most important one where you can get help from the R
Documentation on the functions that are in built-in R.
o The final and last tab is that the Viewer tab which can be used to see the local web

204
content that’s generated using R.

Features of R Studio
 A friendly user interface
 writing and storing reusable programmes
 All imported data and newly created objects (such as variables, functions, etc.) are easily
accessible.
 Comprehensive assistance for any item Code autocompletion
 The capacity to organise and share your work with your partners more effectively through
the creation of projects.
 Plot snippets
 Simple terminal and console switching
 Tracking of operational history
 There are numerous articles from RStudio Support on using the IDE.

Set the working directory in R Studio

R is always pointed at a directory on our computer. We can find out which directory by running
the getwd() function. Note: this function has no arguments. We can set the working directory
manually in two ways:
 The first way is to use the console and using the command setwd(“directorypath”).
You can use this function setwd() and give the path of the directory which you want to be
the working directory for R studio, in the double codes.
 The second way is to set the working directory from the GUI.
To set the working directory from the GUI you have to click on this 3 dots button. When
you click this, this will open up a file browser, which will help you to choose your
working directory.

205
Once you choose your working directory, you need to use this setting button in the more tab and
click it and then you get a popup menu, where you need to select “Set as working directory”.

This will select the current directory, which you have chosen using this file browser as
your working directory. Once you set the working directory, you are ready to program in
R Studio.

Create an RStudio project

Step 1: Select the FILE option and select create option.

206
Step 2: Then select the New Project option.

Step 3: Then choose the path and directory name.

207
Finally, project are created in a specific location:

Navigating directories in R studio

 getwd(): Returns the current working directory.
 setwd(): Set the working directory.
208
 dir(): Return the list of the directory.
 sessionInfo(): Return the session of the windows.
 date(): Return the current date.
Creating your first R script
Here we are adding two numbers in R studio.

How to Perform Various Operations in RStudio

We’ll see some common tasks, their codes in R Studio
Installing R packages
Syntax:
[Link]('package_name')
Loading R package
Syntax:
library(package_name)
Help on an R package
help(package_name)

Creation and Execution of R File in R Studio

209
R Studio is an integrated development environment(IDE) for R. IDE is a GUI, where you can
write your quotes, see the results and also see the variables that are generated during the course
of programming. R is available as an Open Source software for Client as well as Server
Versions.
Creating an R file
There are two ways to create an R file in R studio:
 You can click on the File tab, from there when you click it will give a drop-down menu,
where you can select the new file and then R script, so that, you will get a new file open.

 Use the plus button, which is just below the file tab and you can choose R script, from
there, to open a new R script file.

210
Once you open an R script file, this is how an R Studio with the script file open looks like.

So, 3 panels console, environment/history and file/plots panels are there. On top left you have a
new window, which is now being opened as a script file. Now you are ready to write a script file
or some program in R Studio.

Writing Scripts in an R File

Writing scripts to an R file is demonstrated here with an example:

211
In the above example, a variable ‘a’ is assigned with a value 11, in the first line of the code and
there is b which is ‘a’ times 10, that is the second command. Here, the code is evaluating the
value of a times 10 and assign the value to the b and the third statement, which is print(c(a, b))
means concatenates this a and b and print the result. So, this is how a script file is written in R.
After writing a script file, there is a need to save this file before execution.
Saving an R File
Let us see, how to save the R file. From the file menu if you click the file tab you can either save
or save as button. When you want to save the file if you click the save button, it will
automatically save the file has untitled x. So, this x can be 1 or 2 depending upon how many R
scripts you have already opened.

212
Or, it is a nice idea to use the Save as button, just below the Save one, so that, you can rename
the script file according to your wish. Let us suppose we have clicked the Save as button. This
will pop out a window like this, where you can rename the script file as test.R. Once you rename,
then by clicking the save button you can save the script file.

So now, we have seen how to open an R script and how to write some code in the R script file
and save the file.

The next task is to execute the R file.

Execution of an R file
213
There are several ways in which the execution of the commands that are available in the R file is
done.

 Using the run command: This “run” command can be executed using the GUI, by
pressing the run button there, or you can use the Shortcut key control + enter.

What does it do?

It will execute the line in which the cursor is there.

Using the source command:

This “source” command can be executed using the GUI, by pressing the source button
there, or you can use the Shortcut key control + shift + S.

What does it do?

It will execute the whole R file and only print the output which you wanted to print.

Using the source with echo command:

214
This “source with echo” command can be executed using the GUI, by pressing the source
with echo button there, or you can use the Shortcut key control + shift + enter.

What does it do?

It will print the commands also, along with the output you are printing.
So, this is an example, where R file is executed, using the source with echo command.

It can be seen in the console, that it printed the command a = 11 and the command b = a*10 and
also the output print(c(a, b)) with the values.

So, a is 11 and b is 11 times 10, this is 110. This is how the output will be printed in the console.
Values of a and b are also shown in the environment panel.
Run command over Source command:
 Run can be used to execute the selected lines of R code.
 Source and Source with echo can be used to run the whole file.
 The advantage of using Run is, you can troubleshoot or debug the program when
something is not behaving according to your expectations.
 The disadvantages of using run command are, it populates the console and makes it
messy unnecessarily.
Clear the Console and the Environment in R Studio

215
R Studio is an integrated development environment(IDE) for R. IDE is a GUI, where you can
write your quotes, see the results and also see the variables that are generated during the
course of programming.
Clearing the Console
We Clear console in R and RStudio, In some cases when you run the codes using “source”
and “source with echo” your console will become messy. And it is needed to clear the
console. So let’s now look at how to clear the console. The console can be cleared using the
shortcut key “ctrl + L“.
Example:
In this below screenshot, an R code is written in the script tab defined a and calculated b and
printed a, b. When this code is executed using “source with echo” all the commands will get
printed in the console tab. Now, to clear this console click on the console tab and enter the
key combination “ctrl + L“. Once it is done the console will get cleared.

Note: Remember that clearing the console will not delete the variables that are there in the
workspace. You can see that in the environment tab even though we have cleared the console
in the workspace we still have the variables that are created earlier.
Clearing the Environment
Variables on the R environment can be cleared in two ways:
Using rm() command:
When you want to clear a single variable from the R environment you can use the “ rm()”
command followed by the variable you want to remove.
216
-> rm(variable)
variable: that variable name you want to remove.
If you want to delete all the variables that are there in the environment what you can do is
you can use the “rm” with an argument “list” is equal to “ls” followed by a parenthesis.
-> rm(list=ls())
Using the GUI:
We can also clear all the variables in the environment using the GUI in the environment
pane.
How does it works?
You see this brush button in the environment pane.

So when you press the brush button it will pop up a window saying “you want to
remove all the objects from the environment?”

217
And if you say yes it will clear all the variables which are shown here and you can
see the environment is empty now.

218
219
Matrix in R – Arithmetic Operations

Arithmetic operations include addition (+), subtraction (-), multiplication(*), division (/) and
modulus(%). In this article we are going to see the matrix creation and arithmetic operations
on the matrices in R programming language.

 Create first matrix

Syntax:
matrix_name <- matrix(data , nrow = value, ncol = value) .
Parameters:
 data=includes a list/vector of elements passed as data to an matrix.
 nrow= nrow represent the number of rows specified.
 ncol= ncol represent the number of columns specified.

220
Subtraction

# create a vector of elements

vector1=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16)

# create a matrix with 4* 4 by passing this vector1

matrix1 <- matrix(vector1, nrow = 4, ncol = 4)

# display matrix
print(matrix1)

# create a vector of elements

vector2=c(1,2,3,2,4,5,6,3,4,1,2,7,8,9,4,5)

# create a matrix with 4* 4 by passing vector2

matrix2 <- matrix(vector2, nrow = 4, ncol = 4)

# display matrix
print(matrix2)
221
print(" subtraction result")

# subtract matrices
print(matrix1-matrix2)

Multiplication

# create a vector of elements

vector1=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16)

# create a matrix with 4* 4 by passing this vector1

matrix1 <- matrix(vector1, nrow = 4, ncol = 4)

# display matrix
print(matrix1)

# create a vector of elements

vector2=c(1,2,3,2,4,5,6,3,4,1,2,7,8,9,4,5)

# create a matrix with 4* 4 by passing vector2

222
matrix2 <- matrix(vector2, nrow = 4, ncol = 4)

# display matrix
print(matrix2)
print(" multiplication result")

# multiply matrices
print(matrix1*matrix2)

Division

# create a vector of elements

vector1=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16)

# create a matrix with 4* 4 by passing this vector1

matrix1 <- matrix(vector1, nrow = 4, ncol = 4)

# display matrix
print(matrix1)

# create a vector of elements

vector2=c(1,2,3,2,4,5,6,3,4,1,2,7,8,9,4,5)
223
# create a matrix with 4* 4 by passing vector2
matrix2 <- matrix(vector2, nrow = 4, ncol = 4)

# display matrix
print(matrix2)
print(" Division result")

# divide the matrices

print(matrix1/matrix2)

Modulo operation

Modulo returns the remainder of the elements in a matrix. The operator used: %%. The main
difference between division and modulo operator is that division returns quotient and modulo
returns remainder.

224
Algebraic Operations on a Matrix in R
225
What is Matrix?
A Matrix is a rectangular arrangement of numbers in rows and columns. In a matrix, as we
know rows are the ones that run horizontally and columns are the ones that run vertically.
In R Programming Language matrices are two-dimensional, homogeneous data structures.
These are some examples of matrices.

What are the Algebraic Operations?

Basic algebraic operations are any one of the traditional operations of arithmetic, which are
addition, subtraction, multiplication, division, raising to an integer power, and taking roots.
These operations may be performed on numbers, in which case they are often called
arithmetic operations. We can perform many more algebraic operations on a matrix in R.
Algebraic operations that can be performed on a matrix in R:
 Operations on a single matrix
 Unary operations
 Binary operations
 Linear algebraic operations
o Rank, determinant, transpose, inverse, trace of a matrix
o Nullity of a matrix
o Eigenvalues and eigenvectors of matrices
o Solve a linear matrix equation

Operations on a single matrix

We can use overloaded arithmetic operators to do element-wise operation on a matrix to
create a new matrix. In case of +=, -=, *= operators, the existing matrix is modified.

# R program to demonstrate
# basic operations on a single matrix

# Create a 3x3 matrix

226
a = matrix(
c(1, 2, 3, 4, 5, 6, 7, 8, 9),
nrow = 3,
ncol = 3,
byrow = TRUE
)
cat("The 3x3 matrix:\n")
print(a)

# add 1 to every element

cat("Adding 1 to every element:\n")
print(a + 1)

# subtract 3 from each element

cat("Subtracting 3 from each element:\n")
print(a-3)

# multiply each element by 10

cat("Multiplying each element by 10:\n")
print(a * 10)

# square each element

cat("Squaring each element:\n")
print(a ^ 2)

# modify existing matrix

cat("Doubled each element of original matrix:\n")
print(a * 2)

227
228
Unary operations
Many unary operations can be performed on a matrix in R. This includes sum, min, max,
etc.

# R program to demonstrate
# unary operations on a matrix
# Create a 3x3 matrix
a = matrix(
c(1, 2, 3, 4, 5, 6, 7, 8, 9),
nrow = 3,
ncol = 3,
byrow = TRUE
)
cat("The 3x3 matrix:\n")
print(a)
# maximum element in the matrix
cat("Largest element is:\n")
print(max(a))
# minimum element in the matrix
cat("Smallest element is:\n")
print(min(a))
# sum of element in the matrix
cat("Sum of elements is:\n")
print(sum(a))

Binary operations
229
These operations apply on a matrix elementwise and a new matrix is created. You can use all
basic arithmetic operators like +, -, *, /, etc. In case of +=, -=, = operators, the existing matrix is
modified.

# R program to demonstrate
# binary operations on a matrix
# Create a 3x3 matrix
a = matrix(c(1, 2, 3, 4, 5, 6, 7, 8, 9), nrow = 3, ncol = 3, byrow = TRUE )
cat("The 3x3 matrix:\n")
print(a)
# Create another 3x3 matrix
b = matrix( c(1, 2, 5, 4, 6, 2, 9, 4, 3), nrow = 3, ncol = 3, byrow = TRUE )
cat("The another 3x3 matrix:\n")
print(b)

cat("Matrix addition:\n")
print(a + b)

cat("Matrix subtraction:\n")
print(a-b)

cat("Matrix element wise multiplication:\n")

print(a * b)

cat("Regular Matrix multiplication:\n")

print(a %*% b)

cat("Matrix elementwise division:\n")

print(a / b)

230
231
Linear algebraic operations
One can perform many linear algebraic operations on a given matrix In R. Some of them are as
follows:
Rank, determinant, transpose, inverse, trace of a matrix
 Rank: The rank of a matrix is the maximum number of linearly independent rows or
columns in the matrix. In other words, it is the dimension of the column space (or
equivalently, the row space) of the matrix.
 Determinant: The determinant has various geometric interpretations, such as measuring the
volume scaling factor of a linear transformation represented by the matrix.
 Transpose: The transpose of a matrix is obtained by swapping its rows and columns.
 Inverse: The inverse of a square matrix is a matrix that, when multiplied with the original
matrix, results in the identity matrix.
 Trace: The trace of a square matrix is the sum of its diagonal elements.

# R program to demonstrate
# Linear algebraic operations on a matrix

# Importing required library

# For rank of matrix
library(pracma)
# For trace of matrix
library(psych)

# Create a 3x3 matrix

A = matrix(
c(6, 1, 1, 4, -2, 5, 2, 8, 7),
nrow = 3,
ncol = 3,
byrow = TRUE
)
cat("The 3x3 matrix:\n")
print(A)

# Rank of a matrix

232
cat("Rank of A:\n")
print(Rank(A))

# Trace of matrix A
cat("Trace of A:\n")
print(tr(A))

# Determinant of a matrix
cat("Determinant of A:\n")
print(det(A))

# Transpose of a matrix
cat("Transpose of A:\n")
print(t(A))

# Inverse of matrix A
cat("Inverse of A:\n")
print(inv(A))

233
Nullity of a matrix
The nullity of a matrix is the dimension of the null space, also known as the kernel, of the
matrix. The null space of a matrix.

# R program to demonstrate
# nullity of a matrix

# Importing required library

library(pracma)

# Create a 3x3 matrix

a = matrix(
234
c(1, 2, 3, 4, 5, 6, 7, 8, 9),
nrow = 3,
ncol = 3,
byrow = TRUE
)
cat("The 3x3 matrix:\n")
print(a)

# No of column
col = ncol(a)

# Rank of matrix
rank = Rank(a)

# Calculating nullity
nullity = col - rank

cat("Nullity of matrix is:\n")

print(nullity)

235
Functions in R Programming

A function accepts input arguments and produces the output by executing valid R commands that
are inside the function.
Functions are useful when you want to perform a certain task multiple times.
In R Programming Language when you are creating a function the function name and the file in
which you are creating the function need not be the same and you can have one or more
functions in R.
Creating a Function in R Programming
Functions are created in R by using the command function(). The general structure of the
function file is as follows:

Functions in R Programming
Note: In the above syntax f is the function name, this means that you are creating a function with
name f which takes certain arguments and executes the following statements.
Parameters or Arguments in R Functions:
Parameters and arguments are same term in functions.
Parameters or arguments are the values passed into a function.
A function can have any number of arguments, they are separated by comma in paranthesis.
Example:
# function to add 2 numbers
add_num <- function(a,b)
{
sum_result <- a+b
return(sum_result)
}
# calling add_num function
sum = add_num(35,34)

236
#printing result
print(sum)

Output
[1] 69

No. of Parameters:
Function should be called with right no. of parameters, neither less nor more or else it will give
error.
Default Value of Parameter:
Some functions have default values, and you can also give default value in your user-defined
functions. These values are used by functions if user doesn’t pass any parameter value while
calling a function.
Return Value:
You can use return() function if you want your function to return the result.

Calling a Function in R
After creating a Function, you have to call the function to use it.
Calling a function in R is very easy, you can call a function by writing it’s name and passing
possible parameters value.

Passing Arguments to Functions in R Programming Language

There are several ways you can pass the arguments to the function:
 Case 1: Generally in R, the arguments are passed to the function in the same order as in the
function definition.
 Case 2: If you do not want to follow any order what you can do is you can pass the
arguments using the names of the arguments in any order.
 Case 3: If the arguments are not passed the default values are used to execute the function.
Now, let us see the examples for each of these cases in the following R code:

# A simple R program to demonstrate

# passing arguments to a function

237
Rectangle = function(length=5, width=4){
area = length * width
return(area)
}

# Case 1:
print(Rectangle(2, 3))

# Case 2:
print(Rectangle(width = 8, length = 4))

# Case 3:
print(Rectangle())

Output
[1] 6
[1] 32
[1] 20

Types of Function in R Language

1. Built-in Function: Built-in functions in R are pre-defined functions that are available in R
programming languages to perform common tasks or operations.
2. User-defined Function: R language allow us to write our own function.
Built-in Function in R Programming Language
Built-in Function are the functions that are already existing in R language and you just need to
call them to use.

Here we will use built-in functions like sum(), max() and min().

# Find sum of numbers 4 to 6.

238
print(sum(4:6))

# Find max of numbers 4 and 6.

print(max(4:6))

# Find min of numbers 4 and 6.

print(min(4:6))

Output
[1] 15
[1] 6
[1] 4

Other Built-in Functions in R:

Let’s look at the list of built-in R functions and their uses:

Functions Syntax

Mathematical Functions

abs() calculates a number’s absolute value.

sqrt() calculates a number’s square root.

round() rounds a number to the nearest integer.

exp() calculates a number’s exponential value

log() which calculates a number’s natural

logarithm.

cos(), sin(), and tan() calculates a number’s cosine, sine, and tang.

Statistical Functions

239
Functions Syntax

mean() A vector’s arithmetic mean is determined by

the mean() function.

median() A vector’s median value is determined by the

median() function.

cor() calculates the correlation between two

vectors.

var() calculates the variance of a vector and

calculates the standard deviation of a vector.

Data Manipulation Functions

unique() returns the unique values in a vector.

subset() subsets a data frame based on conditions.

aggregate() groups data according to a grouping variable.

order() uses ascending or descending order to sort a

vector.

File Input/Output Functions

[Link]() reads information from a CSV file.

[Link]() publishes information to write a CSV file.

Read. table() reads information from a tabular.

[Link]() creates a tabular file with data.

User-defined Functions in R Programming Language

240
User-defined functions are the functions that are created by the user.
User defines the working, parameters, default parameter, etc. of that user-defined function. They
can be only used in that specific code.
Example

# A simple R function to check

# whether x is even or odd

evenOdd = function(x){
if(x %% 2 == 0)
return("even")
else
return("odd")
}

print(evenOdd(4))
print(evenOdd(3))

Output
[1] "even"
[1] "odd"

R Function Examples
Now let’s look at some use cases of functions in R with some examples.

1. Single Input Single Output

Now create a function in R that will take a single input and gives us a single output.
Following is an example to create a function that calculates the area of a circle which takes in the
arguments the radius. So, to create a function, name the function as “areaOfCircle” and the
arguments that are needed to be passed are the “radius” of the circle.

241
# A simple R function to calculate
# area of a circle

areaOfCircle = function(radius){
area = pi*radius^2
return(area)
}

print(areaOfCircle(2))

Output
[1] 12.56637

2. Multiple Input Multiple Output

Now create a function in R Language that will take multiple inputs and gives us multiple outputs
using a list.
The functions in R Language take multiple input objects but returned only one object as output,
this is, however, not a limitation because you can create lists of all the outputs which you want to
create and once the list is created you can access them into the elements of the list and get the
answers which you want.
Let us consider this example to create a function “Rectangle” which takes “length” and “width”
of the rectangle and returns area and perimeter of that rectangle. Since R Language can return
only one object. Hence, create one object which is a list that contains “area” and “perimeter” and
return the list.

# A simple R function to calculate

# area and perimeter of a rectangle

Rectangle = function(length, width){

area = length * width
perimeter = 2 * (length + width)

# create an object called result which is

# a list of area and perimeter

242
result = list("Area" = area, "Perimeter" = perimeter)
return(result)
}

resultList = Rectangle(2, 3)
print(resultList["Area"])
print(resultList["Perimeter"])

Output
$Area
[1] 6

$Perimeter
[1] 10

3. Inline Functions in R Programming Language

Sometimes creating an R script file, loading it, executing it is a lot of work when you want to just
create a very small function. So, what we can do in this kind of situation is an inline function.
To create an inline function you have to use the function command with the argument x and then
the expression of the function.

# A simple R program to
# demonstrate the inline function

f = function(x) x^2*4+x/3

print(f(4))
print(f(-2))
print(0)

Output

243
[1] 65.33333
[1] 15.33333
[1] 0

Lazy Evaluations of Functions in R Programming Language

In R the functions are executed in a lazy fashion. When we say lazy what it means is if some
arguments are missing the function is still executed as long as the execution does not involve
those arguments.

Example
In the function “Cylinder” given below. There are defined three-argument “diameter”, “length”
and “radius” in the function and the volume calculation does not involve this argument “radius”
in this calculation. Now, when you pass this argument “diameter” and “length” even though you
are not passing this “radius” the function will still execute because this radius is not used in the
calculations inside the function.
Let’s illustrate this in an R code given below:

# A simple R program to demonstrate

# Lazy evaluations of functions

Cylinder = function(diameter, length, radius ){

volume = pi*diameter^2*length/4
return(volume)
}

# This'll execute because this

# radius is not used in the
# calculations inside the function.
print(Cylinder(5, 10))

Output
[1] 196.3495

If you do not pass the argument and then use it in the definition of the function it will throw an

244
error that this “radius” is not passed and it is being used in the function definition.
Example

# A simple R program to demonstrate

# Lazy evaluations of functions

Cylinder = function(diameter, length, radius ){

volume = pi*diameter^2*length/4
print(radius)
return(volume)
}

# This'll throw an error

print(Cylinder(5, 10))

Output
Error in print(radius) : argument "radius" is missing, with no default

245
Control Statements in R Programming

Control statements are expressions used to control the execution and flow of the program based
on the conditions provided in the statements. These structures are used to make a decision after
assessing the variable. In this article, we’ll discuss all the control statements with the examples.

In R programming, there are 8 types of control statements as follows:

 if condition
 if-else condition
 for loop
 nested loops
 while loop
 repeat and break statement
 return statement
 next statement

if condition
This control structure checks the expression provided in parenthesis is true or not. If true, the
execution of the statements in braces {} continues.
Syntax:
if(expression){
statements
....
....
}
Example:
x <- 100

if(x > 10){

print(paste(x, "is greater than 10"))
}
Output:
[1] "100 is greater than 10"

246
if-else condition

It is similar to if condition but when the test expression in if condition fails, then statements
in else condition are executed.

Syntax:
if(expression){
statements
....
....
}
else{
statements
....
....
}
Example:

x <- 5

# Check value is less than or greater than 10

if(x > 10){
print(paste(x, "is greater than 10"))
}else{
print(paste(x, "is less than 10"))
}
Output:
[1] "5 is less than 10"

for loop
It is a type of loop or sequence of statements executed repeatedly until exit condition is reached.
Syntax:
for(value in vector){
statements

247
....
....
}
Example:

x <- letters[4:10]

for(i in x){
print(i)
}

Output:
[1] "d"
[1] "e"
[1] "f"
[1] "g"
[1] "h"
[1] "i"
[1] "j"

Nested loops
Nested loops are similar to simple loops. Nested means loops inside loop. Moreover, nested
loops are used to manipulate the matrix.
Example:

# Defining matrix
m <- matrix(2:15, 2)

for (r in seq(nrow(m))) {
for (c in seq(ncol(m))) {
print(m[r, c])
}
}

248
Output:
[1] 2
[1] 4
[1] 6
[1] 8
[1] 10
[1] 12
[1] 14
[1] 3
[1] 5
[1] 7
[1] 9
[1] 11
[1] 13
[1] 15

while loop
while loop is another kind of loop iterated until a condition is satisfied. The testing expression is
checked first before executing the body of loop.

Syntax:
while(expression){
statement
....
....
}
Example:

x=1
# Print 1 to 5
while(x <= 5){
print(x)
x=x+1
}

249
Output:
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5

repeat loop and break statement

repeat is a loop which can be iterated many number of times but there is no exit condition to
come out from the loop. So, break statement is used to exit from the loop. break statement can
be used in any type of loop to exit from the loop.

Syntax:
repeat {
statements
....
....
if(expression) {
break
}
}
Example:

x=1

# Print 1 to 5
repeat{
print(x)
x=x+1
if(x > 5){
break
}
}

250
Output:
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5

return statement
return statement is used to return the result of an executed function and returns control to the
calling function.
Syntax:
return(expression)
Example:
# Checks value is either positive, negative or zero
func <- function(x){
if(x > 0){
return("Positive")
}else if(x < 0){
return("Negative")
}else{
return("Zero")
}
}

func(1)
func(0)
func(-1)

Output:
[1] "Positive"
[1] "Zero"
[1] "Negative"

251
next statement
next statement is used to skip the current iteration without executing the further statements and
continues the next iteration cycle without terminating the loop.
Example:

# Defining vector
x <- 1:10

# Print even numbers

for(i in x){
if(i%%2 != 0){
next #Jumps to next loop
}
print(i)
}
Output:
[1] 2
[1] 4
[1] 6
[1] 8
[1] 10

252
R – Objects

Every programming language has its own data types to store values or any information so that
the user can assign these data types to the variables and perform operations respectively.
Operations are performed accordingly to the data types.
These data types can be character, integer, float, long, etc. Based on the data type,
memory/storage is allocated to the variable. For example, in C language character variables are
assigned with 1 byte of memory, integer variable with 2 or 4 bytes of memory and other data
types have different memory allocation for them.
Unlike other programming languages, variables are assigned to objects rather than data types
in R programming.

Type of Objects
There are 5 basic types of objects in the R language:

Vectors
Atomic vectors are one of the basic types of objects in R programming. Atomic vectors can store
homogeneous data types such as character, doubles, integers, raw, logical, and complex. A single
element variable is also said to be vector.
Example:

253
# Create vectors
x <- c(1, 2, 3, 4)
y <- c("a", "b", "c", "d")
z <- 5

# Print vector and class of vector

print(x)
print(class(x))

print(y)
print(class(y))

print(z)
print(class(z))

Output:
[1] 1 2 3 4
[1] "numeric"
[1] "a" "b" "c" "d"
[1] "character"
[1] 5
[1] "numeric"

Lists
List is another type of object in R programming. List can contain heterogeneous data types such
as vectors or another lists.

Example:

# Create list
ls <- list(c(1, 2, 3, 4), list("a", "b", "c"))

# Print
print(ls)

254
print(class(ls))

Output:
[[1]]
[1] 1 2 3 4

[[2]]
[[2]][[1]]
[1] "a"

[[2]][[2]]
[1] "b"

[[2]][[3]]
[1] "c"

[1] "list"

Matrices
To store values as 2-Dimensional array, matrices are used in R. Data, number of rows and
columns are defined in the matrix() function.
Syntax:
matrix(data = NA, nrow = 1, ncol = 1, byrow = FALSE, dimnames = NULL)
Example:

x <- c(1, 2, 3, 4, 5, 6)

# Create matrix
mat <- matrix(x, nrow = 2)

print(mat)
print(class(mat))

Output:

255
[, 1] [, 2] [, 3]
[1, ] 1 3 5
[2, ] 2 4 6

[1] "matrix"

256
Factors
Factor object encodes a vector of unique elements (levels) from the given data vector.
Example:

# Create vector
s <- c("spring", "autumn", "winter", "summer",
"spring", "autumn")

print(factor(s))
print(nlevels(factor(s)))

Output:
[1] spring autumn winter summer spring autumn
Levels: autumn spring summer winter
[1] 4

257
Arrays
array() function is used to create n-dimensional array. This function takes dim attribute as an
argument and creates required length of each dimension as specified in the attribute.
Syntax:
array(data, dim = length(data), dimnames = NULL)
Example:

# Create 3-dimensional array

# and filling values by column
arr <- array(c(1, 2, 3), dim = c(3, 3, 3))

print(arr)

Output:
,, 1

[, 1] [, 2] [, 3]
[1, ] 1 1 1
[2, ] 2 2 2
[3, ] 3 3 3,, 2

[, 1] [, 2] [, 3]
[1, ] 1 1 1
[2, ] 2 2 2
[3, ] 3 3 3,, 3

[, 1] [, 2] [, 3]
[1, ] 1 1 1
[2, ] 2 2 2
[3, ] 3 3 3

258
Data Frames
Data frames are 2-dimensional tabular data object in R programming. Data frames consists of
multiple columns and each column represents a vector. Columns in data frame can have different
modes of data unlike matrices.
Example:

# Create vectors
x <- 1:5
y <- LETTERS[1:5]
z <- c("Albert", "Bob", "Charlie", "Denver", "Elie")

# Create data frame of vectors

df <- [Link](x, y, z)

# Print data frame

print(df)

Output:
xy z
1 1 A Albert
22B Bob
3 3 C Charlie
4 4 D Denver
55E Elie

259
Data Manipulation in R with Dplyr Package

In order to manipulate the data, R provides a library called dplyr which consists of many built-in
methods to manipulate the data. So to use the data manipulation function, first need to import the
dplyr package using library(dplyr) line of code. Below is the list of a few data manipulation
functions present in dplyr package.

Function Name Description

filter() Produces a subset of a Data Frame.

Distinct() Removes duplicate rows in a Data Frame

arrange() Reorder the rows of a Data Frame

select() Produces data in required columns of a Data Frame

rename() Renames the variable names

mutate() Creates new variables without dropping old ones.

Transmute() Creates new variables by dropping the old.

Summarize() Gives summarized data like Average, Sum, etc.

filter() method

The filter() function is used to produce the subset of the data that satisfies the condition specified
in the filter() method. In the condition, we can use conditional operators, logical operators, NA
values, range operators etc. to filter out data. Syntax of filter() function is given below-
filter(dataframeName, condition)
Example:

In the below code we used filter() function to fetch the data of players who scored more than 100
260
runs from the “stats” data frame.

# import dplyr package

library(dplyr)

# create a data frame

stats <- [Link](player=c('A', 'B', 'C', 'D'),
runs=c(100, 200, 408, 19),
wickets=c(17, 20, NA, 5))

# fetch players who scored more

# than 100 runs
filter(stats, runs>100)

Output
player runs wickets
1 B 200 20
2 C 408 NA

distinct() method

The distinct() method removes duplicate rows from data frame or based on the specified
columns. The syntax of distinct() method is given below-
distinct(dataframeName, col1, col2,.., .keep_all=TRUE)
Example:
Here in this example, we used distinct() method to remove the duplicate rows from the data
frame and also remove duplicates based on a specified column.

# import dplyr package

library(dplyr)

261
# create a data frame
stats <- [Link](player=c('A', 'B', 'C', 'D', 'A', 'A'),
runs=c(100, 200, 408, 19, 56, 100),
wickets=c(17, 20, NA, 5, 2, 17))

# removes duplicate rows

distinct(stats)

#remove duplicates based on a column

distinct(stats, player, .keep_all = TRUE)

Output
player runs wickets
1 A 100 17
2 B 200 20
3 C 408 NA
4 D 19 5
5 A 56 2
player runs wickets
1 A 100 17
2 B 200 20
3 C 408 NA
4 D 19 5

arrange() method

In R, the arrange() method is used to order the rows based on a specified column. The syntax of
arrange() method is specified below-

arrange(dataframeName, columnName)
Example:
In the below code we ordered the data based on the runs from low to high using arrange()
function.

262
# import dplyr package
library(dplyr)
# create a data frame
stats <- [Link](player=c('A', 'B', 'C', 'D'),
runs=c(100, 200, 408, 19),
wickets=c(17, 20, NA, 5))
# ordered data based on runs
arrange(stats, runs)

Output
player runs wickets
1 D 19 5
2 A 100 17
3 B 200 20
4 C 408 NA

select() method

The select() method is used to extract the required columns as a table by specifying the required
column names in select() method. The syntax of select() method is mentioned below-
select(dataframeName, col1,col2,…)
Example:
Here in the below code we fetched the player, wickets column data only using select() method.

# import dplyr package

library(dplyr)

# create a data frame

stats <- [Link](player=c('A', 'B', 'C', 'D'),
runs=c(100, 200, 408, 19),
wickets=c(17, 20, NA, 5))
# fetch required column data
select(stats, player,wickets)

Output

263
player wickets
1 A 17
2 B 20
3 C NA
4 D 5

rename() method
The rename() function is used to change the column names. This can be done by the below
syntax-
rename(dataframeName, newName=oldName)
Example:
In this example, we change the column name “runs” to “runs_scored” in stats data frame.

# import dplyr package

library(dplyr)
# create a data frame
stats <- [Link](player=c('A', 'B', 'C', 'D'),
runs=c(100, 200, 408, 19),
wickets=c(17, 20, NA, 5))
# renaming the column
rename(stats, runs_scored=runs)

Output
player runs_scored wickets
1 A 100 17
2 B 200 20
3 C 408 NA
4 D 19 5

mutate() & transmute() methods

264
These methods are used to create new variables. The mutate() function creates new variables
without dropping the old ones but transmute() function drops the old variables and creates new
variables. The syntax of both methods is mentioned below-
mutate(dataframeName, newVariable=formula)
transmute(dataframeName, newVariable=formula)
Example:
In this example, we created a new column avg using mutate() and transmute() methods.

# import dplyr package

library(dplyr)
# create a data frame
stats <- [Link](player=c('A', 'B', 'C', 'D'),
runs=c(100, 200, 408, 19),
wickets=c(17, 20, 7, 5))
# add new column avg
mutate(stats, avg=runs/4)
# drop all and create a new column
transmute(stats, avg=runs/4)

Output
player runs wickets avg
1 A 100 17 25.00
2 B 200 20 50.00
3 C 408 7 102.00
4 D 19 5 4.75
avg
1 25.00
2 50.00
3 102.00
4 4.75
Here mutate() functions adds a new column for the existing data frame without dropping the old
ones where as transmute() function created a new variable but dropped all the old columns.
summarize() method
Using the summarize method we can summarize the data in the data frame by using aggregate

265
functions like sum(), mean(), etc. The syntax of summarize() method is specified below-
summarize(dataframeName, aggregate_function(columnName))
Example:
In the below code we presented the summarized data present in the runs column using
summarize() method.

# import dplyr package

library(dplyr)

# create a data frame

stats <- [Link](player=c('A', 'B', 'C', 'D'),
runs=c(100, 200, 408, 19),
wickets=c(17, 20, 7, 5))

# summarize method
summarize(stats, sum(runs), mean(runs))

Output
sum(runs) mean(runs)
1 727 181.75

266
R Data Types

Different forms of data that can be saved and manipulated are defined and categorized using data
types in computer languages, including R. Each R data type has unique properties and associated
operations.
What are R Data types?
R Data types are used to specify the kind of data that can be stored in a variable.
For effective memory consumption and precise computation, the right data type must be
selected.
Each R data type has its own set of regulations and restrictions.
Variables are not needed to be declare with a data type in R, data type even can be changed.
Example of R data Type:

#numeric data type

var <- 30
#integer data type
var <- 80L

Data Types in R Programming Language

Each variable in R has an associated data type. Each R-Data Type requires different amounts of
memory and has some specific operations which can be performed over it.
Data Types in R are:
1. numeric – (3,6.7,121)
2. Integer – (2L, 42L; where ‘L’ declares this as an integer)
3. logical – (‘True’)
4. complex – (7 + 5i; where ‘i’ is imaginary number)
5. character – (“a”, “B”, “c is third”, “69”)
6. raw – ( [Link](55); raw creates a raw vector of the specified length)
R Programming language has the following basic R-data types and the following table shows the
data type and the values that each data type can take.

267
Basic Data Types Values Examples

"numeric_value <- 3.14"

Numeric Set of all real numbers

"integer_value <- 42L"

Integer Set of all integers, Z

"logical_value <- TRUE"

Logical TRUE and FALSE

"complex_value <- 1 + 2i"

Complex Set of complex numbers

“a”, “b”, “c”, …, “@”, "character_value <-

Character “#”, “$”, …., “1”, “2”, "Hello Geeks"
…etc

"single_raw <-
raw [Link]() [Link](255)"

1. Numeric Data type in R

Decimal values are called numeric in R. It is the default R data type for numbers in R.
If you assign a decimal value to a variable x as follows, x will be of numeric type.
Real numbers with a decimal point are represented using this data type in R. It uses a format for
double-precision floating-point numbers to represent numerical values.

# A simple R program
# to illustrate Numeric data type
# Assign a decimal value to x
x = 5.6
# print the class name of variable

268
print(class(x))
# print the type of variable
print(typeof(x))

Output
[1] "numeric"
[1] "double"

Even if an integer is assigned to a variable y, it is still saved as a numeric value.

# A simple R program
# to illustrate Numeric data type
# Assign an integer value to y
y=5
# print the class name of variable
print(class(y))
# print the type of variable
print(typeof(y))

Output
[1] "numeric"
[1] "double"

When R stores a number in a variable, it converts the number into a “double” value or a decimal
type with at least two decimal places.
This means that a value such as “5” here, is stored as 5.00 with a type of double and a class of
numeric. And also y is not an integer here can be confirmed with the [Link]() function.

# A simple R program
# to illustrate Numeric data type
# Assign a integer value to y
y=5

269
# is y an integer?
print([Link](y))

Output
[1] FALSE

2. Integer Data type in R

R supports integer data types which are the set of all integers.
You can create as well as convert a value into an integer type using the [Link]() function.
You can also use the capital ‘L’ notation as a suffix to denote that a particular value is of the
integer R data type.

# A simple R program
# to illustrate integer data type
# Create an integer value
x = [Link](5)
# print the class name of x
print(class(x))

# print the type of x

print(typeof(x))
# Declare an integer by appending an L suffix.
y = 5L
# print the class name of y
print(class(y))

# print the type of y

print(typeof(y))

Output
[1] "integer"
[1] "integer"
[1] "integer"
[1] "integer"

270
3. Logical Data type in R

R has logical data types that take either a value of true or false.
A logical value is often created via a comparison between variables.
Boolean values, which have two possible values, are represented by this R data type: FALSE or
TRUE

# A simple R program
# to illustrate logical data type
# Sample values
x=4
y=3
# Comparing two values
z=x>y
# print the logical value
print(z)

# print the class name of z

print(class(z))

# print the type of z

print(typeof(z))

Output
[1] TRUE
[1] "logical"
[1] "logical"

4. Complex Data type in R

R supports complex data types that are set of all the complex numbers. The complex data type is
to store numbers with an imaginary component.

271
# A simple R program
# to illustrate complex data type
# Assign a complex value to x
x = 4 + 3i
# print the class name of x
print(class(x))
# print the type of x
print(typeof(x))

Output
[1] "complex"
[1] "complex"

5. Character Data type in R

R supports character data types where you have all the alphabets and special characters.
It stores character values or strings. Strings in R can contain alphabets, numbers, and symbols.
The easiest way to denote that a value is of character type in R data type is to wrap the value
inside single or double inverted commas.

# A simple R program
# to illustrate character data type
# Assign a character value to char
char = "Geeksforgeeks"
# print the class name of char
print(class(char))
# print the type of char
print(typeof(char))

Output
[1] "character"
[1] "character"

There are several tasks that can be done using R data types. Let’s understand each task with its
action and the syntax for doing the task along with an R code to illustrate the task.

272
6. Raw data type in R
To save and work with data at the byte level in R, use the raw data type. By displaying a series of
unprocessed bytes, it enables low-level operations on binary data. Here are some speculative data
on R’s raw data types:

# Create a raw vector

x <- [Link](c(0x1, 0x2, 0x3, 0x4, 0x5))
print(x)

Output
[1] 01 02 03 04 05

Five elements make up this raw vector x, each of which represents a raw byte value.

Find Data Type of an Object in R

To find the data type of an object you have to use class() function. The syntax for doing that is
you need to pass the object as an argument to the function class() to find the data type of an
object.

Syntax
class(object)

Example

# A simple R program
# to find data type of an object
# Logical
print(class(TRUE))
# Integer
print(class(3L))
# Numeric
print(class(10.5))

273
# Complex
print(class(1+2i))
# Character
print(class("12-04-2020"))

Output
[1] "logical"
[1] "integer"
[1] "numeric"
[1] "complex"
[1] "character"

Type verification
You can verify the data type of an object, if you doubt about it’s data type.
To do that, you need to use the prefix “is.” before the data type as a command.
Syntax:
is.data_type(object)

Example

# A simple R program
# Verify if an object is of a certain datatype

# Logical
print([Link](TRUE))
# Integer
print([Link](3L))
# Numeric
print([Link](10.5))
# Complex
print([Link](1+2i))

274
# Character
print([Link]("12-04-2020"))
print([Link]("a"))
print([Link](2+3i))

Output
[1] TRUE
[1] TRUE
[1] TRUE
[1] TRUE
[1] TRUE
[1] FALSE
[1] FALSE

Coerce or Convert the Data Type of an Object to Another

The process of altering the data type of an object to another type is referred to as coercion or data
type conversion. This is a common operation in many programming languages that is used to
alter data and perform various computations.
When coercion is required, the language normally performs it automatically, whereas conversion
is performed directly by the programmer.
Coercion can manifest itself in a variety of ways, depending on the R programming language and
the context in which it is employed.
In some circumstances, the coercion is implicit, which means that the language will change one
type to another without the programmer having to expressly request it.

Syntax
as.data_type(object)

Note: All the coercions are not possible and if attempted will be returning an “NA” value.
For Detailed Explanation – Data Type Conversion in R
Example

# A simple R program

275
# convert data type of an object to another
# Logical
print([Link](TRUE))
# Integer
print([Link](3L))
# Numeric
print([Link](10.5))
# Complex
print([Link](1+2i))
# Can't possible
print([Link]("12-04-2020"))

Output
[1] 1
[1] 3+0i
[1] TRUE
[1] "1+2i"
[1] NA
Warning message:
In print([Link]("12-04-2020")) : NAs introduced by coercion

276
Reading and Writing Data to and from R
Functions for Reading Data into R:
There are a few very useful functions for reading data into R.
1. [Link]() and [Link]() are two popular functions used for reading tabular data into
R.
2. readLines() is used for reading lines from a text file.
3. source() is a very useful function for reading in R code files from a another R program.
4. dget() function is also used for reading in R code files.
5. load() function is used for reading in saved workspaces
6. unserialize() function is used for reading single R objects in binary format.
Functions for Writing Data to Files:
There are similar functions for writing data to files
1. [Link]() is used for writing tabular data to text files (i.e. CSV).
2. writeLines() function is useful for writing character data line-by-line to a file or
connection.
3. dump() is a function for dumping a textual representation of multiple R objects.
4. dput() function is used for outputting a textual representation of an R object.
5. save() is useful for saving an arbitrary number of R objects in binary format to a file.
6. serialize() is used for converting an R object into a binary format for outputting to a
connection (or file).

Reading Data Files with [Link]():

The [Link]() function is one of the most commonly used functions for reading data in R. TO
get the help file for [Link]() just type ?[Link] in R console.

The [Link]() function has a few important arguments:

 file, the name of a file, or a connection
 header, logical indicating if the file has a header line
 sep, a string indicating how the columns are separated
 colClasses, a character vector indicating the class of each column in the dataset
 nrows, the number of rows in the dataset. By default [Link]() reads an entire file.
 [Link], a character string indicating the comment character. This defalts to “#”. If
there are no commented lines in your file, it’s worth setting this to be the empty string “”.

277
 skip, the number of lines to skip from the beginning
 stringsAsFactors, should character variables be coded as factors? This defaults to TRUE
because back in the old days, if you had data that were stored as strings, it was because
those strings represented levels of a categorical variable. Now we have lots of data that is
text data and they don’t always represent categorical variables. So you may want to set
this to be FALSE in those cases. If you always want this to be FALSE, you can set a
global option via options(stringsAsFactors = FALSE). I’ve never seen so much heat
generated on discussion forums about an R function argument than the stringsAsFactors
argument.
Check the following example how to work with [Link]() in r. For this example a data
set called wine data set will be used. You can download the data set by clicking here.
The data set was originally taken from UCI Repository. You can get more details about
the data set from here.

Download the Wine Data set

w<-[Link]("[Link]
[Link]",sep=",",header = TRUE)
head(w)
View(w)

Writing Data Files with [Link]():

To write a R object into a file check the following code.

[Link](w,"E:/MakeMeAnalyst/[Link]") #Give your own path here.
To learn more about data output using [Link]().

readLines() and writeLines() function in R:

readLines() function is mainly used for reading lines from a text file and writeLines()
function is useful for writing character data line-by-line to a file or connection. Check the
following example to deal with readLines() and writeLines(). First, download the sample
text from here and then read it into R.
Download the Sample Text
con <- file("[Link] "r")

278
w<-readLines(con)
close(con)
w[1]
w[2]
w[3]
Output:
> w[1]
[1] "This is a sample text file."
> w[2]
[1] "Read this file using readLines() function."
> w[3]
[1] "And you can wrtie a file using writeLines() function."
You can also write contents into a file using writeLines() function in R. Following
example shows how to do that.
sample<-c("Class,Alcohol,Malic acid,Ash","1,14.23,1.71,2.43","1,13.2,1.78,2.14")
writeLines(sample,"F://[Link]")
You can write them into tsv file also using below code.
sample<-c("Class,Alcohol,Malic acid,Ash","1,14.23,1.71,2.43","1,13.2,1.78,2.14")
t<- gsub(",", "\t", sample)
writeLines(t, "F://[Link]")

dput() and dget() Function in R:

You can create a more descriptive representation of an R object by using

the dput() or dump() functions. Unlike writing out a table or CSV file, dump() and
dput() preserve the metadata, so that another user doesn’t have to specify it all over again.
For example, we can preserve the class of each column of a table or the levels of a factor
variable.
# Create a data frame
x <- [Link](Name = "Mr. A", Gender = "Male", Age=35)
#Print 'dput' output to your R console
dput(x)
#Write the 'dput' output to a file
dput(x, file = "F://w.R")

279
# Now read in 'dput' output from the file
y <- dget("F:/w.R")
y

dump() Function in R:

You can dump() R objects to a file by passing its names.

x<-1:10
d <- [Link](Name = "Mr. A", Gender = "Male", Age=35)
dump(c("x", "d"), file = "F://dump_data.R")
rm(x, d) #After dumping just remove the variables from environment.
source() Function in R:
The inverse of dump() is source() function. Now you can import that dump_data.R into
R using following code.
source("F://dump_data.R")
x
d
str(d)
Output:
>x
[1] 1 2 3 4 5 6 7 8 9 10
>d
Name Gender Age
1 Mr. A Male 35
> str(d)
'[Link]': 1 obs. of 3 variables:
$ Name : Factor w/ 1 level "Mr. A": 1
$ Gender: Factor w/ 1 level "Male": 1
$ Mobile: num 35

280
Binary Formats in R:
The complement to the textual format is the binary format. Binary format is sometimes useful for
efficiency purposes. Sometimes, it may happen that there is no useful way to represent your data
in a textual manner then binary format helps to import and export data i R. The main functions
for converting R objects into a binary format are save(), [Link](), and serialize(). Individual
R objects can be saved to a file using the save() function.
x <- [Link](col1 = rep(10,10), col2 = runif(10,min=0,max=10))
y<-rnorm(10)
z<-100:110
#Save 'x', 'y' and 'z' to a file
save(x,y,z,file="F:/[Link]")
#OR
save(x,y,z,file="F:/[Link]")
#Load 'x', 'y' and 'z' into your workspace
load("F:/[Link]")
#OR
load("F:/[Link]")
If you have a lot of objects that you want to save to a file in one run, you can save all
objects in your workspace using the [Link]() function.

# Save everything to a file

[Link](file = "F://[Link]")
#load all objects in this file
load("F://[Link]")

serialize() and unserialize() function in R:

The serialize() function is used to convert individual R objects into a binary format that can be
communicated across an arbitrary connection. When you call serialize() on an R object, the
output will be a raw vector coded in hexadecimal format. The benefit of the serialize() function
is that it is the only way to perfectly represent an R object in an exportable format, without losing
precision or any metadata. If that is what you need, then serialize() is the function for you.
x<-list(1,2,3)
s<-serialize(x, NULL)

281
s
save(s,file="F:/test_serialization.rda")
load("F:/test_serialization.rda")
unserialize(s)

saveRDS() and readRDS() in R:

Now you are familiar with save() and load() function in R. They allow you to save a
named R object to a file or other connection and restore that object again. When loaded
the named object is restored to the current environment with the same name it had when
saved. This is annoying for example when you have a saved model object resulting from
a previous fit and you want to compare it with the model object returned when the R code
is rerun. Unless you change the name of the model fit object in your script you can’t have
both the saved object and the newly created one available in the same environment at the
same time. saveRDS() provides a far better solution to this problem and to the general
one of saving and loading objects created with R. saveRDS() serializes an R object into a
format that can be saved.
save() does the same thing, but with one important difference; saveRDS() doesn’t save
the both the object and its name it just saves a representation of the object. As a result,
the saved object can be loaded into a named object within R that is different from the
name it had when originally serialized. The main difference is that save() can save many
objects to a file in a single call, whilst saveRDS(), being a lower-level function, works
with a single object at a time.
# save a single object to file
women
saveRDS(women, "F://[Link]")
# restore it under a different name
women2 <- readRDS("F://[Link]")
identical(women, women2)
Output:
> women
height weight
1 58 115
2 59 117

282
3 60 120
4 61 123
5 62 126
6 63 129
7 64 132
8 65 135
9 66 139
10 67 142
11 68 146
12 69 150
13 70 154
14 71 159
15 72 164
identical(women, women2)
[1] TRUE

CSV Files in R
In R can read and write into various file formats like csv, excel,json, xml etc. The csv file
is a text file in which the values in the columns are separated by a
comma. [Link]() function is used to read a CSV file in your working directory.
Similarly, [Link]() function is used to write the csv file. You can download the sample
data set by clicking here and then read it using [Link]() function.
Download the IRIS Data Set
Reading a CSV File in R:
mydata <- [Link](file="[Link]
[Link]", header=TRUE, sep=",")
head(mydata)
dim(mydata)
summary(mydata)
Output:
> head(mydata)
X [Link] [Link] [Link] [Link] Species
1 1 5.1 3.5 1.4 0.2 setosa
2 2 4.9 3.0 1.4 0.2 setosa

283
3 3 4.7 3.2 1.3 0.2 setosa
4 4 4.6 3.1 1.5 0.2 setosa
5 5 5.0 3.6 1.4 0.2 setosa
6 6 5.4 3.9 1.7 0.4 setosa
> dim(mydata)
[1] 150 6
> summary(mydata)
X [Link] [Link] [Link] [Link] Species
Min. : 1.00 Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
1st Qu.: 38.25 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
Median : 75.50 Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
Mean : 75.50 Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:112.75 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :150.00 Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Writing into a CSV File:
The [Link]() function is used to create the csv file.
mydata <- [Link](file="[Link]
[Link]", header=TRUE, sep=",")
t<-tail(mydata)
t
[Link](t,"Iris_tail.csv", [Link] = FALSE)
Output:
X [Link] [Link] [Link] [Link] Species
145 145 6.7 3.3 5.7 2.5 virginica
146 146 6.7 3.0 5.2 2.3 virginica
147 147 6.3 2.5 5.0 1.9 virginica
148 148 6.5 3.0 5.2 2.0 virginica
149 149 6.2 3.4 5.4 2.3 virginica
150 150 5.9 3.0 5.1 1.8 virginica

284
Unit 6
Graphical
analysis in R

285
Topics
Basic Plotting in R
Scatter Plots
Line Plot
Histogram
Boxplot
Bar Plot

286
Graphical analysis in R involves using visual representations to explore, analyze, and interpret
data. R is a powerful statistical computing language with extensive capabilities for creating
various types of plots and charts. Here's a detailed guide to performing graphical analysis using
R, covering the basics, common types of plots, and practical examples.

Data visualization is a technique used for the graphical representation of data. By using elements
like scatter plots, charts, graphs, histograms, maps, etc., we make our data more understandable.
Data visualization makes it easy to recognize patterns, trends, and exceptions in our data. It
enables us to convey information and results in a quick and visual way.
It is easier for a human brain to understand and retain information when it is represented in a
pictorial form. Therefore, Data Visualization helps us interpret data quickly, examine different
variables to see their effects on the patterns, and derive insights from our data.

Getting Started
Before creating plots, ensure you have R and a suitable IDE like RStudio installed. You can use
R’s built-in plotting functions or packages like `ggplot2` for more advanced and aesthetically
pleasing graphics.

Base R Graphics
There are some key elements of a statistical graphic. These elements are the basics of the
grammar of graphics. R provides some built-in functions which are included in the graphics
package for data visualization in R. Let’s discuss each of the elements one by one to gain the
basic knowledge of graphics.
Now we are going to use the default mtcars dataset for data visualization in R.
#To load graphics package
library("graphics")
#To load datasets package
library("datasets")
#To load mtcars dataset
data(mtcars)
#To analyze the structure of the dataset
str(mtcars)

287
It contains data about the design, performance and fuel economy of 32 automobiles from 1973 to
1974, extracted from the 1974 Motor Trend US magazine.

Basic Plotting in R
R provides several base plotting functions. Here’s a quick overview:

The plot() Function

The plot() function is used to plot R objects.
The basic syntax for the plot() function is given below:
1 plot(x,y,type,main,sub,xlab,ylab,asp,col,..)
x:– The x coordinate of the plot, a single plotting structure, a function, or an R object
y:– The Y coordinate points in the plot (optional if x coordinate is a single structure)
type:– ‘p’ for points, ‘l’ for lines, ‘b’ for both, ‘h’ for high-density vertical lines, etc.
main:– Title of the plot
sub:– Subtitle of the plot
xlab:– Title for the x-axis
ylab:– Title for the y-axis
asp :- Aspect ratio(y/x)
col:– Color of the plot(points, lines, etc.)
For example:
#To plot mpg(Miles per Gallon) vs Number of cars
plot(mtcars$mpg, xlab = "Number of cars", ylab = "Miles per Gallon", col = "red")

288
Output:

Here, we get a scatter/dot plot wherein we can observe that there are only six cars with miles per
gallon (mpg) more than 25.

Plot the correlation between 2 variables

#To find relation between hp (Horse Power) and mpg (Miles per Gallon)
plot(mtcars$hp,mtcars$mpg, xlab = "HorsePower", ylab = "Miles per Gallon", type = "h", col =
"blue")

289
Here, we can observe that hp and mpg have a negative correlation, which means that as Horse
Power increases Miles per Gallon decreases.

1. Scatter Plot
Scatter plots are used to visualize the relationship between two continuous variables.
#Sample data
x <- c(1, 2, 3, 4, 5)
y <- c(2, 4, 6, 8, 10)

# Basic scatter plot

plot(x, y, main="Scatter Plot", xlab="X-axis", ylab="Y-axis", pch=19)

2. Line Plot

Line plots are used to visualize data trends over time or other continuous variables.
# Sample data
x <- 1:10
y <- c(2, 3, 5, 7, 11, 13, 17, 19, 23, 29)

# Basic line plot

plot(x, y, type="l", main="Line Plot", xlab="X-axis", ylab="Y-axis")

3. Histogram
Histograms display the distribution of a single continuous variable.
It is used to divide values into groups of continuous ranges measured against the frequency range
of the variable.
For example:
#To find histogram for mpg (Miles per Gallon)
hist(mtcars$mpg,xlab = "Miles Per Gallon", main = "Histogram for MPG", col = "yellow")

290
#Sample data
data <- rnorm(1000) Generating random data

# Basic histogram
hist(data, main="Histogram", xlab="Values", ylab="Frequency", col="lightblue",
border="black")

4. Bar Plot
Bar plots are used to compare different categories.

# Sample data
categories <- c("A", "B", "C")
values <- c(3, 7, 5)
# Basic bar plot
It is used to represent data in the form of rectangular bars, both in vertical and horizontal ways,
and the length of the bar is proportional to the value of the variable.
For example:
#To draw a barplot of hp
#Horizontal
barplot(mtcars$hp,xlab = "HorsePower", col = "cyan", horiz = TRUE)
#Vertical
291
barplot(mtcars$hp, ylab = "HorsePower", col = "cyan", horiz = FALSE)

barplot(values, [Link]=categories, main="Bar Plot", xlab="Categories", ylab="Values",

col="lightgreen")

5. Box Plot
Box plots are used to show the distribution of a continuous variable and highlight outliers.
It is used to represent descriptive statistics of each variable in a dataset. It represents the
minimum, first quartile, median, third quartile, and the maximum values of a variable.
#To draw boxplots for disp (Displacement) and hp (Horse Power)
boxplot(mtcars[,3:4])

292
# Sample data
data <- list(Group1=rnorm(50), Group2=rnorm(50, mean=5))

#Basic box plot

boxplot(data, main="Box Plot", xlab="Groups", ylab="Values", col=c("lightpink", "lightblue"))

293
Advanced Plotting with ggplot2

The ggplot2 package in R is based on the grammar of graphics, which is a set of rules for
describing and building graphs. By breaking up graphs into semantic components such as scales
and layers, ggplot2 implements the grammar of graphics.

The ggplot2 grammar of graphics is composed of the following:

 Data
 Layers
 Scales
 Coordinates
 Faceting
 Themes
ggplot2 is one of the most sophisticated packages in R for data visualization, and it helps create
the most elegant and versatile print-quality plots with minimal adjustments. It is very simple to
create single- and multivariable graphs with the help of the ggplot2 package.
`ggplot2` is a popular package for creating sophisticated and customizable plots. It uses a
grammar of graphics approach.

The three basic components to build a ggplot are as follows:

 Data:– Dataset to be plotted
 Aesthetics:- Mapping of data to visualization
 Geometry/Layers:- Visual elements used for the data

 The basic syntax for ggplot is given below:

ggplot(data = NULL, mapping = aes()) + geom_function()

#To Install and load the ggplot2 package

[Link](“ggplot2”)

library(ggplot2)

294
we are going to use the mtcars dataset from the datasets package in R that can
be loaded as follows:

#To load datasets package

library("datasets")

#To load iris dataset

data(mtcars)

#To analyze the structure of the dataset

str(mtcars)

Scatter Plot with ggplot2

# Sample data
df <- [Link](x=1:10, y=c(2, 3, 5, 7, 11, 13, 17, 19, 23, 29))

# Scatter plot
ggplot(df, aes(x=x, y=y)) +
geom_point() +
ggtitle("Scatter Plot") +
xlab("X-axis") +
ylab("Y-axis")

To draw a scatter plot of cyl(Number of Cylinders) and vs(Engine Type(0 = V-shaped, 1 =

straight)), run the code below:
#Since the following columns have discrete(categorical) set of values, So we can
convert them to factors for optimal plotting
mtcars$am <- [Link](mtcars$am)
mtcars$cyl <- [Link](mtcars$cyl)

295
mtcars$vs <- [Link](mtcars$vs)
mtcars$gear <- [Link](mtcars$gear)
#To draw scatter plot
ggplot(mtcars, aes(x= cyl , y= vs)) + geom_point()

Since this plot has a lot of overlapped values, which is known as overplotting, we will
use geom_jitter() function to add a certain amount of noise to avoid it.

#Here width argument is used to set the amount of jitter

ggplot(mtcars, aes(x= cyl , y= vs)) + geom_jitter(width = 0.1)

Here, we can also use the argument alpha to set the transparency of the points to further reduce
overplotting for data visualization in R.
296
#Transparency set to 50%
ggplot(mtcars, aes(x= cyl , y= vs)) + geom_jitter(width = 0.1, alpha = 0.5)

With ggplot2, we can plot multivariate plots effectively.

For example:

To draw a scatter plot of cyl(Number of Cylinders) and vs(Engine Type(0 = V-shaped, 1 =

straight)) according to am Transmission (0 = automatic, 1 = manual), run the following code:-
#We use the color aesthetic to introduce third variable with a legend on the right side
ggplot(mtcars, aes(x= cyl,y= vs,color = am)) + geom_jitter(width = 0.1, alpha = 0.5)

297
#To add the labels
ggplot(mtcars, aes(x= cyl , y= vs ,color = am)) +
geom_jitter(width = 0.1, alpha = 0.5) +
labs(x = "Cylinders",y = "Engine Type", color = "Transmission(0 = automatic, 1 = manual)")

#To plot with shape =1 and size = 4

ggplot(mtcars, aes(x = wt, y = mpg, col = cyl)) +
geom_point(size = 4, shape = 1, alpha = 0.6) +
labs(x = "Weight",y = "Miles per Gallon", color = "Cylinders")

Line Plot with ggplot2

298
# Sample data
df <- [Link](x=1:10, y=c(2, 3, 5, 7, 11, 13, 17, 19, 23, 29))

# Line plot
ggplot(df, aes(x=x, y=y)) + geom_line() + ggtitle("Line Plot") + xlab("X-axis") + ylab("Y-
axis")

Histogram with ggplot2

#To plot a histogram for mpg (Miles per Gallon),
according to cyl(Number of Cylinders), we use the geom_histogram() functiong
gplot(mtcars, aes(mpg,fill = cyl)) +
geom_histogram(binwidth = 1)+
theme_bw()+
labs(title = "Miles per Gallon by Cylinders",x = "Miles per Gallon",y =
"Count",fill = "Cylinders")

#To show overlapping, we set position to identity and alpha to 0.5

ggplot(mtcars, aes(mpg,fill = cyl)) +
geom_histogram(binwidth = 1,position = "identity", alpha = 0.5)+
299
theme_bw()+
labs(title = "Miles per Gallon by Cylinders",x = "Miles per Gallon",y = "Count",fill =
"Cylinders")

#To overcome overlapping, we can use the frequency polygon, as follows:

ggplot(mtcars, aes(mpg, color = cyl)) + geom_freqpoly(binwidth = 1)+
theme_bw()+ labs(title = "Miles per Gallon by Cylinders", x = "Miles per Gallon",y = "Count",
fill = "Cylinders")

300
# Sample data
df <- [Link](values=rnorm(1000))

# Histogram
ggplot(df, aes(x=values)) +
geom_histogram(binwidth=0.5, fill="lightblue", color="black") +
ggtitle("Histogram") +
xlab("Values") +
ylab("Frequency")

Bar Plot with ggplot2

#To draw a bar plot of cyl(Number of Cylinders) according to the Transmission type
using <strong>geom_bar() and fill()</strong>
ggplot(mtcars, aes(x = cyl, fill = am)) +
geom_bar() +
labs(x = "Cylinders", y = "Car count", fill = "Transmission")

#To find the proprtion, we use position argument,as follows:

ggplot(mtcars, aes(x = cyl, fill = am)) +
geom_bar(position = "fill") +
labs(x = "Cylinders",y = "Proportion",fill = "Transmission")

301
Themes
It is used to change the attributes of non-data elements of our plot like text, lines,
background, etc. We use the theme_function() to make changes to these elements for
data visualization in R.

Some of the commonly used theme function is as follows:

 theme_bw() :- For white background and gray grid lines
 theme_gray:- For gray background and white grid lines
 theme_linedraw:- For black lines around the plot
 theme_light:- For light gray lines and axis
 theme_void:- An empty theme, useful for plots with non-standard coordinates or for
drawings
 theme_dark():- A dark background designed to make colors pop out

ggplot(mtcars, aes(x = cyl, fill = am)) +

geom_bar(position = "fill") +
theme_classic()+
labs(x = "Cylinders",y = "Proportion",fill = "Transmission")

302
Faceting
It is used to further drill down data and split the data by one or more variables, and then plot the
subsets of the data altogether for optimum data visualization in R.
For example:

#To facet the following plot according to gear(Number of Gears(3,4,5)), we use

facet_grid() function as follows:
ggplot(mtcars, aes(x = cyl, fill = am)) +
geom_bar() +
facet_grid(.~gear)+
#facet_grid(rows ~ columns) theme_bw() + labs(title = "Cylinder count by transmission and Gears",
x = "Cylinders", y = "Count",fill = "Transmission")

303
# Sample data
df <- [Link](Category=c("A", "B", "C"), Value=c(3, 7, 5))

# Bar plot
ggplot(df, aes(x=Category, y=Value)) +
geom_bar(stat="identity", fill="lightgreen") +
ggtitle("Bar Plot") +
xlab("Category") +
ylab("Value")

Box Plot with ggplot2

# Sample data
df <- [Link](Group=rep(c("Group1", "Group2"), each=50), Value=c(rnorm(50), rnorm(50,
mean=5)))

# Box plot
ggplot(df, aes(x=Group, y=Value, fill=Group)) +
geom_boxplot() + ggtitle("Box Plot") + xlab("Group") + ylab("Value")

304
#To draw a Box plot
ggplot(mtcars, aes(x = cyl,y = mpg)) + geom_boxplot(fill = "cyan", alpha = 0.5) +
theme_bw() + labs(title = "Cylinder count vs Miles per Gallon",x = "Cylinders",
y = "Miles per Gallon")

#To draw a Box plot

ggplot(mtcars, aes(x = cyl,y = mpg,fill = am)) +
geom_boxplot( alpha = 0.5) +
theme_bw() +
labs(title = "Cylinder vs MPG by Transmission",x = "Cylinders",
y = "Miles per Gallon",fill = "Transmission")

305
Customizing Plots

Both base R graphics and `ggplot2` offer extensive customization options:

- Titles and Labels: Add titles and axis labels to make your plots more informative.
- Colors and Themes: Customize colors and themes to enhance visual appeal.
- Legends and Annotations: Add legends, annotations, and text to provide more context.

Saving Plots

You can save plots to files using `ggsave` in `ggplot2` or functions like `png()`, `jpeg()`, and
`pdf()` in base R.

Saving with ggplot2

# Save a ggplot2 plot

ggplot(df, aes(x=x, y=y)) +
geom_point() +
ggtitle("Scatter Plot") +
xlab("X-axis") +

306
ylab("Y-axis") +
ggsave("scatter_plot.png")

Saving with base R

# Save a base R plot
png("scatter_plot.png")
plot(x, y, main="Scatter Plot", xlab="X-axis", ylab="Y-axis", pch=19)
[Link]()

Advantages of Data Visualization in R:

Some of the advantages of R over other tools for data visualization include:
 R offers a broad collection of visualization libraries in addition to extensive online guidance
on their usage.
 R also offers data visualization in the form of 3D models and multipanel charts.
 Through R, we can easily customize our data visualization by changing axes, fonts, legends,
annotations, and labels.
Disadvantages of Data Visualization in R:
There are also few disadvantages of R in Data Visualization as follows:
 R is only preferred for data visualization when done on an individual standalone server.
 Data visualization using R is slow for large amounts of data as compared to other
counterparts.

307
Unit 7
Advance R

308
Topics
Advanced R
Statistical models in R
Correlation and regression analysis
Analysis of Variance (ANOVA)
Creating data for complex analysis
Summarizing data, and case studies.

309
Correlation and Regression Analysis in R
This section contains R methods for computing and visualizing correlation analyses. Recall that,
correlation analysis is used to investigate the association between two or more variables. A
simple example, is to evaluate whether there is a link between maternal age and child’s weight at
birth.

Correlation analysis
1. Pearson correlation (r), which measures a linear dependence between two variables (x
and y). It’s also known as a parametric correlation test because it depends to the
distribution of the data. It can be used only when x and y are from normal distribution.
The plot of y=f(x) is named the linear regression curve. The Pearson correlation formula
is:

where mx and my are means of the distributions x and y respectively.

2. Kendall tau and Spearman rho, which are rank-based correlation coefficients (non-
parametric)
3. The Spearman correlation method computes the correlation between the rank of x
and the rank of y variables.

where x′=rank(x) and y′=rank(y).

The Kendall correlation method measures the correspondence between the ranking of x and y
variables. The total number of possible pairings of x with y observations is n(n−1)/2, where n is
the size of x and y.
310
The procedure is as follow:
1. Begin by ordering the pairs by the x values. If x and y are correlated, then they would have
the same relative rank orders.
2. Now, for each yi, count the number of yj>yi (concordant pairs (c)) and the number of yj<yi
(discordant pairs (d)).
Kendall correlation distance is defined as follow:

Where

R method to find correlation coefficient

Correlation coefficient can be computed using the functions cor() or [Link]():
Syntax:
cor(x, y, method = c("pearson", "kendall", "spearman")) [Link](x, y, method=c("pearson",
"kendall", "spearman"))
Note: If your data contain missing values, use the following R code to handle missing
values by case-wise deletion.
cor(x, y, method = "pearson", use = "[Link]")
As an illustration Here, we’ll use the built-in R data set mtcars as the first example.
The R code below computes the correlation between mpg and wt variables in mtcars data
set:
my_data <- mtcars
head(my_data, 6)

311
Visualizing the relationship using scatter plot
We can show the correlation in the form of scatter plot as follows:
library("ggpubr")
## Loading required package: ggplot2
ggscatter(my_data, x = "mpg", y = "wt",
add = "[Link]", [Link] = TRUE,
[Link] = TRUE, [Link] = "pearson",
xlab = "Miles/(US) gallon", ylab = "Weight (1000 lbs)")
## `geom_smooth()` using formula 'y ~ x'

312
Scatter plot with smooth fit curve
Preliminary checks before finding the Pearson correlation coefficient
Inorder to apply the Pearson correlation test, the data should satisfy some conditions
Preleminary test to check the test assumptions
1. Is the covariation linear? Yes, form the plot above, the relationship is linear. In the situation
where the scatter plots show curved patterns, we are dealing with nonlinear association
between the two variables.
2. Are the data from each of the 2 variables (x, y) follow a normal distribution?
Pearson correlation test
Correlation test between mpg and wt variables:
res <- [Link](my_data$wt, my_data$mpg, method = "pearson")
res
##
## Pearson's product-moment correlation
##
## data: my_data$wt and my_data$mpg
## t = -9.559, df = 30, p-value = 1.294e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.9338264 -0.7440872
## sample estimates:
## cor
## -0.8676594
Similarly the Kendall rank correlation coefficient or Kendall’s tau statistic is used to estimate a
rank-based measure of association. This test may be used if the data do not necessarily come
from a bivariate normal distribution.
res2 <- [Link](my_data$wt, my_data$mpg, method="kendall")
## Warning in [Link](my_data$wt, my_data$mpg, method = "kendall"): Cannot
## compute exact p-value with ties
res2
##
313
## Kendall's rank correlation tau
##
## data: my_data$wt and my_data$mpg
## z = -5.7981, p-value = 6.706e-09
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## -0.7278321
Further Spearman’s rho statistic is also used to estimate a rank-based measure of association.
This test may be used if the data do not come from a bivariate normal distribution.
res3 <-[Link](my_data$wt, my_data$mpg, method = "spearman")
## Warning in [Link](my_data$wt, my_data$mpg, method = "spearman"):
## Cannot compute exact p-value with ties
res3
##
## Spearman's rank correlation rho
##
## data: my_data$wt and my_data$mpg
## S = 10292, p-value = 1.488e-11
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.886422
Interpreting correlation test
Result of a correlation test can be interpreted using the correlation coefficient and p-value
of the test. Correlation coefficient is comprised between -1 and 1:
1. -1 indicates a strong negative correlation : this means that every time x increases, y decreases
2. 0 means that there is no association between the two variables (x and y)
3. 1 indicates a strong positive correlation : this means that y increases with x

314
Use of p-value statistics: If the p-value of correlation test is less than 0.05, the null hypothesis
of no significant correlation will be accepted at 5% significant level.
Problem 1: From the following data, compute Karl Pearson’s coefficient of correlation.

Solution: As the first step read the variables price and supply and use [Link] function on the
variable pair. The R code and the result are shown below:
price=c(10,20,30,40,50,60,70)
supply=c(8,6,14,16,10,20,24)
resp1=[Link](price,supply,method='pearson')
resp1
##
## Pearson's product-moment correlation
##
## data: price and supply
## t = 3.6145, df = 5, p-value = 0.01531
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2707625 0.9774828
## sample estimates:
## cor
## 0.8504201
Interpretation: Since the Pearson coefficient is 0.8504201. Also p-value is 0.015308 <0.05. So
the null hypothesis is accepted. So it is statistically reasonable to conclude that there is a
significant positive correlation between the price and supply based on the sample.

315
Problem: From the following data compute correlation between height of father and height
of daughters by Karl Pearson’s coefficient of correlation.

Solu
tion: As the first step read the variables price and supply and use [Link] function on the
variable pair. The R code and the result are shown below:
height_F=c(65,66,67,67,68,69,71,73)
height_D=c(67,68,64,69,72,70,69,73)
resp2=[Link](height_F,height_D,method='pearson')
resp2
##
## Pearson's product-moment correlation
##
## data: height_F and height_D
## t = 2.0717, df = 6, p-value = 0.08369
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1080788 0.9281049
## sample estimates:
## cor
## 0.6457766
Interpretation: Since the Pearson coefficient is 0.6457766. Also p-value is 0.0836865 >0.05.
So the null hypothesis is accepted. So it is statistically reasonable to conclude that there is no
significant positive correlation between the price and supply based on the sample.
We can show the correlation in the form of scatter plot as follows:
library("ggpubr")
data1=[Link](height_F,height_D)
ggscatter(data1, x = "height_F", y = "height_D",
add = "[Link]", [Link] = TRUE,
316
[Link] = TRUE, [Link] = "pearson",
xlab = "Height of Father (cm)", ylab = "Height of Daughter (cm)")
## `geom_smooth()` using formula 'y ~ x'

Scatter plot with smooth fit curve

Problem: The scores for nine students in history and algebra are as follows:

Compute the Spearman rank correlation.

SOlution: As the first step read the variables price and supply and use [Link] function on the
variable pair. The R code and the result are shown below:
History=c(35,23,47,17,10,43,9,6,28)
Algebra=c(30,33,45,23,8,49,12,4,31)
ress3=[Link](History,Algebra,method='spearman')

317
ress3
##
## Spearman's rank correlation rho
##
## data: History and Algebra
## S = 12, p-value = 0.002028
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.9
Interpretation: Since the correlation coefficient is 0.9. Also p-value is 0.0020282 <0.05. So the
null hypothesis is rejected So it is statistically reasonable to conclude that there is a significant
positive correlation between the price and supply based on the sample.
We can show the correlation in the form of scatter plot as follows (Kassambara 2020):
library("ggpubr")
data2=[Link](History,Algebra)
ggscatter(data2, x = "History", y = "Algebra",
add = "[Link]", [Link] = TRUE,
[Link] = TRUE, [Link] = "spearman",
xlab = "Marks in History", ylab = "Marks in Algebra")
## `geom_smooth()` using formula 'y ~ x'

318
Scatter plot with smooth fit curve

Correlation Matrix
Previously, we described how to perform correlation test between two variables. In this section,
you’ll learn how to compute a correlation matrix, which is used to investigate the dependence
between multiple variables at the same time. The result is a table containing the correlation
coefficients between each variable and the others.
There are different methods for correlation analysis : Pearson parametric correlation test,
Spearman and Kendall rank-based correlation analysis.
Compute correlation matrix in R
As you may know, The R function cor() can be used to compute a correlation matrix. A
simplified format of the function is :
syntax cor(x, method = c("pearson", "kendall", "spearman"))
Example: Here, we’ll use a data ( few numeric columns) derived from the built-in R data
set mtcars as the first example:
# Load data
data("mtcars")
my_data <- mtcars[, c(1,3,4,5,6,7)]
319
# print the first 6 rows
head(my_data, 6)

Compute correlation matrix

The R code to compute the correlation matrix is:
rescm <- cor(my_data)
round(rescm, 2)

Unfortunately, the function cor() returns only the correlation coefficients between variables. In
the next section, we will use Hmisc R package to calculate the correlation p-values. The
function rcorr() [in Hmisc package] can be used to compute the significance levels for pearson

320
and spearman correlations. It returns both the correlation coefficients and the p-value of the
correlation for all possible pairs of columns in the data table.
Syntax rcorr(x, type = c("pearson","spearman"))
The following R code illustrate the use of rcorr() function on my_data.
library("Hmisc")
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## [Link], units
resH <- rcorr([Link](my_data))
resH

321
Below is the code to compute the correlation
1. Loading the dataset
> data1<-swiss
> head(data1, 4)

2.
Creating a scatter plot using ggplot2 library
> library(ggplot2)
> ggplot(data1, aes(x = Fertility, y = [Link])) + geom_point() +
+ geom_smooth(method = "lm", se = TRUE, color = 'black')
3. Testing the assumptions (Linearity and Normalcy)
Linearity#: Visible from the plot itself (True, the relationship is linear)
Normality$: Using Shapiro test (This is a test of normality, here we are checking whether the
variables are normally distributed or not )
> [Link](data1$Fertility)
Shapiro-Wilk normality test
data: data1$Fertility

322
W = 0.97307, p-value = 0.3449

> [Link](data1$[Link])
Shapiro-Wilk normality test

data: data1$[Link]
W = 0.97762, p-value = 0.4978
p-value is greater than 0.05, so we can assume the normality
4. Correlation Coefficient
> cor(data1$Fertility,data1$[Link])
[1] 0.416556
5. Checking for the significance
> Tes<- [Link](swiss$Fertility,swiss$[Link],method = "pearson")
>
> Tes

Since the p-value is less than 0.05 (here it is 0.003585, we can conclude that Fertility and Infant
Mortality are significantly correlated with a value of 0.41 and a p-value of 0.003585.

323
Regression analysis

Can you predict a company’s revenue by analyzing the budget it allocates to its marketing team?
Yes, you can. Do you know how to predict using linear regression in R? Not yet? Well, let me
show you how. In this article, we will discuss one of the simplest machine-learning techniques,
linear regression in r. Regression in r is almost a 200-year-old tool that is still effective in data
science. It is one of the oldest statistical tools used in machine learning predictive analysis.

What Is Linear Regression in R?

Simple linear regression in R is a powerful technique to uncover associations between two
variables. In this method, the dependent variable (response variable) reacts to changes in the
independent variable (predictor variable). It’s important to note that we are not just calculating
the dependency of the dependent variable on the independent variable, but also exploring the
nuanced association. This makes linear regression in R a valuable tool for understanding and
interpreting relationships in your data.
For example, a firm is investing some amount of money in the marketing of a product, and it has
also collected sales data throughout the years. Now, by analyzing the correlation between the
marketing budget and the sales data, we can predict next year’s sales if the company allocates a
certain amount of money to the marketing department. The above idea of prediction sounds
magical, but it’s pure statistics. The linear regression algorithm is basically fitting a straight line
to our dataset using the least squares method so that we can predict future events. One limitation
of linear regression in r programming is that it is sensitive to outliers. The best-fit line would be
of the form:

324
Y = B0 + B1X
Where, Y – Dependent variable
X – Independent variable
B0 and B1 – Regression parameter

Practical Application of Linear Regression in R

Let’s try to understand the practical application of linear Regression in R with another example.
Let’s say we have a dataset of the blood pressure and age of a certain group of people. With the
help of this data, we can train a simple linear regression model in R, which will be able to
predict blood pressure at ages that are not present in our dataset.
You can download the Dataset from below:
Equation of the regression line in our dataset.
BP = 98.7147 + 0.9709 Age
where y is BP
Now let’s see how to do this
Step 1: Import the Dataset
Import the dataset of Age vs. Blood Pressure, a CSV file using function [Link]( ) in R, and
store this dataset into a data frame bp.
bp <- [Link]("[Link]")
Step 2: Create the Data Frame for Predicting Values
Create a data frame that will store Age 53. This data frame will help us predict blood pressure at
Age 53 after creating a linear regression model.
p <- [Link](53)
colnames(p) <- "Age"
Step 3: Create a Scatter Plot using the ggplot2 Library
Taking the help of the ggplot2 library in R, we can see that there is a correlation between Blood
Pressure and Age, as we can see that the increase in Age is followed by an increase in blood
pressure.

325
We can also use the plot function In R for scatterplot and abline function to plot straight lines.
It is quite evident from the graph that the distribution on the plot is scattered in a manner that we
can fit a straight line through the data points.
Step 4: Calculate the Correlation Between Age and Blood Pressure
We can also verify our above analysis that there is a correlation between Blood Pressure and
Age by taking the help of the cor( ) function in R, which is used to calculate the correlation
between two variables.
cor(bp$BP,bp$Age)
[1] 0.6575673
Step 5: Create a Linear Regression Model
Now, leveraging the lm() function in R, let’s build a linear model. Using ‘BP ~ Age’ as the
formula, with Age as the independent variable and Blood Pressure as the dependent variable, we
apply this to our dataset named ‘bp’. The model seamlessly fits the data, showcasing the power
of R linear Regression.
model <- lm(BP ~ Age, data = bp)
Summary of Our Linear Regression Model
summary(model)
Output:
326
##
## Call:
## lm(formula = BP ~ Age, data = bp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.724 -6.994 -0.520 2.931 75.654
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 98.7147 10.0005 9.871 1.28e-10 ***
## Age 0.9709 0.2102 4.618 7.87e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.31 on 28 degrees of freedom
## Multiple R-squared: 0.4324, Adjusted R-squared: 0.4121
## F-statistic: 21.33 on 1 and 28 DF, p-value: 7.867e-05
Interpretation of the Model
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 98.7147 10.0005 9.871 1.28e-10 ***
## Age 0.9709 0.2102 4.618 7.87e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
B0 = 98.7147 (Y- intercept)
B1 = 0.9709 (Age coefficient)
BP = 98.7147 + 0.9709 Age
It means a change in one unit in Age will bring 0.9709 units to change in Blood Pressure.
Standard Error

327
The standard error is variability to expect in coefficient, which captures sampling variability, so
the variation in intercept can be up to 10.0005, and the variation in Age will be 0.2102, not
more.
T value
The T value is the coefficient divided by the standard error. It is basically how big the estimate
is relative to the error. The bigger the coefficient relative to standard error, the bigger the t score.
The t score comes with a p-value because a distribution p-value is how statistically significant
the variable is to the model for a confidence level of 95%. We will compare this value with
alpha which will be 0.05, so in our case, the p-values of both intercept and Age are less than
alpha (alpha = 0.05). This implies that both are statistically significant to our model.
We can calculate the confidence interval using the confint(model, level=.95) method.
## Residual standard error: 17.31 on 28 degrees of freedom
## Multiple R-squared: 0.4324, Adjusted R-squared: 0.4121
## F-statistic: 21.33 on 1 and 28 DF, p-value: 7.867e-05
Residual Standard Error
Residual standard error or the standard error of the model is basically the average error for the
model, which is 17.31 in our case, and it means that our model can be off by an average of 17.31
while predicting the blood pressure. The lesser the error, the better the model while predicting.
Multiple R-squared
Multiple R-squared is the ratio of (1-(sum of squared error/sum of squared total))
Adjusted R-squared
Suppose we add variables, no matter if it’s significant in prediction or not. In that case, the value
of the R-squared will increase, which is the reason adjusted R-squared is used because if the
variable added isn’t significant for the prediction of the model, the value of the adjusted R-
squared will reduce. It is one of the most helpful tools to avoid overfitting the model.
F – statistics
F – statistics is the ratio of the mean square of the model and the mean square of the error. In
other words, it is the ratio of how well the model is doing and what the error is doing, and the
higher the F value is, the better the model is doing compared to the error.
One is the degree of freedom of the numerator of the F – statistic, and 28 is the degree of
freedom of the errors.
Step 6: Run a Sample Test

328
Now, let’s try using our model to predict the value of blood pressure for someone at age 53.
BP = 98.7147 + 0.9709 Age
The above formula will be used to calculate blood pressure at the age of 53, and this will be
achieved by using the predict function( ). First, we will write the name of the linear
regression model, separated by a comma, giving the value of the new data set at p as the Age 53
is earlier saved in data frame p.
predict(model, newdata = p)
Output:
## 1
## 150.1708
So, the predicted value of blood pressure is 150.17 at age 53
As we have predicted Blood Pressure with the association of Age, now there can be more than
one independent variable involved, which shows a correlation with a dependent variable. This is
called Multiple Regression.
Multiple Linear Regression Model
Multi-Linear regression analysis is a statistical technique to find the association
of multiple independent variables with the dependent variable. For example, revenue generated
by a company is dependent on various factors, including market size, price, promotion,
competitor’s price, etc. basically Multiple linear regression in R establishes a linear relationship
between a dependent variable and multiple independent variables.
The equation of Multiple Linear Regression is as follows:
Y = B0 + B1X1 + B2X2 + .. + BnXk + E
Where
Y – Dependent variable
X – Independent variable
B0, B1, B3, . – Multiple linear regression coefficients
E- Error
Taking another example of the Wine dataset and with the help of AGST, HarvestRain, we are
going to predict the price of wine. Here AGST and HarvestRain are fitted values.
Here’s how we can build a multiple R linear regression model.
Step 1: Import the Dataset
Using the function [Link]( ), import both data sets [Link] and wine_test.csv, into the data

329
frame wine and wine_test, respectively.
wine <- [Link]("[Link]")
wine_test <- [Link]("wine_test.csv")
You can download the dataset below.
Step 2: Find the Correlation Between Different Variables
Using the cor( ) function and round( ) function, we can round off the correlation between all
variables of the dataset wine to two decimal places.
round(cor(wine),2)
Output:
Year Price WinterRain AGST HarvestRain Age FrancePop
## Year 1.00 -0.45 0.02 -0.25 0.03 -1.00 0.99
## Price -0.45 1.00 0.14 0.66 -0.56 0.45 -0.47
## WinterRain 0.02 0.14 1.00 -0.32 -0.28 -0.02 0.00
## AGST -0.25 0.66 -0.32 1.00 -0.06 0.25 -0.26
## HarvestRain 0.03 -0.56 -0.28 -0.06 1.00 -0.03 0.04
## Age -1.00 0.45 -0.02 0.25 -0.03 1.00 -0.99
## FrancePop 0.99 -0.47 0.00 -0.26 0.04 -0.99 1.00
Step 3: Create Scatter Plots Using ggplot2 Library
Create a scatter plot using the library ggplot2 in R. This clearly shows that AGST and the Price
of the wine are highly correlated. Similarly, the scatter plot between HarvestRain and the Price
of wine also shows their correlation.
ggplot(wine,aes(x = AGST, y = Price)) + geom_point() +geom_smooth(method = "lm")

330
ggplot(wine,aes(x = HarvestRain, y = Price)) + geom_point() +geom_smooth(method = "lm")

Step 4: Create a Multilinear Regression Model

model1 <- lm(Price ~ AGST + HarvestRain,data = wine)
summary(model1)
Output:
331
##
## Call:
## lm(formula = Price ~ AGST + HarvestRain, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.88321 -0.19600 0.06178 0.15379 0.59722
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.20265 1.85443 -1.188 0.247585
## AGST 0.60262 0.11128 5.415 1.94e-05 ***
## HarvestRain -0.00457 0.00101 -4.525 0.000167 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3674 on 22 degrees of freedom
## Multiple R-squared: 0.7074, Adjusted R-squared: 0.6808
## F-statistic: 26.59 on 2 and 22 DF, p-value: 1.347e-06
Interpretation of the Model
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.20265 1.85443 -1.188 0.247585
## AGST 0.60262 0.11128 5.415 1.94e-05 ***
## HarvestRain -0.00457 0.00101 -4.525 0.000167 ***
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
B0 = 98.7147 (Y- intercept)
B1 = 0.9709 (Age coefficient)
Price = -2.20265 + 0.60262 AGST - 0.00457 HarvestRain
It means that a change in one unit in AGST will bring 0.60262 units to change in Price, and one

332
unit change in HarvestRain will bring 0.00457 units to change in Price.
Standard Error
The standard error is variability to expect in coefficient, which captures sampling variability, so
the variation in intercept can be up to 1.85443, the variation in AGST will be 0.11128, and the
variation in HarvestRain is 0.00101, not more.
In this case, the p-value of intercept, AGST, and HarvestRain are less than alpha (alpha = 0.05),
which implies that all are statistically significant to our model.
## Residual standard error: 0.3674 on 22 degrees of freedom
## Multiple R-squared: 0.7074, Adjusted R-squared: 0.6808
## F-statistic: 26.59 on 2 and 22 DF, p-value: 1.347e-06
Residual Standard Error
The residual standard error or the standard error of the model is 0.3674 in our case, which means
that our model can be off by an average of 0.3674 while predicting the Price of wines. The lesser
the error, the better the model while predicting. We have also looked at the residuals, which
need to follow a normal distribution.
Multiple R-squared is the ratio of (1-(sum of squared error/sum of squared total))
Two is the degree of freedom of the numerator of the F – statistic, and 22 is the degree of
freedom of the errors.
Step 5: Predict the Values for Our Test Set
prediction <- predict(model1, newdata = wine_test)
Predicted values with the test data set
wine_test

prediction
## 1 2
## 6.982126 7.101033
Advantages of Simple Linear Regression in R
1. Simple to understand: Linear regression in R programming is easy to grasp, even for beginners,
making it accessible to anyone interested in data analysis.
333
2. Easy interpretation: It’s straightforward to interpret the relationship between two variables
because linear regression provides coefficients that tell you how the dependent variable changes
with a one-unit change in the independent variable.
3. Fast computations: Linear regression in R is computationally efficient, so you can analyze large
datasets quickly, which is great for projects with tight deadlines.
4. Visualizations: You can easily create scatterplots and regression lines in R to visualize the
relationship between variables, helping you understand your data better.
Disadvantages of Simple Linear Regression in R
 Assumes linearity: Linear regression assumes that the relationship between variables is linear. If
this isn’t true, your model may not be accurate.
 Assumes equal variance: Linear regression also assumes that the variability of the data
(residuals) is the same across all values of the independent variable. If this assumption is
violated, your predictions might not be reliable.
 Sensitive to outliers: Linear regression in R Programming is sensitive to outliers, which are data
points that don’t fit the pattern of the rest of the data. Outliers can skew your results and make
your model less accurate.
 Limited to two variables: Linear regression can only analyze the relationship between two
variables. If your data is more complex and involves multiple predictors, you might need to use
more advanced techniques.
 Can’t predict outside the data range: Linear regression shouldn’t be used to make predictions
outside the range of your data because it might not give you accurate results.

334
ANOVA (ANalysis Of VAriance) is a statistical test to determine whether two or more
population means are different. In other words, it is used to compare two or more groups to see
if they are significantly different.
In practice, however, the:
 Student t-test is used to compare 2 groups;
 ANOVA generalizes the t-test beyond 2 groups, so it is used to compare 3 or more groups.
Note that there are several versions of the ANOVA (e.g., one-way ANOVA, two-way ANOVA,
mixed ANOVA, repeated measures ANOVA, etc.). In this article, we present the simplest form
only—the one-way ANOVA1—and we refer to it as ANOVA in the remaining of the article.
Although ANOVA is used to make inference about means of different groups, the method is
called “analysis of variance”. It is called like this because it compares the “between” variance
(the variance between the different groups) and the variance “within” (the variance within each
group). If the between variance is significantly larger than the within variance, the group means
are declared to be different. Otherwise, we cannot conclude one way or the other. The two
variances are compared to each other by taking the ratio
(variancebetweenvariancewithinvariancebetweenvariancewithin) and then by comparing this
ratio to a threshold from the Fisher probability distribution (a threshold based on a specific
significance level, usually 5%).
This is enough theory regarding the ANOVA method for now. In the remaining of this article,
we discuss about it from a more practical point of view, and in particular we will cover the
following points:
 the aim of the ANOVA, when it should be used and the null/alternative hypothesis
 the underlying assumptions of the ANOVA and how to check them
 how to perform the ANOVA in R
 how to interpret results of the ANOVA
 understand the notion of post-hoc test and interpret the results
 how to visualize results of ANOVA and post-hoc tests
Data
Data for the present article is the penguins dataset (an alternative to the well-known iris dataset),
accessible via the {palmerpenguins} package:

335
# [Link]("palmerpenguins")
library(palmerpenguins)

The dataset contains data for 344 penguins of 3 different species (Adelie, Chinstrap and
Gentoo). The dataset contains 8 variables, but we focus only on the flipper length and the
species for this article, so we keep only those 2 variables:

library(tidyverse)

dat <- penguins %>%

select(species, flipper_length_mm)
(If you are unfamiliar with the pipe operator (%>%), you can also select
variables with penguins[, c("species", "flipper_length_mm")]. Learn more ways to select
variables in the article about data manipulation.)
Below some basic descriptive statistics and a plot (made with the {ggplot2} package) of our
dataset before we proceed to the goal of the ANOVA:
summary(dat)
## species flipper_length_mm
## Adelie :152 Min. :172.0
## Chinstrap: 68 1st Qu.:190.0
## Gentoo :124 Median :197.0
## Mean :200.9
## 3rd Qu.:213.0
## Max. :231.0
## NA's :2
Flipper length varies from 172 to 231 mm, with a mean of 200.9 mm. There are respectively
152, 68 and 124 penguins of the species Adelie, Chinstrap and Gentoo.
library(ggplot2)

ggplot(dat) +
aes(x = species, y = flipper_length_mm, color = species) +
336
geom_jitter() +
theme([Link] = "none")

Here, the factor is the species variable which contains 3 modalities or groups (Adelie, Chinstrap
and Gentoo).
Aim and hypotheses of ANOVA
As mentioned in the introduction, the ANOVA is used to compare groups (in practice, 3 or more
groups). More generally, it is used to:
 study whether measurements are similar across different modalities (also called levels or
treatments in the context of ANOVA) of a categorical variable
 compare the impact of the different levels of a categorical variable on a quantitative variable
 explain a quantitative variable based on a qualitative variable
In this context and as an example, we are going to use an ANOVA to help us answer the
question: “Is the length of the flippers different between the 3 species of penguins?”.
The null and alternative hypothesis of an ANOVA are:

 H0H0: μAdelie=μChinstrap=μGentooμAdelie=μChinstrap=μGentoo (⇒⇒ the 3 species are

equal in terms of flipper length)

 H1H1: at least one mean is different (⇒⇒ at least one species is different from the other 2

337
species in terms of flipper length)
Be careful that the alternative hypothesis is not that all means are different. The opposite of all
means being equal (H0H0) is that at least one mean is different from the others (H1H1).
In this sense, if the null hypothesis is rejected, it means that at least one species is different from
the other 2, but not necessarily that all 3 species are different from each other. It could be that
flipper length for the species Gentoo is different than for the species Chinstrap and Adelie, but
flipper length is similar between Chinstrap and Adelie. Other types of test (known as post-hoc
tests and covered in this section) must be performed to test whether all 3 species differ.
Underlying assumptions of ANOVA
As for many statistical tests, there are some assumptions that need to be met in order to be able
to interpret the results. When one or several assumptions are not met, although it is technically
possible to perform these tests, it would be incorrect to interpret the results and trust the
conclusions.
Below are the assumptions of the ANOVA, how to test them and which other tests exist if an
assumption is not met:
 Variable type: ANOVA requires a mix of one continuous quantitative dependent variable
(which corresponds to the measurements to which the question relates) and
one qualitative independent variable (with at least 2 levels which will determine the groups
to compare).
 Independence: the data, collected from a representative and randomly selected portion of the
total population, should be independent between groups and within each group. The
assumption of independence is most often verified based on the design of the experiment and
on the good control of experimental conditions rather than via a formal test. If you are still
unsure about independence based on the experiment design, ask yourself if one observation
is related to another (if one observation has an impact on another) within each group or
between the groups themselves. If not, it is most likely that you have independent samples. If
observations between samples (forming the different groups to be compared) are dependent
(for example, if three measurements have been collected on the same individuals as it is often
the case in medical studies when measuring a metric (i) before, (ii) during and (iii) after a
treatment), the repeated measures ANOVA should be preferred in order to take into account
the dependency between the samples.
 Normality:
o In case of small samples, residuals 2 should follow approximately a normal

338
distribution. The normality assumption can be tested visually thanks to
a histogram and a QQ-plot, and/or formally via a normality test such as the Shapiro-
Wilk or Kolmogorov-Smirnov test. If, even after a transformation of your data (e.g.,
logarithmic transformation, square root, Box-Cox, etc.), the residuals still do not
follow approximately a normal distribution, the Kruskal-Wallis test can be applied
([Link](variable ~ group, data = dat in R). This non-parametric test, robust to
non normal distributions, has the same goal than the ANOVA—compare 3 or more
groups—but it uses sample medians instead of sample means to compare groups.
o In case of large samples, normality is not required (this is a common misconception!).
By the central limit theorem, sample means of large samples are often well-
approximated by a normal distribution even if the data are not normally
distributed (Stevens 2013).3 It is therefore not required to test the normality
assumption when the number of observations in each group/sample is large
(usually n≥30n≥30).
 Equality of variances: the variances of the different groups should be equal in the populations
(an assumption called homogeneity of the variances, or even sometimes referred as
homoscedasticity, as opposed to heteroscedasticity if variances are different across groups).
This assumption can be tested graphically (by comparing the dispersion in
a boxplot or dotplot for instance), or more formally via the Levene’s test
(leveneTest(variable ~ group) from the {car} package) or Bartlett’s test, among others. If the
hypothesis of equal variances is rejected, another version of the ANOVA can be used: the
Welch ANOVA ([Link](variable ~ group, [Link] = FALSE)). Note that the Welch
ANOVA does not require homogeneity of the variances, but the distributions should still
follow approximately a normal distribution. Note that the Kruskal-Wallis test does not
require the assumptions of normality nor homoscedasticity of the variances.4
 Outliers: An outlier is a value or an observation that is distant from the other observations.
There should be no significant outliers in the different groups, or the conclusions of your
ANOVA may be flawed. There are several methods to detect outliers in your data but in
order to deal with them, it is your choice to either:
o use the non-parametric version (i.e., the Kruskal-Wallis test)

o transform your data (logarithmic or Box-Cox transformation, among others)

o or remove them (be careful)

Choosing the appropriate test depending on whether assumptions are met may be confusing so

339
here is a brief summary:
1. Check that your observations are independent.
2. Sample sizes:
o In case of small samples, test the normality of residuals:

 If normality is assumed, test the homogeneity of the variances:

 If variances are equal, use ANOVA.
 If variances are not equal, use the Welch ANOVA.
 If normality is not assumed, use the Kruskal-Wallis test.
o In case of large samples normality is assumed, so test the homogeneity of the variances:

 If variances are equal, use ANOVA.

 If variances are not equal, use the Welch ANOVA.
Now that we have seen the underlying assumptions of the ANOVA, we review them specifically
for our dataset before applying the appropriate version of the test.
Variable type
The dependent variable flipper_length_mm is a quantitative variable and the independent
variable species is a qualitative one (with 3 levels corresponding to the 3 species). So we have a
mix of the two types of variable and this assumption is met.
Independence
Independence of the observations is assumed as data have been collected from a randomly
selected portion of the population and measurements within and between the 3 samples are not
related.
The independence assumption is most often verified based on the design of the experiment and
on the good control of experimental conditions, as it is the case here.
If you really want to test it more formally, you can, however, test it via a statistical test—the
Durbin-Watson test (in R: durbinWatsonTest(res_lm) where res_lm is a linear model). The null
hypothesis of this test specifies an autocorrelation coefficient = 0, while the alternative
hypothesis specifies an autocorrelation coefficient ≠≠ 0.
Normality
Since the smallest sample size per group (i.e., per species) is 68, we have large samples.
Therefore, we do not need to check normality.
Usually, we would directly test the homogeneity of the variances without testing normality.

340
However, for the sake of illustration, we act as if the sample sizes were small in order to
illustrate what would need to be done in that case.
Remember that normality of residuals can be tested visually via a histogram and a QQ-plot,
and/or formally via a normality test (Shapiro-Wilk test for instance).
Before checking the normality assumption, we first need to compute the ANOVA (more on that
in this section). We then save the results in res_aov :
res_aov <- aov(flipper_length_mm ~ species,
data = dat
)
We can now check normality visually:
par(mfrow = c(1, 2)) # combine plots

# histogram
hist(res_aov$residuals)

# QQ-plot
library(car)
qqPlot(res_aov$residuals,
id = FALSE # id = FALSE to remove point identification
)

341
From the histogram and QQ-plot above, we can already see that the normality assumption seems
to be met. Indeed, the histogram roughly form a bell curve, indicating that the residuals follow a
normal distribution. Furthermore, points in the QQ-plots roughly follow the straight line and
most of them are within the confidence bands, also indicating that residuals follow
approximately a normal distribution.
Some researchers stop here and assume that normality is met, while others also test the
assumption via a formal normality test. It is your choice to test it (i) only visually, (ii) only via a
normality test, or (iii) both visually AND via a normality test. Bear in mind, however, the two
following points:
1. ANOVA is quite robust to small deviations from normality. This means that it is not an issue
(from the perspective of the interpretation of the ANOVA results) if a small number of points
deviates slightly from the normality,
2. normality tests are sometimes quite conservative, meaning that the null hypothesis of
normality may be rejected due to a limited deviation from normality. This is especially the
case with large samples as power of the test increases with the sample size.
In practice, I tend to prefer the (i) visual approach only, but again, this is a matter of personal
choice and also depends on the context of the analysis.
Still for the sake of illustration, we also now test the normality assumption via a normality
test. You can use the Shapiro-Wilk test or the Kolmogorov-Smirnov test, among others.
342
Remember that the null and alternative hypothesis of these tests are:
 H0: data come from a normal distribution
 H1: data do not come from a normal distribution
In R, we can test normality of the residuals with the Shapiro-Wilk test thanks to
the [Link]() function:
[Link](res_aov$residuals)
##
## Shapiro-Wilk normality test
##
## data: res_aov$residuals
## W = 0.99452, p-value = 0.2609
P-value of the Shapiro-Wilk test on the residuals is larger than the usual significance level
of α=5%α=5%, so we do not reject the hypothesis that residuals follow a normal distribution (p-
value = 0.261).
This result is in line with the visual approach. In our case, the normality assumption is thus met
both visually and formally.
Side note: Remind that the p-value is the probability of having observations as extreme as the
ones we have observed in the sample(s) given that the null hypothesis is true. If the p-
value <α<α (indicating that it is not likely to observe the data we have in the sample given that
the null hypothesis is true), the null hypothesis is rejected, otherwise the null hypothesis is not
rejected. See more about p-value and significance level if you are unfamiliar with those
important statistical concepts.
Remember that if the normality assumption was not reached, some transformation(s) would need
to be applied on the raw data in the hope that residuals would better fit a normal distribution, or
you would need to use the non-parametric version of the ANOVA—the Kruskal-Wallis test.
As pointed out by a reader (see comments at the very end of the article), the normality
assumption can also be tested on the “raw” data (i.e., the observations) instead of the residuals.
However, if you test the normality assumption on the raw data, it must be tested for each group
separately as the ANOVA requires normality in each group.
Testing normality on all residuals or on the observations per group is equivalent, and will give
similar results. Indeed, saying “The distribution of Y within each group is normally distributed”
is the same as saying “The residuals are normally distributed”.

343
Remember that residuals are the distance between the actual value of Y and the mean value of Y
for a specific value of X, so the grouping variable is induced in the computation of the residuals.
So in summary, in ANOVA you actually have two options for testing normality:
1. Checking normality separately for each group on the “raw” data (Y values)
2. Checking normality on all residuals (but not per group)
In practice, you will see that it is often easier to just use the residuals and check them all
together, especially if you have many groups or few observations per group.
If you are still not convinced: remember that an ANOVA is a special case of a linear model.
Suppose your independent variable is a continuous variable (instead of a categorical variable),
the only option you have left is to check normality on the residuals, which is precisely what is
done for testing normality in linear regression models.
Equality of variances - homogeneity
Assuming residuals follow a normal distribution, it is now time to check whether the variances
are equal across species or not. The result will have an impact on whether we use the ANOVA
or the Welch ANOVA.
This can again be verified visually—via a boxplot or dotplot—or more formally via a statistical
test (Levene’s test, among others).
Visually, we have:
# Boxplot
boxplot(flipper_length_mm ~ species,
data = dat
)

344
# Dotplot
library("lattice")

dotplot(flipper_length_mm ~ species,
data = dat
)

345
Both the boxplot and the dotplot show a similar variance for the different species. In the boxplot,
this can be seen by the fact that the boxes and the whiskers have a comparable size for all
species.
There are a couple of outliers as shown by the points outside the whiskers, but this does not
change the fact that the dispersion is more or less the same between the different species.
In the dotplot, this can be seen by the fact that points for all 3 species have more or less the
same range, a sign of the dispersion and thus the variance being similar.
Like the normality assumption, if you feel that the visual approach is not sufficient, you can
formally test for equality of the variances with a Levene’s or Bartlett’s test. Notice that the
Levene’s test is less sensitive to departures from normal distribution than the Bartlett’s test.
The null and alternative hypothesis for both tests are:
 H0: variances are equal
 H1: at least one variance is different
In R, the Levene’s test can be performed thanks to the leveneTest() function from
the {car} package:
# Levene's test
library(car)

leveneTest(flipper_length_mm ~ species,
346
data = dat
)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 0.3306 0.7188
## 339
The p-value being larger than the significance level of 0.05, we do not reject the null hypothesis,
so we cannot reject the hypothesis that variances are equal between species (p-value = 0.719).
This result is also in line with the visual approach, so the homogeneity of variances is met both
visually and formally.
Another method to test normality and homogeneity
For your information, it is also possible to test the homogeneity of the variances and the
normality of the residuals visually (and both at the same time) via the plot() function:
par(mfrow = c(1, 2)) # combine plots

# 1. Homogeneity of variances
plot(res_aov, which = 3)

# 2. Normality
plot(res_aov, which = 2)

347
Plot on the left hand side shows that there is no evident relationships between residuals and
fitted values (the mean of each group), so homogeneity of variances is assumed. If homogeneity
of variances was violated, the red line would not be flat (horizontal).
Plot on the right hand side shows that residuals follow approximately a normal distribution, so
normality is assumed. If normality was violated, points would consistently deviate from the
dashed line.
Outliers
There are several techniques to detect outliers. In this article, we focus on the most simple one
(yet very efficient)—the visual approach via a boxplot:
boxplot(flipper_length_mm ~ species,
data = dat
)

348
There is one outlier in the group Adelie, as defined by the interquartile range criterion. This
point is, however, not seen as a significant outlier so we can assume that the assumption of no
significant outliers is met.
ANOVA
We showed that all assumptions of the ANOVA are met.
We can thus proceed to the implementation of the ANOVA in R, but first, let’s do some
preliminary analyses to better understand the research question.
Preliminary analyses
A good practice before actually performing the ANOVA in R is to visualize the data in relation
to the research question. The best way to do so is to draw and compare boxplots of the
quantitative variable flipper_length_mm for each species.
This can be done with the boxplot() function in base R (same code than the visual check of equal
variances):
boxplot(flipper_length_mm ~ species,
data = dat
)

349
Or with the {ggplot2} package:
library(ggplot2)

ggplot(dat) +
aes(x = species, y = flipper_length_mm) +
geom_boxplot()

350
The boxplots above show that, at least for our sample, penguins of the species Gentoo seem to
have the biggest flipper, and Adelie species the smallest flipper.
Besides a boxplot for each species, it is also a good practice to compute some descriptive
statistics such as the mean and standard deviation by species.
This can be done, for instance, with the aggregate() function:
aggregate(flipper_length_mm ~ species,
data = dat,
function(x) round(c(mean = mean(x), sd = sd(x)), 2)
)
## species flipper_length_mm.mean flipper_length_mm.sd
## 1 Adelie 189.95 6.54
## 2 Chinstrap 195.82 7.13
## 3 Gentoo 217.19 6.48
or with the summarise() and group_by() functions from the {dplyr} package:
library(dplyr)

351
group_by(dat, species) %>%
summarise(
mean = mean(flipper_length_mm, [Link] = TRUE),
sd = sd(flipper_length_mm, [Link] = TRUE)
)
## # A tibble: 3 × 3
## species mean sd
## <fct> <dbl> <dbl>
## 1 Adelie 190. 6.54
## 2 Chinstrap 196. 7.13
## 3 Gentoo 217. 6.48
Mean is also the lowest for Adelie and highest for Gentoo. Boxplots and descriptive statistics
are, however, not enough to conclude that flippers are significantly different in the 3 populations
of penguins.
ANOVA in R
As you guessed by now, only the ANOVA can help us to make inference about the population
given the sample at hand, and help us to answer the initial research question “Is the length of the
flippers different between the 3 species of penguins?”.
ANOVA in R can be done in several ways, of which two are presented below:
1. With the [Link]() function:
# 1st method:
[Link](flipper_length_mm ~ species,
data = dat,
[Link] = TRUE # assuming equal variances
)
##
## One-way analysis of means
##
## data: flipper_length_mm and species
## F = 594.8, num df = 2, denom df = 339, p-value < 2.2e-16
2. With the summary() and aov() functions:
352
# 2nd method:
res_aov <- aov(flipper_length_mm ~ species,
data = dat
)

summary(res_aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## species 2 52473 26237 594.8 <2e-16 ***
## Residuals 339 14953 44
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 2 observations deleted due to missingness
As you can see from the two outputs above, the test statistic (F = in the first method and F
value in the second one) and the p-value (p-value in the first method and Pr(>F) in the second
one) are exactly the same for both methods, which means that in case of equal variances, results
and conclusions will be unchanged.
The advantage of the first method is that it is easy to switch from the ANOVA (used when
variances are equal) to the Welch ANOVA (used when variances are unequal). This can be done
by replacing [Link] = TRUE by [Link] = FALSE, as presented below:
[Link](flipper_length_mm ~ species,
data = dat,
[Link] = FALSE # assuming unequal variances
)
##
## One-way analysis of means (not assuming equal variances)
##
## data: flipper_length_mm and species
## F = 614.01, num df = 2.00, denom df = 172.76, p-value < 2.2e-16
The advantage of the second method, however, is that:
 the full ANOVA table (with degrees of freedom, mean squares, etc.) is printed, which may
be of interest in some (theoritical) cases
353
 results of the ANOVA (res_aov) can be saved for later use (especially useful for post-hoc
tests)
Interpretations of ANOVA results
Given that the p-value is smaller than 0.05, we reject the null hypothesis, so we reject the
hypothesis that all means are equal. Therefore, we can conclude that at least one species is
different than the others in terms of flippers length (p-value < 2.2e-16).
(For the sake of illustration, if the p-value was larger than 0.05: we cannot reject the null
hypothesis that all means are equal, so we cannot reject the hypothesis that the 3 considered
species of penguins are equal in terms of flippers length.)
A nice and easy way to report results of an ANOVA in R is with the report() function from
the {report} package:
# [Link]("remotes")
# remotes::install_github("easystats/report") # You only need to do that once
library("report") # Load the package every time you start R

report(res_aov)
## The ANOVA (formula: flipper_length_mm ~ species) suggests that:
##
## - The main effect of species is statistically significant and large (F(2, 339)
## = 594.80, p < .001; Eta2 = 0.78, 95% CI [0.75, 1.00])
##
## Effect sizes were labelled following Field's (2013) recommendations.
As you can see, the function interprets the results for you and indicates a large and significant
main effect of the species on the flipper length (p-value < .001).
Note that the report() function can be used for other analyses. See more tips and tricks in R if
you find this one useful.
What’s next?
If the null hypothesis is not rejected (p-value ≥≥ 0.05), it means that we do not reject the
hypothesis that all groups are equal. The ANOVA more or less stops here.
Other types of analyses can be performed of course, but—given the data at hand—we could not
prove that at least one group was different so we usually do not go further with the ANOVA.

354
On the contrary, if the null hypothesis is rejected (as it is our case since the p-value < 0.05), we
proved that at least one group is different. We can decide to stop here if we are only interested to
test whether all species are equal in terms of flippers length.
But most of the time, when we showed thanks to an ANOVA that at least one group is different,
we are also interested in knowing which one(s) is(are) different. Results of an ANOVA,
however, do NOT tell us which group(s) is(are) different from the others.
To test this, we need to use other types of test, referred as post-hoc tests (in Latin, “after this”, so
after obtaining statistically significant ANOVA results) or multiple pairwise-comparison tests. 5
This family of statistical tests is the topic of the following sections.
Post-hoc test
Issue of multiple testing
In order to see which group(s) is(are) different from the others, we need to compare groups 2 by
2. In practice, since there are 3 species, we are going to compare species 2 by 2 as follows:
1. Chinstrap versus Adelie
2. Gentoo vs. Adelie
3. Gentoo vs. Chinstrap
In theory, we could compare species thanks to 3 Student’s t-tests since we need to compare 2
groups and a t-test is used precisely in that case.
However, if several t-tests are performed, the issue of multiple testing (also referred as
multiplicity) arises. In short, when several statistical tests are performed, some will have p-
values less than αα purely by chance, even if all null hypotheses are in fact true.
To demonstrate the problem, consider our case where we have 3 hypotheses to test and a desired
significance level of 0.05.
The probability of observing at least one significant result (at least one p-value < 0.05) just due
to chance is:
P(at least 1 sig. result)=1−P(no sig. results)=1−(1−0.05)3=0.142625P(at least 1 sig.
result)=1−P(no sig. results)=1−(1−0.05)3=0.142625
So, with as few as 3 tests being considered, we already have a 14.26% chance of observing at
least one significant result, even if all of the tests are actually not significant.
And as the number of groups increases, the number of comparisons increases as well, so the
probability of having a significant result simply due to chance keeps increasing.
For example, with 10 groups we need to make 45 comparisons and the probability of having at
355
least one significant result by chance becomes 1−(1−0.05)45=90%1−(1−0.05)45=90%. So it is
very likely to observe a significant result just by chance when comparing 10 groups, and when
we have 14 groups or more we are almost certain (99%) to have a false positive!
Post-hoc tests take into account that multiple tests are done and deal with the problem by
adjusting αα in some way, so that the probability of observing at least one significant result due
to chance remains below our desired significance level.6
Post-hoc tests in R and their interpretation
Post-hoc tests are a family of statistical tests so there are several of them. The most common
ones are:
 Tukey HSD, used to compare all groups to each other (so all possible comparisons of 2 groups).
 Dunnett, used to make comparisons with a reference group. For example, consider 2 treatment
groups and one control group. If you only want to compare the 2 treatment groups with respect
to the control group, and you do not want to compare the 2 treatment groups to each other, the
Dunnett’s test is preferred.
 Bonferroni correction if one has a set of planned comparisons to do.
The Bonferroni correction is simple: you simply divide the desired global αα level by the
number of comparisons.
In our example, we have 3 comparisons so if we want to keep a global α=0.05α=0.05, we have α
′=0.053=0.0167α′=0.053=0.0167. We can then simply perform a Student’s t-test for each
comparison, and compare the obtained pp-values with this new α′α′.
The other two post-hoc tests are presented in the next sections.
Note that variances are assumed to be equal for all three methods (unless you use the Welch’s t-
test instead of the Student’s t-test with the Bonferroni correction). If variances are not equal, you
can use the Games-Howell test, among others.
Tukey HSD test
In our case, since there is no “reference” species and we are interested in comparing all species,
we are going to use the Tukey HSD test.
In R, the Tukey HSD test is done as follows. This is where the second method to perform the
ANOVA comes handy because the results (res_aov) are reused for the post-hoc test:

356
library(multcomp)
# Tukey HSD test:
post_test <- glht(res_aov,
linfct = mcp(species = "Tukey")
)
summary(post_test)
##
## Simultaneous Tests for General Linear Hypotheses
##
## Multiple Comparisons of Means: Tukey Contrasts
##
##
## Fit: aov(formula = flipper_length_mm ~ species, data = dat)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## Chinstrap - Adelie == 0 5.8699 0.9699 6.052 1.03e-08 ***
## Gentoo - Adelie == 0 27.2333 0.8067 33.760 < 1e-08 ***
## Gentoo - Chinstrap == 0 21.3635 1.0036 21.286 < 1e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
In the output of the Tukey HSD test, we are interested in the table displayed after Linear
Hypotheses:, and more precisely, in the first and last column of the table. The first column
shows the comparisons which have been made; the last column (Pr(>|t|)) shows the adjusted 7 p-
values for each comparison (with the null hypothesis being the two groups are equal and the
alternative hypothesis being the two groups are different).
It is these adjusted p-values that are used to test whether two groups are significantly different or
not, and we can be confident that the entire set of comparisons collectively has an error rate of
0.05.
In our example, we tested:
357
1. Chinstrap versus Adelie (line Chinstrap - Adelie == 0)
2. Gentoo vs. Adelie (line Gentoo - Adelie == 0)
3. Gentoo vs. Chinstrap (line Gentoo - Chinstrap == 0)
All three ajusted p-values are smaller than 0.05, so we reject the null hypothesis for all
comparisons, which means that all species are significantly different in terms of flippers length.
The results of the post-hoc test can be visualized with the plot() function:
par(mar = c(3, 8, 3, 3))
plot(post_test)

We see that the confidence intervals do not cross the zero line, which indicate that all groups are
significantly different.
Note that the Tukey HSD test can also be done in R with the TukeyHSD() function:
TukeyHSD(res_aov)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = flipper_length_mm ~ species, data = dat)
##
## $species
358
## diff lwr upr p adj
## Chinstrap-Adelie 5.869887 3.586583 8.153191 0
## Gentoo-Adelie 27.233349 25.334376 29.132323 0
## Gentoo-Chinstrap 21.363462 19.000841 23.726084 0
With this code, it is the column p adj (also the last column) which is of interest. Notice that the
conclusions are the same than above: all species are significantly different in terms of flippers
length.
The results can also be visualized with the plot() function:
plot(TukeyHSD(res_aov))

Dunnett’s test
We have seen in this section that as the number of groups increases, the number of comparisons
also increases. And as the number of comparisons increases, the post-hoc analysis must lower
the individual significance level even further, which leads to lower statistical power (so a
difference between group means in the population is less likely to be detected).
One method to mitigate this and increase the statistical power is by reducing the number of
comparisons. This reduction allows the post-hoc procedure to use a larger individual error rate to
achieve the desired global error rate.

359
While comparing all possible groups with a Tukey HSD test is a common approach, many
studies have a control group and several treatment groups. For these studies, you may need to
compare the treatment groups only to the control group, which reduces the number of
comparisons.
Dunnett’s test does precisely this—it only compares a group taken as reference to all other
groups, but it does not compare all groups to each others.
So to recap:
 the Tukey HSD test allows to compares all groups but at the cost of less power
 the Dunnett’s test allows to only make comparisons with a reference group, but with the benefit
of more power
Now, again for the sake of illustration, consider that the species Adelie is the reference species
and we are only interested in comparing the reference species against the other 2 species. In that
scenario, we would use the Dunnett’s test.
In R, the Dunnett’s test is done as follows (the only difference with the code for the Tukey HSD
test is in the line linfct = mcp(species = "Dunnett")):
library(multcomp)

# Dunnett's test:
post_test <- glht(res_aov,
linfct = mcp(species = "Dunnett")
)

360
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## Chinstrap - Adelie == 0 5.8699 0.9699 6.052 7.59e-09 ***
## Gentoo - Adelie == 0 27.2333 0.8067 33.760 < 1e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
The interpretation is the same as for the Tukey HSD test’s except that in the Dunett’s test we
only compare:
1. Chinstrap versus Adelie (line Chinstrap - Adelie == 0)
2. Gentoo vs. Adelie (line Gentoo - Adelie == 0)
Both adjusted p-values (displayed in the last column) are below 0.05, so we reject the null
hypothesis for both comparisons.
This means that both the species Chinstrap and Gentoo are significantly different from the
reference species Adelie in terms of flippers length. (Nothing can be said about the comparison
between Chinstrap and Gentoo though.)
Again, the results of the post-hoc test can be visualized with the plot() function:
par(mar = c(3, 8, 3, 3))
plot(post_test)

361
We see that the confidence intervals do not cross the zero line, which indicate that both the
species Gentoo and Chinstrap are significantly different from the reference species Adelie.
Note that in R, by default, the reference category for a factor variable is the first category in
alphabetical order. This is the reason that, by default, the reference species is Adelie.
The reference category can be changed with the relevel() function (or with
the {questionr} addin). Considering that we want Gentoo as the reference category instead of
Adelie:
# Change reference category:
dat$species <- relevel(dat$species, ref = "Gentoo")

# Check that Gentoo is the reference category:

levels(dat$species)
## [1] "Gentoo" "Adelie" "Chinstrap"
Gentoo now being the first category of the three, it is indeed considered as the reference level.
In order to perform the Dunnett’s test with the new reference we first need to rerun the ANOVA
to take into account the new reference:
res_aov2 <- aov(flipper_length_mm ~ species,
362
data = dat
)
We can then run the Dunett’s test with the new results of the ANOVA:
# Dunnett's test:
post_test <- glht(res_aov2,
linfct = mcp(species = "Dunnett")
)

summary(post_test)
##
## Simultaneous Tests for General Linear Hypotheses
##
## Multiple Comparisons of Means: Dunnett Contrasts
##
##
## Fit: aov(formula = flipper_length_mm ~ species, data = dat)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## Adelie - Gentoo == 0 -27.2333 0.8067 -33.76 <1e-10 ***
## Chinstrap - Gentoo == 0 -21.3635 1.0036 -21.29 <1e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
par(mar = c(3, 8, 3, 3))
plot(post_test)

363
From the results above we conclude that Adelie and Chinstrap species are significantly different
from Gentoo species in terms of flippers length (adjusted p-values < 1e-10).
Note that even if your study does not have a reference group which you can compare to the other
groups, it is still often better to do multiple comparisons determined by some research questions
than to do all-pairwise tests. By reducing the number of post-hoc comparisons to what is
necessary only, and no more, you maximize the statistical power.8
Other p-values adjustment methods
For the interested readers, note that you can use other p-values adjustment methods by using
the [Link]() function:
[Link](dat$flipper_length_mm, dat$species,
[Link] = "holm"
)
##
## Pairwise comparisons using t tests with pooled SD
##
## data: dat$flipper_length_mm and dat$species
##
364
## Gentoo Adelie
## Adelie < 2e-16 -
## Chinstrap < 2e-16 3.8e-09
##
## P value adjustment method: holm
By default, the Holm method is applied but other methods exist. See ?[Link] for all available
options.
Visualization of ANOVA and post-hoc tests on the same plot
If you are interested in including results of ANOVA and post-hoc tests on the same plot (directly
on the boxplots), here are two pieces of code which may be of interest to you.
The first one is edited by me based on the code found in this article:
# Edit from here
x <- which(names(dat) == "species") # name of grouping variable
y <- which(
names(dat) == "flipper_length_mm" # names of variables to test
)
method1 <- "anova" # one of "anova" or "[Link]"
method2 <- "[Link]" # one of "[Link]" or "[Link]"
my_comparisons <- list(c("Chinstrap", "Adelie"), c("Gentoo", "Adelie"), c("Gentoo",
"Chinstrap")) # comparisons for post-hoc tests
# Edit until here

# Edit at your own risk

library(ggpubr)
for (i in y) {
for (j in x) {
p <- ggboxplot(dat,
x = colnames(dat[j]), y = colnames(dat[i]),
color = colnames(dat[j]),
legend = "none",
365
palette = "npg",
add = "jitter"
)
print(
p + stat_compare_means(aes(label = paste0(after_stat(method), ", p-value = ",
after_stat([Link]))),
method = method1, label.y = max(dat[, i], [Link] = TRUE)
)
+ stat_compare_means(comparisons = my_comparisons, method = method2, label =
"[Link]") # remove if p-value of ANOVA or Kruskal-Wallis test >= alpha
)
}
}

And the second method is from the {ggstatsplot} package:

library(ggstatsplot)

366
ggbetweenstats(
data = dat,
x = species,
y = flipper_length_mm,
type = "parametric", # ANOVA or Kruskal-Wallis
[Link] = TRUE, # ANOVA or Welch ANOVA
[Link] = "box",
[Link] = TRUE,
[Link] = "significant",
[Link] = FALSE,
[Link] = FALSE
)

As you can see on the above plot, boxplots by species are presented together with p-values of the
ANOVA (after p = in the subtitle of the plot) and p-values of the post-hoc tests (above each
comparison).

367
Unit-wise
MCQs

368
Unit-1 – Introduction to Probability

1. What is the definition of Probability?

A. The measure of the number of successful outcomes to total possible outcomes.
B. The measure of the number of failures to total possible outcomes.
C. The ratio of favorable outcomes to unfavorable outcomes.
D. The ratio of unfavorable outcomes to favorable outcomes.
Answer: A

2. Which of the following is an example of Classical Probability?

A. Tossing a biased coin
B. Rolling a fair die
C. Predicting the stock market
D. Observing traffic patterns
Answer: B

3. How many ways can a committee of 4 members be selected from a group of 10 people?
A. 210
B. 5040
C. 3024
D. 1260
Answer: A
(Explanation: Combination formula: 10C4 = 10! / (4!(10-4)!) = 210)

4. Which of the following is an example of a Permutation?

A. Choosing 3 books out of 5 to arrange on a shelf
B. Selecting 3 fruits from a basket of 10
C. Dividing a group of 5 people into teams
D. Choosing 2 letters from a word
Answer: A
(Explanation: Permutation involves arrangement/order: 5P3 = 5! / (5-3)!)

5. What is the Addition Theorem of Probability?

A. The probability that event A and event B both occur.
B. The probability that event A or event B occurs, accounting for any overlap.
369
C. The probability of the complement of event A.
D. The probability that event A occurs given event B has occurred.
Answer: B
(Explanation: P(A ∪ B) = P(A) + P(B) - P(A ∩ B))

6. Which formula defines the Multiplication Theorem of Probability?

A. P(A and B) = P(A) + P(B)
B. P(A or B) = P(A) + P(B) - P(A and B)
C. P(A and B) = P(A) * P(B|A)
D. P(A given B) = P(A) * P(B)
Answer: C
(Explanation: The Multiplication Theorem: P(A ∩ B) = P(A) * P(B|A))

7. Conditional Probability is best described by which of the following?

A. The probability that two independent events occur.
B. The probability that an event occurs given that another event has already occurred.
C. The probability of the union of two events.
D. The probability that neither of the events occurs.
Answer: B
(Explanation: Conditional Probability P(A|B) = P(A ∩ B) / P(B))

8. What is the Conditional Probability, P(A|B), when P(A and B) = 0.2 and P(B) = 0.5?
A. 0.5
B. 0.4
C. 0.1
D. 0.2
Answer: B
(Explanation: P(A|B) = P(A ∩ B) / P(B) = 0.2 / 0.5 = 0.4)

9. Bayes's Theorem is used to calculate:

A. The conditional probability of an event based on prior knowledge.
B. The probability of independent events.
C. The likelihood of a mutually exclusive event.
D. The union of two events.
Answer: A
370
(Explanation: Bayes's Theorem is P(A|B) = [P(B|A) * P(A)] / P(B))

10. Using Bayes's Theorem, if P(B|A) = 0.7, P(A) = 0.5, and P(B) = 0.6, what is P(A|B)?
A. 0.583
B. 0.233
C. 0.5833
D. 0.175
Answer: C
(Explanation: P(A|B) = [P(B|A) * P(A)] / P(B) = (0.7 * 0.5) / 0.6 = 0.5833)

371
Unit 2 – Random Variable

1. What is a Random Variable?

A. A variable that can only take one specific value.
B. A variable that takes on a deterministic value.
C. A variable that takes on different values based on random outcomes.
D. A variable that is only used in calculus.
Answer: C

2. Which of the following is a key difference between discrete and continuous random
variables?
A. Discrete random variables have only positive values.
B. Continuous random variables are countable.
C. Discrete random variables take distinct values, while continuous random variables take any
value in a range.
D. Continuous random variables are finite.
Answer: C

3. Which of the following is an example of a discrete random variable?

A. The weight of an apple.
B. The number of heads when flipping a coin three times.
C. The height of a person.
D. The temperature in a city on a given day.
Answer: B

4. The probability density function (PDF) is used for which type of random variable?
A. Discrete random variables.
B. Continuous random variables.
C. Both discrete and continuous random variables.
D. Neither discrete nor continuous random variables.
Answer: B

5. The sum of the probabilities for a discrete random variable must be equal to which of
the following?
A. 0
372
B. 0.5
C. 1
D. It can be any value
Answer: C

6. For a continuous random variable, the area under the probability density function (PDF)
over its entire range equals:
A. 0
B. 1
C. Infinity
D. A negative value
Answer: B

7. What is the mathematical expectation (or expected value) of a random variable XXX?
A. The mode of XXX.
B. The square root of XXX.
C. The weighted average of all possible values of XXX.
D. The highest possible value of XXX.
Answer: C

8. If XXX is a discrete random variable, how is its expected value E[X]E[X]E[X]

calculated?
A. E[X]=∫xf(x)dx
B. E[X]=∑(x⋅P(X=x))
C. E[X]=max(x)
D. E[X]=∑(x2)
Answer: B

9. The expectation of a function of a random variable, E[g(X)]E[g(X)]E[g(X)], is given by

which theorem?
A. Law of Total Probability
B. Central Limit Theorem
C. The Expectation Theorem
D. The Law of the Unconscious Statistician
Answer: D
373
10. Which of the following statements is true for the variance of a random variable XXX?
A. Variance is always negative.
B. Variance measures the spread of XXX around its mean.
C. Variance is equal to the expected value.
D. Variance is only defined for continuous random variables.
Answer: B

374
Unit 3- Data Distribution

1. Which of the following is a type of continuous probability distribution?

 a) Binomial distribution
 b) Poisson distribution
 c) Exponential distribution
 d) Bernoulli distribution

Answer: c) Exponential distribution

2. The sum of probabilities in any probability distribution is equal to:

 a) 1
 b) 0
 c) Infinity
 d) Depends on the type of distribution
Answer: a) 1

3. Which of the following distributions is best suited for modeling the number of
occurrences of an event in a fixed interval of time or space?
 a) Normal distribution
 b) Poisson distribution
 c) Exponential distribution
 d) Binomial distribution
Answer: b) Poisson distribution

4. In a normal distribution, approximately how much data lies within one standard
deviation from the mean?
 a) 50%
 b) 68%
 c) 95%
 d) 99.7%
Answer: b) 68%

5. Which distribution is used to model the time between events in a Poisson process?

375
 a) Normal distribution
 b) Exponential distribution
 c) Binomial distribution
 d) Uniform distribution
Answer: b) Exponential distribution

6. What is the key assumption of the Binomial distribution?

 a) Events are dependent
 b) Each trial has the same probability of success
 c) The number of trials is infinite
 d) Each trial has a different probability of success
Answer: b) Each trial has the same probability of success

7. Which of the following is NOT a characteristic of a normal distribution?

 a) Symmetry
 b) Mean = Median = Mode
 c) Skewness
 d) Bell-shaped curve
Answer: c) Skewness

8. In a Poisson distribution, the mean and variance are:

 a) Always equal
 b) Never equal
 c) Mean is twice the variance
 d) Mean is half the variance
Answer: a) Always equal

9. Which type of distribution is used in the Monte Carlo Simulation?

 a) Binomial distribution
 b) Any probability distribution
 c) Normal distribution only
 d) Poisson distribution only
Answer: b) Any probability distribution

376
10. Which method is commonly used for generating random numbers in simulations?
 a) Numerical integration
 b) Monte Carlo method
 c) Gaussian elimination
 d) Newton-Raphson method
Answer: b) Monte Carlo method

11. Which of the following describes the Binomial distribution?

 a) Continuous distribution
 b) Distribution with two possible outcomes
 c) Distribution of number of events per time period
 d) Distribution of waiting times between events
Answer: b) Distribution with two possible outcomes

12. Which distribution is used to model events that occur at a constant rate but randomly
in time or space?
 a) Poisson distribution
 b) Normal distribution
 c) Exponential distribution
 d) Binomial distribution
Answer: a) Poisson distribution

13. The central limit theorem states that the sampling distribution of the sample mean
approaches which distribution as the sample size increases?
 a) Binomial distribution
 b) Poisson distribution
 c) Exponential distribution
 d) Normal distribution
Answer: d) Normal distribution

14. Random number generation in computer simulations is typically based on which type
of algorithms?
 a) Deterministic algorithms
 b) Stochastic algorithms

377
 c) Gaussian algorithms
 d) Recursion algorithms
Answer: a) Deterministic algorithms

15. Monte Carlo Simulation is mainly used for:

 a) Determining exact probabilities
 b) Sampling data from distributions
 c) Solving deterministic problems
 d) Generating random numbers
Answer: b) Sampling data from distributions

378
Unit 4- Testing of Hypothesis

Hypothesis Testing Procedure

1. Which of the following is the first step in the hypothesis testing procedure? a) Collecting
data
b) Formulating a hypothesis
c) Choosing a level of significance
d) Drawing a conclusion
Answer: b) Formulating a hypothesis
2. The null hypothesis (H₀) usually represents which of the following?
a) A claim of effect or difference
b) A claim of no effect or no difference
c) A researcher's alternative hypothesis
d) A bias in data collection
Answer: b) A claim of no effect or no difference
Standard Error and Sampling Distribution
3. The standard error of the mean measures which of the following?
a) The spread of the entire population
b) The variability between sample means
c) The deviation of individual data points from the mean
d) The degree of correlation between variables
Answer: b) The variability between sample means
4. As the sample size increases, what happens to the standard error of the mean?
a) Increases
b) Decreases
c) Stays the same
d) Depends on the data
Answer: b) Decreases
Estimation
5. A point estimate is: a) A single value estimate of a population parameter
b) The interval in which the population parameter is likely to fall
c) The same as the sample standard deviation
d) Always larger than the population mean

379
Answer: a) A single value estimate of a population parameter
6. Which of the following is a method for estimating parameters?
a) Method of moment estimation
b) Likelihood estimation
c) Bayesian estimation
d) All of the above
Answer: d) All of the above
Student's t-Distribution
7. The Student’s t-distribution is most appropriate when:
a) The population standard deviation is known
b) The population standard deviation is unknown and the sample size is small
c) The sample size is large
d) The data is non-normal
Answer: b) The population standard deviation is unknown and the sample size is small
8. As the degrees of freedom increase, the t-distribution approaches which distribution?
a) Binomial distribution
b) Normal distribution
c) Chi-square distribution
d) F-distribution
Answer: b) Normal distribution
Chi-Square Test and Goodness of Fit
9. The Chi-square test is used to test for:
a) Equality of two means
b) Independence between two categorical variables
c) Equality of variances
d) Correlation between variables
Answer: b) Independence between two categorical variables
10. A goodness of fit test is used to determine:
a) Whether sample data matches a population distribution
b) Whether two populations have the same variance
c) The mean difference between two independent samples
d) The strength of correlation between two variables
Answer: a) Whether sample data matches a population distribution
F-test and Analysis of Variance (ANOVA)

380
11. An F-test is used to compare:
a) The means of two groups
b) The variances of two or more groups
c) The correlation coefficients of two variables
d) The proportions in a population
Answer: b) The variances of two or more groups
12. In ANOVA, a significant F-ratio indicates that:
a) All group means are equal
b) At least one group mean is different
c) All variances are equal
d) All groups have the same number of observations
Answer: b) At least one group mean is different
Factor Analysis
13. Factor analysis is primarily used to:
a) Test hypotheses
b) Reduce the dimensionality of data
c) Compare two population means
d) Estimate the population proportion
Answer: b) Reduce the dimensionality of data
14. In factor analysis, the factors are typically: a) Observable variables
b) Latent (unobservable) variables
c) Test statistics
d) Confidence intervals
Answer: b) Latent (unobservable) variables
15. Which of the following is an assumption of factor analysis?
a) Data is binary
b) Variables are highly correlated
c) Samples are normally distributed
d) Variances are unequal
Answer: b) Variables are highly correlated

381
Unit 5- Introduction to R Programming Language
1. Which of the following is true about R programming language?
A) R is only used for statistical analysis.
B) R is a programming language and software environment for statistical computing.
C) R can only be run on Unix systems.
D) R does not support data manipulation.
Answer: B) R is a programming language and software environment for statistical
computing.

2. Which function is used to install packages in R?

A) load()
B) install()
C) [Link]()
D) library()
Answer: C) [Link]()

3. How can you read a CSV file into an R data frame?

A) readFile("[Link]")
B) [Link]("[Link]")
C) [Link]("[Link]")
D) load("[Link]")
Answer: B) [Link]("[Link]")

4. Which of the following is a valid way to create a matrix in R?

A) matrix(1:9, nrow=3, ncol=3)
B) mat(1:9, rows=3, cols=3)
C) mat(1,2,3,4)
D) createMatrix(1:9, 3, 3)
Answer: A) matrix(1:9, nrow=3, ncol=3)

5. Which control structure is used to execute a set of statements repeatedly in R?

A) if
B) switch
C) while
D) repeat
382
Answer: C) while

6. What is the correct way to define a function in R?

A) function(x) { x * 2 }
B) def(x): return x * 2
C) func(x) { return x * 2 }
D) x <- function { x * 2 }
Answer: A) function(x) { x * 2 }

7. Which of the following is an object in R?

A) Data frames
B) Vectors
C) Lists
D) All of the above
Answer: D) All of the above

8. Which function is used to create a data frame in R?

A) matrix()
B) list()
C) [Link]()
D) [Link]()
Answer: C) [Link]()

9. In R, which operator is used for element-wise multiplication of matrices?

A) *
B) %*%
C) /
D) %%
Answer: A) *

10. What is the output of c(1,2,3) + c(4,5,6) in R?

A) 5 7 9
B) 1 2 3
C) 4 5 6
D) Error
383
Answer: A) 5 7 9

11. Which of the following functions in R is used to remove objects?

A) remove()
B) rm()
C) delete()
D) del()
Answer: B) rm()

12. What will the following command do: str(object)?

A) Prints the structure of an object.
B) Deletes the object.
C) Converts the object to a string.
D) None of the above.
Answer: A) Prints the structure of an object.

13. Which function is used to read data from a file in R?

A) scan()
B) read()
C) [Link]()
D) file()
Answer: A) scan()

14. What type of control structure is if-else in R?

A) Looping structure
B) Conditional structure
C) Function definition
D) Data manipulation
Answer: B) Conditional structure

15. Which function is used to view the first few rows of a data frame in R?
A) head()
B) view()
C) summary()
D) show()
384
Answer: A) head()

385
Unit 6- Graphical Analysis using R

1. What function is used to create a basic scatter plot in R?

a. barplot()
b. boxplot()
c. plot()
d. hist()
Answer: c) plot()
2. Which argument in the plot() function defines the title of the plot?

a. xlab
b. ylab
c. main
d. title
Answer: c) main
3. What does a boxplot display in a dataset?

a. Frequency distribution
b. Mean, variance, standard deviation
c. Minimum, lower quartile, median, upper quartile, maximum
d. Correlation between two variables
Answer: c) Minimum, lower quartile, median, upper quartile, maximum
4. Which function is used to create a box-whisker plot in R?

a. boxplot()
b. plot()
c. barplot()
d. hist()
Answer: a) boxplot()
5. How can the layout of multiple plots be adjusted in R?

a. par(mfrow)
b. [Link]()

386
c. plot(multi)
d. mfrow()
Answer: a) par(mfrow)
6. In a pie chart, which function argument defines the labels of each slice?

a. labels
b. xlab
c. legend
d. slices
Answer: a) labels
7. Which function is used to create a pie chart in R?

a. pie()
b. barplot()
c. plot()
d. boxplot()
Answer: a) pie()
8. In R, what is the correct way to create a bar chart for categorical data?

a. plot()
b. barplot()
c. hist()
d. pie()
Answer: b) barplot()
9. Which argument in the boxplot() function helps you add horizontal lines to indicate the
median?

a. notch
b. median
c. hline
d. horiz
Answer: a) notch
10. How can you modify the size of plot windows in R?

387
a. [Link]()
b. [Link]()
c. par()
d. [Link]()
Answer: c) par()
11. Which function creates a matrix of scatter plots in R?

a. matrix()
b. pairs()
c. scatter()
d. matplot()
Answer: b) pairs()
12. Which argument in the par() function helps control the margins of the plot?

a. mar
b. xlab
c. xlim
d. grid
Answer: a) mar
13. What is the default method for handling overlapping text labels in R plots?

a. Text wrapping
b. Clipping
c. Overplotting
d. Plot resizing
Answer: c) Overplotting
14. Which argument in the barplot() function allows you to specify whether the bars should
be horizontal?

a. horizontal
b. barh
c. beside
d. horiz
Answer: d) horiz

388
15. Which argument in the pie() function allows you to set colors for the slices?

a. col
b. slice
c. color
d. fill
Answer: a) col

389
Unit 7- Advance R
1. What does the apply() function in R do?
a) Apply a function over a vector
b) Apply a function over the margins of an array
c) Apply a function to a dataframe
d) Apply a function over multiple vectors
Answer: b) Apply a function over the margins of an array
2. Which of the following is used for memory management in R?
a) gc()
b) rm()
c) save()
d) summary()
Answer: a) gc()
3. In R, the function to create a copy of an object without making it reference-based is:
a) clone()
b) deepcopy()
c) [Link]()
d) No such function; R uses copy-on-modify by default
Answer: d) No such function; R uses copy-on-modify by default

Correlation and Regression Analysis

4. Which of the following is true for Pearson’s correlation coefficient?
a) It is always positive
b) It only works for non-linear relationships
c) It ranges from -1 to 1
d) It is unit dependent
Answer: c) It ranges from -1 to 1
5. What is multicollinearity in regression analysis?
a) A scenario where independent variables are highly correlated
b) A scenario where dependent variables are highly correlated
c) When the regression model does not fit the data
d) A case where residuals are autocorrelated
Answer: a) A scenario where independent variables are highly correlated
6. In linear regression, the R-squared value represents:

390
a) The number of independent variables
b) The goodness-of-fit of the model
c) The intercept of the regression line
d) The p-value of the regression
Answer: b) The goodness-of-fit of the model

Analysis of Variance (ANOVA)

7. What is the main purpose of conducting an ANOVA test?
a) To compare means between two groups
b) To compare means between more than two groups
c) To test for correlation between variables
d) To test for linear regression
Answer: b) To compare means between more than two groups
8. The F-statistic in ANOVA is used to:
a) Compare variances within groups
b) Test the null hypothesis that all group means are equal
c) Measure the correlation between dependent variables
d) Compare regression coefficients
Answer: b) Test the null hypothesis that all group means are equal
9. Which assumption is crucial for ANOVA to be valid?
a) Homogeneity of variance
b) Multicollinearity
c) Independence of errors
d) Both a and c
Answer: d) Both a and c

Creating Data for Complex Analysis

10. Which function in R can be used to simulate random normal data for analysis?
a) rnorm()
b) rpois()
c) runif()
d) rbinom()
Answer: a) rnorm()
11. To create a data frame in R from existing vectors, which function should be used?
a) cbind()
391
b) rbind()
c) [Link]()
d) list()
Answer: c) [Link]()
12. What does the sample() function in R do?
a) Sample a subset of rows from a dataframe
b) Generate random samples from a given vector
c) Create a summary of a vector
d) Perform a bootstrap analysis
Answer: b) Generate random samples from a given vector

Summarizing Data
13. Which function is commonly used in R for generating summary statistics for each variable
in a data frame?
a) summary()
b) describe()
c) summarize()
d) overview()
Answer: a) summary()
14. What does the aggregate() function in R do?
a) Summarizes multiple data sets into one
b) Splits data into groups and computes summary statistics
c) Creates frequency distributions
d) Combines columns from multiple data frames
Answer: b) Splits data into groups and computes summary statistics

Case Studies
15. In a real-world regression case study, which of the following would indicate overfitting?
a) High accuracy on both training and test sets
b) High accuracy on training data but low accuracy on test data
c) Low R-squared value
d) High p-value for most predictors
Answer: b) High accuracy on training data but low accuracy on test data

392
Practice
Questions

393
Questions on Binomial Distribution
Q1 What is meant by binomial distribution?
The binomial distribution is the discrete probability distribution that gives only two possible results in an
experiment, either success or failure.
Q2 Mention the formula for the binomial distribution.
The formula for binomial distribution is:
P(x: n,p) = nC p (q)n-x
x x

Where p is the probability of success, q is the probability of failure, n= number of trials

Q3 What is the formula for the mean and variance of the binomial distribution?
The mean and variance of the binomial distribution are:
Mean = np
Variance = npq
Q4 What are the criteria for the binomial distribution?
The number of trials should be fixed.
Each trial should be independent.
The probability of success is exactly the same from one trial to the other trial.
Q5 What is the difference between a binomial distribution and normal distribution?
The binomial distribution is discrete, whereas the normal distribution is continuous.

Practice Problems
Solve the following problems based on binomial distribution:
1. The mean and variance of the binomial variate X are 8 and 4 respectively. Find P(X<3).

2. The binomial variate X lies within the range {0, 1, 2, 3, 4, 5, 6}, provided that P(X=2) = 4P(x=4).
Find the parameter “p” of the binomial variate X.

3. In binomial distribution, X is a binomial variate with n= 100, p= ⅓, and P(x=r) is maximum. Find
the value of r.

4. The probability that a mountain-bike rider travelling along a certain track will have a tyre burst is
0.05. Find the probability that among 17 riders: (a) exactly one has a burst tyre (b) at most three
have a burst tyre (c) two or more have burst tyres.

5. (a) A transmission channel transmits zeros and ones in strings of length 8, called ‘words’. Possible
distortion may change a one to a zero or vice versa; assume this distortion occurs with
probability .01 for each digit, independently. An error-correcting code is employed in the
construction of the word such that the receiver can deduce the word correctly if at most one digit is
in error. What is the probability the word is decoded incorrectly?

394
(b) Assume that a word is a sequence of 10 zeros or ones and, as before, the probability of incorrect
transmission of a digit is .01. If the error-correcting code allows correct decoding of the word if no
more than two digits are incorrect, compute the probability that the word is decoded correctly.
6. An examination consists of 10 multi-choice questions, in each of which a candidate has to deduce
which one of five suggested answers is correct. A completely unprepared student guesses each
answer completely randomly. What is the probability that this student gets 8 or more questions
correct? Draw the appropriate moral!

7. The probability that a machine will produce all bolts in a production run within specification is
0.998. A sample of 8 machines is taken at random. Calculate the probability that (a) all 8 machines,
(b) 7 or 8 machines, (c) at least 6 machines will produce all bolts within specification

8. The probability that a machine develops a fault within the first 3 years of use is 0.003. If 40
machines are selected at random, calculate the probability that 38 or more will not develop any
faults within the first 3 years of use.

Questions on Normal Distribution

Q1 What is a normal distribution in statistics?
A probability function that specifies how the values of a variable are distributed is called the normal
distribution. It is symmetric since most of the observations assemble around the central peak of the curve.
The probabilities for values of the distribution are distant from the mean narrow off evenly in both
directions.
Q2 What does normal distribution mean?
In statistics (and in probability theory), the Normal Distribution, also called the Gaussian Distribution, is the
most important continuous probability distribution. Sometimes it is also called a bell curve.
Q3 What is a normal distribution used for?
A normal distribution is significant in statistics and is often used in the natural sciences and social arts to
describe real-valued random variables whose distributions are unknown.
Q4 What are the characteristics of a normal distribution?
The essential characteristics of a normal distribution are:
It is symmetric, unimodal (i.e., one mode), and asymptotic.
The values of mean, median, and mode are all equal.
A normal distribution is quite symmetrical about its center. That means the left side of the center of the
peak is a mirror image of the right side. There is also only one peak (i.e., one mode) in a normal
distribution.
Q5 How do you know if data is normally distributed?
A histogram presents a useful graphical representation of the given data. When a histogram of distribution
is superimposed with its normal curve, then the distribution is known as the normal distribution.

395
Q6 How do you use a normal distribution table?
As we know, the label for rows contains the integer part and the first decimal place of z. In contrast, the title
for columns comprises the second decimal place of z. The values within the table are the probabilities
corresponding to the table type. Hence, to get the value of 0.56 from the z-table, identify the probability
value corresponding to the 0.5 row and 0.06 column (=0.2123).
Questions on Poisson Distribution
Q1 What is a Poisson distribution?
A Poisson distribution is defined as a discrete frequency distribution that gives the probability of the
number of independent events that occur in the fixed time.
Q2 When do we use Poisson distribution?
Poisson distribution is used when the independent events occurring at a constant rate within the given
interval of time are provided.
Q3 What is the difference between the Poisson distribution and normal distribution?
The major difference between the Poisson distribution and the normal distribution is that the Poisson
distribution is discrete whereas the normal distribution is continuous. If the mean of the Poisson distribution
becomes larger, then the Poisson distribution is similar to the normal distribution.
Q4 Are the mean and variance of the Poisson distribution the same?
The mean and the variance of the Poisson distribution are the same, which is equal to the average number of
successes that occur in the given interval of time.
Q5 Mention the three important constraints in Poisson distribution.
The three important constraints used in Poisson distribution are:
The number of trials (n) tends to infinity
The probability of success (p) tends to zero
np=1, which is finite.
1. Large sheets of metal have faults in random positions but on average have 1 fault per 10 m2. What is the
probability that a sheet 5 m × 8 m will have at most one fault?
2. If 250 litres of water are known to be polluted with 106 bacteria what is the probability that a sample of 1
cc of the water contains no bacteria?
3. Suppose vehicles arrive at a signalised road intersection at an average rate of 360 per hour and the cycle
of the traffic lights is set at 40 seconds. In what percentage of cycles will the number of vehicles arriving be
(a) exactly 5, (b) less than 5? If, after the lights change to green, there is time to clear only 5 vehicles before
the signal changes to red again, what is the probability that waiting vehicles are not cleared in one cycle?
4. Previous results indicate that 1 in 1000 transistors are defective on average.
(a) Find the probability that there are 4 defective transistors in a batch of 2000.
(b) What is the largest number, N, of transistors that can be put in a box so that the probability of no
defectives is at least 1/2?
5. A manufacturer sells a certain article in batches of 5000. By agreement with a customer the following
method of inspection is adopted: A sample of 100 items is drawn at random from each batch and inspected.
396
If the sample contains 4 or fewer defective items, then the batch is accepted by the customer. If more than 4
defectives are found, every item in the batch is inspected. If inspection costs are 75 p per hundred articles,
and the manufacturer normally produces 2% of defective articles, find the average inspection costs per
batch.
6. A book containing 150 pages has 100 misprints. Find the probability that a particular page contains
(a) no misprints,
(b) 5 misprints,
(c) at least 2 misprints,
(d) more than 1 misprint.
7. For a particular machine, the probability that it will break down within a week is 0.009. The
manufacturer has installed 800 machines over a wide area. Calculate the probability that
(a) 5, (b) 9, (c) less than 5, (d) more than 4 machines breakdown in a week.
8. At a given university, the probability that a member of staff is absent on any one day is 0.001. If there are
800 members of staff, calculate the probabilities that the number absent on any one day is
(a) 6,
(b) 4,
(c) 2,
(d) 0,
(e) less than 3,
(f) more than 1.
9. The number of failures occurring in a machine of a certain type in a year has a Poisson distribution with
mean 0.4. In a factory there are ten of these machines. What is
(a) the expected total number of failures in the factory in a year?
(b) the probability that there are fewer than two failures in the factory in a year?

397
Case Studies

398
Case Study – Exponential Distribution

The time in minutes, X, between the arrival of successive customers at a post office is
exponentially distributed with pdf

(A) What is the expected time between arrivals?

(B) A customer walks into the post office at 12.30 p.m. What is the probability the
next customer arrives:
(i) on or before 12:32p.m.?
(ii) after 12:35p.m.?
Solution
(A) Here λ=0.2 and so the mean time between arrivals is 1/ 0.2=5 minutes.
(B) (i) If the next customer arrives on or before 12:32 p.m., it means that the time between
their arrival and the previous arrival is at most 22 minutes. So, we require P(X≤2).
Using the formula above we have:

399
(ii) If the next customer arrives after 12:35 p.m. then the time between the two customers
is more than 55 minutes. We now require P(X>5). To calculate P(X>x).
This is equivalent to 1−P (X ≤ x) and so:

Poisson Processes
The Exponential distribution is often used as a model for the times between events. We
have looked at events occurring randomly in time in association with the Poisson
distribution. The Poisson distribution gives the probabilities for the number of events
taking place in the given time period whereas the exponential distribution gives the
probabilities for times between the events. Both of these concern events occurring
randomly in time at a constant average rate, λλ. This is known as a Poisson process.
For example, consider a series of randomly occurring events, such as customers entering a
bank. The times of arrivals might look like this:

.
There are two ways we can view the data.
1. The number of arrivals in each minute (1,1,0,3,1).

2. The times between successive arrivals.

For the Poisson process we have,

1. the number of arrivals follows a Poisson distribution with parameter λλ, and

 the time between successive calls has an exponential distribution with parameter λλ.

Other Continuous Probability Distributions

There are other important continuous probability distributions which you will meet in
practical business problems and decision making. In particular, you will meet the Student's

400
t-distribution or t distribution and the Chi-squared (χ2) distribution. The t-distribution is
used when testing a hypothesis about a mean or a difference between two means. The Chi-
square distribution is used when analysing categorical data.

Caselet

401
Case – Binomial Distribution

A University Engineering Department has introduced a new software package called SOLVIT. To save
money, the University’s Purchasing Department has negotiated a bargain price for a 4-user license that
allows only four students to use SOLVIT at any one time. It is estimated that this should allow 90% of
students to use the package when they need it. The Students’ Union has asked for more licenses to be
bought since engineering students report having to queue excessively to use SOLVIT. As a result, the
Computer Centre monitors the use of the software. Their findings show that on average 20 students are
logged on at peak times and 4 of these want to use SOLVIT. Was the Purchasing Department’s estimate
correct?

Case normal distribution

A Practical Example: Your company packages sugar in 1 kg bags.

When you weigh a sample of bags you get these results:
 1007g, 1032g, 1002g, 983g, 1004g, ... (a hundred measurements)
402
 Mean = 1010g

 Standard Deviation = 20g

Some values are less than 1000g ... can you fix that?
The normal distribution of your measurements looks like this:

31% of the bags are less than 1000g,

which is cheating the customer!
It is a random thing, so we can't stop bags having less than 1000g, but we can try
to reduce it a lot.
Let's adjust the machine so that 1000g is:
 at −3 standard deviations:

From the big bell curve above we see that 0.1% are less. But maybe that is too small.
 at −2.5 standard deviations:

Below 3 is 0.1% and between 3 and 2.5 standard deviations is 0.5%, together that is
0.1% + 0.5% = 0.6% (a good choice I think)
So let us adjust the machine to have 1000g at −2.5 standard deviations from the mean.
Now, we can adjust it to:
 increase the amount of sugar in each bag (which changes the mean), or

 make it more accurate (which reduces the standard deviation)

Let us try both.

Adjust the mean amount in each bag

403
The standard deviation is 20g, and we need 2.5 of them:
2.5 × 20g = 50g
So the machine should average 1050g, like this:

Adjust the accuracy of the machine

Or we can keep the same mean (of 1010g), but then we need 2.5 standard deviations to
be equal to 10g:
10g / 2.5 = 4g
So the standard deviation should be 4g, like this:
(We hope the machine is that accurate!)
Or perhaps we could have some combination of better accuracy and slightly larger
average size, I will leave that up to you!

Case questions on passion distribution

404
1. A manufacturer produces light-bulbs that are packed into boxes of 100. If
quality control studies indicate that 0.5% of the light-bulbs produced are
defective, what percentage of the boxes will contain: (a) no defective? (b) 2 or
more defectives?

2. A Council is considering whether to base a recovery vehicle on a stretch of

road to help clear incidents as quickly as possible. The road concerned carries
over 5000 vehicles during the peak rush hour period. Records show that, on
average, the number of incidents during the morning rush hour is 5. The
Council won’t base a vehicle on the road if the probability of having more than
5 incidents in any one morning is less than 30%. Based on this information
should the Council provide a vehicle?

405
406
A Business Planning Example using Monte-Carlo Simulation
Imagine you are the marketing manager for a firm that is planning to introduce a new
product. You need to estimate the first year net profit from this product, which will depend
on:
 Sales volume in units

 Price per unit

 Unit cost

 Fixed costs

Net profit will be calculated as Net Profit = Sales Volume* (Selling Price - Unit cost) -
Fixed costs. Fixed costs (for overhead, advertising, etc.) are known to be $120,000. But
the other factors all involve some uncertainty. Sales volume (in units) can cover quite a
range, and the selling price per unit will depend on competitor actions. Unit costs will also
vary depending on vendor prices and production experience.
Uncertain Variables
To build a risk analysis model, we must first identify the uncertain variables -- also
called random variables. While there's some uncertainty in almost all variables in a
business model, we want to focus on variables where the range of values is significant.
Sales and Price
Based on your market research, you believe that there are equal chances that the market
will be Slow, OK, or Hot.
 In the "Slow market" scenario, you expect to sell 50,000 units at an average selling
price of $11.00 per unit.

 In the "OK market" scenario, you expect to sell 75,000 units, but you'll likely
realize a lower average selling price of $10.00 per unit.

 In the "Hot market" scenario, you expect to sell 100,000 units, but this will bring in
competitors who will drive down the average selling price to $8.00 per unit.

As a result, you expect to sell 75,000 units (i.e., (50,000+75,000+100,000)/3 = 75,000) at

an average selling price of $9.67 per unit (i.e., ($11+$10+$8)/3 = $9.67).
Unit Cost
Another uncertain variable is Unit Cost. Your firm’s production manager advises you that
unit costs may be anywhere from $5.50 to $7.50, with a most likely cost of $6.50. In this
case, the most likely cost is also the average cost.
407
Uncertain Functions
Net Profit
Our next step is to identify uncertain functions -- also called functions of a random
variable. Recall that Net Profit is calculated as Net Profit = Sales Volume * (Selling Price
- Unit cost) - Fixed costs. However, Sales Volume, Selling Price and Unit Cost are
all uncertain variables, so Net Profit is an uncertain function.
The Flawed Average Model
Before we explore how to use simulation to analyze this problem, consider the Excel model
pictured below, which calculates Net Profit based on average sales volume, average selling
price, and average unit cost.

Intuition might suggest that plugging the average value of our uncertain inputs (Sales
Volume, Selling Price, and Unit Cost) into our model should produce the average value of
the output (Net Profit). However, as we’ll see in a moment, the Net Profit figure of
$117,750 calculated by this model, based on average values for the uncertain factors, is
quite misleading. The true average Net Profit is closer to $93,000! As Dr. Sam Savage
warns, "Plans based on average assumptions will be wrong on average."

408
Internal
Sample
paper

409
Internal Examination (Sep -2024)

Course: BCA Semester: V

Subject: Statistical Analysis using R Course Code: 504
Max. Marks: 40
Max. Time: 1:30 hrs.

Instructions: Use of calculator for subjects like Financial Management, operation etc.
allowed if required. (Scientific calculators not allowed).
Use of unfair means will lead to cancellation of paper followed by disciplinary action.

Attempt any two questions from section-I and Attempt any two questions from section-
II.
Section-I
(Theoretical Concept and Practical/Application oriented)
Answer in 500 words. Each question carries 08 marks.

Q1.
Q2.
Q3.
Q4. Write short note on any two. Answer in 300 words. Each carries 04 marks.
a)
b)
c)

Section-II
(Analytical Question / Case Study / Essay Type Question to test analytical and
Comprehensive Skills)

Answer in 700 words. Attempt any 2 questions. Each question carries 12 marks
Q5.

Q6.

Q7.

410
Declaration by Faculty:
I, Dr. Rakhee Chhibber, Visiting faculty Teaching Statistical Analysis using R in BCA course V
Sem have incorporated all the necessary pages section/quotations papers mentioned in this check
list above except 5 years Papers because this course is introduced first time in University
subjects.

Signature
411

Statistical Programming with R Course Guide
No ratings yet
Statistical Programming with R Course Guide
15 pages
Big Data Analytics Certification Course
No ratings yet
Big Data Analytics Certification Course
6 pages
XLRI Data Science Course Overview
No ratings yet
XLRI Data Science Course Overview
5 pages
B Tech Data Science 3rd Sem Syllabus
No ratings yet
B Tech Data Science 3rd Sem Syllabus
14 pages
Statistics with R Course Overview
No ratings yet
Statistics with R Course Overview
3 pages
IIM Indore Joint MS in Data Science
No ratings yet
IIM Indore Joint MS in Data Science
16 pages
Big Data Analytics Course Overview
No ratings yet
Big Data Analytics Course Overview
3 pages
RKMVERI MSc in Big Data Analytics Overview
No ratings yet
RKMVERI MSc in Big Data Analytics Overview
51 pages
B.Tech Minor in Data Science Syllabus
No ratings yet
B.Tech Minor in Data Science Syllabus
5 pages
AI and ML Training Curriculum Overview
No ratings yet
AI and ML Training Curriculum Overview
10 pages
Introduction to R Software Course
No ratings yet
Introduction to R Software Course
1 page
B.Com. Business Analytics Guidelines
No ratings yet
B.Com. Business Analytics Guidelines
5 pages
BUS 110 Applied Statistics Course Overview
No ratings yet
BUS 110 Applied Statistics Course Overview
6 pages
XLRI Data Science Program Overview
No ratings yet
XLRI Data Science Program Overview
7 pages
NASSCOM Data Analyst Course Overview
No ratings yet
NASSCOM Data Analyst Course Overview
3 pages
R Programming Course Syllabus
No ratings yet
R Programming Course Syllabus
4 pages
Business Analytics PG Program Overview
No ratings yet
Business Analytics PG Program Overview
14 pages
Data Analytics
No ratings yet
Data Analytics
4 pages
Johns Hopkins Carey Business School - Data Analytics Syllabus
No ratings yet
Johns Hopkins Carey Business School - Data Analytics Syllabus
6 pages
IIM Bangalore AI for Managers Program
No ratings yet
IIM Bangalore AI for Managers Program
12 pages
IIT & IIM Indore Online MSc in Data Science
No ratings yet
IIT & IIM Indore Online MSc in Data Science
16 pages
Essential Statistics for Non-STEM Analysts
No ratings yet
Essential Statistics for Non-STEM Analysts
16 pages
R Programming Data Analytics Practical
No ratings yet
R Programming Data Analytics Practical
22 pages
Big Data Econometrics Course Overview
No ratings yet
Big Data Econometrics Course Overview
4 pages
Professional Certificate in Data Science
No ratings yet
Professional Certificate in Data Science
15 pages
Data Science with R Overview and Applications
No ratings yet
Data Science with R Overview and Applications
32 pages
XLRI Postgraduate Certificate in Business Analytics
No ratings yet
XLRI Postgraduate Certificate in Business Analytics
12 pages
IISc Advanced Business Analytics Course
No ratings yet
IISc Advanced Business Analytics Course
20 pages
IISc Business Analytics Certification Program
No ratings yet
IISc Business Analytics Certification Program
20 pages
R Learning Guide for MGT 6203
No ratings yet
R Learning Guide for MGT 6203
23 pages
Data Science Course Overview IIM Kashipur
No ratings yet
Data Science Course Overview IIM Kashipur
8 pages
Master of Business Analytics Overview
No ratings yet
Master of Business Analytics Overview
6 pages
CAIML: AI & ML for Predictive Analytics
No ratings yet
CAIML: AI & ML for Predictive Analytics
10 pages
Business Analytics Course Overview
No ratings yet
Business Analytics Course Overview
7 pages
Quantitative Techniques for MBA Students
No ratings yet
Quantitative Techniques for MBA Students
4 pages
Power BI, Tableau & Python Workshop 2025
No ratings yet
Power BI, Tableau & Python Workshop 2025
2 pages
Data Science Course Overview for Economists
No ratings yet
Data Science Course Overview for Economists
5 pages
Business Analytics with R Course Overview
No ratings yet
Business Analytics with R Course Overview
3 pages
R Programming Practical File for BBA
No ratings yet
R Programming Practical File for BBA
5 pages
Prateek Verma: Data Analyst Profile
No ratings yet
Prateek Verma: Data Analyst Profile
1 page
Front Pages
No ratings yet
Front Pages
4 pages
Data Analytics Course by Seldom India
No ratings yet
Data Analytics Course by Seldom India
29 pages
Accelerator Program in Business Analytics and Data Science: Placement Assurance
No ratings yet
Accelerator Program in Business Analytics and Data Science: Placement Assurance
25 pages
Stock Market Modeling & Fraud Detection
No ratings yet
Stock Market Modeling & Fraud Detection
1 page
Business Analytics with R Training Online
No ratings yet
Business Analytics with R Training Online
10 pages
Data Science Internship Resume
No ratings yet
Data Science Internship Resume
3 pages
AI Certificate Program for Managers
No ratings yet
AI Certificate Program for Managers
12 pages
Data Science Toolbox Insights
No ratings yet
Data Science Toolbox Insights
15 pages
Introduction to Statistics Course Outline
100% (1)
Introduction to Statistics Course Outline
5 pages
Data Analytics with R Course Overview
No ratings yet
Data Analytics with R Course Overview
6 pages
MBA Course on Business Analytics Fundamentals
No ratings yet
MBA Course on Business Analytics Fundamentals
7 pages
Data Science with R: Installation & Basics
No ratings yet
Data Science with R: Installation & Basics
69 pages
MSc in Data Science with upGrad & LJMU
No ratings yet
MSc in Data Science with upGrad & LJMU
13 pages
BEC Data Analytics Course Overview
No ratings yet
BEC Data Analytics Course Overview
2 pages
.NET Framework History and Overview
No ratings yet
.NET Framework History and Overview
15 pages
Preventing Deadlock in Java Multithreading
No ratings yet
Preventing Deadlock in Java Multithreading
7 pages
Type Safety in Java Collections with Generics
No ratings yet
Type Safety in Java Collections with Generics
4 pages
Understanding Cyber Crime Definitions
No ratings yet
Understanding Cyber Crime Definitions
2 pages
Confectionery Store Management System
No ratings yet
Confectionery Store Management System
13 pages
Transport Layer Protocols Overview
No ratings yet
Transport Layer Protocols Overview
66 pages
Computer Network Basics and Topologies
No ratings yet
Computer Network Basics and Topologies
51 pages
OSI Model Overview and Functions
No ratings yet
OSI Model Overview and Functions
41 pages
Understanding Classification Models in BI
No ratings yet
Understanding Classification Models in BI
13 pages
GLUE Methodology: 20-Year Review
No ratings yet
GLUE Methodology: 20-Year Review
22 pages
Stingray Modeling API Overview
No ratings yet
Stingray Modeling API Overview
28 pages
BORE: Bayesian Optimization by Density-Ratio Estimation: Brochu Et Al. 2010 Shahriari Et Al. 2015
No ratings yet
BORE: Bayesian Optimization by Density-Ratio Estimation: Brochu Et Al. 2010 Shahriari Et Al. 2015
26 pages
Understanding Bayesian Inference
100% (1)
Understanding Bayesian Inference
20 pages
AI Applications in Petroleum Engineering
No ratings yet
AI Applications in Petroleum Engineering
104 pages
Applied Computational Statistics Course
No ratings yet
Applied Computational Statistics Course
6 pages
Deep Unsupervised Learning via Diffusion
No ratings yet
Deep Unsupervised Learning via Diffusion
18 pages
Hyperparameter Tuning in Machine Learning
No ratings yet
Hyperparameter Tuning in Machine Learning
17 pages
Good News-Bad News Effect on Self-Image
No ratings yet
Good News-Bad News Effect on Self-Image
25 pages
Applied Reliability Third Edition Paul A. Tobias ebook testbank solutions complete unlock 2026
100% (5)
Applied Reliability Third Edition Paul A. Tobias ebook testbank solutions complete unlock 2026
90 pages
Reconstructing Constructivism in Learning
No ratings yet
Reconstructing Constructivism in Learning
83 pages
Fast Bayesian Soil Stratification CPT
No ratings yet
Fast Bayesian Soil Stratification CPT
17 pages
Understanding Bayesian Linear Regression
No ratings yet
Understanding Bayesian Linear Regression
46 pages
LLM-Bayesian Model for Startup Success
No ratings yet
LLM-Bayesian Model for Startup Success
59 pages
Traffic Accident Severity Analysis
No ratings yet
Traffic Accident Severity Analysis
9 pages
Electronic Devices and Circuits EC3353
No ratings yet
Electronic Devices and Circuits EC3353
16 pages
Linear Regression Optimization Techniques
No ratings yet
Linear Regression Optimization Techniques
83 pages
In Silico Trials: Credibility Assessment
No ratings yet
In Silico Trials: Credibility Assessment
8 pages
New Framework for Observational Method in Retaining Walls
No ratings yet
New Framework for Observational Method in Retaining Walls
8 pages
Understanding Human Thought via Bayesian
No ratings yet
Understanding Human Thought via Bayesian
2 pages
Mary Calls Probability in Bayesian Network
No ratings yet
Mary Calls Probability in Bayesian Network
7 pages
Introduction to Bayesian Econometrics
No ratings yet
Introduction to Bayesian Econometrics
10 pages
2019 - Article - EffectsOfMindfulness-Acceptance Commitment (MAC)
No ratings yet
2019 - Article - EffectsOfMindfulness-Acceptance Commitment (MAC)
12 pages
Species Limits in Butter Ies (Lepidoptera: Nymphalidae) : Reconciling Classical Taxonomy With The Multispecies Coalescent
No ratings yet
Species Limits in Butter Ies (Lepidoptera: Nymphalidae) : Reconciling Classical Taxonomy With The Multispecies Coalescent
12 pages
Machine Learning for Student Placement Prediction
No ratings yet
Machine Learning for Student Placement Prediction
21 pages
Kisii University BSc Actuarial Science Exam 2023
No ratings yet
Kisii University BSc Actuarial Science Exam 2023
7 pages
True Statements in Bayesian Inference
No ratings yet
True Statements in Bayesian Inference
21 pages
Active Inference and Consciousness Theory
No ratings yet
Active Inference and Consciousness Theory
20 pages
Testing Investment Return Claims
No ratings yet
Testing Investment Return Claims
7 pages