ENGG 2780A / ESTR 2020: Statistics for Engineers
Spring 2025
L1: (Probability and)
Statistics
Sinno Jialin Pan
Outline
Probability v.s. statistics
Statistical inference problems
• Bayesian inference
• Classical inference
Review of probability
What is Probability
Probability is a mathematical language
for quantifying uncertainty
🌧
Number of Number of The range of
heads out of rainy days per hourly stock
100 flips month price
Random variable +¿ Probability theory
Probability Theory
1
𝑝=
2
𝑥 Binomial (1 00 , 𝑝)
𝜆=5
🌧 𝑥 Poisson ( 𝜆)
,
2
𝑥 Normal (𝜇 , 𝜎 )
Probability Theory (cont.)
Assume probability distribution is
known
• A family of distributions
• The parameter(s) of the
Data
distribution
generating
process
𝑃 ( 𝑥=𝑘) OR 𝑃 (𝑘1 ≤ 𝑥 ≤𝑘2 )
Independence: 𝑃 ( 𝑥𝑖 ∨𝑥 𝑗 ) =𝑃 (𝑥 𝑖 )
Conditional Independence:
Bayes’ Rule: 𝑃 ( 𝑥𝑖 ∨ 𝑦 ,𝑥 𝑗 ) =𝑃 (𝑥 𝑖∨ 𝑦)
The Central Dogma of Statistics
data = independent
samples
We have samples of observed data,
but don’t know the underlining
distribution
Statistics
Observations of
heads for 100
flips
𝑥 Binomial (100 , 𝑝)
Historical records of number
of rainy days for the past few
🌧
years
𝑥 Poisson ( 𝜆)
Historical hourly prices of
the stock for the past few
months
2
𝑥 Normal (𝜇 , 𝜎 )
Probability v.s. Statistics
Probability
theory
The Central
Data Limit
Observed
generating Theorem
data
process
Statistical
inference
“Theory without Practice is empty; but Practice
without Theory is blind” – Immanuel Kant
Descriptive statistics v.s. Inferential
statistics
Descriptive statistics: use numbers to
summarize and describe data
Do not involve
generalization beyond
the data at hand
021 Report on Annual Earnings and Hours Survey (from [Link]
Statistical inference tasks
HTTHTTHTTT
etc.
Estimation:Binomial (10,𝑝 )
𝜃
Classical Bayesian
statistics are
Parameters statistics
Parameters are
considered as considered as random
deterministic variables with prior
quantities that 𝜃distributions
𝑓 or𝑝 (𝜃)
Point
happenestimation
to unknown
Θ Θ
with observed data Bayes’
𝑃 ( 𝑥|𝜃 ) 𝑃 (𝜃)
rule 𝜃∨𝑥 𝑓 Θ∨ X or 𝑝 Θ∨ X (𝜃)
𝑃 ( 𝜃|𝑥 ) =
𝑃 (𝑥 )
Statistical inference tasks
HTTHTTHTTT
Hypothesis testing:biased or fair?
Statistical inference tasks
HTTHTTHTTT
Confidence interval
estimation
95%
confidence
Schedule
Week Date
Lectur
Topic
Bayesian
e statistics
Week 1 Jan 6 L1 Probability vs Statistics
Week 2 Jan 13 L2 Bayesian statistics
Prediction, estimation, &
Week 3 Jan 20 L2 & L3
hypothesis testing
Prediction, estimation, &
Week 4 Jan 27 L3
hypothesis testing
Week 5 Feb 3 Lunar New Year Vacation (no class)
Week 6 Feb 10 L4 Sampling statistics
Week 7 Feb 17 L5 Classical point estimation
Week 8 Feb 24 Midterm Exam (during lecture)
Week 9 Mar 3 Reading Week (no class)
Week 10 Mar 10 L6 Confidence interval I
Week 11 Mar 17 L7 Confidence interval II
Week 12 Mar 24 L8 Hypothesis test
Week 13 Mar 31 L9 Composite hypothesis test
Week 14 Apr 7 L10 Comparing populations
Review of Probability
Random variables: quantify outcomes of
random events Non-deterministic
Discrete random
variables are defined by a Probability Mass
Distributions
Function (PMF), 𝑘
𝑃 ( 𝑋 = 𝑥 𝑖 ) =𝑝 ( 𝑥 𝑖 ) ,𝑖=1 , … ,𝑘 ∑ 𝑝 ( 𝑥 𝑖 ) =1
𝑖=1
Continuous random
variables
Distributions are defined by a Probability Density
Function (PDF),
𝑏
Note 𝑃 ( 𝑋 =𝑥 ) ≠ 𝑓 ( 𝑥)
𝑃 ( 𝑎 ≤ 𝑋 ≤ 𝑏 )=∫ 𝑓 ( 𝑥 ) 𝑑𝑥
: 𝑓 (𝑥)
𝑎
𝑃 ( 𝑋 =𝑥 ) =0
∞ 𝑥+𝛿
∫ 𝑓 ( 𝑥) 𝑑𝑥=1 𝑃 ( 𝑋 =𝑥 ) =lim ∫ 𝑓 ( 𝑥) 𝑑𝑥≈ 𝑓 (𝑥) 𝛿
−∞ 𝛿→0 𝑥 𝑋
𝑥𝑥+ 𝛿
Binomial random variables
Parameter for Bernoulli variable
𝑋𝑖 0 1
PMF ( 𝑋 𝑖 ) 1 −𝑝 𝑝
𝑋 1+ …+ 𝑋 𝑛 = 𝑋 Binomial ( 𝑛 , 𝑝 )
PMF ( 𝑘, 𝑛 , 𝑝 )=𝑃 ( 𝑋 =𝑘 ;𝑛 ,𝑝 )
¿ ( )
𝑛
𝑘
𝑘
𝑝 (1 −𝑝 )
𝑛 −𝑘
How likely to get 2 heads in 3 coin flips
if
the probability of3heads is
𝑃 ( 𝐻=2 )= (2 ) 𝑝 (1− 𝑝)
2
𝐻 Binomial( 3 , 𝑝)
0.5 : 2
3 × 0.5 ×0.5=0.375
0.7 : 2
3 × 0.7 ×0.3= 0.441
1: 2
3 × 1 × 0= 0
How likely to get 200 heads in 300 coin
flips if
the probability of heads is200
𝑃 ( 𝐻=200 )=300
(200 ) 𝑝 300 −200
(1− 𝑝)
𝐻 Binomial(300 , 𝑝)
0.5 : ≈ 2 × 10
−9
0.7 : ≈ 0.022
1: ¿0
and random variables
𝑧 𝒩 ( 0 ,1 )
𝑥 −𝜇
𝑧=
𝜎
𝑥=𝜎 𝑧 +𝜇
PDF 1 ( 𝑥 −𝜇 )
2 PDF 1
1
2
− − 𝑥
1 2 𝜎2
𝑓 ( 𝑥)= 𝑒 2
𝑓 ( 𝑥)= 𝑒
𝜎 √2 𝜋 √2𝜋
2
𝑡 𝑥
Cumulative Density 1 −
Function (CDF) of
Φ (𝑡 )=
√2 𝜋
∫𝑒 2
𝑑𝑥
−∞
PDF CDF
𝑃 ( 𝑋 ≤𝑡 )
Mean and variance
𝔼[ 𝑥] Var [ 𝑥 ]
𝑥 Bernoulli ( 𝑝) 𝑝 𝑝 (1 −𝑝 )
𝑥 B inomial ( 𝑛,𝑝) 𝑛𝑝 𝑛𝑝(1 −𝑝 )
𝑥 𝒩( 𝜇 , 𝜎 ) 𝜇
2 2
𝜎
The Central Limit Theorem
are independent with the same PMF/PDF
𝑛
𝔼 [ 𝑋 𝑖 ] =𝜇 Var [ 𝑋 𝑖 ] =𝜎 2> 0 𝑋=∑ 𝑋 𝑖
𝑖=1
For every (positive or
CDF of
negative):
lim 𝑃 ( 𝑋 ≤ 𝔼 [ 𝑋 ] +𝑡 √ Var [ 𝑋 ] ) =Φ (𝑡 )
𝑛→ ∞
¿ 𝑍
lim 𝑃
𝑛→ ∞ (√
𝑋 −𝔼 [ 𝑋 ]
Var [ 𝑋 ] )
≤ 𝑡 =Φ (𝑡 )
𝑋 𝒩 (𝔼 [ 𝑋 ] ,Var [ 𝑋 ] )
𝑋 𝒩 (𝑛 𝜇 ,𝑛 𝜎 )
can be 2
approximated by
Use CLT to estimate probability of at
least 200 heads in 300 coin-flips if is
𝐻 Binomial(300 , 𝑝) 𝜇=300 𝑝 𝜎 = √ 300 𝑝 (1 −𝑝 )
0.5 : 𝜇=150 , 𝜎 ≈ 8.66
𝑃 ( 𝐻 ≥ 200)≈ 𝒩 ( 𝑥 ≥ 200;
≈ 𝒩 ( 𝑧 ≥ 5.77; 0 , 1 )
2 200 −150
𝜇, 𝜎 ) 8.66
≈0
𝑃 ( 𝐻 ≥ 200 ) ≈ 𝒩 ( 𝑥 ≥ 200; 𝜇 , 𝜎 )
≈ 𝒩 ( 𝑧 ≥ −1.26 ; 0 , 1 )
0.7 : 2
200 −210
7.94
¿ Φ (1.26 ≈) 0.896
Bayesian statistical inference
1. Assign prior probabilities to
parameters
2. Observe data
3. Update probabilities via Bayes’ rule
𝑃Θ ( 𝜃) 𝑃 𝑋 ∨Θ ( 𝑥|𝜃 ) 𝑃 Θ ( 𝜃 ) 𝑃 𝑋 ∨Θ ( 𝑥|𝜃 )
𝑃 Θ∨ 𝑋 ( 𝜃|𝑥 ) = =
𝑃 𝑋 ( 𝑥) ∑ 𝑃Θ (𝜃 ′ ) 𝑃 𝑋 ∨Θ ( 𝑥|𝜃′ )
′
𝜃
Bayes’ rule (four versions)
Prior likeliho
𝑝 Θ ( 𝜃 ) 𝑝 𝑋 ∨Θ ( 𝑥|𝜃 )
discrete, 𝑝 Θ∨ 𝑋 ( 𝜃 |𝑥 ) = od ′
discrete: ∑ 𝑝 Θ ( 𝜃 ) 𝑝 𝑋 ∨Θ ( 𝑥|𝜃 )
′
′
𝜃
Posterior
𝑝Θ ( 𝜃 ) 𝑓 ( 𝑥|𝜃 )
discrete, 𝑝 Θ∨ 𝑋 ( 𝜃 |𝑥 ) = 𝑋 ∨Θ
continuous: ∑ 𝑝 Θ ( 𝜃′ ) 𝑓 𝑋 ∨Θ ( 𝑥|𝜃 ′ )
′
𝜃
𝑍 (𝑥)
𝑓 Θ ( 𝜃) 𝑝 𝑋 ∨Θ ( 𝑥|𝜃 )
continuous, 𝑓 Θ ∨ 𝑋 ( 𝜃|𝑥 )=
discrete: ∫ 𝑓 Θ ( 𝜃 )𝑝 𝑋 ∨Θ ( 𝑥|𝜃 ) 𝑑 𝜃
′ ′ ′
continuous, 𝑓 Θ ( 𝜃) 𝑓 𝑋 ∨Θ ( 𝑥|𝜃 )
𝑓 Θ ∨ 𝑋 ( 𝜃|𝑥 )=
continuous: ∫ 𝑓 Θ ( 𝜃 ) 𝑓 𝑋 ∨Θ ( 𝑥|𝜃 ) 𝑑 𝜃
′ ′ ′
Note: these 4 versions are obtained by replacing by the PMF for
discrete variables and by the PDF for continuous variables
Denominator is a constant w.r.t. ?
E.g.,
∫ 𝑓 Θ (𝜃 ) 𝑓 𝑋∨Θ ( 𝑥|𝜃 ) 𝑑𝜃¿∫ 𝑓 𝑋 ,Θ (𝜃 ,𝑥) 𝑑𝜃
′ ′ ′
¿𝑓
′ ′
Marginalized
( 𝑥 )Or denoted by
𝑋
Only depends on the observed data
Constant w.r.t.
A coin might be of the following type:
H T H H T T
Prior 90% 5% 5%
𝜃=1 𝜃=2 𝜃=3
You flip a . How do you adjust your beliefs
(priors)? 𝑃 ( 𝐻 |𝜃=1 ) 𝑃 (𝜃 =1) 0.5 × 0.9 0.45
𝑃 ( 𝜃=1| 𝐻 1 )=
1
¿ ¿
𝑍 ( 𝐻 1) 𝑍 ( 𝐻 1 ) 𝑍( 𝐻 1)
𝑃 ( 𝐻 1|𝜃 =2 ) 𝑃 ( 𝜃= 2) 1× 0.05 0.05
𝑃 ( 𝜃=2| 𝐻 1 ) = ¿ ¿
𝑍 ( 𝐻 1) 𝑍 ( 𝐻 1 ) 𝑍( 𝐻 1)
𝑃 ( 𝜃= 3|𝐻 1 ) =0
𝑍 (𝐻 1)=0.45+ 0.05+ 0=0.5
𝑃 ( 𝜃=1| 𝐻 1 )= 0.9 𝑃 ( 𝜃= 2| 𝐻 1 ) =0.1 𝑃 ( 𝜃= 3|𝐻 1 ) =0
𝑃 ( 𝑏|𝑎, 𝑐 ) 𝑃 (𝑎∨𝑐)
𝑃 ( 𝑎|𝑏, 𝑐 ) =
Adjusted priors:Bayes’ rule
variant:
𝑃 (𝑏∨𝑐 )
𝑍 (𝑏, 𝑐)
H T H H T T
Prior 90% 10% 0%
𝜃=1∨𝐻 1 𝜃=2∨𝐻 1 𝜃=3∨ 𝐻 1
You flip another . How do you
readjust?
𝑃 ( 𝜃= 3|𝐻 𝐻 ) =0
2
and are independent given a
1
𝑃 ( 𝐻 2|𝜃=1 ,specific coin ) 0.5× 0.9
𝐻 1 ) 𝑃 ( 𝜃=1∨𝐻 0.45
𝑃 ( 𝜃=1| 𝐻 2 𝐻 1 ) =
1
¿ ¿
𝑍 ( 𝐻2, 𝐻1) 𝑍 ( 𝐻2, 𝐻1) 𝑍( 𝐻 2 , 𝐻 1 )
𝑃 ( 𝐻 2|𝜃=2 , 𝐻 1 ) 𝑃 ( 𝜃=2∨𝐻 1) 1× 0.1 0.1
𝑃 ( 𝜃=2| 𝐻 2 𝐻 1 )= ¿ ¿
𝑍 ( 𝐻 2 , 𝐻 1) 𝑍( 𝐻 2 , 𝐻 1 ) 𝑍 ( 𝐻 2 , 𝐻 1 )
𝑍 (𝐻 2 𝐻 1)=0 .45+0.1+0=0.55
0.45
𝑃 ( 𝜃=1| 𝐻 2 𝐻 1 ) = ≈ 0.82 𝑃 ( 𝜃= 2| 𝐻 2 𝐻 1 ) ≈ 0.18 𝑃 ( 𝜃= 3|𝐻 2 𝐻 1 ) =0
0.55
Bayes’ rule for multiple random
variables
Use continuous variables as an example
𝑓𝑋 , …, 𝑋𝑛 ∨Θ ( 𝑥 1 ,… , 𝑥 𝑛|𝜃 ) 𝑓 Θ ( 𝜃)
𝑓 Θ ∨ 𝑋 , …, 𝑋 ( 𝜃|𝑥 1 , … , 𝑥𝑛 )= 1
1 𝑛
𝑍 ( 𝑥 1 ,… , 𝑥 𝑛)
∝𝑓 𝑋1 , …, 𝑋 𝑛 ∨Θ ( 𝑥 1 , … , 𝑥 𝑛|𝜃 ) 𝑓 Θ ( 𝜃)
¿ 𝑓 𝑋 ∨Θ ( 𝑥 1|𝜃 ) … 𝑓 𝑋 ∨Θ ( 𝑥 𝑛|𝜃 ) 𝑓 Θ (𝜃)
1 𝑛
if are independent given
Some commonly used prior
distributions
B eta ( 𝛼 , 𝛽 )=¿
Γ ( 𝛼 ) Γ ( 𝛽) for positive
∞
Γ ( 𝛼 )=∫ 𝑥 𝛼 −1 𝑒− 𝑥 𝑑𝑥 integer
B ( 𝛼 , 𝛽 )=
Γ ( 𝛼+ 𝛽 ) 0
is widely used to model the (prior) distribution of a
random variable whose range is , where and are
(hyper)-parameters
∞
As ∫ B eta ( 𝛼 , 𝛽 ) 𝑑 𝜃=1 , we have
−∞
∞
∫ 𝜃
𝛼 −1
¿ ¿
− ∞
Some commonly used prior distributions (cont.)
{
𝛽 𝛼 𝛼− 1 − 𝛽𝜃
G amma ( 𝛼 , 𝛽 )= Γ ( 𝛼 ) 𝜃 𝑒 for 𝜃 >0
0 for 𝜃 ≤ 0
for positive integer
∞
Γ ( 𝛼 )=∫ 𝑥 𝛼 −1 𝑒− 𝑥 𝑑𝑥
0
is widely used to model the (prior) distribution of a
non-negative random variable, where and are
(hyper)-parameters
∞
As ∫ Gamma ( 𝛼 , 𝛽 ) 𝑑 𝜃=1 , we have
−∞
∞ ∞
Γ (𝛼 )
∫𝜃 𝛼 −1
𝑒 − 𝛽𝜃
𝑑 𝜃 =∫ 𝜃 𝛼−1
𝑒 − 𝛽𝜃
𝑑 𝜃=
𝛽
𝛼
−∞ 0