AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS LT PC
3003
COURSE OBJECTIVES:
CO1: Explain the data analytics pipeline
CO2: Apply descriptive analytics techniques to visualize data
CO3: Perform statistical inferences from data
CO4: Apply sampling test techniques to get the variance in the data
CO5: Build models for predictive analytics
TEXT BOOKS:
1. David Cielen, Arno D. B. Meysman, and Mohamed Ali, “Introducing Data Science”, Manning Publications, 2016.
(first two chapters for Unit I).
2. Robert S. Witte and John S. Witte, “Statistics”, Eleventh Edition, Wiley Publications, 2017.
3. Jake VanderPlas, “Python Data Science Handbook”, O’Reilly, 2016. M. RAMNATH, AP / AI&DS,
ramnath@[Link]
RIT
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
UNIT III INFERENTIAL STATISTICS 10
Topic Text Book 2
Unit – III INFERENTIAL STATISTICS
Populations – samples – random sampling – Sampling
distribution
Standard error of the mean
Hypothesis testing – z-test – z-test procedure
decision rule – calculations – decisions
interpretations - one-tailed and two-tailed tests
Estimation – point estimate
confidence interval – level of confidence
effect of sample size.
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
Populations:
Any complete set of observations (or potential observations) may be
characterized as a population.
Real Population:
A real population is one in which all potential observations are accessible at the
time of sampling.
Examples of real populations include the two described in the previous
paragraph, as well as the ages of all visitors to Disneyland on a given day.
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
Hypothetical Populations:
A hypothetical population is one in which all potential observations are not
accessible at the time of sampling.
Samples:
Any subset of observations from a population may be characterized as a
sample. In typical applications of inferential statistics, the sample size is small
relative to the population size.
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
Optimal Sample Size:
There is no simple rule of thumb for determining the best or optimal sample size
for any particular situation. Often sample sizes are in the hundreds or even the
thousands for surveys, but they are less than 100 for most experiments. Optimal
sample size depends on the answers to a number of questions, including “What
is the estimated variability among observations?” and “What is an acceptable
amount of error in our conclusion?” Once these types of questions have been
answered, with the aid of guidelines hypothesis test to determine the optimal
sample size for any situation.
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
RANDOM SAMPLING:
A selection process that guarantees all potential observations in the
population have an equal chance of being selected.
It’s important to note that randomness describes the selection process—that
is, the conditions under which the sample is taken—and not the particular
pattern of observations in the sample.
Casual or Haphazard, Not Random
A casual or haphazard sample doesn’t qualify as a random sample.
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
Random Assignment:
A procedure designed to ensure that each subject has an equal chance of being
assigned to any group in an experiment.
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
TABLES OF RANDOM NUMBERS:
Since the first number on the specimen page in Table H is 100 (disregard the
fourth and fifth digits in each five-digit number), the person identified with
that number is included in the sample.
The next three-digit number, 325, identifies the second person. Ignore the
next number, 765, since none of the numbers between 680 and 999 is
identified with any names in the student directory.
Also, ignore repeat appearances of any number between 001 and 679. The
next three-digit number, 135, identifies the third person. Continue this
process until the specified sample size has been achieved.
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
Efficient Use of Tables:
The inefficiency of the previous procedure becomes apparent when a random
sample must be obtained from a large population, such as that defined by a
city telephone directory.
It would be most laborious to assign a different number to each name in the
directory prior to consulting the table of random numbers.
Instead, most investigators refer directly to the random number table, using
each random number as a guide to a particular name in the directory.
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
Efficient Use of Tables:
For example, a six-digit random number, such as 239421, identifies the name
on page 239 (the first three digits) and line 421 (the last three digits).
This process is repeated for a series of six-digit random numbers until the
required number of names has been sampled.
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
Sampling Distribution:
A sampling distribution refers to a probability distribution of a statistic that
comes from choosing random samples of a given population.
Also known as a finite-sample distribution, it represents the distribution of
frequencies on how spread apart various outcomes will be for a specific
population.
Sampling Distribution of the Mean :
The sampling distribution of the mean refers to the probability distribution of
means for all possible random samples of a given size from some population.
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
Probability of a Particular Sample Mean:
A sampling distribution refers to a probability distribution of a statistic that
comes from choosing random samples of a given population.
Also known as a finite-sample distribution, it represents the distribution of
frequencies on how spread apart various outcomes will be for a specific
population.
Sampling Distribution of the Mean :
The sampling distribution of the mean refers to the probability distribution of
means for all possible random samples of a given size from some population.
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
CREATING A SAMPLING
DISTRIBUTION FROM
SCRATCH
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
MEAN OF ALL SAMPLE MEANS (μX):
The mean of the sampling distribution of the mean always equals the mean
of the population.
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
STANDARD ERROR OF THE MEAN (σX):
A rough measure of the average amount by which sample means deviate
from the mean of the sampling distribution or from the population mean.
The standard error of the mean equals the standard deviation of the
population divided by the square root of the sample size.
where X represents the standard error of the mean; σ represents the standard
deviation of the population; and n represents the sample size.
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
STANDARD ERROR OF THE MEAN σ
The standard error of the mean serves as a special type of standard deviation
that measures variability in the sampling distribution.
You might find it helpful to think of the standard error of the mean as a rough
measure of the average amount by which sample means deviate from the mean
of the sampling distribution or from the population mean.
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
Central Limit Theorem :
Regardless of the population shape, the shape of the sampling distribution of
the mean approximates a normal curve if the sample size is sufficiently large.
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
Few Sample Means with Extreme
Values
Many Sample Means with Intermediate
Values
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
TESTING A HYPOTHESIS ABOUT SAT SCORES:
The mean SAT math score for all local freshmen equals the national average
of 500. Now, given a mean math score of 533 for a random sample of 100
freshmen, let’s test the hypothesis that, with respect to the national average.
Nothing special is happening in the local population. Insofar as an
investigator usually suspects just the opposite—namely, that something
special is happening in the local population—he or she hopes to reject the
hypothesis that nothing special is happening, henceforth referred to as the
null hypothesis.
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
Hypothesized Sampling Distribution:
Regardless of the population shape, the shape of the sampling distribution of
the mean approximates a normal curve if the sample size is sufficiently large.
If the null hypothesis is true, then the distribution of sample means—that is,
the sampling distribution of the mean for all possible random samples, each
of size 100, from the local population of freshmen—will be centered about
the national average of 500. (Remember, the mean of the sampling
distribution always equals the population mean.)
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
Hypothesized Sampling Distribution
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
Common Outcomes:
An observed sample mean qualifies as a common outcome if the difference
between its value and that of the hypothesized population mean is small enough
to be viewed as a probable outcome under the null hypothesis.
Rare Outcomes:
An observed sample mean qualifies as a rare outcome if the difference between
its value and the hypothesized population mean is too large to be reasonably
viewed as a probable outcome under the null hypothesis.
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
Boundaries for Common and Rare Outcomes
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
Boundaries for Common and Rare Outcomes:
If the one observed sample mean is located between 478 and 522, it will
qualify as a common outcome (readily attributed to variability) under the null
hypothesis, and the null hypothesis will be retained.
If, however, the one observed sample mean is greater than 522 or less than
478, it will qualify as a rare outcome (not readily attributed to variability)
under the null hypothesis, and the null hypothesis will be rejected.
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
Boundaries for Common and Rare Outcomes:
Because the observed sample mean of 533 does exceed 522, the null
hypothesis is rejected.
On the basis of the present test, it is unlikely that the sample of 100 freshmen,
with a mean math score of 533, originates from a population whose mean
equals the national average of 500, and, therefore, the investigator can
conclude that the mean math score for the local population of freshmen
probably differs from (exceeds) the national average.
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
z TEST FOR A POPULATION MEAN
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
z TEST FOR A POPULATION MEAN:
Regardless of the population shape, the shape of the sampling distribution of
the mean approximates a normal curve if the sample size is sufficiently large.
Converting a Raw Score to z:
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
Converting a Sample Mean to z:
The observed z of 3 exceeds the value of 1.96 specified in the hypothesized
sampling distribution. Thus, the observed z qualifies as a rare outcome under the
null hypothesis, and the null hypothesis is rejected. The results of this test with z
are the same as those for the original hypothesis test with .
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
Assumptions of z Test:
When a hypothesis test evaluates how far the observed sample mean deviates, in
standard error units, from the hypothesized population mean, as in the present
example, it is referred to as a z test or, more accurately, as a z test for a
population mean.
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
Assumptions of z Test:
This z test is accurate only when
(1) the population is normally distributed or the sample size is large enough to
satisfy the requirements of the central limit theorem.
(2) the population standard deviation is known.
In the present example, the z test is appropriate because the sample size of 100 is
large enough to satisfy the central limit theorem and the population standard
deviation is known to be 110.
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
STATEMENT OF THE RESEARCH PROBLEM:
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
STATEMENT OF THE RESEARCH PROBLEM:
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
NULL HYPOTHESIS (H0):
A statistical hypothesis that usually asserts that nothing special is happening with
respect to some characteristic of the underlying population.
where H0 represents the null hypothesis and μ is the population mean for the
local freshmen class.
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
NULL HYPOTHESIS (H0):
The null hypothesis always makes a precise statement about a characteristic
of the population, never about a sample.
Remember, the purpose of a hypothesis test is to determine whether a
particular outcome, such as an observed sample mean, could have reasonably
originated from a population with the hypothesized characteristic.
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
ALTERNATIVE HYPOTHESIS (H1):
The alternative hypothesis (H1) asserts the opposite of the null hypothesis. A
decision to retain the null hypothesis implies a lack of support for the alternative
hypothesis, and a decision to reject the null hypothesis implies support for the
alternative hypothesis.
where H1 represents the alternative hypothesis, μ is the population mean for the
local freshman class, and signifies, “is not equal to.”
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
Confidence Level:
Confidence level = 1 - Significance level (alpha). In other words, the
confidence level equals one minus the significance level.
For example, if our significance level is 0.05, this means that there is a 5%
probability of rejecting the null hypothesis when it is true.
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
WHY HYPOTHESIS TESTS?
Hypothesis tests are used to assess whether a difference between two samples
represents a real difference between the populations from which the samples
were taken.
A null hypothesis of 'no difference or no significance' is taken as a starting
point, and we calculate the probability that both sets of data came from the
same population.
If significance difference means then reject null hypothesis (i. e) Reject H 0
and go for Alternate hypothesis
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
Importance of the Standard Error:
The ratio of the observed difference to the standard error is small enough to
be reasonably attributed to chance, we retain H0
Otherwise, if the ratio of the observed difference to the standard error is too
large to be reasonably attributed to chance, as in the SAT example, we reject
H0.
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
Possibility of Incorrect Decisions:
if H0 is true (and, therefore, the hypothesized distribution of z about H 0
also is true), there is a slight possibility that, just by chance, the one observed z
actually originates from one of the shaded rejection regions of the hypothesized
distribution of z, thus causing the true H 0 to be rejected. This type of incorrect
decision—rejecting a true H0 —is referred to as a type I error or a false alarm.
Type I Error :
Rejecting a true null hypothesis.
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
Possibility of Incorrect Decisions:
On first impulse, it might seem desirable to abolish the shaded rejection
regions in the hypothesized sampling distribution to ensure that a true H 0
never is rejected.
A most unfortunate consequence of this strategy, however, is that no H 0 , not
even a radically false H0 , ever would be rejected. This second type of
incorrect decision—retaining a false H0 —is referred to as a type II error or a
miss.
Type II Error: Retaining a false null hypothesis.
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
STRONG OR WEAK DECISIONS:
Retaining H0 Is a Weak Decision:
H0 is retained whenever the observed z qualifies as a common outcome on
the assumption that H0 is true. Therefore, H0 could be true.
If there is greater than a 5% chance of a result as extreme as the sample result
when the null hypothesis is true, then the null hypothesis is retained.
This does not necessarily mean that the researcher accepts the null hypothesis
as true—only that there is not currently enough evidence to reject it.
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
STRONG OR WEAK DECISIONS:
Rejecting H0 Is a Strong Decision:
H0 is rejected whenever the observed z qualifies as a rare out come—one that
could have occurred just by chance with a probability of .05 or less— on the
assumption that H0 is true.
This suspiciously rare outcome implies that H 0 is probably false (and
conversely, that H1 is probably true). Therefore, the rejection of H 0 can be
viewed as a strong decision.
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
ONE-TAILED AND TWO-TAILED TESTS:
Two-Tailed or Nondirectional Test:
Rejection regions are located in
both tails of the sampling
distribution. (Upper Tail Critical)
One-Tailed or Directional Test:
Rejection region is located in
just one tail of the sampling (Lower Tail Critical)
distribution.
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
POINT ESTIMATE FOR μ:
A point estimate for μ uses a single value to represent the unknown
population mean.
The best single point estimate for the unknown population mean is simply
the observed value of the sample mean.
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
CONFIDENCE INTERVAL (CI) FOR μ:
A point estimate for μ uses a single value to represent the unknown population
mean.
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
LEVEL OF CONFIDENCE:
The level of confidence indicates the percent of time that a series of confidence
intervals includes the unknown population characteristic, such as the population
mean.
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
EFFECT OF SAMPLE SIZE:
The larger the sample size, the smaller the standard error and, hence, the more
precise (narrower) the confidence interval will be.
Indeed, as the sample size grows larger, the standard error will approach zero
and the confidence interval will shrink to a point estimate.
Given this perspective, the sample size for a confidence interval, unlike that
for a hypothesis test, never can be too large.
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology
Department of Artificial Intelligence and Data Science, Ramco Institute of Technology