Data Screening (Missing Values, Outliers, Normality etc) The purpose of data screening is to: (a) check if data
have been entered correctly, such as out-of-range values. (b) check for missing values, and deciding how to deal with the missing values. (c) check for outliers, and deciding how to deal with outliers. (d) check for normality, and deciding how to deal with non-normality. 1. Finding incorrectly entered data Your first step with !ata "creening# is using $re%uencies# &. "elect Analyze --> Descripti e Statistics --> Fre!uencies '. (ove all variables into the )ariable(s)# window. *. +lick ,-. ,utput below is for only the four system# variables in our dataset because [Link] the output for all variables in our dataset would take up too much space in this document. The Statistics# bo/ tells you the number of missing values for each variable. 0e will use this information later when we are discussing missing values.
"ac# aria$le is then presented as a fre%uency table. $or e/ample, below we see the output for system&#. 1y looking at the coding manual for the 2egal beliefs# survey, you can see that the available responses for system&# are & through &&. 1y looking at the output below, you can see that there is a number out-of-range: &*#. (3,T4 5 in your dataset there will not be a &*# because 6 gave you the screened dataset, so 6 have included the &*# into this e/ample to show you what it looks like when a number is out of range.) "ince &* is an invalid number, you then need to identify why &*# was entered. $or e/ample, did the person entering data make a mistake7 ,r, did the sub8ect respond with a &*# even though the %uestion indicated that only numbers & through && are valid7 You can identify the source of the error by looking at the hard copies of the data. $or e/ample, first identify which sub8ect indicated the &*# by clicking on the variable name to highlight it (system&), and then using the find# function by: "dit --> Find, and then scrolling to the left to identify the sub8ect number. Then, hunt down the hard copy of the data for that sub8ect number.
&
%. Missing Values 0hy do missing values occur7 (issing values are either random or non-random. 9andom missing values may occur because the sub8ect inadvertently did not answer some %uestions. $or e/ample, the study may be overly comple/ [Link] long, or the sub8ect may be tired [Link] not paying attention, and miss the %uestion. 9andom missing values may also occur through data entry mistakes. 3on-random missing values may occur because the sub8ect purposefully did not answer some %uestions. $or e/ample, the %uestion may be confusing, so many sub8ects do not answer the %uestion. :lso, the %uestion may not provide appropriate answer choices, such as no opinion# or not applicable#, so the sub8ect chooses not to answer the %uestion. :lso, sub8ects may be reluctant to answer some %uestions because of social desirability concerns about the content of the %uestion, such as %uestions about sensitive topics like past crimes, se/ual history, pre8udice or bias toward certain groups, and etc. 0hy is missing data a problem7 (issing values means reduced sample si;e and loss of data. You conduct research to measure empirical reality so missing values thwart the purpose of research. (issing values may also indicate bias in the data. 6f the missing values are non-random, then the study is not accurately measuring the intended constructs. The results of your study may have been different if the missing data was not missing. <ow do 6 identify missing values7 &. "elect Analyze --> Descripti e Statistics --> Fre!uencies '. (ove all variables into the )ariable(s)# window. *. +lick ,-. ,utput below is for only the four system# variables in our dataset because [Link] the output for all variables in our dataset would take up too much space in this document. The Statistics# bo/ tells you the number of missing values for each variable.
<ow do 6 deal with missing values7 6rrespective of whether the missing values are random or non-random, you have three options when dealing with missing values. ,ption & is to do nothing. 2eave the data as is, with the missing values in place. This is the most fre%uent approach, for a few reasons. $irst, missing values are typically small. "econd, missing values are typically non-random. Third, even if there are a few missing values on individual items, you typically create composites of the items by averaging them together into one new variable, and this composite variable will not have missing values because it is an average of the e/isting data. <owever, if you chose this option, you must keep in mind how "="" will treat the missing values. "="" will either use listwise deletion# or pairwise deletion# of the missing values. You can elect either one when conducting each test in "="". a. 2istwise deletion 5 "="" will not include cases (sub8ects) that have missing values on the variable(s) under analysis. 6f you are only analy;ing one variable, then listwise deletion is simply analy;ing the e/isting data. 6f you are analy;ing multiple variables, then listwise deletion removes cases (sub8ects) if there is a missing value on any of the variables. The disadvantage is a loss of data because you are removing all data from sub8ects who may have answered some of the %uestions, but not others (e.g., the missing data). b. =airwise deletion 5 "="" will include all available data. >nlike listwise deletion which removes cases (sub8ects) that have missing values on any of the variables under analysis, pairwise deletion only removes the specific missing values from the analysis (not the entire case). 6n other words, all available data is included. $or e/ample, if you are conducting a correlation on multiple variables, then "="" will conduct the bivariate correlation between all available data point, and ignore only those missing values if they e/ist on some variables. 6n this case, pairwise deletion will result in different sample si;es for '
each correlation. =airwise deletion is useful when sample si;e is small or missing values are large because there are not many values to begin with, so why omit even more with listwise deletion. c. 6n other to better understand how listwise deletion versus pairwise deletion influences your results, try conducting the same test using both deletion methods. !oes the outcome change7 . ,ption ' is to delete cases with missing values. $or e/ample, for every missing value in the dataset, you can delete the sub8ects with the missing values. Thus, you are left with complete data for all sub8ects. The disadvantage to this approach is you reduce the sample si;e of your data. 6f you have a large dataset, then it may not be a big disadvantage because you have enough sub8ects even after you delete the cases with missing values. :nother disadvantage to this approach is that the sub8ects with missing values may be different than the sub8ects without missing values (e.g., missing values that are non-random), so you have a nonrepresentative sample after removing the cases with missing values. ,nce situation in which 6 use ,ption ' is when particular sub8ects have not answered an entire scale or page of the study. ,ption * is to replace the missing values, called imputation. There is little agreement about whether or not to conduct imputation. There is some agreement, however, in which type of imputation to conduct. $or e/ample, you typically do 3,T conduct (ean substitution or 9egression substitution. (ean substitution is replacing the missing value with the mean of the variable. 9egression substitution uses regression analysis to replace the missing value. 9egression analysis is designed to predict one variable based upon another variable, so it can be used to predict the missing value based upon the sub8ect?s answer to another variable. 1oth (ean substitution and 9egression substitution can be found using: &rans'orm --> (eplace Missing )ases* The favored type of imputation is replacing the missing values using different estimation methods. The (issing )alues :nalysis# add-on contains the estimation methods, but versions of "="" without the add-on module do not. The estimation methods be found by using: &rans'orm --> (eplace Missing )ases*
+. Outliers , 0hat are outliers7 ,utliers are e/treme values as compared to the rest of the data. The determination of values as outliers# is sub8ective. 0hile there are a few benchmarks for determining whether a value is an outlier#, those benchmarks are arbitrarily chosen, similar to how p@.AB# is also arbitrarily chosen. "hould 6 check for outliers7 ,utliers can render your data non-normal. "ince normality is one of the assumptions for many of the statistical tests you will conduct, finding and eliminating the influence of outliers may render your data normal, and thus render your data appropriate for analysis using those statistical tests. <owever, 6 know no one who checks for outliers. $or e/ample, 8ust because a value is e/treme compared to the rest of the data does not necessarily mean it is somehow an anomaly, or invalid, or should be removed. The sub8ect chose to respond with that value, so removing that value is arbitrarily throwing away data simply because it does not fit this assumption# that data should be normal#. +onducting research is about discovering empirical reality. 6f the sub8ect chose to respond with that value, then that data is a reflection of reality, so removing the outlier# is the antithesis of why you conduct research. There is one more (less theoretical, and more practical) reason why 6 know no one who conducts outlier analysis. 6t is common practice to use multiple %uestions to measure constructs because it increases the power of your statistical analysis. You typically create a composite# score (average of all the %uestions) when analy;ing your data. $or e/ample, in a study about happiness, you may use an established happiness scale, or create your own happiness %uestions that measure all the facets of the happiness construct. 0hen analy;ing your data, you average together all the happiness %uestions into & happiness composite measure. 0hile there may be some outliers in each individual %uestion, averaged the items together reduces the probability of outliers due to the increased amount of data composited into the variable. +hecking outliers: &. "elect Analyze --> Descripti e Statistics --> "-plore '. (ove all variables into the )ariable(s)# window. *. +lick "tatistics#, and click ,utliers# C. +lick =lots#, and unclick "tem-and-leaf# *
B. +lick ,-. ,utput on ne/t page is for system&# Descripti es# bo/ tells you descriptive statistics about the variable, including the value of "kewness and -urtosis, with accompanying standard error for each. This information will be useful later when we talk about normality#. The BD Trimmed (ean# indicates the mean value after removing the top and bottom BD of scores. 1y comparing this BD Trimmed (ean# to the mean#, you can identify if e/treme scores (such as outliers that would be removed when trimming the top and bottom BD) are having an influence on the variable.
."-treme Values/ and t#e 0o-plot relate to each other. The bo/plot is a graphical display of the data that shows: (&) median, which is the middle black line, (') middle BAD of scores, which is the shaded region, (*) top and bottom 'BD of scores, which are the lines e/tending out of the shaded region, (C) the smallest and largest (non-outlier) scores, which are the hori;ontal lines at the [Link] of the bo/plot, and (B) outliers. The bo/plot shows both mild# outliers and e/treme# outliers. (ild outliers are any score more than &.BE6F9 from the rest of the scores, and are indicated by open dots. 6F9 stands for 6nter%uartile range#, and is the middle BAD of the scores. 4/treme outliers are any score more than *E6F9 from the rest of the scores, and are indicated by stars. <owever, keep in mind that these benchmarks are arbitrarily chosen, similar to how p@.AB is arbitrarily chosen. $or system&#, there is an open dot. 3otice that the dot says C'#, but, by looking at 4/treme )alues bo/#, there are actually $,>9 lowest scores of &#, one of which is case C'. "ince all four scores of &# overlap each other, the bo/plot can only display one case. 6n summary, this output tells us there are four outliers, each with a value of &#.
1. Outliers :nother way to look for univariate outliers is to do outlier analysis within different groups in your study. $or e/ample, imagine a study that manipulated the presence or absence of a weapon during a crime, and the !ependent )ariable was measuring the level of emotional reaction to the crime. 6n addition to looking for univariate outliers for your !), you may want to also look for univariate outliers within each condition. 6n our dataset about 2egal 1eliefs#, let?s treat gender as the grouping variable. &. "elect Analyze --> Descripti e Statistics --> "-plore '. (ove all variables into the )ariable(s)# window. (ove se/# into the $actor 2ist# *. +lick "tatistics#, and click ,utliers# C. +lick =lots#, and unclick "tem-and-leaf# B. +lick ,-. ,utput below is for system&# Descripti es# bo/ tells you descriptive statistics about the variable. 3otice that information for males# and females# is displayed separately.
."-treme Values/ and t#e 0o-plot relate to each other. 3otice the difference between males and females.
2. Outliers , dealing 3it# outliers $irst, we need to identify why the outlier(s) e/ist. 6t is possible the outlier is due to a data entry mistake, so you should first conduct the test described above as &. $inding incorrectly entered data# to ensure that any outlier you find is not due to data entry errors. 6t is also possible that the sub8ects responded with the outlier# value for a reason. $or e/ample, maybe the %uestion is poorly worded or constructed. ,r, maybe the %uestion is ade%uately constructed but the sub8ects who responded with the outlier values are different than the sub8ects who did not respond with the e/treme scores. You can create a new variable that categori;es all the sub8ects as either outlier sub8ects# or non-outlier sub8ects#, and then re-e/amine the data to see if there is a difference between these two types of sub8ects. :lso, you may find the same sub8ects are responsible for outliers in many %uestions in the survey by looking at the sub8ect numbers for the outliers displayed in all the bo/plots. 9emember, however, that 8ust because a value is e/treme compared to the rest of the data does not necessarily mean it is somehow an anomaly, or invalid, or should be removed. "econd, if you want to reduce the influence of the outliers, you have four options. ,ption & is to delete the value. 6f you have only a few outliers, you may simply delete those values, so they become blank or missing values. ,ption ' is to delete the variable. 6f you feel the %uestion was poorly constructed, or if there are too many outliers in that variable, or if you do not need that variable, you can simply delete the variable. :lso, if transforming the value or variable (e.g., ,ptions G* and GC) does not eliminate the problem, you may want to simply delete the variable. ,ption * is to transform the value. You have a few options for transforming the value. You can change the value to the ne/t [Link] (non-outlier) number. $or e/ample, if you have a &AA point scale, and you have two outliers (HB and HI), and the ne/t highest (non-outlier) number is JH, then you could simply change the HB and HI to JHs. :lternatively, if the two outliers were B and I, and the ne/t lowest (non-outlier) number was &&, then the B and I would change to &&s. :nother option is to change the value to the ne/t [Link] (non-outlier) number =2>" one unit increment [Link]. $or e/ample, the HB and HI numbers would change to HAs (e.g., JH plus & unit higher). The B and I numbers change to &As (e.g., && minus & unit lower). ,ption C is to transform the variable. 6nstead of changing the individual outliers (as in ,ption G*), we are now talking about transforming the entire variable. Transformation creates normal distributions, as described in the ne/t section below about 3ormality#. "ince outliers are one cause of non-normality, see the ne/t section to learn how to transform variables, and thus reduce the influence of outliers. Third, after dealing with the outlier, you re-run the outlier analysis to determine if any new outliers emerge or if the data are outlier free. 6f new outliers emerge, and you want to reduce the influence of the outliers, you I
choose one the four options again. Then, re-run the outlier analysis to determine if any new outliers emerge or if the data are outlier free, and repeat again. K. Normality 1elow, 6 describe five steps for determining and dealing with normality. <owever, the bottom line is that almost no one checks their data for normalityL instead they assume normality, and use the statistical tests that are based upon assumptions of normality that have more power (ability to find significant results in the data). $irst, what is normality7 : normal distribution is a symmetric bell-shaped curve defined by two things: the mean (average) and variance (variability). "econd, why is normality important7 The central idea behind statistical inference is that as sample si;e increases, distributions will appro/imate normal. (ost statistical tests rely upon the assumption that your data is normal#. Tests that rely upon the assumption or normality are called parametric tests. 6f your data is not normal, then you would use statistical tests that do not rely upon the assumption of normality, call nonparametric tests. 3on-parametric tests are less powerful than parametric tests, which means the non-parametric tests have less ability to detect real differences or variability in your data. 6n other words, you want to conduct parametric tests because you want to increase your chances of finding significant results. Third, how do you determine whether data are normal#7 There are three interrelated approaches to determine normality, and all three should be conducted. $irst, look at a histogram with the normal curve superimposed. : histogram provides useful graphical representation of the data. "="" can also superimpose the theoretical normal# distribution onto the histogram of your data so that you can compare your data to the normal curve. To obtain a histogram with the superimposed normal curve: &. "elect Analyze --> Descripti e Statistics --> Fre!uencies* '. (ove all variables into the )ariable(s)# window. *. +lick +harts#, and click <istogram, with normal curve#. C. +lick ,-. ,utput below is for system&#. 3otice the bell-shaped black line superimposed on the distribution. :ll samples deviate somewhat from normal, so the %uestion is how much deviation from the black line indicates non-normality#7 >nfortunately, graphical representations like histogram provide no hard-and-fast rules. :fter you have viewed many (manyM) histograms, over time you will get a sense for the normality of data. 6n my view, the histogram for system&# shows a fairly normal distribution.
"econd, look at the values of "kewness and -urtosis. "kewness involves the symmetry of the distribution. "kewness that is normal involves a perfectly symmetric distribution. : positively skewed distribution has scores clustered to the left, with the tail e/tending to the right. : negatively skewed distribution has scores clustered to the right, with the tail e/tending to the left. -urtosis involves the peakedness of the distribution. -urtosis that is normal involves a distribution that is bell-shaped and not too peaked or flat. =ositive kurtosis is indicated by a peak. 3egative kurtosis is indicated by a flat distribution. !escriptive statistics about skewness and kurtosis can be found by using either the $re%uencies, !escriptives, or 4/plore commands. 6 like to use the 4/plore# command because it provides other useful information about normality, so &. "elect Analyze --> Descripti e Statistics --> "-plore* '. (ove all variables into the )ariable(s)# window. *. +lick =lots#, and unclick "tem-and-leaf# C. +lick ,-. Descripti es bo/ tells you descriptive statistics about the variable, including the value of "kewness and -urtosis, with accompanying standard error for each. 1oth "kewness and -urtosis are A in a normal distribution, so the farther away from A, the more non-normal the distribution. The %uestion is how much# skew or kurtosis render the data non-normal7 This is an arbitrary determination, and sometimes difficult to interpret using the values of "kewness and -urtosis. 2uckily, there are more ob8ective tests of normality, described ne/t.
Third, the descriptive statistics for "kewness and -urtosis are not as informative as established tests for normality that take into account both "kewness and -urtosis simultaneously. The -olmogorov-"mirnov test (--") and "hapiro-0ilk ("-0) test are designed to test normality by comparing your data to a normal distribution with the same mean and standard deviation of your sample: &. "elect Analyze --> Descripti e Statistics --> "-plore* '. (ove all variables into the )ariable(s)# window. *. +lick =lots#, and unclick "tem-and-leaf#, and click 3ormality plots with tests#. C. +lick ,-. &est o' Normality# bo/ gives the --" and "-0 test results. 6f the test is 3,T significant, then the data are normal, so any value above .AB indicates normality. 6f the test is significant (less than .AB), then the data are non-normal. 6n this case, both tests indicate the data are non-normal. <owever, one limitation of the normality tests is that the larger the sample si;e, the more likely to get significant results. Thus, you may get significant results with only slight deviations from normality. 6n this case, our sample si;e is large (nN*'K) so the significance of the --" and "-0 tests may only indicate slight deviations from normality. You need to eyeball your data (using histograms) to determine for yourself if the data rise to the level of non-normal.
Normal 454 6lot# provides a graphical way to determine the level of normality. The black line indicates the values your sample should adhere to if the distribution was normal. The dots are your actual data. 6f the dots fall e/actly on the black line, then your data are normal. 6f they deviate from the black line, your data are nonnormal. 6n this case, you can see substantial deviation from the straight black line.
$ourth, if your data are non-normal, what are your options to deal with non-normality7 You have four basic options. a. ,ption & is to leave your data non-normal, and conduct the parametric tests that rely upon the assumptions of normality. Oust because your data are non-normal, does not instantly invalidate the parametric tests. 3ormality (versus non-normality) is a matter of degrees, not a strict cut-off point. "light deviations from normality may render the parametric tests only slightly inaccurate. The issue is the degree to which the data are non-normal. b. ,ption ' is to leave your data non-normal, and conduct the non-parametric tests designed for nonnormal data. c. ,ption C is to transform the data. Transforming your data involving using mathematical formulas to modify the data into normality.
&A