4.18 Intro to Data Science: Measures of Dispersion

  • Considered the measures of central tendency—mean, median and mode.
  • Help us categorize typical values in a group.
  • An entire group is called a population.
  • Sometimes a population is quite large, such as the people likely to vote in the next U.S. presidential election, which is a number in excess of 100,000,000 people.
  • For practical reasons, the polling organizations trying to predict who will become the next president work with carefully selected small subsets of the population known as samples.
  • Hear we introduce measures of dispersion (also called measures of variability) that help you understand how spread out the values are.
  • We’ll calculate each measure of dispersion both by hand and with functions from the module statistics, using the following population of 10 six-sided die rolls:

    1, 3, 4, 2, 6, 5, 3, 4, 5, 2

Variance

  • To determine variance, begin with the mean of these values—3.5.
  • Next, subtract the mean from every die value:

    -2.5, -0.5, 0.5, -1.5, 2.5, 1.5, -0.5, 0.5, 1.5, -1.5

  • Then, square each of these results (yielding only positives):

    6.25, 0.25, 0.25, 2.25, 6.25, 2.25, 0.25, 0.25, 2.25, 2.25

  • Finally, calculate the mean of these squares, which is 2.25 (22.5 / 10)—this is the population variance.
  • Squaring the difference between each die value and the mean of all die values emphasizes outliers—the values that are farthest from the mean—which can be important in data analysis.
  • The following code uses the statistics module’s pvariance function to confirm our manual result:
In [1]:
import statistics
In [2]:
statistics.pvariance([1, 3, 4, 2, 6, 5, 3, 4, 5, 2])
Out[2]:
2.25

Standard Deviation

  • The standard deviation is the square root of the variance (in this case, 1.5), which tones down the effect of the outliers.
  • The smaller the variance and standard deviation are, the closer the data values are to the mean and the less overall dispersion (that is, spread) there is between the values and the mean.
  • The following code calculates the population standard deviation with the statistics module’s pstdev function, confirming our manual result:
In [3]:
statistics.pstdev([1, 3, 4, 2, 6, 5, 3, 4, 5, 2])
Out[3]:
1.5
In [4]:
import math
In [5]:
math.sqrt(statistics.pvariance([1, 3, 4, 2, 6, 5, 3, 4, 5, 2]))
Out[5]:
1.5

Advantage of Population Standard Deviation vs. Population Variance

  • Suppose you’ve recorded the March Fahrenheit temperatures in your area.
  • You might have 31 numbers such as 19, 32, 28 and 35.
  • The units for these numbers are degrees.
  • When you square your temperatures to calculate the population variance, the units of the population variance become “degrees squared.”
  • When you take the square root of the population variance to calculate the population standard deviation, the units once again become degrees, which are the same units as your temperatures.

©1992–2020 by Pearson Education, Inc. All Rights Reserved. This content is based on Chapter 4 of the book Intro to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and the Cloud.

DISCLAIMER: The authors and publisher of this book have used their best efforts in preparing the book. These efforts include the development, research, and testing of the theories and programs to determine their effectiveness. The authors and publisher make no warranty of any kind, expressed or implied, with regard to these programs or to the documentation contained in these books. The authors and publisher shall not be liable in any event for incidental or consequential damages in connection with, or arising out of, the furnishing, performance, or use of these programs.