Give You a Definite Maybe
An Introductory Handbook for Probability, Statistics, and Excel
has been prepared by Ian Johnston of Malaspina University-College (now Vancouver
Island University), in Nanaimo,
BC for use in Liberal Studies.The
text is in the public domain and may be used by anyone, in whole or in part,
without permission and without charge, provided the source is acknowledged,
released May 2000.Minor editorial
and formatting changes were made in November 2004]
Five: A Normal Distribution and The Normal Curve
In the previous sections we have talked about frequency
distributions.These refer, you
will recall, to the particular scores in a set of results.Some results are more frequent than others, and the density
of the distribution may vary (e.g., clustered near to the mean, spread out
widely on either side of the mean, bunched at either end of the values, and so
on).There are innumerable ways in
which the scores in a set of results can be distributed.
particular interest to us in the remainder of this module is a frequency
distribution in which the dispersion of the scores is symmetrical about the
central mean, that is, in which the frequency of results above the average
matches exactly the frequency of
results below the average and in which the most frequent result falls at the
average (see Histogram D in the sample histograms in Section Two).In such a distribution, the high point (the most frequent results will be
in the middle of the distribution, and each side of the high point will be a
mirror image of the other.
such a distribution, the histogram would be perfectly symmetrical around the
centre; in other words, the tallest column (i.e., the most frequent value) will
occur exactly in the centre of the diagram and the other columns (frequencies)
will fall away on either side of the central value equally on either side.Such a perfectly symmetrical distribution is called a normal
distribution (in popular language the shape of this frequency distribution
is commonly called a bell curve).
that a normal distribution may come with very different dimensions (tall and
skinny, short and wide), but the characteristics mentioned above hold in all
cases (the high point, i.e., the most frequent value, is always in the centre,
and the two sides of the curve are perfectly symmetrical).In other words, the characteristic bell shape is always present.
are a few examples of histograms illustrating normal distribution.These histograms illustrate the probability distribution for success in
various coin tosses.The x-axis here indicates the number of heads in a particular
sequence of coin tosses; the y-axis represents the theoretical frequency of that
result in the given number of fair tosses.
histograms will have different sizes and shapes, because the frequency
distribution changes with the number of tosses.But notice that all the histograms are perfectly symmetrical
around the centre (the tallest and therefore most frequent value).
once again, that these diagrams represent probability distributions, or the
frequency of results theoretically calculated.And since the total of all the probabilities for an event equals 1, the
shaded area contained in all the columns equals 1.
indicates that in a three-coin-toss sequence (or three coins tossed
simultaneously) there are four possible results: 0 heads, 1 head, 2 heads, and 3
heads (the values on the X-Axis).The
percentage frequency of these four possibilities we read off the Y-Axis.We read the following diagrams in the same way: the number of heads on
the X-Axis, and the percent probability on the Y-Axis.Notice the perfect symmetry in these distributions.
the above histogram for a six-coin-toss sequence, there are six possible results
(from 0 heads to 6 heads). The most frequent result is in the centre (3
heads), and the frequencies decline as one moves away from the centre (indicated
by the decreasing height of the columns).
in the above histogram (for 20 coin tosses) how at the extremes (0, 1, 2, 18,
19, 20) the percent probability is so small that the value does not show on the
graph.Virtually all the results in
a 20 coin-toss sequence will fall between 3 and 17, with the most frequent value
in the centre (at 10).The
frequencies on either side of 10 are perfectly symmetrical (we can see that by
the equal heights of 9 and 11, of 8 and 12, of 7 and 13, of 6 and 14, of 5 and
15, or 4 and 16, of 3 and 17.
The Normal Curve
how in the diagrams above, as the number of columns increases, the entire shape
of the histogram begins to approximate a curve, with the shaded areas all under
the top line.And, in fact, we can
readily convert these histograms (using rectangles) to a curve by joining up the
central points on the top of each column.
we have when we do this is exactly the same frequency distribution picture as we
had with the columns, except that we have filled in the gaps between columns.Now we do not have the body of the columns, but that does not matter,
because the important part of the histogram picture is the line defined by the
top centre points of the columns (which indicates the percent probability of any
particular value along the x-axis).In
such a diagram, the important factor is the area under the curve, for that
graphically presents the total frequencies.Equal areas under such a curve will represent equal frequencies (more on
we join up the columns in the histogram in this way, we produce a particularly
useful statistical shape, the normal
normal curve or the normal distribution is an extremely important statistical
concept, as important in many areas of enquiry as the right-angle triangle is in
Euclidean geometry, and for the remainder of our short study of statistics we
shall be dealing only with this frequency distribution.So understand clearly what the normal distribution means.
we say that a particular population characteristic is normally distributed, we
mean the following:
1.The normal frequency curve shows that the highest frequency falls in the
centre (i.e., at the mean of the values in the distribution) with an equal and
exactly similar curve on either side of that centre.Thus, the most frequent value in a normal distribution is the average,
with half the values falling below the average and half above it.
2.The normal curve, often called a bell curve, is perfectly symmetrical.Therefore the median (the arithmetical average), the mode (the most
frequent value), and the median (the middle value) will coincide at the centre
of the curve (the high point).Make
sure you understand this point.
3.The further away any particular value is from the average (above or
below), the less frequent that value will be (i.e., the frequencies will
diminish on either side of the high central point).
4.Because the two halves on either side of the centre are exactly
symmetrical, the frequency of values above the mean will match exactly the
frequencies of values below the mean, provided the distances between the values
and mean are identical.Thus, the
frequency of a value 3 units to the right of the mean will be identical to the
frequency of the value 3 units to the left of the mean.This is a key idea; please make sure you understand it.
5.The total frequency of all values in the population will be contained by
the area under the curve.This is
obvious enough, since the total area under the curve represents all the possible
occurrences of that characteristic.
6.Various areas under the curve will therefore indicate the percentage of
the total frequency.For instance, 50 percent of the area under the curve lies to
the left of the mean (i.e., half of all normally distributed results will fall
in this area), and 50 percent of the area under the curve lies to the right of
the mean.Therefore, 50 percent of
all scores will lie to the left and 50 percent to the right of the mean.Equal areas under the curve represent equal numbers in the frequency.Again, please make sure you understand this important idea.
7.Normal curves may have different shapes (i.e., tall and skinny, short and
low, and so on).What will
determine the overall shape of the symmetrical curve will the value of the mean
and the standard deviation in the population (these will define the shape in the
same way the centre point and the radius define a circle).But the general characteristics listed above will remain the same.
make very sure you understand each one of the above points, because much of what
we do from this point on assumes that you are quite familiar with the properties
of the normal curve.
distributions are particularly important for a number of reasons (as we shall
see), not the least of which is that many of the important characteristics we
wish to study (including all inherited characteristics) are normally
distributed.What that means is
that if we gather a very large number of samples of a particular measurement
(e.g., height) and construct a frequency distribution, the result will be
normal, that is, will manifest the characteristics listed above.
carefully that the normal curve is a theoretical depiction of the distribution
of frequencies of the values.It
does not tell us that in any particular series of measurements of a normally
distributed item half must lie above and half below the mean.It indicates that there is a .5 probability that in any series of values,
any particular score will lie above or below the mean and that the average will
fall in the centre of the distribution.Or,
put another way, in any measurement of a heritable characteristic (height,
intelligence, weight, and so on) 50 percent of the population will be below the
arithmetical average (the mean), because such characteristics are normally
distributed.It is not the case
that in any distribution exactly 50 percent of the population will fall below
the mean—but that must be the case if the frequency distribution is a normal
all values are normally distributed (please remember that): for example, the
salaries of those working at Malaspina University-College, the responses to a
public opinion questionnaire, levels of contaminant in the Georgia Strait.But what makes this particular frequency distribution so important is
that a great many things in our world are normally distributed (e.g., population
heights, mortality rates, stock market fluctuations, yearly temperature
averages, girth of trees, all repeated human measurements of a single natural
phenomena, heritable characteristics, and so on).It is an enormously useful and important analytical concept (2).
Further Properties of the Normal Curve
have noted above some of the properties of the normal curve (most frequent value
is at the centre, symmetry about the central value, diminishing frequency with
the distance from the centre).However,
there are many more important features.
may have noticed that the shape of the curve in a normal distribution has a
clear point on each side where the slope goes from concave (bulging outward) to
convex (bulging inwards).If you
were walking up the curve you would notice that at first the slope increases,
but at a particular point it would begin to decrease as you approach the summit.The point at which this occurs is called the point
one draws a perpendicular line from the points of inflection, one on either side
of the mean, to the base line (the X-axis) then the distance from that point to
the value of the mean on the X-axis (in the centre) is equal to the standard
deviation.Make sure you understand
this very important property of the normal curve.
that these two perpendicular lines drawn from the points of inflection on either
side of the mean divide the area under the curve further, so that we now have
four separate areas, as follows (see diagram on the next page):
1.The area between the mean and one standard deviation above the mean (Area
2.The area between the mean and one standard deviation below the mean (Area
3.The area to the right of one standard deviation above the mean (Area C);
4.The area to the left of one standard deviation below the mean (Area D).
the normal curve is perfectly symmetrical, Area A will equal Area B, and Area C
will equal Area D.And the total of
A, B, C, and D will equal the total area under the curve (i.e., the entire
population).Since the curve never
quite touches the X-axis at either end, there may be a value beyond the tails (a
highly improbable value), but its frequency will be so low that we can virtually
calculations indicate that in any normal distribution, no matter what its height or width, about 68 percent of all the
observations fall within one standard deviation from the mean (i.e., in Areas A
and B combined).Thus, 34 percent
will lie between the mean and 1 standard deviation above the mean (in Area A),
and 34 percent between the mean and 1 standard deviation below the mean (in Area
B).Hence, in a normal distribution
32 percent of the observations will fall outside 1 standard deviation, 16
percent on either side (i.e., 16 percent of the population will fall in Area C
and 16 percent in area D).
may express this, more appropriately, in the language of probability, as
follows: in any normal distribution, there is approximately a .68 probability
that a particular value will fall within 1 standard deviation (SD) of the mean;
there is approximately a .34 probability that a particular value will lie
between the mean and 1 SD above the mean (in Area A) and approximately a .34
probability that a particular value will lie between the mean and 1 SD below the
mean (in Area B).Similarly, there
is approximately a .16 probability that a particular value will lie higher than
1 SD from the mean (in Area C), and approximately a .16 probability that a
particular value will lie lower than 1 SD below the mean (in Area D).
diagram below illustrates the areas under the normal curve for one and two
standard deviations above and below the mean (i.e., this is the same as the
previous diagram, except that the vertical lines indicating two standard
deviations from the mean have been added to it, thus creating six areas under
vertical lines represent the mean (at the centre), and distances of 1 and 2
standard deviations on either side of the mean.As before, Area A and Area B are equal, each defined by the
mean and 1 standard deviation on either side of it.Each of these areas (A and B) contains approximately 34 percent of all
the values in a normal distribution.
C and Area D, which are also equal, are defined by the vertical lines
representing 1 and 2 standard deviations from the mean (on either side).Each of these areas will contain approximately 13.5 percent of all the
values in a normal distribution.
E and F, at the extreme ends of the curve are defined as the areas marked off by
the vertical line representing 3 standard deviations and the tail ends of the
curve.Each of these areas will
contain 2.5 percent of all the values in a normal distribution (i.e., in a
normal distribution, 5 percent of the population will be beyond 2 standard
deviations: 2.5 above the mean, and 2.5 below the mean).
we continued to draw standard deviation vertical lines to mark off three
standard deviations from the mean (not shown on the diagram), we would have two
very small areas at the extreme tips of the curve it indicate the values lying
more than three standard deviations from the mean.This area contains .3 percent of all the values in the normal
same information given in the above paragraphs in terms of percentages can be
restated in the language of probability as follows:
In any normal distribution, there is a .34
probability that any particular value will fall between the mean and 1
standard deviation above the mean (in Area A), a .34 probability that any
particular value will fall between the mean and 1 standard deviation below
the mean (Area B); furthermore, there is a .135 probability that any
particular value will fall between 1 and 2 standard deviations above the
mean (Area C) and a .135 probability that any particular value will fall
between 1 and 2 standard deviations below the mean (Area D).Finally, there is a .475 probability that any particular value will
fall within 2 standard deviations above the mean (somewhere in Areas A and
C) and a .475 probability that any particular value will fall within 2
standard deviations below the mean (somewhere within Areas B and D).
Further analysis of the mathematics of normal curves reveals that the
area contained by the perpendicular lines representing 3 standard deviations
from the mean contains 99.7 percent of the area under the curve and thus
represents 99.7 percent of all the scores in the data set.In other words, there is a 99.7 percent chance (or
p = .997) that in any normal distribution, any particular value will fall
within 3 standard deviations from the mean (3).
Thus, the areas beyond three standard deviations contain only .30
percent of the total area.This
means that in a normally distributed characteristic, the probability of a
value lying more than three standard deviations from the mean is .003, or
.0015 at the top end (above the mean) and .0015 at the bottom end (below the
mean).Thus, it is very rare
indeed (but not impossible) for an observed value in a normal distribution to
occur more than 3 standard deviations from the mean.
D. A Simple Application of the Mathematical Properties of
the Normal Curve
mathematical information about a normal curve provides enormously valuable
information.For if we know that a
population is normally distributed (i.e., that the frequency distribution in the
population follows a normal curve), then if we know the mean of that curve and
the standard deviation, we know the probabilities of any particular value
falling within specified areas of the curve.We can thus make some important predictions about that population.
instance, suppose we know that the height of men in a population (say, in Prince
George) is normally distributed, that the mean height (from a sample we collect)
is 68 in., and the standard deviation is 4 in.We then know the probabilities for the distribution of heights in Prince
George, as follows:
34 percent of the men will be between 68 in. (the mean) and 72 in. (1 SD above
the mean, 68 + 4); approximately 34 percent will be between 68 in. (the mean)
and 64 in. (1 SD below the mean, 68 - 4); approximately 13.5 percent will be
between 68 in. and 76 in. (between 1 SD above the mean and 2 SD above the mean);
and approximately 13.5 percent will be between 64 in. and 60 in. (between 1 SD
and 2 SD below the mean); and approximately 2.5 percent will be between 76 in.
and 80 in. (between 2 and 3 SD above the mean); and approximately 2.5 percent
will be between 60 in. and 56 in. (between 2 SD and 3 SD below the mean).
if a child of yours informs you that she is getting married to some man from
Prince George, you already know some important things about your prospective
son-in-law, even though you have never met.
There is a .34
probability that his height will be between 68 in. and 72 in.; there is a
.34 probability that his height will be between 68 in. and 64 in.; or,
putting these two together, that there is a .68 probability that his height
is between 64 in. and 72 in.
could obviously continue this analysis to take into account all the percentage
frequencies indicated by the normal curve.
this mathematical analysis of the normal curve holds for the frequencies of any
value which is normally distributed.Once
we know the mean and the standard deviation, we are able to predict the
probability of the value for any particular member of the population.And this process is possible, to repeat the point, for any measurable
factor whose frequencies are normally distributed (e.g., mortality rates, some
test scores, volume of wood in trees, and so on).Thus, once we know that a characteristic is normally distributed, what
the values are for the mean and the standard deviation, we are in a position to
make a number of conclusions about the probable distribution of the entire
Normal Curve: Summary
is vitally important for an initial understanding of statistics to grasp the
point that the features of the normal curve apply to all distribution
frequencies of normally distributed items.Normal curves may have many different heights and widths, but in all
cases, these characteristics apply:
1.The mean, median, and mode coincide at the high point of the curve and
divide the results into two equal and perfectly symmetrical halves.
2.Of all the scores in a perfectly normal distribution, approximately 34
percent will lie between the mean and 1 Standard Deviation above the mean, and
approximately 34 percent will lie between the mean and 1 Standard Deviation
below the mean.
3.Of all the scores in a perfectly normal distribution, approximately 95
percent will fall between the lines representing 2 Standard Deviations from the
mean (i.e., about 27 percent of all scores will fall between 1 and 2 standard
deviations, with 13.5 percent on either side of the curve).
4.Of all the scores, approximately 99 percent will lie between the lines
indicating 3 standard deviations from the mean (i.e., approximately 5 percent of
the sample will fall between 2 and 3 standard deviations, or approximately 2.5
percent on either side of the mean).
that these characteristics hold for any normal distribution regardless of the
height or width of the normal curve.Thus,
once we know that the frequencies of a particular mathematical measurement is
normally distributed, we know that the above groupings of the results should
occur in any very large sample.
F. Self-Test on Normal Distribution Curve
1.The duration times of a certain brand of battery are normally
distributed, with a mean of 80 hours and a standard deviation of 10 hours.As a marketing gimmick, the manufacturer decides to guarantee to replace
any battery which fails prior to a certain time.Approximately how long a guarantee should the company provide
so that no more than 2.5 percent of the batteries fail prior to the guaranteed
2.You have a contract to make one thousand uniforms for the Canadian navy.The heights of sailors are normally distributed, with a mean of 69 inches
and a standard deviation of 2 inches.What
percentage of the uniforms will have to fit sailors shorter than 67 inches?What percentage will have to be suitable for sailors taller than 73
3.Let us assume the results from all large tests are normally distributed.In the final results for Subject A, the mean percentage score is 80 and
the Standard Deviation 5.In
Subject B, the mean percentage score is 70 and the Standard Deviation 2.5.Suppose you score 75 percent in both courses.What percentage of students received results better than you
in Subject A and in Subject B?What
is your percentile rank in each subject?
answers to these questions are given in Section I below.
G. A Normal Distribution and Bernouilli’s Theorem
It is important to
grasp the point that the bell-like shape of a normal distribution only occurs
with a great many samples from normally distributed data.In fact, in any quality normally distributed (e.g., any heritable
quality, like height), as Bernouilli’s Theorem tells us, the frequency
distribution of the results will get closer and closer to the shape of a normal
distribution as we increase the number of measurements (i.e., data in the
To follow this
point more clearly, consider the following diagrams.They represent the frequency distributions of random numbers
taken from a population of numbers which is known to be normally distributed
(Excel generated the numbers and produced the charts).In this case, the mean of the total population is 10 and the standard
deviation 3 (chosen arbitrarily).
The first diagram
illustrates the frequency distribution for a sample of 100 numbers.You will notice that it does not look very bell-like.The second diagram illustrates the frequency distribution for a sample of
1000 numbers.You can see that the
characteristic shape of the normal distribution is beginning to emerge.
The final two
diagrams illustrate the frequency distributions for samples of 2000 and 3000
numbers respectively.Clearly, the
final diagram, although still not a perfect bell curve, approximates much more
closely than any of the others the characteristic shape of the normal
distribution.A larger sample (say,
10,000) would look even closer to the symmetrical bell shape.
When we are
dealing with random number generation from a population which is not normally
distributed but which is uniformly random, then increasing the number in the
sample is not going to produce more and more closely any clear shape.
example, are histograms for 1000 and for 2000 numbers between 1 and 400 randomly
generated, but this time from a population which is not normally distributed.Notice that there is no emerging bell curve shape as one increases the
number of samples from 1000 to 2000.
A Final Word
It is particularly important that you take away from this section and
the previous sections a clear sense of the meanings of the following key terms:
mean, standard deviation, z-score
(positive and negative), normal distribution, normal curve.
addition, you must retain a clear sense that knowing the standard deviation and
the mean of a certain normal curve enables one to ascertain the probability that
certain results will fall within a certain distance of the mean.
from now on we assume that students are all familiar with the concept that the
area under the normal curve indicates the theoretical distribution of
frequencies in any normally distributed data.Various areas under the curve represent the various probabilities that
any one score will fall within the designated area.Thus, the smaller the area for any group of scores, the
smaller the probability that any score in that group will occur.The tails of the curve (beyond 3 standard deviations) contain very small
areas, and thus the probabilities of scores within those areas are very low
(less than .01).
a rough guide, remember that the majority (approximately 68 percent) of all
scores in a normal distribution should fall within 1 standard deviation and the
mean (or between a z-score of +1 and
-1); almost all (95 percent of the scores) should fall between the mean and 2
standard deviations (or between a z-score
of +2 and -2), and that the probability of a score falling within 3 standard
deviations and the mean is approximately 100 percent.
does not mean that it is impossible for a score in a normal distribution to fall
further than 3 SD from the mean, simply that such a result is very rare (the
value of p is close to 0).
too, that these characteristics refer only to data which is normally
distributed.These figures do not
apply in other sorts of distributions (in which the shape of the frequency curve
will be different).
will understand very little of what comes in the next sections if you have not
grasped clearly the above information.
I. Answers to Self-Test on the Normal Distribution Curve
1.The manufacturer does not want to return more than 2.5 percent of his
batteries.Since the lifetime of
the batters is normally distributed, we know that 95 percent of them will fall
with 2 standard deviations of the mean, that is between 80 + 2SD and 80 - 2SD,
or 80 + 20 and 80 -20, or between 100 hr and 60 hr.Thus, 5 percent of the population of batteries will fall outside this
range, 2.5 percent above and 2.5 percent below.We are not worried about the batteries above this range,
because owners are not going to complain about batteries lasting longer; the
area of the population we are concerned with is the 2.5 percent below 2 standard
deviations (i.e., below 60 hr).Therefore,
the manufacturer should set his guarantee at 60 hr.
2.Sailors shorter that 67 inches fall into an area of the normal curve from
the lower extremity to the line making 1 SD below the mean (since the mean is 69
in. and the Standard Deviation 2 in.).In
a normal distribution, the area to the left of 1SD below the mean is
approximately 16 percent of the total population.Similarly, sailors taller than 73 in fall into an area 2 SD to the right
of the mean.In a normal
distribution, the area more than 2 SD to the right of the mean is equal to 2.5
percent of the total population.
3.In Subject A your score of 75 is 5 marks below the Standard Deviation (of
5).This is equivalent to 1
Standard Deviation below the mean (or a z-score
of -1).Since the marks are
normally distributed, the percentage of students getting better marks than you
includes the entire population to the right of one Standard Deviation below the
mean, or 84 percent.In Subject B
your mark of 75 is 5 percent above the mean (or a z
score of 2, since the Standard Deviation is 2.5).Thus, the students who did better than you are those in the
area to the right of two Standard Deviations above the mean, or 2.5 percent.The percentile rank is the percentage of students who fared worse than
you.Thus, in the first test, you
have a percentile score of 16; in the second test you have a percentile score of
to Section Five
(1) The adjective normal does
not mean "usual" or "customary" (although such a
distribution is, in fact, quite common), but comes from "normative,"
meaning ideal. [Back
(2) The credit for first
recognizing and developing the properties of the normal curve is generally given
to the English mathematician Abraham de Moivre, 1667 to 1754, an acquaintance of
Newton's and a member of the Royal Society, who used as his statistical
laboratory the London coffee houses where all sorts of gambling went on.The basic principle underlying Normal Distribution is that any data which
are influenced by many small and unrelated random effects (like, for example,
weight) are going to be normally distributed (at least to a very near
approximation).This principle is
called the Central Limit Theorem.See
Appendix E for an illustration of how combining independent random effects
produces a normal distribution.[Back to Text]
These percentage figures are approximate.The
more exact figures are as follows: the area between the mean and one standard
deviation contains 34.13 percent of all results on either side of the mean; the
area between the mean and two standard deviations contains 47.72 percent of all
results on either side of the mean; the area between the mean and three standard
deviations contains 49.87 percent of all results on either side of the mean.For a complete lay out of the area under the normal curve at different
standard deviations see Table A in Appendix B.For the purpose of our exercises we will use the approximate values given
above, except where noted.[Back