Unit: Exploring One – Variable Data
Chapter: Normal Distribution
Reference: – Data Distribution, Describing data, Central Tendency, Normal Distribution, Bell shaped curve, Symmetry, Empirical rule, Z- Scores & Percentiles, Normal Probability plots, Central Limit Theorem, Sampling Distribution, Hypothesis Testing, Confidence Intervals
After studying this chapter, you should be able to:
- Normal Distribution, Symmetry & Empirical Rule.
- Z- scores & Percentiles, Central Limit Theorem
- Hypothesis Testing & Confidence Intervals
Normal Distribution, Symmetry & Empirical Rule
- Normal Distribution: The Normal distribution, also known as the Gaussian distribution, is a continuous probability distribution that is characterized by a bell-shaped curve. It is symmetric around its mean (average) and is defined by two parameters: the mean (μ) and the standard deviation (σ). The shape of the Normal distribution is completely determined by these two parameters.
The Normal distribution is widely used in statistics due to its many important properties and applications. Many natural phenomena and measurements tend to follow a normal distribution, which makes it a crucial assumption in various statistical analyses.
- Symmetry of the Normal Distribution: The Normal distribution is symmetric, meaning that the left and right halves of the distribution are mirror images of each other. This symmetry is evident in the bell-shaped curve, where the peak (mode), mean, and median all coincide at the center.
Mathematically, if X follows a Normal distribution with mean μ and standard deviation σ, then the probability density function (pdf) of X is given by the theorem.
The graph of this function produces the bell-shaped curve characteristic of the Normal distribution.
- Empirical Rule (68-95-99.7 Rule): The Empirical Rule, also known as the 68-95-99.7 Rule, is a fundamental property of the Normal distribution that describes the approximate proportion of data values falling within certain intervals around the mean. This rule is based on the properties of standard deviations and applies to data that follow a normal distribution.
According to the Empirical Rule:
-
- Approximately 68% of the data falls within one standard deviation of the mean (μ ± σ).
- Approximately 95% of the data falls within two standard deviations of the mean (μ ± 2σ).
- Approximately 99.7% of the data falls within three standard deviations of the mean (μ ± 3σ).
This rule provides a quick way to estimate the spread and proportion of data within different intervals of a normal distribution without having to calculate the exact probabilities.
The Normal distribution, its symmetry, and the Empirical Rule are fundamental concepts in statistics and play a significant role in hypothesis testing, confidence intervals, and various statistical analyses. They also serve as a basis for understanding other important distributions in probability theory and data analysis.
Z- scores & Percentiles, Central Limit Theorem
- Z-scores: A Z-score (also known as a standard score) measures how many standard deviations a data point is away from the mean of its distribution. It is a standardized value that allows us to compare and interpret data from different distributions. The formula for calculating the Z-score of a data point, x, in a distribution with mean μ and standard deviation σ, is given by:
Z = (x – μ) / σ
If the Z-score is positive, the data point is above the mean, and if it is negative, the data point is below the mean. A Z-score of 0 indicates that the data point is equal to the mean. Z-scores help us determine how unusual or typical a particular data point is within its distribution.
- Percentiles: Percentiles are measures used to divide a dataset into 100 equal parts, each representing a percentage of the data. The pth percentile is the value below which p% of the data falls. For example, the 25th percentile (also known as the first quartile) is the value below which 25% of the data falls.
To find the value of a specific percentile in a dataset, follow these steps:
-
- Order the data in ascending order.
- Compute the position of the percentile (position = (p/100) * (n + 1)), where n is the number of data points.
- If the position is an integer, the percentile is the data value at that position.
- If the position is not an integer, the percentile is the average of the data values at the positions directly above and below the calculated position.
- Central Limit Theorem: The Central Limit Theorem (CLT) is a fundamental result in statistics that describes the sampling distribution of the sample means for large random samples, regardless of the shape of the population distribution. The CLT states that as the sample size (n) increases, the sampling distribution of the sample mean approaches a normal distribution with mean equal to the population mean (μ) and standard deviation equal to the population standard deviation (σ) divided by the square root of the sample size (n).
The key implications of the Central Limit Theorem are:
-
- The distribution of sample means tends to be approximately Normal, regardless of the shape of the population distribution, as long as the sample size is sufficiently large (usually n ≥ 30).
- The larger the sample size, the closer the sampling distribution of the sample mean will be to a normal distribution.
- The Central Limit Theorem is crucial in inferential statistics, where it allows us to make inferences about population parameters based on sample statistics.
Example:
Class A Exam Scores: 78, 82, 85, 88, 90, 92, 95, 98
Class B Exam Scores: 70, 75, 80, 85, 90, 95, 100
Solution: – Step 1: Calculate Mean and Standard Deviation for Each Class
- For Class A: Mean (μ) = (78 + 82 + 85 + 88 + 90 + 92 + 95 + 98) / 8 = 89 Standard Deviation (σ) = √[((78 – 89)2 + (82 – 89)2 + … + (98 – 89)2) / 8] ≈ 5.2
- For Class B: Mean (μ) = (70 + 75 + 80 + 85 + 90 + 95 + 100) / 7 = 85 Standard Deviation (σ) = √[((70 – 85)2 + (75 – 85)2 + … + (100 – 85)2) / 7] ≈ 10.4
Step 2: Create Histograms to Compare Distributions Let's create histograms for each class to visualize the distribution of exam scores.
Step 3: Compare to the Normal Distribution Now, let's compare the histograms to a normal distribution with the same mean and standard deviation for each class.
For Class A, the Normal distribution would have approximately the same mean (μ = 89) and standard deviation (σ ≈ 5.2). It would look like a bell-shaped curve centered around 89.
For Class B, the Normal distribution would have approximately the same mean (μ = 85) and standard deviation (σ ≈ 10.4). It would also look like a bell-shaped curve centered around 85.
Step 4: Conclusion After comparing the histograms to the Normal distribution, we can see that:
- Class A's scores are relatively closer to a normal distribution due to the symmetric and bell-shaped appearance of the histogram.
- Class B's scores are less symmetric and more spread out, suggesting that the data is less close to a normal distribution.
Key Points
Shape of the Distribution:
Compare the shape of the data distribution to a normal distribution. A Normal distribution is bell-shaped and symmetric, with the left and right halves mirroring each other.
Skewness:
Check for skewness in the data. If the data is skewed, it may not resemble a normal distribution.
Outliers:
Identify outliers in the data. Outliers can significantly affect the shape of the distribution and its resemblance to a normal distribution.
Central Tendency:
Compare the mean, median, and mode of the data. In a normal distribution, they are equal, but in other distributions, they may differ.
Range and Spread:
Analyze the range and spread of the data. A Normal distribution typically has a predictable spread around the mean due to the Empirical Rule.
Z-scores:
Calculate Z-scores for data points and check if they follow a standard Normal distribution (mean = 0, standard deviation = 1).
Percentiles:
Compare percentiles of the data with the corresponding percentiles of a standard Normal distribution (using Z-scores).
Central Limit Theorem:
Consider the sample size and determine if the Central Limit Theorem applies. For large samples, the sample means tend to follow a normal distribution regardless of the population distribution.
Histograms:
Create histograms for the data and compare them to the bell-shaped curve of a normal distribution.
Probability Plots:
Use probability plots (e.g., Q-Q plots) to visually assess how closely the data aligns with a normal distribution.
Kurtosis:
Assess the kurtosis of the data, which measures the "peakedness" or "flatness" of the distribution. A Normal distribution has a kurtosis of 3.
Statistical Tests:
Utilize statistical tests (e.g., Anderson-Darling test, Kolmogorov-Smirnov test) to formally assess the fit of the data to a Normal distribution.
Transformations:
Consider data transformations (e.g., logarithmic, square root) to achieve a more Normal-like distribution.
Real-World Application:
Analyze the context of the data and determine if the departure from Normality is reasonable given the specific application.
Sample Size:
Remember that for small sample sizes, the data distribution may not resemble a normal distribution, even if the population distribution is Normal.