Unit: Inference for Categorical Data: Chi Square
Chapter: Chi – Square Test for Independence
Reference: – Exploring data, Sampling & Experimental design, Probability, Inference, Confidence Intervals, Power & Sample size, Designing Studies, bivariate data, Probability models, Chi- square tests, Inference for categorical data, Inference for Means & Proportions, Multivariate data analysis.
After studying this chapter, you should be able to:
- Exploring data, Sampling & Experimental design.
- Probability Inference & Confidence Intervals.
- Bivariate data & probability Models.
- Inference for Means & Proportions, Multivariate data.
Exploring Data, Sampling & Experimental Design
Exploring Data:
- Descriptive Statistics: Descriptive statistics summarize and present data using measures of center (mean, median) and measures of spread (range, interquartile range, standard deviation).
- Graphical Displays: Histograms, stem-and-leaf plots, boxplots, and scatterplots are used to visualize data distributions, identify patterns, and detect outliers.
- Shape of Distributions: Distributions can be symmetric, skewed left or right, or bimodal. Skewness and modality provide insights into data patterns.
- Center and Spread: The mean is affected by outliers, while the median is more robust. The standard deviation quantifies the variability around the mean.
- Z-Scores: Z-scores standardize data by measuring how many standard deviations an observation is from the mean. They help identify unusual observations.
Sampling and Experimental Design:
- Random Sampling: Simple random sampling ensures every member of a population has an equal chance of being selected, reducing bias in samples.
- Stratified Sampling: Dividing the population into homogeneous subgroups (strata) and then randomly sampling from each stratum helps ensure representation.
- Cluster Sampling: Dividing the population into clusters and randomly selecting entire clusters can be more practical when sampling is challenging.
- Systematic Sampling: Selecting every "k-th" element from a population after a random start helps achieve randomness in an ordered dataset.
- Experimental vs. Observational Studies: Experimental studies involve manipulating variables to establish causation, while observational studies observe variables without manipulation.
- Control Groups: Experimental designs often include control groups that do not receive the treatment, allowing comparison to assess the treatment's effect.
- Randomization: Assigning subjects to treatment and control groups randomly helps eliminate selection bias and establish causal relationships.
- Blinding: Single-blind and double-blind designs reduce bias by preventing participants and/or experimenters from knowing which treatment is given.
- Placebo Effect: The placebo effect occurs when a subject's belief in a treatment causes an actual response, highlighting the importance of control groups.
- Sampling Bias: Sampling bias occurs when certain groups are underrepresented or overrepresented in a sample, potentially leading to inaccurate conclusions.
Probability Inference & Confidence Intervals
Probability Inference & Confidence Intervals:
- Population and Sample: Probability inference involves making statements about a population based on a sample. Confidence intervals provide a range of plausible values for a population parameter.
- Parameter and Statistic: A parameter is a numerical summary of a population, while a statistic is a numerical summary of a sample. Inference aims to estimate population parameters using sample statistics.
- Sampling Distribution: The distribution of a statistic (like the sample mean) across all possible samples of a given size from a population. The central limit theorem states that the sampling distribution of the sample mean approaches normality as sample size increases.
- Margin of Error: The range around a sample statistic within which the true population parameter is likely to fall with a certain level of confidence. It is determined by the sample size and variability.
- Confidence Level: The probability that a confidence interval contains the true population parameter. Common confidence levels are 90%, 95%, and 99%.
- Confidence Interval Formula: A confidence interval is typically calculated as: point estimate ± margin of error. For example, for a confidence interval for a population mean, it is often: sample mean ± critical value * (standard deviation / √n).
- Critical Value: The z-score (for normal distributions) or t-score (for small samples) that corresponds to a specific confidence level. It determines the width of the confidence interval.
- Interpretation: A 95% confidence interval means that if we were to take many samples and construct confidence intervals for each, about 95% of these intervals would contain the true population parameter.
- Hypothesis Testing vs. Confidence Intervals: Hypothesis testing involves making decisions about population parameters based on sample data, while confidence intervals provide a range of likely values for the population parameter.
- Precision and Sample Size: Increasing the sample size generally leads to narrower confidence intervals, providing more precise estimates of population parameters.
Bivariate Data & Probability Models
Bivariate Data:
- Bivariate Data: Bivariate data involves pairs of observations on two variables. It explores relationships and patterns between these variables.
- Scatterplot: A graphical representation of bivariate data that uses points to show the relationship between two variables. It helps identify trends, clusters, and outliers.
- Correlation Coefficient (r): A measure of the strength and direction of a linear relationship between two quantitative variables. It ranges from -1 to +1.
- Positive and Negative Correlation: Positive correlation means that as one variable increases, the other tends to increase. Negative correlation means as one variable increases, the other tends to decrease.
- Strength of Correlation: The closer the absolute value of the correlation coefficient is to 1, the stronger the linear relationship between the variables.
- Line of Best Fit (Regression Line): A line that summarizes the trend in scatterplot data. It minimizes the sum of squared vertical distances between data points and the line.
- Residuals: The differences between observed and predicted values from the regression line. Residual plots help assess the adequacy of the model.
- Coefficient of Determination (R-squared): A measure that indicates the proportion of the variability in the response variable that is explained by the regression model.
- Outliers: Data points that do not follow the overall pattern of the data. They can have a significant impact on correlation and regression results.
Probability Models:
- Random Variables: A random variable assigns a numerical value to each outcome of a random process. It can be discrete or continuous.
- Probability Distribution: A function that describes the probabilities of different outcomes of a random variable. It may be described using a probability mass function (PMF) or probability density function (PDF).
- Discrete Probability Distributions: Examples include the binomial distribution (for a fixed number of trials with two outcomes) and the Poisson distribution (for rare events).
- Continuous Probability Distributions: Examples include the normal distribution (bell curve) and the exponential distribution (for time between events in a Poisson process).
- Standard Normal Distribution: A special case of the normal distribution with a mean of 0 and a standard deviation of 1. Z-scores are used to standardize and compare values from different normal distributions.
- Using Probability Models: Probability models help predict outcomes and understand the likelihood of different events. They are fundamental for making informed decisions based on uncertain or random processes.
Inference for Means & Proportions, Multivariate Data
Inference for Mean and Proportion:
- Sample Mean and Population Mean: The sample mean is a point estimate of the population mean. Inference methods allow us to make statements about the population mean using sample data.
- Sampling Distribution of the Sample Mean: The sampling distribution of the sample mean is approximately normal for large samples, thanks to the Central Limit Theorem.
- One-Sample t-Test: Used to test hypotheses about the population mean when the population standard deviation is unknown and the sample size is small.
- Confidence Intervals for the Mean: Confidence intervals provide a range of plausible values for the population mean with a certain level of confidence.
- Margin of Error for a Mean: The margin of error for a mean in a confidence interval depends on the sample size, standard deviation, and chosen confidence level.
- Two-Sample t-Test: Used to compare means of two independent samples, testing whether their means are significantly different.
- Paired t-Test: Used to compare means of two related samples, where each data point in one sample is paired with a data point in the other.
- Inference for Proportions: Similar to means, we can make inferences about population proportions using sample proportions and confidence intervals.
- Hypothesis Testing for Proportions: Hypothesis tests can be conducted to compare sample proportions to a hypothesized population proportion.
Multivariate Data Analysis:
- Multivariate Data: Multivariate data involves more than two variables. Techniques in multivariate analysis help explore relationships among multiple variables.
- Correlation Matrix: A table showing correlations between pairs of variables. It helps identify patterns and associations within the data.
- Covariance Matrix: A matrix that describes the relationships between pairs of variables, considering both their means and deviations from the means.
- Principal Component Analysis (PCA): A dimensionality reduction technique that transforms variables into a new set of uncorrelated variables (principal components) to capture most of the variability.
- Multivariate Regression Analysis: Extends linear regression to multiple predictor variables. It models the relationships between a response variable and multiple predictors.
- Cluster Analysis: Groups similar observations into clusters based on the characteristics of multiple variables. It helps identify patterns and similarities within the data.
Example: Car manufacturer claims that their new hybrid car model has an average gas mileage of 50 miles per gallon (mpg) or more. A consumer advocacy group is sceptical of this claim and decides to test it. They collect a random sample of 30 cars of the new hybrid model and measure their gas mileage. The sample mean gas mileage is 48 mpg, with a sample standard deviation of 4 mpg. Test whether there is sufficient evidence to support the manufacturer's claim at a 5% significance level.
Solution: – Step 1: Define Hypotheses:
- Null Hypothesis (H₀): The average gas mileage of the new hybrid car model is 50 mpg or more. H₀: μ ≥ 50.
- Alternative Hypothesis (H₁): The average gas mileage of the new hybrid car model is less than 50 mpg. H₁: μ < 50.
Step 2: Choose a Significance Level: We are given a 5% significance level (α = 0.05).
Step 3: Collect and Analyze Data: Sample size (n) = 30 Sample mean (x̄) = 48 mpg Sample standard deviation (s) = 4 mpg
Step 4: Determine the Critical Value or P-value: Since this is a one-tailed test (we're testing if the gas mileage is less than 50 mpg), we need to find the critical value or p-value corresponding to the significance level α = 0.05 for a t-distribution with degrees of freedom (df) = n – 1 = 30 – 1 = 29.
Using a t-distribution table or calculator, the critical t-value is approximately -1.699 (for α = 0.05 and df = 29).
Step 5: Make a Decision: Since the calculated t-value (-2.74) is more extreme than the critical t-value (-1.699), we reject the null hypothesis.
Step 6: Interpret the Result: There is sufficient evidence to conclude that the average gas mileage of the new hybrid car model is less than 50 mpg at a 5% significance level.
Conclusion: Based on the sample data and hypothesis test, the consumer advocacy group has enough evidence to reject the manufacturer's claim that the average gas mileage of the new hybrid car model is 50 mpg or more.
Key Points
- Null Hypothesis (H₀): The initial assumption or claim that is typically based on existing knowledge or a manufacturer's statement.
- Alternative Hypothesis (H₁ or Hₐ): The statement that contradicts the null hypothesis and represents what you're trying to determine with the test.
- Significance Level (α): The predetermined level of significance used to decide whether to reject the null hypothesis. Common values are 0.05, 0.01, etc.
- One-Tailed Test: A test that looks for an effect in one direction only (less than or greater than a certain value).
- Two-Tailed Test: A test that looks for an effect in either direction (not equal to a certain value).
- Test Statistic: A numerical value calculated from sample data that measures how far the sample results are from what's expected under the null hypothesis.
- P-value: The probability of observing a test statistic as extreme as the one calculated, assuming the null hypothesis is true.
- Critical Value: The threshold test statistic value beyond which you'd reject the null hypothesis, determined by the significance level and the distribution (e.g., t-distribution, z-distribution).
- Degrees of Freedom (df): The number of values in the final calculation of a statistic that are free to vary. For t-distributions, it's typically n – 1 (sample size minus 1).
- Type I Error (α): Rejecting the null hypothesis when it is actually true. The probability of making this error is equal to the chosen significance level.
- Type II Error (β): Failing to reject the null hypothesis when it is actually false. The probability of making this error is denoted as β.
- Critical Region: The set of values that lead to the rejection of the null hypothesis in hypothesis testing. It's based on the chosen significance level.
- P-value Method: Compare the calculated p-value to the significance level. If p-value ≤ α, reject the null hypothesis.
- Comparing Test Statistic and Critical Value: For critical value method, if the calculated test statistic is more extreme than the critical value, reject the null hypothesis.
- Interpreting Results: Draw conclusions based on whether you reject or fail to reject the null hypothesis, considering the context of the problem.