What assumptions are necessary to perform a large sample test for the difference between two populations means?
Learning Objectives Suppose we wish to compare the means of two distinct populations. Figure \(\PageIndex{1}\) illustrates the conceptual framework of our investigation in this and the next section. Each population has a mean and a standard deviation. We arbitrarily label one population as Population \(1\) and the other as Population \(2\), and subscript the parameters with the numbers \(1\) and \(2\) to tell them apart. We draw a random sample from Population \(1\) and label the sample statistics it yields with the subscript \(1\). Without reference to the first sample we draw a sample from Population \(2\) and label its sample statistics with the subscript \(2\). Figure \(\PageIndex{1}\):Independent Sampling from Two PopulationsDefinition: Independence Samples from two distinct populations are independent if each one is drawn without reference to the other, and has no connection with the other. Our goal is to use the information in the samples to estimate the difference \(\mu _1-\mu _2\) in the means of the two populations and to make statistically valid inferences about it. Confidence IntervalsSince the mean \(x-1\) of the sample drawn from Population \(1\) is a good estimator of \(\mu _1\) and the mean \(x-2\) of the sample drawn from Population \(2\) is a good estimator of \(\mu _2\), a reasonable point estimate of the difference \(\mu _1-\mu _2\) is \(\bar{x_1}-\bar{x_2}\). In order to widen this point estimate into a confidence interval, we first suppose that both samples are large, that is, that both \(n_1\geq 30\) and \(n_2\geq 30\). If so, then the following formula for a confidence interval for \(\mu _1-\mu _2\) is valid. The symbols \(s_{1}^{2}\) and \(s_{2}^{2}\) denote the squares of \(s_1\) and \(s_2\). (In the relatively rare case that both population standard deviations \(\sigma _1\) and \(\sigma _2\) are known they would be used instead of the sample standard deviations.) \(100(1-\alpha )\%\) Confidence Interval for the Difference Between Two Population Means: Large, Independent SamplesThe samples must be independent, and each sample must be large: Example \(\PageIndex{1}\) To compare customer satisfaction levels of two competing cable television companies, \(174\) customers of Company \(1\) and \(355\) customers of Company \(2\) were randomly selected and were asked to rate their cable companies on a five-point scale, with \(1\) being least satisfied and \(5\) most satisfied. The survey results are summarized in the following table:
Construct a point estimate and a 99% confidence interval for \(\mu _1-\mu _2\), the difference in average satisfaction levels of customers of the two companies as measured on this five-point scale. Solution: The point estimate of \(\mu _1-\mu _2\) is \[\bar{x_1}-\bar{x_2}=3.51-3.24=0.27\] In words, we estimate that the average customer satisfaction level for Company \(1\) is \(0.27\) points higher on this five-point scale than it is for Company \(2\). To apply the formula for the confidence interval, proceed exactly as was done in Chapter 7. The \(99\%\) confidence level means that \(\alpha =1-0.99=0.01\) so that \(z_{\alpha /2}=z_{0.005}\). From Figure 7.1.6 "Critical Values of " we read directly that \(z_{0.005}=2.576\). Thus \[(\bar{x_1}-\bar{x_2})\pm z_{\alpha /2}\sqrt{\frac{s_{1}^{2}}{n_1}+\frac{s_{2}^{2}}{n_2}}=0.27\pm 2.576\sqrt{\frac{0.51^{2}}{174}+\frac{0.52^{2}}{355}}=0.27\pm 0.12\] We are \(99\%\) confident that the difference in the population means lies in the interval \([0.15,0.39]\), in the sense that in repeated sampling \(99\%\) of all intervals constructed from the sample data in this manner will contain \(\mu _1-\mu _2\). In the context of the problem we say we are \(99\%\) confident that the average level of customer satisfaction for Company \(1\) is between \(0.15\) and \(0.39\) points higher, on this five-point scale, than that for Company \(2\). Hypothesis TestingHypotheses concerning the relative sizes of the means of two populations are tested using the same critical value and \(p\)-value procedures that were used in the case of a single population. All that is needed is to know how to express the null and alternative hypotheses and to know the formula for the standardized test statistic and the distribution that it follows. The null and alternative hypotheses will always be expressed in terms of the difference of the two population means. Thus the null hypothesis will always be written \[H_0: \mu _1-\mu _2=D_0\] where \(D_0\) is a number that is deduced from the statement of the situation. As was the case with a single population the alternative hypothesis can take one of the three forms, with the same terminology:
As long as the samples are independent and both are large the following formula for the standardized test statistic is valid, and it has the standard normal distribution. (In the relatively rare case that both population standard deviations \(\sigma _1\) and \(\sigma _2\) are known they would be used instead of the sample standard deviations.) Standardized Test Statistic for Hypothesis Tests Concerning the Difference Between Two Population Means: Large, Independent Samples\[Z=\frac{(\bar{x_1}-\bar{x_2})-D_0}{\sqrt{\frac{s_{1}^{2}}{n_1}+\frac{s_{2}^{2}}{n_2}}}\] The test statistic has the standard normal distribution. The samples must be independent, and each sample must be large: \(n_1\geq 30\) and \(n_2\geq 30\). Example \(\PageIndex{2}\) Refer to Example \(\PageIndex{1}\) concerning the mean satisfaction levels of customers of two competing cable television companies. Test at the \(1\%\) level of significance whether the data provide sufficient evidence to conclude that Company \(1\) has a higher mean satisfaction rating than does Company \(2\). Use the critical value approach. Solution:
\[H_0: \mu _1-\mu _2=0\] \[vs.\] \[H_a: \mu _1-\mu _2>0\; \; @\; \; \alpha =0.01\]
\[Z=\frac{(\bar{x_1}-\bar{x_2})-D_0}{\sqrt{\frac{s_{1}^{2}}{n_1}+\frac{s_{2}^{2}}{n_2}}}\]
\[Z=\frac{(\bar{x_1}-\bar{x_2})-D_0}{\sqrt{\frac{s_{1}^{2}}{n_1}+\frac{s_{2}^{2}}{n_2}}}=\frac{(3.51-3.24)-0}{\sqrt{\frac{0.51^{2}}{174}+\frac{0.52^{2}}{355}}}=5.684\]
Figure \(\PageIndex{2}\): Rejection Region and Test Statistic for Example \(\PageIndex{2}\)
Example \(\PageIndex{3}\) Perform the test of Example \(\PageIndex{2}\) using the \(p\)-value approach. Solution: The first three steps are identical to those in Example \(\PageIndex{2}\)
Key Takeaway
What are the assumptions for testing the difference between two means?The two populations have the same variance. This assumption is called the assumption of homogeneity of variance. The populations are normally distributed. Each value is sampled independently from each other value.
What assumptions are required to use the two sample test of means?Assumptions: Independent observations. The two groups are independent. The two populations from which the data are sampled are each normally distributed.
What assumptions are necessary for testing the difference in the two population variances?Assumptions to use the t-statistic to test for a difference between the means of two populations. There should be homogeneity of variance. The two population needs to be independent. Normality assumption must be assured.
What are the assumptions of large sample theory?A larger sample size means the distribution of results should approach a normal bell-shaped curve. The final assumption is homogeneity of variance. Homogeneous, or equal, variance exists when the standard deviations of samples are approximately equal.
|