What is it called when variables are related connected or associated with one another?

Let's take a step back and look at what this class is all about. Basically, we are trying to either build theories or test theories. Theories are explanations for why certain variables are related to each other.

What do we mean by variables being related to each other? Fundamentally, it means that the values of variable correspond to the values of another variable, for each case in the dataset. In other words, knowing the value of one variable, for a given case, helps you to predict the value of the other one. If the variables are perfectly related, then knowing the value of one variable tells you exactly what the value of the other variable is.

To actually measure relationships among variables, you have to know what level of measurement the variable is. The level of measurement determines what kinds of mathematical operations can meaningfully be performed on the values of a variable. In this course, we basically deal with just three kinds of relationships:

VariablesTest for RelationshipExample
Both variables are nominal level Chi-square test See which divisions have the most female employees
Independent variable is nominal,

Dependent variable is interval or ratio

T-test [if indep has 2 categories only];

ANOVA

Test hypothesis that male employees are more satisfied than female employees
Both variables are interval level Correlation; Regression Look at relationship between job satisfaction and salary level

As you know, many social science variables, such as attitude scales, are really ordinal level measurements. But there are not many measures of ordinal relationship, and all are beyond the scope of this class. So what do you do? There are two choices: one, treat them as nominal and use chi-square tests, or two, treat them as interval and use correlation and regression. People normally do the latter [treat them as interval].

This chapter is about exploring the associations between pairs of variables in a sample. These are called bivariate associations. An association is any relationship between two variables that makes them dependent, i.e. knowing the value of one variable gives us some information about the possible values of the second variable. The main goal of this chapter is to show how to use descriptive statistics and visualisations to explore associations among different kinds of variables.

Associations between numeric variables

Descriptive statistics

Statisticians have devised various different ways to quantify an association between two numeric variables in a sample. The common measures seek to calculate some kind of correlation coefficient. The terms ‘association’ and ‘correlation’ are closely related; so much so that they are often used interchangeably. Strictly speaking correlation has a narrower definition: a correlation is defined by a metric [the ‘correlation coefficient’] that quantifies the degree to which an association tends to a certain pattern.

The most widely used measure of correlation is Pearson’s correlation coefficient [also called the Pearson product-moment correlation coefficient]. Pearson’s correlation coefficient is something called the covariance of the two variables, divided by the product of their standard deviations. The mathematical formula for the Pearson’s correlation coefficient applied to a sample is: \[ r_{xy} = \frac{1}{N-1}\sum\limits_{i=1}^{N}{\frac{x_i-\bar{x}}{s_x} \frac{y_i-\bar{y}}{s_y}} \] We’re using \[x\] and \[y\] here to refer to each of the variables in the sample. The \[r_{xy}\] denotes the correlation coefficient, \[s_x\] and \[s_y\] denote the standard deviation of each sample, \[\bar{x}\] and \[\bar{y}\] are the sample means, and \[N\] is the sample size.

Remember, a correlation coefficient quantifies the degree to which an association tends to a certain pattern. In the case of Pearson’s correlation coefficient, the coefficient is designed to summarise the strength of a linear [i.e. ‘straight line’] association. We’ll return to this idea in a moment.

Pearson’s correlation coefficient takes a value of 0 if two variables are uncorrelated, and a value of +1 or -1 if they are perfectly related. ‘Perfectly related’ means we can predict the exact value of one variable given knowledge of the other. A positive value indicates that high values in one variable is associated with high values of the second. A negative value indicates that high values of one variable is associated with low values of the second. The words ‘high’ and ‘low’ are relative to the arithmetic mean.

In R we can use the cor function to calculate Pearson’s correlation coefficient. For example, the Pearson correlation coefficient between pressure and wind is given by:

cor[storms$wind, storms$pressure]

## [1] -0.9254911

This is negative, indicating wind speed tends to decline with increasing pressure. It is also quite close to -1, indicating that this association is very strong. We saw this in the Introduction to ggplot2 chapter when we plotted atmospheric pressure against wind speed.

The Pearson’s correlation coefficient must be interpreted with care. Two points are worth noting:

  1. Because it is designed to summarise the strength of a linear relationship, Pearson’s correlation coefficient will be misleading when this relationship is curved, or even worse, hump-shaped.

  2. Even if the relationship between two variables really is linear, Pearson’s correlation coefficient tells us nothing about the slope [i.e. the steepness] of the relationship.

If those last two statements don’t make immediate sense, take a close look at this figure:

This shows a variety of different relationships between pairs of numeric variables. The numbers in each subplot are the Pearson’s correlation coefficients associated with the pattern. Consider each row:

  1. The first row shows a series of linear relationships that vary in their strength and direction. These are all linear in the sense that the general form of the relationship can be described by a straight line. This means that it is appropriate to use Pearson’s correlation coefficient in these cases to quantify the strength of association, i.e. the coefficient is a reliable measure of association.

  2. The second row shows a series of linear relationships that vary in their direction, but are all examples of a perfect relationship—we can predict the exact value of one variable given knowledge of the other. What these plots show is that Pearson’s correlation coefficient measures the strength of association without telling us anything the steepness of the relationship.

  3. The third row shows a series of different cases where it is definitely inappropriate to Pearson’s correlation coefficient. In each case, the variables are related to one another in some way, yet the correlation coefficient is always 0. Pearson’s correlation coefficient completely fails to flag the relationship because it is not even close to being linear.

Other measures of correlation

What should we do if we think the relationship between two variables is non-linear? We should not use Pearson correlation coefficient to measure association in this case. Instead, we can calculate something called a rank correlation. The idea is quite simple. Instead of working with the actual values of each variable we ‘rank’ them, i.e. we sort each variable from lowest to highest and the assign the labels ‘first, ’second’, ‘third’, etc. to different observations. Measures of rank correlation are based on a comparison of the resulting ranks. The two most popular are Spearman’s \[\rho\] [‘rho’] and Kendall’s \[\tau\] [‘tau’].

We won’t examine the mathematical formula for each of these as they don’t really help us understand them much. We do need to know how to interpret rank correlation coefficients though. The key point is that both coefficients behave in a very similar way to Pearson’s correlation coefficient. They take a value of 0 if the ranks are uncorrelated, and a value of +1 or -1 if they are perfectly related. Again, the sign tells us about the direction of the association.

We can calculate both rank correlation coefficients in R using the cor function again. This time we need to set the method argument to the appropriate value: method = "kendall" or method = "spearman". For example, the Spearman’s \[\rho\] and Kendall’s \[\tau\] measures of correlation between pressure and wind are given by:

cor[storms$wind, storms$pressure, method = "kendall"]

## [1] -0.7627645

cor[storms$wind, storms$pressure, method = "spearman"]

## [1] -0.9025831

These roughly agree with the Pearson correlation coefficient, though Kendall’s \[\tau\] seems to suggest that the relationship is weaker. Kendall’s \[\tau\] is often smaller than Spearman’s \[\rho\] correlation. Although Spearman’s \[\rho\] is used more widely, it is more sensitive to errors and discrepancies in the data than Kendall’s \[\tau\].

Graphical summaries

Correlation coefficients give us a simple way to summarise associations between numeric variables. They are limited though, because a single number can never summarise every aspect of the relationship between two variables. This is why we always visualise the relationship between two variables. The standard graph for displaying associations among numeric variables is a scatter plot, using horizontal and vertical axes to plot two variables as a series of points. We saw how to construct scatter plots using ggplot2 in the [Introduction to ggplot2] chapter so we won’t step through the details again.

There are a few other options beyond the standard scatter plot. Specifically, ggplot2 provides a couple of different geom_XX functions for producing a visual summary of relationships between numeric variables in situations where over-plotting of points is obscuring the relationship. One such example is the geom_count function:

ggplot[storms, aes[x = pressure, y = wind]] +
  geom_count[alpha = 0.5]

The geom_count function is used to construct a layer in which data are first grouped into sets of identical observations. The number of cases in each group is counted, and this number [‘n’] is used to scale the size of points. Take note—it may be necessary to round numeric variables first [e.g. via mutate] to make a usable plot if they aren’t already discrete.

Two further options for dealing with excessive over-plotting are the geom_bin_2d and geom_hex functions. The the geom_bin_2d divides the plane into rectangles, counts the number of cases in each rectangle, and then uses the number of cases to assign the rectangle’s fill colour. The geom_hex function does essentially the same thing, but instead divides the plane into regular hexagons. Note that geom_hex relies on the hexbin package, so this need to be installed to use it. Here’s an example of geom_hex in action:

ggplot[storms, aes[x = pressure, y = wind]] +
  geom_hex[bins = 25]

Notice that this looks exactly like the ggplot2 code for making a scatter plot, other than the fact that we’re now using geom_hex in place of geom_point.

Associations between categorical variables

Numerical summaries

Numerically exploring associations between pairs of categorical variables is not as simple as the numeric variable case. The general question we need to address is, “do different combinations of categories seem to be under or over represented?” We need to understand which combinations are common and which are rare. The simplest thing we can do is ‘cross-tabulate’ the number of occurrences of each combination. The resulting table is called a contingency table. The counts in the table are sometimes referred to as frequencies.

The xtabs function [= ‘cross-tabulation’] can do this for us. For example, the frequencies of each storm category and month combination is given by:

xtabs[~ type + month, data = storms]

##                      month
## type                    6   7   8   9  10  11  12
##   Extratropical        27  38  23 149 129  42   4
##   Hurricane             3  31 300 383 152  25   2
##   Tropical Depression  22  59 150 156  84  42   0
##   Tropical Storm       31 123 247 259 204  61   1

The first argument sets the variables to cross-tabulate. The xtabs function uses R’s special formula language, so we can’t leave out that ~ at the beginning. After that, we just provide the list of variables to cross-tabulate, separated by the + sign. The second argument tells the function which data set to use. This isn’t a dplyr function, so the first argument is not the data for once.

What does this tell us? It shows us how many observations are associated with each combination of values of type and month. We have to stare at the numbers for a while, but eventually it should be apparent that hurricanes and tropical storms are more common in August and September [month ‘8’ and ‘9’]. More severe storms occur in the middle of the storm season—perhaps not all that surprising.

If both variables are ordinal we can also calculate a descriptive statistic of association from a contingency table. It makes no sense to do this for nominal variables because their values are not ordered. Pearson’s correlation coefficient is not appropriate here. Instead, we have to use some kind of rank correlation coefficient that accounts for the categorical nature of the data. Spearman’s \[\rho\] and Kendall’s \[\tau\] are designed for numeric data, so they can’t be used either.

One measure of association that is appropriate for categorical data is Goodman and Kruskal’s \[\gamma\] [“gamma”]. This behaves just like the other correlation coefficients we’ve looked at: it takes a value of 0 if the categories are uncorrelated, and a value of +1 or -1 if they are perfectly associated. The sign tells us about the direction of the association. Unfortunately, there isn’t a base R function to compute Goodman and Kruskal’s \[\gamma\], so we have to use a function from one of the packages that implements it [e.g. the GKgamma function in the vcdExtra package] if we need it.

Graphical summaries

Bar charts can be used to summarise the relationship between two categorical variables. The basic idea is to produce a separate bar for each combination of categories in the two variables. The lengths of these bars is proportional to the values they represent, which is either the raw counts or the proportions in each category combination. This is the same information displayed in a contingency table. Using ggplot2 to display this information is not very different from producing a bar graph to summarise a single categorical variable.

Let’s do this for the type and year variables in storms, breaking the process up into two steps. As always, we start by using the ggplot function to construct a graphical object containing the necessary default data and aesthetic mapping:

bar_plt 

Chủ Đề