What is it called when variables are related connected or associated with one another?
Let's take a step back and look at what this class is all about. Basically, we are trying to either build theories or test theories. Theories are explanations for why certain variables are related to each other. Show
What do we mean by variables being related to each other? Fundamentally, it means that the values of variable correspond to the values of another variable, for each case in the dataset. In other words, knowing the value of one variable, for a given case, helps you to predict the value of the other one. If the variables are perfectly related, then knowing the value of one variable tells you exactly what the value of the other variable is. To actually measure relationships among variables, you have to know what level of measurement the variable is. The level of measurement determines what kinds of mathematical operations can meaningfully be performed on the values of a variable. In this course, we basically deal with just three kinds of relationships:
As you know, many social science variables, such as attitude scales, are really ordinal level measurements. But there are not many measures of ordinal relationship, and all are beyond the scope of this class. So what do you do? There are two choices: one, treat them as nominal and use chi-square tests, or two, treat them as interval and use correlation and regression. People normally do the latter (treat them as interval). This chapter is about exploring the associations between pairs of variables in a sample. These are called bivariate associations. An association is any relationship between two variables that makes them dependent, i.e. knowing the value of one variable gives us some information about the possible values of the second variable. The main goal of this chapter is to show how to use descriptive statistics and visualisations to explore associations among different kinds of variables. Associations between numeric variablesDescriptive statisticsStatisticians have devised various different ways to quantify an association between two numeric variables in a sample. The common measures seek to calculate some kind of correlation coefficient. The terms ‘association’ and ‘correlation’ are closely related; so much so that they are often used interchangeably. Strictly speaking correlation has a narrower definition: a correlation is defined by a metric (the ‘correlation coefficient’) that quantifies the degree to which an association tends to a certain pattern. The most widely used measure of correlation is Pearson’s correlation coefficient (also called the Pearson product-moment correlation coefficient). Pearson’s correlation coefficient is something called the covariance of the two variables, divided by the product of their standard deviations. The mathematical formula for the Pearson’s correlation coefficient applied to a sample is: \[ r_{xy} = \frac{1}{N-1}\sum\limits_{i=1}^{N}{\frac{x_i-\bar{x}}{s_x} \frac{y_i-\bar{y}}{s_y}} \] We’re using \(x\) and \(y\) here to refer to each of the variables in the sample. The \(r_{xy}\) denotes the correlation coefficient, \(s_x\) and \(s_y\) denote the standard deviation of each sample, \(\bar{x}\) and \(\bar{y}\) are the sample means, and \(N\) is the sample size. Remember, a correlation coefficient quantifies the degree to which an association tends to a certain pattern. In the case of Pearson’s correlation coefficient, the coefficient is designed to summarise the strength of a linear (i.e. ‘straight line’) association. We’ll return to this idea in a moment. Pearson’s correlation coefficient takes a value of 0 if two variables are uncorrelated, and a value of +1 or -1 if they are perfectly related. ‘Perfectly related’ means we can predict the exact value of one variable given knowledge of the other. A positive value indicates that high values in one variable is associated with high values of the second. A negative value indicates that high values of one variable is associated with low values of the second. The words ‘high’ and ‘low’ are relative to the arithmetic mean. In R we can use the
This is negative, indicating wind speed tends to decline with increasing pressure. It is also quite close to -1, indicating that this association is very strong. We saw this in the Introduction to ggplot2 chapter when we plotted atmospheric pressure against wind speed. The Pearson’s correlation coefficient must be interpreted with care. Two points are worth noting:
If those last two statements don’t make immediate sense, take a close look at this figure: This shows a variety of different relationships between pairs of numeric variables. The numbers in each subplot are the Pearson’s correlation coefficients associated with the pattern. Consider each row:
Other measures of correlationWhat should we do if we think the relationship between two variables is non-linear? We should not use Pearson correlation coefficient to measure association in this case. Instead, we can calculate something called a rank correlation. The idea is quite simple. Instead of working with the actual values of each variable we ‘rank’ them, i.e. we sort each variable from lowest to highest and the assign the labels ‘first, ’second’, ‘third’, etc. to different observations. Measures of rank correlation are based on a comparison of the resulting ranks. The two most popular are Spearman’s \(\rho\) (‘rho’) and Kendall’s \(\tau\) (‘tau’). We won’t examine the mathematical formula for each of these as they don’t really help us understand them much. We do need to know how to interpret rank correlation coefficients though. The key point is that both coefficients behave in a very similar way to Pearson’s correlation coefficient. They take a value of 0 if the ranks are uncorrelated, and a value of +1 or -1 if they are perfectly related. Again, the sign tells us about the direction of the association. We can calculate both rank correlation coefficients in R using the
These roughly agree with the Pearson correlation coefficient, though Kendall’s \(\tau\) seems to suggest that the relationship is weaker. Kendall’s \(\tau\) is often smaller than Spearman’s \(\rho\) correlation. Although Spearman’s \(\rho\) is used more widely, it is more sensitive to errors and discrepancies in the data than Kendall’s \(\tau\). Graphical summariesCorrelation coefficients give us a simple way to summarise associations between numeric variables. They are limited though, because a single number can never summarise every aspect of the relationship between two variables. This is why we always visualise the relationship between two variables. The standard graph for displaying associations among numeric variables is a scatter plot, using horizontal and vertical axes to plot two variables as a series of points. We saw how to construct scatter plots using ggplot2 in the [Introduction to ggplot2] chapter so we won’t step through the details again. There are a few other options beyond the standard scatter plot. Specifically, ggplot2 provides a couple of different
The Two
further options for dealing with excessive over-plotting are the
Notice that this looks exactly like the ggplot2 code for making a scatter plot, other than the fact that we’re now using Associations between categorical variablesNumerical summariesNumerically exploring associations between pairs of categorical variables is not as simple as the numeric variable case. The general question we need to address is, “do different combinations of categories seem to be under or over represented?” We need to understand which combinations are common and which are rare. The simplest thing we can do is ‘cross-tabulate’ the number of occurrences of each combination. The resulting table is called a contingency table. The counts in the table are sometimes referred to as frequencies. The
The first argument sets the variables to cross-tabulate. The What does this tell us? It shows us how many observations are associated with each combination of values of If both variables are ordinal we can also calculate a descriptive statistic of association from a contingency table. It makes no sense to do this for nominal variables because their values are not ordered. Pearson’s correlation coefficient is not appropriate here. Instead, we have to use some kind of rank correlation coefficient that accounts for the categorical nature of the data. Spearman’s \(\rho\) and Kendall’s \(\tau\) are designed for numeric data, so they can’t be used either. One measure of association that is appropriate for categorical data is Goodman and Kruskal’s \(\gamma\) (“gamma”). This behaves just like the other correlation coefficients we’ve looked at: it takes a value of 0 if the categories are uncorrelated, and a value of +1 or -1 if they are perfectly associated. The sign tells us about the direction of the
association. Unfortunately, there isn’t a base R function to compute Goodman and Kruskal’s \(\gamma\), so we have to use a function from one of the packages that implements it (e.g. the Graphical summariesBar charts can be used to summarise the relationship between two categorical variables. The basic idea is to produce a separate bar for each combination of categories in the two variables. The lengths of these bars is proportional to the values they represent, which is either the raw counts or the proportions in each category combination. This is the same information displayed in a contingency table. Using ggplot2 to display this information is not very different from producing a bar graph to summarise a single categorical variable. Let’s do this for the
Notice that we’ve included two aesthetic mappings. We mapped the
This is called a stacked bar chart. Each year has its own bar ( We have all the right information in this graph,
but it could be improved. Look at the labels on the x axis. Not every bar is labelled. This occurs because We can convert a numeric vector to a character vector with the
We must load and attach dplyr to make this work. The new data frame Now we just need to construct and display the ggplot2 object again using this new data frame:
That’s an improvement. However, the ordering of the storm categories is not ideal because the order in which the different groups are presented does not reflect the ordinal scale we have in mind for storm category. We saw this same problem in the
Exploring categorical variables chapter—ggplot2 treats does not ‘know’ the correct order of the We need to somehow embed the information about the required category order of
This may look a little confusing at first glance, but all we did here was create a vector of ordered category names called
Factors Factors are very useful. They crop up all the time in R. Unfortunately, they are
also a pain to work with and a frequent source of errors. A complete treatment of factors would require a whole new chapter, so to save space, we’ve just shown one way to work with them via the A stacked bar chart is the default produced by
The This final figure shows that on average, storm systems spend more time as hurricanes and tropical storms than tropical depressions or extratropical systems. Other than that, the story is a little messy. For example, 1997 was an odd year, with few storm events and relatively few hurricanes. Categorical-numerical associationsWe’ve seen how to summarise the relationship between a pair of variables when they are of the same type: numeric vs. numeric or categorical vs. categorical. The obvious next question is, “How do we display the relationship between a categorical and numeric variable?” As usual, there are a range of different options. Descriptive statisticsNumerical summaries can be constructed by taking the various ideas we’ve explored for numeric variables (means, medians, etc), and applying them to subsets of data defined by the values of the categorical variable. This is easy to do with the dplyr Graphical summariesThe most common visualisation for exploring categorical-numerical relationships is the ‘box and whiskers plot’ (or just ‘box plot’). It’s easier to understand these plots once we’ve seen an example. To construct a box and whiskers plot we need to set ‘x’ and ‘y’ axis aesthetics for the categorical and numeric variable, and we use the
It’s fairly obvious why this is called a box and whiskers plot. Here’s a quick overview of the component parts of each box and whiskers:
The resulting plot compactly summarises the distribution of the numeric variable within each of the categories. We can see information about the central tendency, dispersion and skewness of each distribution. In addition, we can get a sense of whether there are potential outliers by noting the presence of individual points outside the whiskers. What does the above plot tell us about atmospheric pressure and storm type? It shows that Alternatives to box and whiskers plotsBox and whiskers plots are a good choice for exploring categorical-numerical relationships. They provide a lot of information about how the distribution of the numeric variable changes across categories. Sometimes we may want to squeeze even more information about these distributions into a plot. One way to do this is to make multiple histograms (or dot plots, if we don’t have much data). We
already know how to make a histogram, and we have seen how aesthetic properties such as
We define two mappings: the continuous variable ( Plotting several histograms in one layer like this places a lot of information in one plot, but it can be hard to make sense of this when the histograms overlap a lot. If the overlapping histograms are too difficult to interpret we might consider producing a separate one for each category. We’ve already seen a quick way to do this. Faceting works well here:
We can see quite a lot in this plot and the last. The tropical depression, tropical storm, and hurricane histograms do not overlap (with a few minor exceptions). These three storm categories are obviously defined with respect to wind speed. Perhaps they represent different phases of one underlying physical phenomenon? The extratropical storm system seems to be something altogether different. In fact, an extratropical storm is a different kind of weather system from the other three. It can turn into a tropical depression (winds < 39 mph) or a subtropical storm (winds > 39 mph), but only a subtropical can turn into a hurricane. We’re oversimplifying, but the point is that the simple ordinal scale that we envisaged for the What does it mean when variables are associated?Association between two variables means the values of one variable relate in some way to the values of the other. It is usually measured by correlation for two continuous variables and by cross tabulation and a Chi-square test for two categorical variables.
What does it mean for two variables to have an association?Two variables have a positive association / correlation when the values of one variable tend to increase as the values of the other variable increase. A perfect positive association means that a relationship appears to exist between two variables, and that relationship is positive 100% of the time.
When both variables are affecting one another it is called?When two variables move in tandem, the two variables are said to have a positive correlation. Though one variable may not directly influence the other, the two variables may at least change in the same direction.
|