What is it called when variables are related connected or associated with one another?

Let's take a step back and look at what this class is all about. Basically, we are trying to either build theories or test theories. Theories are explanations for why certain variables are related to each other.

What do we mean by variables being related to each other? Fundamentally, it means that the values of variable correspond to the values of another variable, for each case in the dataset. In other words, knowing the value of one variable, for a given case, helps you to predict the value of the other one. If the variables are perfectly related, then knowing the value of one variable tells you exactly what the value of the other variable is.

To actually measure relationships among variables, you have to know what level of measurement the variable is. The level of measurement determines what kinds of mathematical operations can meaningfully be performed on the values of a variable. In this course, we basically deal with just three kinds of relationships:

VariablesTest for RelationshipExample

Both variables are nominal level	Chi-square test	See which divisions have the most female employees
Independent variable is nominal, Dependent variable is interval or ratio	T-test [if indep has 2 categories only]; ANOVA	Test hypothesis that male employees are more satisfied than female employees
Both variables are interval level	Correlation; Regression	Look at relationship between job satisfaction and salary level

As you know, many social science variables, such as attitude scales, are really ordinal level measurements. But there are not many measures of ordinal relationship, and all are beyond the scope of this class. So what do you do? There are two choices: one, treat them as nominal and use chi-square tests, or two, treat them as interval and use correlation and regression. People normally do the latter [treat them as interval].

This chapter is about exploring the associations between pairs of variables in a sample. These are called bivariate associations. An association is any relationship between two variables that makes them dependent, i.e. knowing the value of one variable gives us some information about the possible values of the second variable. The main goal of this chapter is to show how to use descriptive statistics and visualisations to explore associations among different kinds of variables.

Associations between numeric variables

Descriptive statistics

Statisticians have devised various different ways to quantify an association between two numeric variables in a sample. The common measures seek to calculate some kind of correlation coefficient. The terms ‘association’ and ‘correlation’ are closely related; so much so that they are often used interchangeably. Strictly speaking correlation has a narrower definition: a correlation is defined by a metric [the ‘correlation coefficient’] that quantifies the degree to which an association tends to a certain pattern.

The most widely used measure of correlation is Pearson’s correlation coefficient [also called the Pearson product-moment correlation coefficient]. Pearson’s correlation coefficient is something called the covariance of the two variables, divided by the product of their standard deviations. The mathematical formula for the Pearson’s correlation coefficient applied to a sample is: \[ r_{xy} = \frac{1}{N-1}\sum\limits_{i=1}^{N}{\frac{x_i-\bar{x}}{s_x} \frac{y_i-\bar{y}}{s_y}} \] We’re using \[x\] and \[y\] here to refer to each of the variables in the sample. The \[r_{xy}\] denotes the correlation coefficient, \[s_x\] and \[s_y\] denote the standard deviation of each sample, \[\bar{x}\] and \[\bar{y}\] are the sample means, and \[N\] is the sample size.

Remember, a correlation coefficient quantifies the degree to which an association tends to a certain pattern. In the case of Pearson’s correlation coefficient, the coefficient is designed to summarise the strength of a linear [i.e. ‘straight line’] association. We’ll return to this idea in a moment.

Pearson’s correlation coefficient takes a value of 0 if two variables are uncorrelated, and a value of +1 or -1 if they are perfectly related. ‘Perfectly related’ means we can predict the exact value of one variable given knowledge of the other. A positive value indicates that high values in one variable is associated with high values of the second. A negative value indicates that high values of one variable is associated with low values of the second. The words ‘high’ and ‘low’ are relative to the arithmetic mean.

In R we can use the cor function to calculate Pearson’s correlation coefficient. For example, the Pearson correlation coefficient between pressure and wind is given by:

cor[storms$wind, storms$pressure]

## [1] -0.9254911

This is negative, indicating wind speed tends to decline with increasing pressure. It is also quite close to -1, indicating that this association is very strong. We saw this in the Introduction to ggplot2 chapter when we plotted atmospheric pressure against wind speed.

The Pearson’s correlation coefficient must be interpreted with care. Two points are worth noting:

Because it is designed to summarise the strength of a linear relationship, Pearson’s correlation coefficient will be misleading when this relationship is curved, or even worse, hump-shaped.
Even if the relationship between two variables really is linear, Pearson’s correlation coefficient tells us nothing about the slope [i.e. the steepness] of the relationship.

If those last two statements don’t make immediate sense, take a close look at this figure:

This shows a variety of different relationships between pairs of numeric variables. The numbers in each subplot are the Pearson’s correlation coefficients associated with the pattern. Consider each row:

The first row shows a series of linear relationships that vary in their strength and direction. These are all linear in the sense that the general form of the relationship can be described by a straight line. This means that it is appropriate to use Pearson’s correlation coefficient in these cases to quantify the strength of association, i.e. the coefficient is a reliable measure of association.
The second row shows a series of linear relationships that vary in their direction, but are all examples of a perfect relationship—we can predict the exact value of one variable given knowledge of the other. What these plots show is that Pearson’s correlation coefficient measures the strength of association without telling us anything the steepness of the relationship.
The third row shows a series of different cases where it is definitely inappropriate to Pearson’s correlation coefficient. In each case, the variables are related to one another in some way, yet the correlation coefficient is always 0. Pearson’s correlation coefficient completely fails to flag the relationship because it is not even close to being linear.

Other measures of correlation

What should we do if we think the relationship between two variables is non-linear? We should not use Pearson correlation coefficient to measure association in this case. Instead, we can calculate something called a rank correlation. The idea is quite simple. Instead of working with the actual values of each variable we ‘rank’ them, i.e. we sort each variable from lowest to highest and the assign the labels ‘first, ’second’, ‘third’, etc. to different observations. Measures of rank correlation are based on a comparison of the resulting ranks. The two most popular are Spearman’s \[\rho\] [‘rho’] and Kendall’s \[\tau\] [‘tau’].

We won’t examine the mathematical formula for each of these as they don’t really help us understand them much. We do need to know how to interpret rank correlation coefficients though. The key point is that both coefficients behave in a very similar way to Pearson’s correlation coefficient. They take a value of 0 if the ranks are uncorrelated, and a value of +1 or -1 if they are perfectly related. Again, the sign tells us about the direction of the association.

We can calculate both rank correlation coefficients in R using the cor function again. This time we need to set the method argument to the appropriate value: method = "kendall" or method = "spearman". For example, the Spearman’s \[\rho\] and Kendall’s \[\tau\] measures of correlation between pressure and wind are given by:

cor[storms$wind, storms$pressure, method = "kendall"]

## [1] -0.7627645

cor[storms$wind, storms$pressure, method = "spearman"]

## [1] -0.9025831

These roughly agree with the Pearson correlation coefficient, though Kendall’s \[\tau\] seems to suggest that the relationship is weaker. Kendall’s \[\tau\] is often smaller than Spearman’s \[\rho\] correlation. Although Spearman’s \[\rho\] is used more widely, it is more sensitive to errors and discrepancies in the data than Kendall’s \[\tau\].

Graphical summaries

Correlation coefficients give us a simple way to summarise associations between numeric variables. They are limited though, because a single number can never summarise every aspect of the relationship between two variables. This is why we always visualise the relationship between two variables. The standard graph for displaying associations among numeric variables is a scatter plot, using horizontal and vertical axes to plot two variables as a series of points. We saw how to construct scatter plots using ggplot2 in the [Introduction to ggplot2] chapter so we won’t step through the details again.

There are a few other options beyond the standard scatter plot. Specifically, ggplot2 provides a couple of different geom_XX functions for producing a visual summary of relationships between numeric variables in situations where over-plotting of points is obscuring the relationship. One such example is the geom_count function:

ggplot[storms, aes[x = pressure, y = wind]] +
  geom_count[alpha = 0.5]

The geom_count function is used to construct a layer in which data are first grouped into sets of identical observations. The number of cases in each group is counted, and this number [‘n’] is used to scale the size of points. Take note—it may be necessary to round numeric variables first [e.g. via mutate] to make a usable plot if they aren’t already discrete.

Two further options for dealing with excessive over-plotting are the geom_bin_2d and geom_hex functions. The the geom_bin_2d divides the plane into rectangles, counts the number of cases in each rectangle, and then uses the number of cases to assign the rectangle’s fill colour. The geom_hex function does essentially the same thing, but instead divides the plane into regular hexagons. Note that geom_hex relies on the hexbin package, so this need to be installed to use it. Here’s an example of geom_hex in action:

ggplot[storms, aes[x = pressure, y = wind]] +
  geom_hex[bins = 25]

Notice that this looks exactly like the ggplot2 code for making a scatter plot, other than the fact that we’re now using geom_hex in place of geom_point.

Associations between categorical variables

Numerical summaries

Numerically exploring associations between pairs of categorical variables is not as simple as the numeric variable case. The general question we need to address is, “do different combinations of categories seem to be under or over represented?” We need to understand which combinations are common and which are rare. The simplest thing we can do is ‘cross-tabulate’ the number of occurrences of each combination. The resulting table is called a contingency table. The counts in the table are sometimes referred to as frequencies.

The xtabs function [= ‘cross-tabulation’] can do this for us. For example, the frequencies of each storm category and month combination is given by:

xtabs[~ type + month, data = storms]

##                      month
## type                    6   7   8   9  10  11  12
##   Extratropical        27  38  23 149 129  42   4
##   Hurricane             3  31 300 383 152  25   2
##   Tropical Depression  22  59 150 156  84  42   0
##   Tropical Storm       31 123 247 259 204  61   1

The first argument sets the variables to cross-tabulate. The xtabs function uses R’s special formula language, so we can’t leave out that ~ at the beginning. After that, we just provide the list of variables to cross-tabulate, separated by the + sign. The second argument tells the function which data set to use. This isn’t a dplyr function, so the first argument is not the data for once.

What does this tell us? It shows us how many observations are associated with each combination of values of type and month. We have to stare at the numbers for a while, but eventually it should be apparent that hurricanes and tropical storms are more common in August and September [month ‘8’ and ‘9’]. More severe storms occur in the middle of the storm season—perhaps not all that surprising.

If both variables are ordinal we can also calculate a descriptive statistic of association from a contingency table. It makes no sense to do this for nominal variables because their values are not ordered. Pearson’s correlation coefficient is not appropriate here. Instead, we have to use some kind of rank correlation coefficient that accounts for the categorical nature of the data. Spearman’s \[\rho\] and Kendall’s \[\tau\] are designed for numeric data, so they can’t be used either.

One measure of association that is appropriate for categorical data is Goodman and Kruskal’s \[\gamma\] [“gamma”]. This behaves just like the other correlation coefficients we’ve looked at: it takes a value of 0 if the categories are uncorrelated, and a value of +1 or -1 if they are perfectly associated. The sign tells us about the direction of the association. Unfortunately, there isn’t a base R function to compute Goodman and Kruskal’s \[\gamma\], so we have to use a function from one of the packages that implements it [e.g. the GKgamma function in the vcdExtra package] if we need it.

Graphical summaries

Bar charts can be used to summarise the relationship between two categorical variables. The basic idea is to produce a separate bar for each combination of categories in the two variables. The lengths of these bars is proportional to the values they represent, which is either the raw counts or the proportions in each category combination. This is the same information displayed in a contingency table. Using ggplot2 to display this information is not very different from producing a bar graph to summarise a single categorical variable.

Let’s do this for the type and year variables in storms, breaking the process up into two steps. As always, we start by using the ggplot function to construct a graphical object containing the necessary default data and aesthetic mapping:

bar_plt


				
					

                 
	Bài Viết Liên Quan
	
	 	
		
		   
		   
		   
		
		
			100 bài hát hip hop hàng đầu trên radio năm 2022

		
	

		
		
		   
		   
		   
		
		
			What is an example of CaaS?

		
	

		
		
		   
		   
		   
		
		
			What is the most common source of malware attacks?

		
	

		
		
		   
		   
		   
		
		
			What day is Juneteenth recognized in 2024?

		
	

		
		
		   
		   
		   
		
		
			It is a program that enables a computer to perform a specific task

		
	

		
		
		   
		   
		   
		
		
			Managers looking for advice on properly dealing with obsolete technology hardware can:

		
	

		
		
		   
		   
		   
		
		
			Biologists claim there is genetic diversity within each race than there is among races

		
	

		
		
		   
		   
		   
		
		
			What refers to our tendency to group elements in a way that creates a balanced figure?

		
	

		
		
		   
		   
		   
		
		
			What is a family of web feed formats used for web syndication of programs and content?

		
	

		
		
		   
		   
		   
		
		
			2023 Mercedes S580 changes

		
	

		
		
		   
		   
		   
		
		
			It is the collection of interlinked knowledge that is contained in computers worldwide.

		
	

		
		
		   
		   
		   
		
		
			What is Active Directory domain services and how does it work?

		
	

		
		
		   
		   
		   
		
		
			Which kind of antenna is used in a point-to-point link, especially over long distances?

		
	

		
		
		   
		   
		   
		
		
			Is inventory included in the statement of financial position?

		
	

		
		
		   
		   
		   
		
		
			Which type of psychologist would be most interested in the inner workings of the brain?

		
	

		
		
		   
		   
		   
		
		
			5 xe đạp điện tử hàng đầu 2022 năm 2022

		
	

		
		
		   
		   
		   
		
		
			Research is a cyclical process because it starts with a problem and ends with a problem

		
	

		
		
		   
		   
		   
		
		
			2023 Honda hrv Snow mode

		
	

		
		
		   
		   
		   
		
		
			Why do think implementing information security is important to an organization company?

		
	

		
		
		   
		   
		   
		
		
			What is true about reporting the mean difference between two groups in a meta-analysis?

		
	

	
	




Toplist mới

 
	
	 
		#1
		
			Top 7 tiếng việt lớp 2 trang 35 tập 2 2023
			6 tháng trước
		
	



	
	 
		#2
		
			Top 6 nêu những hành vi vi phạm trong lĩnh vực kinh doanh mà em biết 2023
			6 tháng trước
		
	



	
	 
		#3
		
			Top 7 chuyển sinh thành nữ phụ phản diện trong otome game anime 2023
			6 tháng trước
		
	



	
	 
		#4
		
			Top 7 loại cây nào thường gieo trồng bằng hạt 2023
			6 tháng trước
		
	



	
	 
		#5
		
			Top 8 quản lý hồ sơ bệnh án của bệnh nhân erd 2023
			6 tháng trước
		
	



	
	 
		#6
		
			Top 10 bài văn nghị luận về an toàn giao thông lớp 8 2023
			6 tháng trước
		
	



	
	 
		#7
		
			Top 6 loài tế bào nào dưới đây không phải là tế bào thụ cảm thị giác 2023
			6 tháng trước
		
	



	
	 
		#8
		
			Top 7 nguyên nhân cơ bản khiến khí hậu châu đại dương ôn hòa 2023
			6 tháng trước
		
	



	
	 
		#9
		
			Top 3 giáo án tô màu ngôi nhà 3-4 tuổi 2023
			6 tháng trước
		
	






		


	Bài mới nhất
	
	 	
		
		   
		   
		   
		
		
			Dùng lượng giác để giải các bài toán đại số năm 2024

		
	

		
		
		   
		   
		   
		
		
			Dung dịch cần chuẩn độ là gì năm 2024

		
	

		
		
		   
		   
		   
		
		
			Cách sửa lỗi khi cắm tai nghe vào pc năm 2024

		
	

		
		
		   
		   
		   
		
		
			Lớp vỏ ở ngoài còn bên trong là gì năm 2024

		
	

		
		
		   
		   
		   
		
		
			Khi nào được hưởng trợ cấp thôi việc năm 2024

		
	

		
		
		   
		   
		   
		
		
			Bài giảng tích hợp liên môn toán 7 tiet 27 năm 2024

		
	

		
		
		   
		   
		   
		
		
			Top cong ty logistic lon nhat viet nam năm 2024

		
	

		
		
		   
		   
		   
		
		
			Apple watch series 4 là gì năm 2024

		
	

		
		
		   
		   
		   
		
		
			Cách phân loại và gọi tên chất hóa học năm 2024

		
	

		
		
		   
		   
		   
		
		
			Cách phục hồi ipad bị vô hiệu hóa năm 2024

		
	

	
	
                 
	Chủ Đề
	
	
	
		  Toplist
		  Địa Điểm Hay
		  Hỏi Đáp
		  Là gì
		  programming
		  Mẹo Hay
		  Nghĩa của từ
		  Học Tốt
		  Công Nghệ
		  Khỏe Đẹp
		  bao nhiêu
		  mẹo hay
		  Top List
		  Bao nhiêu
		  Bài Tập
		  Sản phẩm tốt
		  Xây Đựng
		  Ngôn ngữ
		  Tiếng anh
		  đánh giá
		  Bài tập
		  So Sánh
		  Ở đâu
		  So sánh
		  Dịch 
		  Hướng dẫn
		  bao nhieu
		  Tại sao
		  Đại học
		  hướng dẫn
		  Thế nào
		  Máy tính
		  Bao lâu
		  Vì sao
		  Hà Nội
		  Khoa Học
		  Món Ngon