The strength of the linear correlation between the bivariate data has to be tested to see if there is a linear correlation between the bivariate data. The PPMCC, denoted by r measures the strength and direction of the linear correlation between the bivariate data and takes on the range of -1 ?r ?1. When the value of r=0, it only shows that there is no linear correlation between the bivariate data but it does not suggest that any other forms of relationship between the bivariate data do not exist.It is important to note that in specific contexts, the significance of the value of r changes, such as in Psychology where if the value of r is at least 0.
6, it could be considered as a strong positive linear correlation as there are many confounding variables ( CV ) – variables other than the factors of the bivariate data that could influence the correlation coefficient which is very relevant in the context of this exploration as it deals with the possible effects of education induced stress in students.From the value of r obtained at 0.5904, it can be deduced that there is indeed a moderately strong positive linear relationship between the bivariate data. Furthermore, as previously stated, given the context of this exploration, as the data deals with humans, the r value could be considered to be very strong as there are too many CVs to possibly exclude to achieve the most accurate r value. Regardless, what the r value does suggest is that there is some sort association between the bivariate data being tested and their linear dependence.
Confidence intervals are often used in statistics to address how well the sample used can estimate the population value. It is able to provide us with a range of acceptable values within the parameter of interest in a population as the sampling distribution of r is not normally distributed, this means that as population correlation, denoted by ? increase positively, the sampling distribution would become more negatively skewed and vice versa. As |?| reaches a value of 1, the sampling variance would reach nearer to a value of 0.
Confidence intervals are also calculated at a confidence level, most commonly at 95 % ( ?=0.05 ) which allows us to more reliably determine which values are acceptable for an unknown parameter. As such, Fisher’s transformation, denoted by z is used to conduct tests on the reliability of the PPMCC and to calculate the confidence interval with the equation,Therefore, the population correlation, p is likely to be in the range when -0.063 ?p ?0.890. By carefully selecting a sample size of the target population, it aims to more accurately describe the correlation between a set of bivariate data in the entire population. From the r value obtained and the predicted ? value, it can be inferred with up to a 95 % confidence level that there is a correlation between the mean years of schooling against number or suicides of males aged 10 – 25 years old.
This suggests that the data used in this exploration can mean that the mean years of schooling has a strong relation in the increased number of suicides observed in the country in the past decade. The confidence interval allows us to more reliably predict the correlation between data sets when population sample size cannot be obtained due to financial and resource constraints.Yet, a high correlation of only a single set of bivariate data does not imply causation. The r value just gives us statistical proof that there could be an association between the bivariate data but concluding their association based only on a single set of bivariate data will result in many inaccuracies and false assumptions. The estimated value of ? which is based off the value of r is consequently affected as well. Although, selecting a target sample could minimize the error in the linear correlation between the bivariate data, it is still insufficient in providing us with the most accurate representation of their linear relationship.Anscombe’s quartet demonstrates how summary statistics, despite its ability to describe the characteristics of a vast and complex data set with just a few data points could lead to misleading outcomes and that simply basing conclusions off the PPMCC is ineffective without taking into account other variables.
To more reliably observe the relationship between the bivariate data, it is possible to split up the scatter diagram into 4 equal quadrants around the mean point denoted by the coordinates ( x ? ,y ? ). The mean point can be inferred as the middle point of the summation of both the x and y values and acts as a reference point when a line of best fit is drawn through it to find the direction and strength of association between the variables point. The mean point is calculated by finding the mean of the respective summation o x values and y values.
It would appear that there is a moderately strong positive linear relationship between the bivariate data but inaccuracies can still occur due to the line of best fit only passing through the mean point, which as previously stated would be affected in the presence of outliers. To improve the accuracy of the line drawn, residuals can be used as it can tell us how far off any predictions made would be.Residuals forms when the vertical distance of the data point resides above or below the graph of a line of best fit or also known as the regression line, where the residual is positive when the data point lies above the regression line and vice versa. There is no residual when the regression line passes through the data point.The following formula of residuals can be usedThe purpose of the least squares regression line is to find the smallest possible value for the sum of squares of the residuals, effectively aiming to make the squared vertical distance of the data points to be as close to the regression line as possible. To find the slope or gradient of the regression line, the following formula can be appliedBy splitting the scatter diagram into 4 equal quadrants, it becomes easier to see how the different points are distributed across the graph. If the bivariate data were predicted to show a positive linear relationship, it would be expected that more data points would be observed to fall in quadrants 2 and 3.
Whereas, if the bivariate data were predicted to show a negative linear relationship, more data points are expected to be found in quadrants 1 and 4. As can be observed from Figure 3, out of the possible 10 data points, 7 of them lie in either quadrant 2 or 3 while only 3 lie in quadrants 1 and 4. This suggests that there is a moderately strong positive linear relationship between the bivariate data which is further strengthened by the equation of line of best fit, denoted by Y=mX+C, where the value of the gradient, m is 15.69. This also means that on average, for every increase or decrease of 1 year in the mean year of schooling, it can be predicted that the number of suicides in male students aged 10 – 25 would either increase or decrease by 15.
69. The positive linear relationship between the bivariate data is shown to be associated moderately strongly while the least squares regression line is used for predicting the outcome, y for every change in the value of x.SSR is the regression sum of squares and quantifies the distance between the estimated sloped regression line from the sample mean. SSE is the error sums of squares and quantifies the variance of the data points around the regression line and SST is the total sum of square that quantifies the variance of the data points around the mean.
From the r value obtained, simply squaring it will give the r^2 value and the goodness of fit of the regression line,It is important to note however, that although it might appear that there is some sort of moderately strong positive linear relationship between the 2 variables and reiterate that correlation still does not imply causation. As there could be other CVs that could have affected the outcome between the bivariate data such as financial status or health of the students. These CVs can contribute to the formation of outliers that would affect the accuracy of the relationship between the bivariate data as they stray too far from the line of best fit.
From Figure 3, it can be seen that the data point of 66 appears to be such an outlier. Least squares regression is only reliable in situations where the residuals do not differ significantly from the rest. As the method is used to minimize the sum of the squared errors, any outliers that are part of the scatter diagram would have a more significant effect on the relationship derived for the bivariate data similarly the error is also squared. To mitigate the errors of the least squares regression, the least absolute deviations method can be used as the absolute value of the residual is taken instead which allows it to become more efficient in dealing with data that has many outliers.
Nevertheless, the remaining 9 data points appear to be relatively close to the line of best fit and coupled by a relatively strong r value of 0.5904, it suggests that these data points are significant in contributing to the accuracy of the moderately strong positive linear relationship between the bivariate data.