Simple and Multiple Regression with R – Part 1. by Duane Edwards

The R statistical program is becoming very popular among both statisticians and programmers. What's surprising about this popularity is that the program is largely command-line type, lacking graphical user interface which would make it user-friendly to the average statistician. What is appealing about the program, however, is its simplicity and flexibility. The following demonstration aims at illustrating how easy and flexible the program is in spite of the lack of a graphical user interface (GUI).

To demonstrate, I will carry out simple and multiple regression analyses on a simple data set. I will also generate the corresponding graphs and the Pearson's correlation.

The data being used in this analysis are data collected from a study of suicide in Guyana. The study sought to find if there is any correlation between political, social and religious integration and suicide rates in Guyana. Political and religious integration were arrived at using Shannon's entropy to indicate levels of disintegration (reverse of which is integration) in the 10 regions of Guyana. And, the social integration was arrived at from figures provided in the Government of Guyana's poverty reduction strategy paper showing coping mechanisms used by the various regions. The figures for suicide rate represent the number of deaths by suicide for every 100,000 persons between the period 2003 – 2007.

Regions 1 2 3 4 5 6 7 8 9 10
Suicide rate per 100,000 person 33.86 188.80 126.47 106.01 124.10 252.38 89.55 89.79 66.75 65.07
Political Integration 1.57 1.25 1.21 1.57 1.16 1.09 2.02 2.26 2.24 1.57
Social Integration 27.2 35.38 22.94 26.47 21.00 33.20 16.44 26.95 22.72 30.99
Religious Integration 0.54 1.40 1.49 1.17 1.44 1.44 0.95 0.84 0.13 0.68

With the data presented above in table form, I will carry out first a simple regression, followed by a multiple regression analysis to demonstrate the simplicity and flexibility of the R statistical program.

Step 1. Create the variables and assign them values in R.

> pint <- c(1.57,1.25,1.21,1.75,1.16,1.09,2.02,2.26,2.24,1.57)

 

> sint <- c(27.2,35.38,22.94,26.47,21.00,33.20,16.44,26.95,22.72,30.99)

 

> rint <- c(0.54,1.40,1.49,1.17,1.44,1.44,0.95,0.84,0.13,0.68)

 

> suir <- c(33.86,188.80,126.47,106.01,124.10,252.38,89.55,89.79,66.75,65.07)</code>

 

I have simply created four variables, namely, pint (political disintegration) sint (social integration), rint (religious disintegration), and suir (suicide rate) and assign them their respective values. In R the main assignment operator is the less than sign followed by the hyphen (< - ). Together they create an arrow indicating that the values on the right are assigned to the variables on the left. With the data already loaded and values assigned to variables, let's say we want to graphically see the relationship between one of these independent variables (specifically, political disintegration) and suicide rate. We simply use the plot(x,y) command. >plot(pint, suir) the result is:

Immediately, we can see that whatever relationship there is between the two variables, that relationship is inverse. It means that the regions with the least political disintegration (the most political integration) are the regions with the highest suicide rates.

Step 2. Running the regression.

The code for running the regression is very simple: lm(y ~ x). the lm represents linear model, while the y and x (separated by the tilde ~) in the parenthesis represent dependent and independent variables respectively. So, replacing y and x with the actual variables would give us: lm(suir ~ pint). Now let's code this into R. > model1 <- lm(suir ~ pint)

> summary(model1)

Call:

lm(formula = suir ~ pint)

 

Residuals:

Min 1Q Median 3Q Max

-84.21 -29.23 6.65 28.49 91.02

 

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 259.69 66.91 3.881 0.00467 **

pint -90.20 40.16 -2.246 0.05490 .

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

Residual standard error: 53.47 on 8 degrees of freedom

Multiple R-squared: 0.3867, Adjusted R-squared: 0.3101

F-statistic: 5.045 on 1 and 8 DF, p-value: 0.0549

 

By the use of the summary() command, we have all that is needed to make our assessment of the relationship between political integration(or disintegration) and suicide rates in Guyana. In addition, the Pearson's correlation could be arrived at by another simple code: cor(var1, var2, method='pearson') > cor(pint, suir, method='pearson')

[1] -0.6218784