Estimating the OLS coeffcients , constructing a linear model and calculating the MSE using Cross validation

Here's My very first technical blog post!!!!

Since I am a beginner in R, I found the problem quite overwhelming at first. It was partly because I was searching for the wrong stuff.
     So it took me fairly long to figure out how to go about it.

There are several in-built functions in R to construct a linear regression model and perform a Cross validation. The following is the shortest code to do that:

You need to create a data frame of the dataset as follows
fitData <- data.frame(OLStrain$Y, OLStrain$X1......)



Don't forget to name the columns:

names(fitData) <- c("Y","X1",...........)

Now we create a linear model:
This post talks about fitting an ordinary linear model with Y  as the response and X1, X2, X3....(the number of variables in your problem) as the predictors. This is done as follows:

m1 <- lm(Y ~ .,data=fitData)

To learn more about linear models, and the different parameters that you canplay around with, here are the reference sites that I found useful:
1. http://astrostatistics.psu.edu/datasets/2006tutorial/html/stats/html/lm.html
2. http://psychweb.psy.umt.edu/denis/datadecision/lm_R/index.html
3. http://www.econ.uiuc.edu/~wsosa/econ507/ClassicalLinearModel.pdf
 You could also go to  http://stackoverflow.com and search for your question. You will most definitely find some post relating to the problem that you are facing.


Now for the next steps
This model object is then fed to cv.lm function. Refer to the following
for more information on the function.

http://www.statmethods.net/stats/regression.html  OR

http://hosho.ees.hokudai.ac.jp/~kubo/Rdoc/library/DAAG/html/cv.lm.html

cv.lm(df=fitData, m1, m=5)

Tada!!!! you are done ....
You can see the results of various folds.

The above can be done without using in-built functions too.
Ahhh!!! that was really a tough deal for me...Here's what I did which worked:

STEP 1:

I have taken an example of 5 folds.
The following simply divides the dataset into 5 folds of equal size. However you could modify the function to create folds of different sizes.

 sizes = function(n, k=5) {
  sizes = n/k;
  for (i in 1:k) {
    sizes = append(sizes, (n/k))
  }
  sizes
}


STEP 2:

create a random sample to be used as the test set (remember , in cross validation , we divide the dataset into multiple folds and use 1 of them as test set and the rest as the testset. )

Now in this example, I have divided the dataset into 5 folds and I assign one fold as test set and the rest of the data is used for training the model. This is called the K-FOLD CROSS VALIDATION.
There are other kinds of cross validation techniques that you could use.

Okay coming back to the problem,

testing = function(n, k=5) {
  indices = list()  #You specify the indices variable to be a list
  size = sizes(n, k=k) #call the function sizes to calculate the size of each fold
  values = 1:n
  for (i in 1:k) {
 
    s = sample(values, size[i])
  
    indices[[i]] = s
   
    values = setdiff(values, s)
  }
  print(indices)
  indices
}


  result = list()
  alltestingindices = kfcv.testing(dim(OLStrain)[1])
  for (i in 1:5) {
    testingindices = alltestingindices[[i]]
    cv.train = OLStrain[-testingindices,]
    cv.test = OLStrain[testingindices,]
   
    X <- as.matrix(cbind(1,cv.train$X1, cv.train$X2,...........))
    Y <- as.matrix(cv.train$Y)

In order to find the regression coefficients that minimize the Sum of squared errors, we need to use the following equation:

   

                           ( XTX)-1XTY   

    OLS <- chol2inv(chol(t(X)%*%X))%*%t(X)%*%Y
   
   
    predictedY <- as.matrix(OLS[1] + OLS[2]*cv.test$X1 + OLS[3]*cv.test$X2+ OLS[4]*cv.test$X3+ OLS[5]*cv.test$X4.........)

     MSEError <- MSEError+ mean((cv.test$Y-predictedY)^2)
   
  }

Booyah !!! you have done it!





Comments

Popular Posts