Summary
Data Cleaning & Exploration
- Data Cleaning
- Data Exploration
Modeling
Appendix

The source of data for this project is:

Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013

Summary

The objective is to predict ‘classe’, a factor with 5 levels, from the reference data. The data from JHU/Coursera course is contained in a file ‘training.csv’, that will be used for building, validating and testing models. The second file ‘testing.csv’ lacks any classe values and will only be used for the submission test on coursera.

Initial cleaning: Clean data, removing various flavors of NA and columns containing values that are not relevant to current project to correctly predict classe variable.
Determining variable importance: Initial rough model on 20% of ‘training’ set, using randomForest() with 3-fold cross-validation, with chief purpose to determine relative importance of features.
Second model: use a reduced set of predictors, based on varImp(), applying randomForest with 3-fold CV to 50% of data. Examine results and initial “out of bag” errors.
Perform additional validation by predicting and evaluating the second model accuracy against remaining 30% of data from ‘training’ set.
Predict 20 cases for the submission portion of project and report results.

Note: I recorded and report the system time required to run each model, and the time to process all the code chunks in this report. Both my hardware and software configuration are also reported for reference.

Summary Results The second model, a 24 predictor Random Forest, validated with 3-fold cross validation, and an estimated out of sample error of ~1%, achieved 99% accuracy on an additional out of sample test. It also scored 20 of 20 correct on the course submission test.

Data Cleaning & Exploration

Data Cleaning

      startknit <- Sys.time()  ## start time, see end of report total time

      sourceData <- read.csv("pml-training.csv", 
                         na.strings = c("NA", "", "#DIV/0!"))

      submitTestSet <- read.csv("pml-testing.csv", 
                        na.strings = c("NA", "", "#DIV/0!"))
      # View(sourceData)
      
# remove NA columns both sets
      cols2get <- colSums(is.na(sourceData))==0
      cols2get2 <- colSums(is.na(submitTestSet))==0
      cleanData <- sourceData[, cols2get]
      cleanSubmit <- submitTestSet[,cols2get2]
      
      ## View(cleanData)

# remove row "X", as well as time date window columns which 
# are not related to prediction of 'classe'. I will keep 
# 'user_name' in the set - it may or may not be important 
# in the model response.
      
      cleanData <- cleanData[,c(2,8:60)]
      cleanSubmit <- cleanSubmit[,c(2,8:60)]

Data Exploration

Based on cursory examination of the first few columns of the data with exploratory plots, it is evident that several may be important predictors. Note: these are a small sample of plots that were done and the code for all plots is printed in the Appendix.

Comments about plots 1-4: In plot 1, the differences between users are clear, so I will retain user_name as a feature in the initial rough model. In plot 2, it appears that classe may be more responsive to the absolute values and a non-linear relationship between the two variables. Plot 3, like plot 1, shows differences by user_name. In plot 4, there is some separation of classes D and E, together, from the other classes.

Modeling

Pre-modeling (partitions and train control)

# doParallel with 2 cores reduced runtime ~30%.      
      require(parallel)  
      require(doParallel)  
      mc <- makeCluster(detectCores())
      registerDoParallel(mc)

      
      require(caret)
      
# create 3 data partitions: 
#     20% initial model(trainset1) 
#     50% second model(trainset2)
#     30% validation(validset)
      
      set.seed(616) # for repeatability
      
      trndx20 <- createDataPartition(cleanData$classe,
                                     p = 0.2, list = FALSE)
       
      trainset1 <- cleanData[trndx20,] # contains 20 % of rows
      tempset1 <- cleanData[-trndx20,] # contains 80 % of rows
       
      trndx50 <- createDataPartition(tempset1$classe,
                                     p = (0.5/0.8), list = FALSE)

      trainset2 <- tempset1[trndx50,] # contains 50% of original rows
      validset <- tempset1[-trndx50,] # contains 30% of original rows
      
# common train control parameters will be used on all train() . . .
      
      controla <- trainControl(method="cv", 
                             number = 3, 
                             allowParallel = TRUE)

Initial Random Forest Model, all predictors, 20% of observations

      start1 <- Sys.time()
      
      model1 <- train(classe~.,data=trainset1, 
                    method="rf",
                    trControl = controla)
      
      round(Sys.time() - start1,2) # model1 run time

## Time difference of 2.66 mins

      model1

## Random Forest 
## 
## 3927 samples
##   53 predictor
##    5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold) 
## 
## Summary of sample sizes: 2618, 2619, 2617 
## 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa      Accuracy SD  Kappa SD   
##    2    0.9541604  0.9419626  0.007313994  0.009255129
##   29    0.9595058  0.9487652  0.010364708  0.013129760
##   57    0.9536464  0.9413324  0.016357197  0.020733942
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 29.

      model1$finalModel

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 29
## 
##         OOB estimate of  error rate: 2.9%
## Confusion matrix:
##      A   B   C   D   E class.error
## A 1109   6   0   0   1 0.006272401
## B   19 725  13   3   0 0.046052632
## C    1  25 652   7   0 0.048175182
## D    0   2  19 622   1 0.034161491
## E    0   4   7   6 705 0.023545706

      varImp(model1)

## rf variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                      Overall
## roll_belt             100.00
## pitch_forearm          64.27
## yaw_belt               49.56
## magnet_dumbbell_z      48.74
## magnet_dumbbell_y      47.58
## roll_forearm           45.08
## pitch_belt             37.94
## accel_dumbbell_y       25.09
## roll_dumbbell          23.56
## magnet_dumbbell_x      21.67
## accel_belt_z           19.89
## magnet_belt_z          18.98
## accel_forearm_x        17.66
## magnet_belt_y          17.38
## total_accel_dumbbell   16.82
## accel_dumbbell_z       15.96
## magnet_forearm_z       14.26
## roll_arm               13.85
## yaw_dumbbell           13.49
## gyros_belt_z           12.22

The predictor user_name isn’t important in this model, so my hypothesis that the user_name would be important is rejected.

Given model1 high accuracy (confusionMatrix), I will use the varImp() results from model1 to reduce the feature set and run a second model on the 50% partition named ‘trainset2’. I will use scaled importance value of 10 as the cut-off, which will remove user_name and many other predictors.

Model 2: Random Forest on 50% partition with reduced feature set

# reduce feature set to those with imp >= 10.0
      
      importance <- varImp(model1)$importance # get varImp()
      importance$vars <- rownames(importance) 
      importance <- importance[order(importance$Overall, 
                                     decreasing=TRUE),]
      impCols <- importance[(importance$Overall >= 10.0),2] # subset
      
      impCols

##  [1] "roll_belt"            "pitch_forearm"        "yaw_belt"            
##  [4] "magnet_dumbbell_z"    "magnet_dumbbell_y"    "roll_forearm"        
##  [7] "pitch_belt"           "accel_dumbbell_y"     "roll_dumbbell"       
## [10] "magnet_dumbbell_x"    "accel_belt_z"         "magnet_belt_z"       
## [13] "accel_forearm_x"      "magnet_belt_y"        "total_accel_dumbbell"
## [16] "accel_dumbbell_z"     "magnet_forearm_z"     "roll_arm"            
## [19] "yaw_dumbbell"         "gyros_belt_z"         "magnet_belt_x"       
## [22] "gyros_dumbbell_y"     "accel_forearm_z"      "magnet_arm_x"

# create second model
      
      trainset2 <- trainset2[,c(impCols,"classe")]
      
      start2 <- Sys.time()
      
      model2 <- train(classe~.,data=trainset2, 
                    method="rf", 
                    trControl = controla)
      
      round(Sys.time()-start2,2) # model2 run time

## Time difference of 3.18 mins

      model2

## Random Forest 
## 
## 9812 samples
##   24 predictor
##    5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold) 
## 
## Summary of sample sizes: 6541, 6541, 6542 
## 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa      Accuracy SD   Kappa SD   
##    2    0.9831839  0.9787263  0.0024457371  0.003090716
##   13    0.9808399  0.9757649  0.0009795966  0.001234489
##   24    0.9735019  0.9664834  0.0029369553  0.003709115
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 2.

      model2$finalModel

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 1.35%
## Confusion matrix:
##      A    B    C    D    E class.error
## A 2781    6    0    1    2 0.003225806
## B   20 1855   23    0    1 0.023170090
## C    0   17 1684    9    1 0.015780245
## D    0    1   36 1568    3 0.024875622
## E    0    1    2    9 1792 0.006651885

# validation of model 2 on balance of data
      
      validate <- predict(model2,validset)
      confusionMatrix(validate,validset$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1671   10    0    0    0
##          B    3 1119    8    0    1
##          C    0    9 1015   17    3
##          D    0    0    2  947    1
##          E    0    0    1    0 1076
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9907          
##                  95% CI : (0.9878, 0.9929)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9882          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9982   0.9833   0.9893   0.9824   0.9954
## Specificity            0.9976   0.9975   0.9940   0.9994   0.9998
## Pos Pred Value         0.9941   0.9894   0.9722   0.9968   0.9991
## Neg Pred Value         0.9993   0.9960   0.9977   0.9966   0.9990
## Prevalence             0.2845   0.1934   0.1744   0.1639   0.1837
## Detection Rate         0.2840   0.1902   0.1725   0.1610   0.1829
## Detection Prevalence   0.2857   0.1922   0.1775   0.1615   0.1831
## Balanced Accuracy      0.9979   0.9904   0.9917   0.9909   0.9976

Model 2, a 3 fold cross validated Random Forest, using 1/2 of the data, was validated, with 99% accuracy on an out of sample test, using 30% of the data. This model used 24 variables as features which are listed earlier in the report as ‘ImpCols’.

Submission Test

      answer20 <- predict(model2,cleanSubmit)
      answer20

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

# results were submitted and scored 20 of 20 correct.

Appendix

This appendix has:

code for plots.
hardware & session information for reproducibility.
total elapsed time to knit the RMD, including all models, text and charts.

Plot Code

      require(ggplot2)
      qplot(roll_belt,pitch_belt, data=cleanData, color = user_name,
            main="Plot1: Roll Belt and Pitch Belt by user_name")

      qplot(pitch_arm,roll_arm, data=cleanData, color = classe,
            main="Plot 2: Roll Arm and Pitch Arm by classe")
      
      qplot(pitch_arm,roll_arm, data=cleanData, color = user_name,
            main="Plot 3: Roll Arm and Pitch Arm by user_name")
      
      qplot(pitch_arm,yaw_arm, data=cleanData, color = classe,
            main="Plot 4: Yaw Arm and Pitch Arm by classe")

# the following plots were not shown for the report
#      boxplot(cleanData)
#      boxplot(roll_belt~classe,data=cleanData, main="Roll Belt by Classe")
#      boxplot(pitch_belt~classe,data=cleanData, main="Pitch Belt by Classe")
#      boxplot(yaw_belt~classe,data=cleanData, main="Yaw Belt by Classe")
#      qplot(yaw_arm,roll_arm, data=cleanData, color = classe,
#            main="Roll Arm and Yaw Arm by classe")
#      qplot(roll_belt,pitch_belt, data=cleanData, color = classe,
#           main="Roll and Pitch Belt by classe")
#      qplot(roll_belt,yaw_belt, data=cleanData, color = classe,
#            main="Roll and Yaw Belt by classe")
#      qplot(yaw_belt,pitch_belt, data=cleanData, color = classe,
#            main="Yaw Belt and Pitch Belt by classe")

Variable Importance plot:

        plot(varImp(model1), main="Ranked variable importance in model 1")

Hardware and Session Info

My hardware: HP/Compaq C770US. Pentium Dual CPU T2390 @ 1.86GHz with 3 GB DDR2 ram.

      print(sessionInfo(),locale=FALSE)

## R version 3.2.0 (2015-04-16)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] randomForest_4.6-10 caret_6.0-47        lattice_0.20-31    
## [4] doParallel_1.0.8    iterators_1.0.7     foreach_1.4.2      
## [7] ggplot2_1.0.1      
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.11.6         compiler_3.2.0      formatR_1.2        
##  [4] nloptr_1.0.4        plyr_1.8.3          class_7.3-12       
##  [7] tools_3.2.0         digest_0.6.8        lme4_1.1-7         
## [10] evaluate_0.7        gtable_0.1.2        nlme_3.1-120       
## [13] mgcv_1.8-6          Matrix_1.2-1        yaml_2.1.13        
## [16] SparseM_1.6         brglm_0.5-9         proto_0.3-10       
## [19] e1071_1.6-4         BradleyTerry2_1.0-6 stringr_1.0.0      
## [22] knitr_1.10.5        gtools_3.5.0        grid_3.2.0         
## [25] nnet_7.3-9          rmarkdown_0.7       minqa_1.2.4        
## [28] reshape2_1.4.1      car_2.0-25          magrittr_1.5       
## [31] scales_0.2.5        codetools_0.2-11    htmltools_0.2.6    
## [34] MASS_7.3-40         splines_3.2.0       pbkrtest_0.4-2     
## [37] colorspace_1.2-6    quantreg_5.11       labeling_0.3       
## [40] stringi_0.4-1       munsell_0.4.2

Elapsed Time to knit all chunks

      round(Sys.time() - startknit,2)  ## time  to knit all chunks

## Time difference of 6.23 mins

end of report. Thank you for your review and feedback.

PML Project: Weight Lifting Activity Recognition

WhitefishDontJump

June 16, 2015