Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
The training data for this project are available here: pml-training.csv
The test data are available here: pml-testing.csv
The data for this project come from this source
The goal of your project is to predict the manner in which they did the exercise. This is the “classe” variable in the training set. You may use any of the other variables to predict with. You should create a report describing how you built your model, how you used cross validation, what you think the expected out of sample error is, and why you made the choices you did. You will also use your prediction model to predict 20 different test cases.
library(plyr);
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lattice)
library(ggplot2)
library(caret)
library(rpart)
library(rpart.plot)
library(RColorBrewer)
library(rattle)
## Warning: Failed to load RGtk2 dynamic library, attempting to install it.
## Please install GTK+ from http://r.research.att.com/libs/GTK_2.24.17-X11.pkg
## If the package still does not load, please ensure that GTK+ is installed and that it is on your PATH environment variable
## IN ANY CASE, RESTART R BEFORE TRYING TO LOAD THE PACKAGE AGAIN
## Rattle: A free graphical interface for data mining with R.
## Version 4.1.0 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
library(kernlab);
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
##
## alpha
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
library(knitr)
library(e1071)
trainingdf <- read.csv("pml-training.csv")
testingdf <- read.csv("pml-testing.csv")
dim(trainingdf)
## [1] 19622 160
dim(testingdf)
## [1] 20 160
groupByClasse <- trainingdf %>% group_by(classe) %>% summarise(counts = n())
g <- ggplot(groupByClasse, aes(x = classe, y = counts)) + geom_bar(stat = "identity")
g <- g + geom_bar(stat = "identity")
g <- g + ggtitle("Total number of records for each groups")
g <- g + xlab("Groups")
g <- g + ylab("Counts")
plot(g)
rm(groupByClasse)
excludecolumns <- c("X", "user_name", "raw_timestamp_part_1", "raw_timestamp_part_2",
"cvtd_timestamp", "new_window")
# Method to exlude some columns
getDataExcludingSomeColumns <- function(tdata, excludecolumns) {
exdata <- tdata[, !(names(tdata) %in% excludecolumns)]
exdata
}
# Now remove the columns
trainingdf <- getDataExcludingSomeColumns(trainingdf, excludecolumns)
testingdf <- getDataExcludingSomeColumns(testingdf, c(excludecolumns, 'problem_id'))
dim(trainingdf)
## [1] 19622 154
dim(testingdf)
## [1] 20 153
Now after excluding after some obvious columns we have left with 154, one extra column because trainingdf contains classe and testingdf does not.
# Removing the Measured statistic columns
measuredStaticstucColPattern <- "kurtosis_|skewness_|max_|min_|amplitude_|avg_|stddev_|var_"
# Removed the measured Statics columns since they are same for one column for example max of yaw_belt will be same in all the rows
getDataExceludedMatchingColumnPattern <- function (tdata, excludecolumnsPattern) {
exdata <- tdata[, -grep(excludecolumnsPattern, colnames(tdata))]
exdata
}
trainingdf <- getDataExceludedMatchingColumnPattern(trainingdf, measuredStaticstucColPattern)
testingdf <- getDataExceludedMatchingColumnPattern(testingdf, measuredStaticstucColPattern)
dim(trainingdf)
## [1] 19622 54
dim(testingdf)
## [1] 20 53
# Now removing the columns which has more than 50% NA values
removedNAsColumns <- function(df) {
numRows <- nrow(df)
missingDf <- is.na(df)
removedColumns = which(colSums(missingDf) > numRows*50/100)
# might be possible that non of the columns have NA's more than 50%
if (length(removedColumns) > 0) {
colNames <- names(removedColumns)
df <- df[, -colNames]
}
df
}
trainingdf <- removedNAsColumns(trainingdf)
testingdf <- removedNAsColumns(testingdf)
dim(trainingdf)
## [1] 19622 54
dim(testingdf)
## [1] 20 53
Also using the following code block, we can check that is there any row left with NA’s values or not
completeCase <- complete.cases(trainingdf)
nrows <- nrow(trainingdf)
sum(completeCase) == nrows
## [1] TRUE
From the above code block sum(completeCase) == nrows
confirm that the number of complete case is equal to number of rows in trainingdf same for testingdf
## integer(0)
processedData <- function(rawdata) {
# for each columns NA should be replaced with average of that columns
for(column in names(rawdata)) {
if(column == "classe") {
next;
}
columnValue <- as.numeric(rawdata[, column]);
avgColumnValue <- mean(columnValue, na.rm=TRUE)
minColumnValue <- min(columnValue, na.rm=TRUE)
maxColumnValue <- max(columnValue, na.rm=TRUE)
columnValue[is.na(columnValue)] <- avgColumnValue
if (maxColumnValue == minColumnValue) {
next;
}
for(i in 1:length(columnValue)) {
columnValue[i] <- round((columnValue[i] - minColumnValue) / (maxColumnValue - minColumnValue), 4);
}
rawdata[, column] <- columnValue
}
rawdata
}
## Get the processed training data frame
trainingdf <- processedData(trainingdf)
testingdf <- processedData(testingdf)
dim(trainingdf)
## [1] 19622 54
dim(testingdf)
## [1] 20 53
inTrain <- createDataPartition(y = trainingdf$classe, p=.95, list = FALSE)
training <- trainingdf[inTrain, ]
testing <- trainingdf[-inTrain, ]
rfModel <- randomForest(classe~., data=training)
# Summary of the model
rfModel
##
## Call:
## randomForest(formula = classe ~ ., data = training)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 7
##
## OOB estimate of error rate: 0.17%
## Confusion matrix:
## A B C D E class.error
## A 5300 0 0 0 1 0.0001886437
## B 2 3605 1 0 0 0.0008314856
## C 0 7 3244 0 0 0.0021531836
## D 0 0 13 3042 1 0.0045811518
## E 0 0 0 6 3421 0.0017508025
# confusion matrics
rfPredictionsTesting <- predict(rfModel, newdata = testing, class = "class")
rfCMatrix <- confusionMatrix(rfPredictionsTesting, testing$classe)
rfCMatrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 279 0 0 0 0
## B 0 188 0 0 0
## C 0 1 171 0 0
## D 0 0 0 160 0
## E 0 0 0 0 180
##
## Overall Statistics
##
## Accuracy : 0.999
## 95% CI : (0.9943, 1)
## No Information Rate : 0.285
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9987
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.000 0.9947 1.0000 1.0000 1.0000
## Specificity 1.000 1.0000 0.9988 1.0000 1.0000
## Pos Pred Value 1.000 1.0000 0.9942 1.0000 1.0000
## Neg Pred Value 1.000 0.9987 1.0000 1.0000 1.0000
## Prevalence 0.285 0.1931 0.1747 0.1634 0.1839
## Detection Rate 0.285 0.1920 0.1747 0.1634 0.1839
## Detection Prevalence 0.285 0.1920 0.1757 0.1634 0.1839
## Balanced Accuracy 1.000 0.9974 0.9994 1.0000 1.0000
#plot the model
plot(rfModel)
# Plot the variable importance
varImpPlot(rfModel)
# Confusion matrix with testing
preductionOnTesting <- predict(rfModel, newdata=testing)
confusionMatrix(preductionOnTesting, testing$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 279 0 0 0 0
## B 0 188 0 0 0
## C 0 1 171 0 0
## D 0 0 0 160 0
## E 0 0 0 0 180
##
## Overall Statistics
##
## Accuracy : 0.999
## 95% CI : (0.9943, 1)
## No Information Rate : 0.285
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9987
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.000 0.9947 1.0000 1.0000 1.0000
## Specificity 1.000 1.0000 0.9988 1.0000 1.0000
## Pos Pred Value 1.000 1.0000 0.9942 1.0000 1.0000
## Neg Pred Value 1.000 0.9987 1.0000 1.0000 1.0000
## Prevalence 0.285 0.1931 0.1747 0.1634 0.1839
## Detection Rate 0.285 0.1920 0.1747 0.1634 0.1839
## Detection Prevalence 0.285 0.1920 0.1757 0.1634 0.1839
## Balanced Accuracy 1.000 0.9974 0.9994 1.0000 1.0000
plot(rfCMatrix$table, col = rfCMatrix$byClass, main = paste("Random Forest Confusion Matrix: Accuracy =", round(rfCMatrix$overall['Accuracy'], 4)))
set.seed(33323)
decisionTreeModel <- rpart(classe ~ ., data=training, method="class")
library(rpart.plot)
# Normal plot
rpart.plot(decisionTreeModel)
# fancy Plot
fancyRpartPlot(decisionTreeModel)
# predicitons
predictionsDecisionTree <- predict(decisionTreeModel, testing, type = "class")
# Confusion matrix
cmtree <- confusionMatrix(predictionsDecisionTree, testing$classe)
cmtree
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 240 27 8 10 10
## B 9 118 11 9 17
## C 6 11 133 21 14
## D 21 26 14 117 18
## E 3 7 5 3 121
##
## Overall Statistics
##
## Accuracy : 0.7446
## 95% CI : (0.7161, 0.7717)
## No Information Rate : 0.285
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6767
## Mcnemar's Test P-Value : 1.586e-06
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8602 0.6243 0.7778 0.7312 0.6722
## Specificity 0.9214 0.9418 0.9356 0.9035 0.9775
## Pos Pred Value 0.8136 0.7195 0.7189 0.5969 0.8705
## Neg Pred Value 0.9430 0.9129 0.9521 0.9451 0.9298
## Prevalence 0.2850 0.1931 0.1747 0.1634 0.1839
## Detection Rate 0.2451 0.1205 0.1359 0.1195 0.1236
## Detection Prevalence 0.3013 0.1675 0.1890 0.2002 0.1420
## Balanced Accuracy 0.8908 0.7831 0.8567 0.8174 0.8248
# Accuracy plot
plot(cmtree$table, col = cmtree$byClass, main = paste("Decision Tree Confusion Matrix: Accuracy =", round(cmtree$overall['Accuracy'], 4)))
svmModel = svm(classe ~. , data=training)
#prediction
svmPredictions <- predict(svmModel, newdata= testing)
# Confusion matrix
cmSVM <- confusionMatrix(svmPredictions, testing$classe)
cmSVM
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 279 14 1 0 0
## B 0 169 4 0 0
## C 0 5 164 14 1
## D 0 1 2 146 3
## E 0 0 0 0 176
##
## Overall Statistics
##
## Accuracy : 0.954
## 95% CI : (0.939, 0.9663)
## No Information Rate : 0.285
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9417
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.8942 0.9591 0.9125 0.9778
## Specificity 0.9786 0.9949 0.9752 0.9927 1.0000
## Pos Pred Value 0.9490 0.9769 0.8913 0.9605 1.0000
## Neg Pred Value 1.0000 0.9752 0.9912 0.9831 0.9950
## Prevalence 0.2850 0.1931 0.1747 0.1634 0.1839
## Detection Rate 0.2850 0.1726 0.1675 0.1491 0.1798
## Detection Prevalence 0.3003 0.1767 0.1879 0.1553 0.1798
## Balanced Accuracy 0.9893 0.9446 0.9672 0.9526 0.9889
#plot
plot(cmSVM$table, col = cmSVM$byClass, main = paste("SVM Confusion Matrix: Accuracy =", round(cmSVM$overall['Accuracy'], 4)))
# Using Random Forest
rfPredictions <- predict(rfModel, newdata = testingdf)
rfPredictions
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B E
## Levels: A B C D E
# Using Decision tree
decisionTreePredictions <- predict(decisionTreeModel, newdata = testingdf, type= "class")
decisionTreePredictions
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## E A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
# Using SVM
dim(testingdf)
## [1] 20 53
dim(testing)
## [1] 979 54
svmPredictions <- predict(svmModel, newdata = testingdf)
svmPredictions
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E