Create one R script run_analysis.R to do the following:
- Merge the training and the test sets to create one data set.
- Extract only the measurements on the mean and standard deviation for each measurement.
- Use descriptive activity names to name the activities in the data set
- Appropriately labels the data set with descriptive feature names.
- Create a second, independent tidy data set with the average of each variable for each activity and each subject.
-
The required dataset was downloaded and unzipped in a local directory using the following R script
get_zip_file <- function(zipdir, url) { #Create a temp file to save the downloaded zipped file temp <- tempfile() #download the file from the given url download.file(url, temp) #Create the zipdir directory dir.create(zipdir) #Unzip the file into the directory unzip(temp, exdir = zipdir) } -
Local directory is assigned to
zipdirin order to access all the relevant files.zipdir<-"UCI HAR Dataset" -
Observed Data from
train/X_train.txtandtest/X_test.txtis read into data framesdata_Trainanddata_Testrespectively from appropriate folders inzipdir,data_Train<-read.table(paste(zipdir,"train/X_train.txt", sep= "/"), stringsAsFactors = FALSE) data_Test<-read.table(paste(zipdir,"test/X_test.txt", sep= "/"), stringsAsFactors = FALSE) -
Subject Info from
train/subject_train.txtandtest/subject_test.txtis read into data framesdata_SubTrainanddata_SubTestrespectively,data_SubTrain<-read.table(paste(zipdir,"train/subject_train.txt", sep= "/"), stringsAsFactors = FALSE) data_SubTest<-read.table(paste(zipdir,"test/subject_test.txt", sep= "/"), stringsAsFactors = FALSE) -
Labels from
train/y_train.txtandtest/y_test.txtis read into data framesdata_Train_Activityanddata_Test_Activityrespectively,data_Train_Activity<-read.table(paste(zipdir,"train/y_train.txt", sep= "/"), stringsAsFactors = FALSE) data_Test_Activity<-read.table(paste(zipdir,"test/y_test.txt", sep= "/"), stringsAsFactors = FALSE) -
List of all features from
features.txtis read into data framedata_features. These are the 561 observed variables in the Train and Test data sets.data_features<-read.table(paste(zipdir,"features.txt", sep= "/"), stringsAsFactors = FALSE) -
Activity name for each activity, which has been provided in the file
activity_labels.txt, is loaded into to data framedata_activity_labels, which will be used in Step 3.data_activity_labels<-read.table(paste(zipdir,"activity_labels.txt", sep= "/"), stringsAsFactors = FALSE) -
Data frame
data_Trainanddata_Testare merged together using rbind() into data framedata_Observations.data_Observations<-rbind(data_Train, data_Test) -
Data frame
data_SubTrainanddata_SubTestare merged together using rbind() into data framedata_Subject.data_Subject<-rbind(data_SubTrain,data_SubTest) -
Data frame
data_Train_Activityanddata_Test_Activityare merged together using rbind() into data framedata_Activity.data_Activity<-rbind(data_Train_Activity, data_Test_Activity) -
Column names for data frame
data_Observationsis assigned assetnames(data_Observations,names(data_Observations), data_features[,2]) -
Column names for
data_SubTrainanddata_Train_Activityis also assigned assetnames(data_Subject, names(data_Subject), "subject") setnames(data_Activity, names(data_Activity), "activity")
-
As only the measurements on the mean and standard deviation for each measurement is required,
grepl()is used to extract these relevant columns. -
grepl()is used get only those columns withmean()andstd()at the end. This gives a logical vector of column namesselectColumnsonly withmean()andstd()as required.selectColumns<-(grepl("-mean\\()$",names(data_Observations)) & !grepl("-meanFreq\\()",names(data_Observations)) | grepl("-std\\()$",names(data_Observations))) -
selectColumnsis used to extract only the relevant columns from thedata_Observationsdata frame.data_Observations<-data_Observations[,selectColumns] -
Now, all three data frames are ready to be merged together by
cbind()to create one complete data framedata_Alldata_All<-cbind(data_Activity, data_Observations) data_All<-cbind(data_Subject, data_All)
-
Activity name for each activity, which has been loaded into data frame
data_activity_labelsin Step 1, is used to give proper description. Remove "_" from the activity label and convert it to lower case.data_activity_labels[,2]<-gsub("WALKING_UPSTAIRS","walkup", data_activity_labels[,2]) data_activity_labels[,2]<-gsub("WALKING_DOWNSTAIRS","walkdown", data_activity_labels[,2]) data_activity_labels[,2]<-tolower(gsub("_"," ", data_activity_labels[,2])) -
activitycolumn of data framedata_Allis converted to factor asdata_All$activity<-as.factor(data_All$activity) -
Appropriate labels are assigned to the levels as
setattr(data_All$activity, "levels", data_activity_labels[,2])
-
Lower camel case is adopted for renaming the features for easy readability.
-
Features are renamed to make it more descriptive by substituting
Meanfor-mean(),Stdfor-std(),timeDomainfortandfrequencyDomainforf.names(data_All)<-gsub("-mean\\()","Mean", names(data_All)) names(data_All)<-gsub("-std\\()","Std", names(data_All)) names(data_All)<-gsub("^t","timeDomain", names(data_All)) names(data_All)<-gsub("^f","frequencyDomain", names(data_All)) -
There appeared to be an error in the naming of some features in the original dataset, for example
fBodyBodyGyroJerkMag-mean()hasBodyrepeated twice, hence all occurence ofBodyBodyis replaced byBodynames(data_All)<-gsub("BodyBody","Body", names(data_All))
Step 5 : Create a second, independent tidy data set with the average of each variable for each activity and each subject.
-
Assignment suggests to find variables pertaining to mean and standard deviations of various observations and produce a tidy dataset of the average of these variables for each combination of subject and activity.
-
It is assumed that tidy data set is to be created for each variable extracted in previous step.
-
library(reshape2)is required to perform this final step. -
melt()is used ondata_Allwithid=c("subject", "activity")to create multiple rows of unique id-variable combinations.tidydata<-melt(as.data.frame(data_All), id=c("subject", "activity")) -
Average of each variable for each activity and each subject is computed into final data frame
tidydatausingdcast()tidydata<-dcast(tidydata, subject + activity ~ variable, mean) -
Rename all measure variables with prefix
avgand maintaine the lower camel case convention.names(tidydata)<-gsub("^timeDomain","avgTimeDomain", names(tidydata)) names(tidydata)<-gsub("^frequencyDomain","avgFrequencyDomain", names(tidydata)) -
Write tidy data to a text file.
write.table(tidydata, file = "tidyData.txt", row.names = FALSE, col.names = TRUE)