Category Archives: R

concrete data from the applied predictive modeling library

Variable names in concrete data:

Cement

BlastFurnaceSlag

FlyAsh

Water

Superplasticizer

CoarseAggregate

FineAggregate

Age

CompressiveStrength

 

Concrete data Summary

summary(concrete)

20150419-data(concrete)_summary(concrete)

 

Mixtures data Summary

summary(mixtures)

20150419-data(concrete)-mixtures_summary(mixtures)

Mixtures data Feature Plot

requires library(caret)

We could list out all the variable names as shown below:

featurePlot(x=mixtures[,c(“Cement”, “BlastFurnaceSlag”, “FlyAsh”, “Water”, “Superplasticizer”, “CoarseAggregate”, “FineAggregate”, “Age”)],y=mixtures$CompressiveStrength, plot=”pairs”)

or simplify a bit:

names <- colnames(mixtures)

names <- names[-length(names)]

then plot:

featurePlot(x = mixtures[, names], y = mixtures$CompressiveStrength, plot=“pairs”)

 

20150419-data(concrete)-mixtures_featurePlot

R: Get AppliedPredictiveModeling Library

install.packages(“AppliedPredictiveModeling”)

Verify it installed correctly; try loading the library:

library(AppliedPredictiveModeling)

 

If you also see “also installing the dependency ‘CORElearn’” in the console is because Applied Predictive Modeling requires CORElearn which should install automatically.  If it already is installed you won’t see that message.

Remove all instances of duplicates in R

I came a cross a data set that included two unique identification fields.  The first was for the data set and was truly unique.  The second was a cross reference to a separate data set and unfortunately wasn’t unique.  I wrote some R code to remove any duplicated instances of the non-unique identifiers.

The data set was loaded in from a .csv into a R data frame, mydata.  “File Number” is the column name where the duplicated values reside.

Code box by: Crayon Syntax Highlighter