R Feature Selection with caret - Limit results plot to top 10 and also store full results into data frame

I am relatively new to R and trying my hand at feature selection for the first time. I followed a tutorial online that used the PimaIndiansDiabetes dataset as an example. I repeated the steps in this tutorial on my own dataset that has over 110 features.

I have included the sample code for the tutorial I used below. The only difference is that my code has a larger dataset and different naming conventions.

When I plot the importance value for my own results the plot has over 110 items appearing. Does anybody know how I can limit this to the top 10?

# ensure results are repeatable

# load the dataset

# prepare training scheme
control <- trainControl(method="repeatedcv", number=10, repeats=3)

# train the model
model <- train(diabetes~., data=PimaIndiansDiabetes, method="lvq",
preProcess="scale", trControl=control)

# estimate variable importance
importance <- varImp(model, scale=FALSE)

# summarize importance


# plot importance

I also want to be able to store these full results into a dataframe.
I tried the following command:

importanceDF <- as.data.frame(importance)

but I get the following error

Error in as.data.frame.default(importance) :
cannot coerce class ""varImp.train"" to a data.frame

Apologies if this is a simple question, I have tried googling but have yet to find an answer that works.

Thanks in advance,



As per zacdav's answer I have applied the following logic:

temp <- importance
temp$importance <- importance$importance[1:5, ]

However I noted that when I original run

The order is as follows in the sample data:

glucose 0.7881
mass 0.6876
age 0.6869
pregnant 0.6195
pedigree 0.6062
pressure 0.5865
triceps 0.5536
insulin 0.5379

Then when I run
temp$importance <- importance$importance[1:5, ]

I get the following order:


This is taking the top 5 rows in how they appear n the original table rather than based on their importance.

I tried running the following:

# put into DF
importanceDF <- importance$importance
# sort
importanceDF_Ordered <- importanceDF[order(-importanceDF$neg),]
temp <- importanceDF_Ordered

The last line then gives an error:

Error in `$<-.data.frame`(`*tmp*`, "importance", value = list(neg =
c(0.619514925373134, :
replacement has 5 rows, data has 8

1 Answer

Looking at the structure of the importance object you will see it is a list comprising of three elements, a data.frame of the importance values towards each response class and other metadata. You can just index the data.frame using the $ notation.



List of 3
$ importance:'data.frame': 8 obs. of 2 variables:
..$ neg: num [1:8] 0.62 0.788 0.586 0.554 0.538 ...
..$ pos: num [1:8] 0.62 0.788 0.586 0.554 0.538 ...
$ model : chr "ROC curve"
$ calledFrom: chr "varImp"
- attr(*, "class")= chr "varImp.train"

So to get the data.frame all you need to do is importance$importance


As far as adjusting this object so you can plot a subset of the features you can adjust the object. I would suggest maybe making a copy so that analysis does not need to be rerun. A crude example is as follows:

temp <- importance
temp$importance <- importance$importance[1:5, ]

I have chosen to plot the first five using 1:5 row index on the data.frame to override the temp objects data.frame.
If you are interested in calling the plot method directly use caret:::plot.varImp.train



Thank you so much for you help really appreciate it.
– Amy
Aug 6 at 2:14

thanks for your help, however I noticed that when I run the command plot(temp) it does not actually plot it in order. Rather it takes the rows as ho they appear originally and not by importance. I have edited my question to include code I tried using to fix this issue, would you mind looking to see if you know where I am going wrong please
– Amy
Aug 6 at 3:04

