# [Solved] Removal of constant columns in R

I was using the prcomp function when I received this error

```
Error in prcomp.default(x, ...) :
cannot rescale a constant/zero column to unit variance
```

I know I can scan my data manually but is there any function or command in R that can help me remove these constant variables?

I know this is a very simple task, but I have never been across any function that does this.

Thanks,

##
Solution #1:

The problem here is that your column variance is equal to zero. You can check which column of a data frame is constant this way, for example :

```
df <- data.frame(x=1:5, y=rep(1,5))
df
# x y
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 1
# 5 5 1
# Supply names of columns that have 0 variance
names(df[, sapply(df, function(v) var(v, na.rm=TRUE)==0)])
# [1] "y"
```

So if you want to exclude these columns, you can use :

```
df[,sapply(df, function(v) var(v, na.rm=TRUE)!=0)]
```

**EDIT :** In fact it is simpler to use `apply`

instead. Something like this :

```
df[,apply(df, 2, var, na.rm=TRUE) != 0]
```

##
Solution #2:

I guess this Q&A is a popular Google search result but the answer is a bit slow for a large matrix, plus I do not have enough reputation to comment on the first answer. Therefore I post a new answer to the question.

For each column of a large matrix, checking whether the maximum is equal to the minimum is sufficient.

```
df[,!apply(df, MARGIN = 2, function(x) max(x, na.rm = TRUE) == min(x, na.rm = TRUE))]
```

This is the test. More than 90% of the time is reduced compared to the first answer. It is also faster than the answer from the second comment on the question.

```
ncol = 1000000
nrow = 10
df <- matrix(sample(1:(ncol*nrow),ncol*nrow,replace = FALSE), ncol = ncol)
df[,sample(1:ncol,70,replace = FALSE)] <- rep(1,times = nrow) # df is a large matrix
time1 <- system.time(df1 <- df[,apply(df, 2, var, na.rm=TRUE) != 0]) # the first method
time2 <- system.time(df2 <- df[,!apply(df, MARGIN = 2, function(x) max(x, na.rm = TRUE) == min(x, na.rm = TRUE))]) # my method
time3 <- system.time(df3 <- df[,apply(df, 2, function(col) { length(unique(col)) > 1 })]) # Keith's method
time1
# user system elapsed
# 22.267 0.194 22.626
time2
# user system elapsed
# 2.073 0.077 2.155
time3
# user system elapsed
# 6.702 0.060 6.790
all.equal(df1, df2)
# [1] TRUE
all.equal(df3, df2)
# [1] TRUE
```

##
Solution #3:

Since this Q&A is a popular Google search result but the answer is a bit slow for a large matrix and @raymkchow version is slow with NAs i propose a new version using exponential search and `data.table`

power.

This a function I implemented in dataPreparation package.

First build an exemple data.table, with more lines than columns (which is usually the case) and 10% of NAs

```
ncol = 1000
nrow = 100000
df <- matrix(sample(1:(ncol*nrow),ncol*nrow,replace = FALSE), ncol = ncol)
df <- apply (df, 2, function(x) {x[sample( c(1:nrow), floor(nrow/10))] <- NA; x} ) # Add 10% of NAs
df[,sample(1:ncol,70,replace = FALSE)] <- rep(1,times = nrow) # df is a large matrix
df <- as.data.table(df)
```

Then benchmark all approaches:

```
time1 <- system.time(df1 <- df[,apply(df, 2, var, na.rm=TRUE) != 0, with = F]) # the first method
time2 <- system.time(df2 <- df[,!apply(df, MARGIN = 2, function(x) max(x, na.rm = TRUE) == min(x, na.rm = TRUE)), with = F]) # raymkchow
time3 <- system.time(df3 <- df[,apply(df, 2, function(col) { length(unique(col)) > 1 }), with = F]) # Keith's method
time4 <- system.time(df4 <- df[,-whichAreConstant(df, verbose=FALSE)]) # My method
```

The results are the following:

```
time1 # Variance approch
# user system elapsed
# 2.55 1.45 4.07
time2 # Min = max approach
# user system elapsed
# 2.72 1.5 4.22
time3 # length(unique()) approach
# user system elapsed
# 6.7 2.75 9.53
time4 # Exponential search approach
# user system elapsed
# 0.39 0.07 0.45
all.equal(df1, df2)
# [1] TRUE
all.equal(df3, df2)
# [1] TRUE
all.equal(df4, df2)
# [1] TRUE
```

`dataPreparation:whichAreConstant`

is 10 times faster than the other approachs.

Plus the more rows you have the more intersting it is to use.

##
Solution #4:

If you are after a dplyr solution that returns the non-constant variables in a df, I’d recommend the following. Optionally, you can add `%>% colnames()`

if the column names are desired:

```
library(dplyr)
df <- data.frame(x = 1:5, y = rep(1,5))
# returns dataframe
var_df <- df %>%
select_if(function(v) var(v, na.rm=TRUE) != 0)
var_df %>% colnames() # returns column names
```