R: Keep values below 99% quantile by group in data frame
Sometimes, especially in remote sensing, we deal with presence of outliers. Too distant beetle dispersal recorded as tree infestation from one year (over 3 km) to another may simply result from source tree located outside of the study area. Therefore, estimation of the population dispersal based on presence of infested trees might not correspond to true beetle dispersal.
Will the trends change after removal 0.5, 1 or 5% of dataset? To investigate this, we need firstly to identify the corresponding quantiles, and then keep only the remaining value. If the data are grouped by specific variable, this could be tricky. Let's look how to keep values below 99% quantile by group!
Example:
I would like to remove values above the 99% quantile by group.
# create data frame
df<-data.frame(group = rep(c("A", "B"), each = 3), value = c(c(6,5,80,4,80)*10,3))
group value 1 A 60 2 A 50 3 A 800 4 B 40 5 B 800 6 B 3
Get quantiles for individual groups
quant<-aggregate(df$value, by = list(df$group), FUN = quantile, probs = 0.99)
> quant Group.1 x
A 785.22
B 784.8
Select only the values lower than 99% of value by group
df[with(df, as.logical(ave(value, group, FUN= function(x) x <quantile(x, probs = 0.99)))), ]
Which results:
group value 1 A 60 2 A 50 4 B 40 6 B 3