如何找到统计模式?

在R中，mean()和median()是标准函数，它们执行您所期望的功能。Mode()告诉您对象的内部存储模式，而不是参数中出现次数最多的值。但是是否存在一个标准库函数来实现向量(或列表)的统计模式?

当前回答

这里有另一个解决方案:

freq <- tapply(mySamples,mySamples,length)
#or freq <- table(mySamples)
as.numeric(names(freq)[which.max(freq)])

2010-03-30 20:21:29

其他回答

这里有另一个解决方案:

freq <- tapply(mySamples,mySamples,length)
#or freq <- table(mySamples)
as.numeric(names(freq)[which.max(freq)])

2010-03-30 20:21:29

对Ken Williams的回答做了一个小修改，增加了可选的params na。Rm和return_multiple。

与依赖names()的答案不同，此答案在返回值中维护x的数据类型。

stat_mode <- function(x, return_multiple = TRUE, na.rm = FALSE) {
  if(na.rm){
    x <- na.omit(x)
  }
  ux <- unique(x)
  freq <- tabulate(match(x, ux))
  mode_loc <- if(return_multiple) which(freq==max(freq)) else which.max(freq)
  return(ux[mode_loc])
}

要显示它与可选参数一起工作并维护数据类型:

foo <- c(2L, 2L, 3L, 4L, 4L, 5L, NA, NA)
bar <- c('mouse','mouse','dog','cat','cat','bird',NA,NA)

str(stat_mode(foo)) # int [1:3] 2 4 NA
str(stat_mode(bar)) # chr [1:3] "mouse" "cat" NA
str(stat_mode(bar, na.rm=T)) # chr [1:2] "mouse" "cat"
str(stat_mode(bar, return_mult=F, na.rm=T)) # chr "mouse"

感谢@Frank的简化。

2017-07-20 13:43:38

虽然我喜欢肯威廉姆斯简单的功能，我想检索多种模式，如果他们存在。考虑到这一点，我使用下面的函数，它返回多个模式或单个模式的列表。

rmode <- function(x) {
  x <- sort(x)  
  u <- unique(x)
  y <- lapply(u, function(y) length(x[x==y]))
  u[which( unlist(y) == max(unlist(y)) )]
}

2014-12-24 16:08:02

CRAN上现在可用的折叠包中的通用函数fmode实现了基于索引哈希的基于c++的模式。它比上述任何一种方法都要快得多。它提供了向量、矩阵、data.frames和dplyr分组tibbles的方法。语法:

libary(collapse)
fmode(x, g = NULL, w = NULL, ...)

其中x可以是上述对象之一，g提供一个可选的分组向量或分组向量列表(用于分组模式计算，也在c++中执行)，w(可选)提供一个数值权重向量。在分组tibble方法中，没有g参数，您可以执行data %>% group_by(idvar) %>% fmode。

2020-03-19 21:45:11

我浏览了所有这些选项，开始想知道它们的相对特性和性能，所以我做了一些测试。如果其他人也好奇，我在这里分享我的结果。

我不想为这里发布的所有函数而烦恼，我选择了一个基于一些标准的示例:函数应该对字符、因子、逻辑和数字向量都有效，它应该适当地处理na和其他有问题的值，输出应该是“合理的”，即没有数字作为字符或其他类似的愚蠢行为。

我还添加了一个我自己的函数，它是基于与chrispy相同的想法，除了适应更一般的用途:

library(magrittr)

Aksel <- function(x, freq=FALSE) {
    z <- 2
    if (freq) z <- 1:2
    run <- x %>% as.vector %>% sort %>% rle %>% unclass %>% data.frame
    colnames(run) <- c("freq", "value")
    run[which(run$freq==max(run$freq)), z] %>% as.vector   
}

set.seed(2)

F <- sample(c("yes", "no", "maybe", NA), 10, replace=TRUE) %>% factor
Aksel(F)

# [1] maybe yes  

C <- sample(c("Steve", "Jane", "Jonas", "Petra"), 20, replace=TRUE)
Aksel(C, freq=TRUE)

# freq value
#    7 Steve

最后，我通过微基准测试在两组测试数据上运行了五个函数。函数名指的是它们各自的作者:

Chris的函数被设置为method="modes"和na。rm=TRUE默认值，以使其更具可比性，但除此之外，这里使用的函数是由它们的作者提供的。

In matter of speed alone Kens version wins handily, but it is also the only one of these that will only report one mode, no matter how many there really are. As is often the case, there's a trade-off between speed and versatility. In method="mode", Chris' version will return a value iff there is one mode, else NA. I think that's a nice touch. I also think it's interesting how some of the functions are affected by an increased number of unique values, while others aren't nearly as much. I haven't studied the code in detail to figure out why that is, apart from eliminating logical/numeric as a the cause.

2016-05-27 02:49:33

如何找到统计模式?

推荐文章

最新文章

标签