加快R中的循环操作

我在r中有一个很大的性能问题。我写了一个迭代data.frame对象的函数。它只是简单地向data.frame添加一个新列并累积一些东西。(操作简单)。data.frame大约有850K行。我的电脑还在工作(大约10小时了)，我不知道运行时间。

dayloop2 <- function(temp){
    for (i in 1:nrow(temp)){    
        temp[i,10] <- i
        if (i > 1) {             
            if ((temp[i,6] == temp[i-1,6]) & (temp[i,3] == temp[i-1,3])) { 
                temp[i,10] <- temp[i,9] + temp[i-1,10]                    
            } else {
                temp[i,10] <- temp[i,9]                                    
            }
        } else {
            temp[i,10] <- temp[i,9]
        }
    }
    names(temp)[names(temp) == "V10"] <- "Kumm."
    return(temp)
}

有什么办法可以加快这次行动吗?

当前回答

通过使用索引或嵌套的ifelse()语句跳过循环，可以更快地实现这一点。

idx <- 1:nrow(temp)
temp[,10] <- idx
idx1 <- c(FALSE, (temp[-nrow(temp),6] == temp[-1,6]) & (temp[-nrow(temp),3] == temp[-1,3]))
temp[idx1,10] <- temp[idx1,9] + temp[which(idx1)-1,10] 
temp[!idx1,10] <- temp[!idx1,9]    
temp[1,10] <- temp[1,9]
names(temp)[names(temp) == "V10"] <- "Kumm."

2010-05-25 22:15:27

其他回答

在R中，您通常可以通过使用apply族函数来加速循环处理(在您的示例中，可能是复制)。看一下提供进度条的plyr包。

另一种选择是完全避免循环，用向量化算法代替它们。我不确定你到底在做什么，但你可能可以将你的函数一次性应用到所有行:

temp[1:nrow(temp), 10] <- temp[1:nrow(temp), 9] + temp[0:(nrow(temp)-1), 10]

这将会快得多，然后你可以用你的条件过滤行:

cond.i <- (temp[i, 6] == temp[i-1, 6]) & (temp[i, 3] == temp[i-1, 3])
temp[cond.i, 10] <- temp[cond.i, 9]

向量化算术需要更多的时间和思考问题，但有时可以节省几个数量级的执行时间。

2010-05-26 08:37:10

我不喜欢重写代码……当然，ifelse和lapply是更好的选择，但有时很难匹配。

我经常使用data.frames，就像使用df$var[I]这样的列表一样

这里有一个虚构的例子:

nrow=function(x){ ##required as I use nrow at times.
  if(class(x)=='list') {
    length(x[[names(x)[1]]])
  }else{
    base::nrow(x)
  }
}

system.time({
  d=data.frame(seq=1:10000,r=rnorm(10000))
  d$foo=d$r
  d$seq=1:5
  mark=NA
  for(i in 1:nrow(d)){
    if(d$seq[i]==1) mark=d$r[i]
    d$foo[i]=mark
  }
})

system.time({
  d=data.frame(seq=1:10000,r=rnorm(10000))
  d$foo=d$r
  d$seq=1:5
  d=as.list(d) #become a list
  mark=NA
  for(i in 1:nrow(d)){
    if(d$seq[i]==1) mark=d$r[i]
    d$foo[i]=mark
  }
  d=as.data.frame(d) #revert back to data.frame
})

data.frame版本:

   user  system elapsed 
   0.53    0.00    0.53

表版本:

   user  system elapsed 
   0.04    0.00    0.03

使用向量列表比data.frame快17倍。

对于为什么内部data.frames在这方面这么慢，有什么意见吗?有人会认为它们像列表一样运作……

为了更快地编写代码，使用class(d)='list'而不是d=as.list(d)和class(d)='data.frame'

system.time({
  d=data.frame(seq=1:10000,r=rnorm(10000))
  d$foo=d$r
  d$seq=1:5
  class(d)='list'
  mark=NA
  for(i in 1:nrow(d)){
    if(d$seq[i]==1) mark=d$r[i]
    d$foo[i]=mark
  }
  class(d)='data.frame'
})
head(d)

2016-08-03 03:37:48

这里的答案很好。有一个小方面没有被提及，那就是这个问题说的是“我的电脑还在工作(现在大约10小时了)，我不知道运行时间”。在开发时，我总是将以下代码放入循环中，以了解更改如何影响速度，并监视完成所需的时间。

dayloop2 <- function(temp){
  for (i in 1:nrow(temp)){
    cat(round(i/nrow(temp)*100,2),"%    \r") # prints the percentage complete in realtime.
    # do stuff
  }
  return(blah)
}

也可以使用lapply。

dayloop2 <- function(temp){
  temp <- lapply(1:nrow(temp), function(i) {
    cat(round(i/nrow(temp)*100,2),"%    \r")
    #do stuff
  })
  return(temp)
}

如果循环中的函数非常快，但循环的数量很大，那么可以考虑偶尔打印一次，因为打印到控制台本身会有开销。如。

dayloop2 <- function(temp){
  for (i in 1:nrow(temp)){
    if(i %% 100 == 0) cat(round(i/nrow(temp)*100,2),"%    \r") # prints every 100 times through the loop
    # do stuff
  }
  return(temp)
}

2018-05-25 11:34:53

看一下{purrr}中的accumulate()函数:

dayloop_accumulate <- function(temp) {
  temp %>%
    as_tibble() %>%
     mutate(cond = c(FALSE, (V6 == lag(V6) & V3 == lag(V3))[-1])) %>%
    mutate(V10 = V9 %>% 
             purrr::accumulate2(.y = cond[-1], .f = function(.i_1, .i, .y) {
               if(.y) {
                 .i_1 + .i
               } else {
                 .i
               }
             }) %>% unlist()) %>%
    select(-cond)
}

2020-09-13 11:05:57

正如Ari在他的回答的最后提到的，Rcpp和内联包使事情变得非常容易。作为一个例子，试试下面的内联代码(警告:未测试):

body <- 'Rcpp::NumericMatrix nm(temp);
         int nrtemp = Rccp::as<int>(nrt);
         for (int i = 0; i < nrtemp; ++i) {
             temp(i, 9) = i
             if (i > 1) {
                 if ((temp(i, 5) == temp(i - 1, 5) && temp(i, 2) == temp(i - 1, 2) {
                     temp(i, 9) = temp(i, 8) + temp(i - 1, 9)
                 } else {
                     temp(i, 9) = temp(i, 8)
                 }
             } else {
                 temp(i, 9) = temp(i, 8)
             }
         return Rcpp::wrap(nm);
        '

settings <- getPlugin("Rcpp")
# settings$env$PKG_CXXFLAGS <- paste("-I", getwd(), sep="") if you want to inc files in wd
dayloop <- cxxfunction(signature(nrt="numeric", temp="numeric"), body-body,
    plugin="Rcpp", settings=settings, cppargs="-I/usr/include")

dayloop2 <- function(temp) {
    # extract a numeric matrix from temp, put it in tmp
    nc <- ncol(temp)
    nm <- dayloop(nc, temp)
    names(temp)[names(temp) == "V10"] <- "Kumm."
    return(temp)
}

对于#include也有类似的过程，只需要传递一个参数

inc <- '#include <header.h>

到cxxfunction，如include=inc。最酷的是它为你做了所有的链接和编译，所以原型制作非常快。

免责声明:我不完全确定tmp的类应该是数字而不是数字矩阵或其他东西。但我基本确定。

编辑:如果你在此之后仍然需要更快的速度，OpenMP是一个适合c++的并行化工具。我还没有尝试从内联使用它，但它应该工作。这个想法是，在n个核的情况下，循环迭代k由k % n执行。Matloff的《R编程的艺术》中有一个合适的介绍，在这里，第16章，求助于C。

2012-07-26 06:15:50

加快R中的循环操作

推荐文章

最新文章

标签