加快R中的循环操作

我在r中有一个很大的性能问题。我写了一个迭代data.frame对象的函数。它只是简单地向data.frame添加一个新列并累积一些东西。(操作简单)。data.frame大约有850K行。我的电脑还在工作(大约10小时了)，我不知道运行时间。

dayloop2 <- function(temp){
    for (i in 1:nrow(temp)){    
        temp[i,10] <- i
        if (i > 1) {             
            if ((temp[i,6] == temp[i-1,6]) & (temp[i,3] == temp[i-1,3])) { 
                temp[i,10] <- temp[i,9] + temp[i-1,10]                    
            } else {
                temp[i,10] <- temp[i,9]                                    
            }
        } else {
            temp[i,10] <- temp[i,9]
        }
    }
    names(temp)[names(temp) == "V10"] <- "Kumm."
    return(temp)
}

有什么办法可以加快这次行动吗?

当前回答

看一下{purrr}中的accumulate()函数:

dayloop_accumulate <- function(temp) {
  temp %>%
    as_tibble() %>%
     mutate(cond = c(FALSE, (V6 == lag(V6) & V3 == lag(V3))[-1])) %>%
    mutate(V10 = V9 %>% 
             purrr::accumulate2(.y = cond[-1], .f = function(.i_1, .i, .y) {
               if(.y) {
                 .i_1 + .i
               } else {
                 .i
               }
             }) %>% unlist()) %>%
    select(-cond)
}

2020-09-13 11:05:57

其他回答

最大的问题和无效的根源是索引data.frame，我的意思是所有你使用temp[，]的行。尽量避免这种情况。我把你的函数，更改索引，这里是version_A

dayloop2_A <- function(temp){
    res <- numeric(nrow(temp))
    for (i in 1:nrow(temp)){    
        res[i] <- i
        if (i > 1) {             
            if ((temp[i,6] == temp[i-1,6]) & (temp[i,3] == temp[i-1,3])) { 
                res[i] <- temp[i,9] + res[i-1]                   
            } else {
                res[i] <- temp[i,9]                                    
            }
        } else {
            res[i] <- temp[i,9]
        }
    }
    temp$`Kumm.` <- res
    return(temp)
}

正如你所看到的，我创建了收集结果的向量。最后，我将它添加到data.frame，我不需要打乱名称。那么它有多好呢?

我用nrow从1000到10,000 × 1000运行data.frame的每个函数，并用system.time测量时间

X <- as.data.frame(matrix(sample(1:10, n*9, TRUE), n, 9))
system.time(dayloop2(X))

结果是

您可以看到您的版本以指数方式依赖于nrow(X)。修正后的模型有线性关系，简单的lm模型预测850,000行计算需要6分10秒。

向量化的力量

正如Shane和Calimo在他们的答案中所述，向量化是获得更好性能的关键。从你的代码，你可以移动到循环之外:

调节结果的初始化(是temp[i,9])

这导致了这段代码

dayloop2_B <- function(temp){
    cond <- c(FALSE, (temp[-nrow(temp),6] == temp[-1,6]) & (temp[-nrow(temp),3] == temp[-1,3]))
    res <- temp[,9]
    for (i in 1:nrow(temp)) {
        if (cond[i]) res[i] <- temp[i,9] + res[i-1]
    }
    temp$`Kumm.` <- res
    return(temp)
}

比较这个函数的结果，这次是nrow从10,000到100,000乘10,000。

调谐调谐

另一个调整是将循环索引temp[i,9]更改为res[i](在第i个循环迭代中完全相同)。这又是索引向量和索引data。frame的区别。第二件事:当你查看循环时，你可以看到不需要遍历所有的i，而只对符合条件的i进行遍历。我们开始吧

dayloop2_D <- function(temp){
    cond <- c(FALSE, (temp[-nrow(temp),6] == temp[-1,6]) & (temp[-nrow(temp),3] == temp[-1,3]))
    res <- temp[,9]
    for (i in (1:nrow(temp))[cond]) {
        res[i] <- res[i] + res[i-1]
    }
    temp$`Kumm.` <- res
    return(temp)
}

您获得的性能高度依赖于数据结构。准确地说-在条件中TRUE值的百分比。对于我的模拟数据，它需要850,000行小于1秒的计算时间。

如果你想更进一步，我认为至少有两件事是可以做到的:

写一个C代码来做条件累加如果你知道你的数据Max序列不是很大，那么你可以把循环改为向量化 While (any(cond)) { cond <- c(FALSE, cond[-1] & !cond[-n]) Res [indx] <- Res [indx] + Res [which(indx)-1] cond[indx] <- FALSE ｝

用于模拟和图形的代码可在GitHub上获得。

2010-06-03 22:34:17

在R中，您通常可以通过使用apply族函数来加速循环处理(在您的示例中，可能是复制)。看一下提供进度条的plyr包。

另一种选择是完全避免循环，用向量化算法代替它们。我不确定你到底在做什么，但你可能可以将你的函数一次性应用到所有行:

temp[1:nrow(temp), 10] <- temp[1:nrow(temp), 9] + temp[0:(nrow(temp)-1), 10]

这将会快得多，然后你可以用你的条件过滤行:

cond.i <- (temp[i, 6] == temp[i-1, 6]) & (temp[i, 3] == temp[i-1, 3])
temp[cond.i, 10] <- temp[cond.i, 9]

向量化算术需要更多的时间和思考问题，但有时可以节省几个数量级的执行时间。

2010-05-26 08:37:10

如果你在使用for循环，你很可能会像编写C或Java或其他语言一样编写R。正确向量化的R代码非常快。

以这两段简单的代码为例，按顺序生成一个10,000个整数的列表:

第一个代码示例是如何使用传统的编码范式编写循环代码。它需要28秒才能完成

system.time({
    a <- NULL
    for(i in 1:1e5)a[i] <- i
})
   user  system elapsed 
  28.36    0.07   28.61

你可以通过简单的预分配内存来获得几乎100倍的性能提升:

system.time({
    a <- rep(1, 1e5)
    for(i in 1:1e5)a[i] <- i
})

   user  system elapsed 
   0.30    0.00    0.29

但是使用冒号操作符的基本R向量操作:这个操作几乎是瞬时的:

system.time(a <- 1:1e5)

   user  system elapsed 
      0       0       0

2011-06-28 06:55:05

看一下{purrr}中的accumulate()函数:

dayloop_accumulate <- function(temp) {
  temp %>%
    as_tibble() %>%
     mutate(cond = c(FALSE, (V6 == lag(V6) & V3 == lag(V3))[-1])) %>%
    mutate(V10 = V9 %>% 
             purrr::accumulate2(.y = cond[-1], .f = function(.i_1, .i, .y) {
               if(.y) {
                 .i_1 + .i
               } else {
                 .i
               }
             }) %>% unlist()) %>%
    select(-cond)
}

2020-09-13 11:05:57

通过使用索引或嵌套的ifelse()语句跳过循环，可以更快地实现这一点。

idx <- 1:nrow(temp)
temp[,10] <- idx
idx1 <- c(FALSE, (temp[-nrow(temp),6] == temp[-1,6]) & (temp[-nrow(temp),3] == temp[-1,3]))
temp[idx1,10] <- temp[idx1,9] + temp[which(idx1)-1,10] 
temp[!idx1,10] <- temp[!idx1,9]    
temp[1,10] <- temp[1,9]
names(temp)[names(temp) == "V10"] <- "Kumm."

2010-05-25 22:15:27

加快R中的循环操作

推荐文章

最新文章

标签