人们使用什么技巧来管理交互式R会话的可用内存?我使用下面的函数[基于Petr Pikal和David Hinds在2004年发布的r-help列表]来列出(和/或排序)最大的对象,并偶尔rm()其中一些对象。但到目前为止最有效的解决办法是……在64位Linux下运行,有充足的内存。

大家还有什么想分享的妙招吗?请每人寄一份。

# improved list of objects
.ls.objects <- function (pos = 1, pattern, order.by,
                        decreasing=FALSE, head=FALSE, n=5) {
    napply <- function(names, fn) sapply(names, function(x)
                                         fn(get(x, pos = pos)))
    names <- ls(pos = pos, pattern = pattern)
    obj.class <- napply(names, function(x) as.character(class(x))[1])
    obj.mode <- napply(names, mode)
    obj.type <- ifelse(is.na(obj.class), obj.mode, obj.class)
    obj.size <- napply(names, object.size)
    obj.dim <- t(napply(names, function(x)
                        as.numeric(dim(x))[1:2]))
    vec <- is.na(obj.dim)[, 1] & (obj.type != "function")
    obj.dim[vec, 1] <- napply(names, length)[vec]
    out <- data.frame(obj.type, obj.size, obj.dim)
    names(out) <- c("Type", "Size", "Rows", "Columns")
    if (!missing(order.by))
        out <- out[order(out[[order.by]], decreasing=decreasing), ]
    if (head)
        out <- head(out, n)
    out
}
# shorthand
lsos <- function(..., n=10) {
    .ls.objects(..., order.by="Size", decreasing=TRUE, head=TRUE, n=n)
}

当前回答

如果您正在Linux上工作,希望使用多个进程,并且只需要对一个或多个大对象执行读取操作,请使用makeForkCluster而不是makePSOCKcluster。这也节省了将大对象发送给其他进程的时间。

其他回答

For both speed and memory purposes, when building a large data frame via some complex series of steps, I'll periodically flush it (the in-progress data set being built) to disk, appending to anything that came before, and then restart it. This way the intermediate steps are only working on smallish data frames (which is good as, e.g., rbind slows down considerably with larger objects). The entire data set can be read back in at the end of the process, when all the intermediate objects have been removed.

dfinal <- NULL
first <- TRUE
tempfile <- "dfinal_temp.csv"
for( i in bigloop ) {
    if( !i %% 10000 ) { 
        print( i, "; flushing to disk..." )
        write.table( dfinal, file=tempfile, append=!first, col.names=first )
        first <- FALSE
        dfinal <- NULL   # nuke it
    }

    # ... complex operations here that add data to 'dfinal' data frame  
}
print( "Loop done; flushing to disk and re-reading entire data set..." )
write.table( dfinal, file=tempfile, append=TRUE, col.names=FALSE )
dfinal <- read.table( tempfile )

I'm fortunate and my large data sets are saved by the instrument in "chunks" (subsets) of roughly 100 MB (32bit binary). Thus I can do pre-processing steps (deleting uninformative parts, downsampling) sequentially before fusing the data set. Calling gc () "by hand" can help if the size of the data get close to available memory. Sometimes a different algorithm needs much less memory. Sometimes there's a trade off between vectorization and memory use. compare: split & lapply vs. a for loop. For the sake of fast & easy data analysis, I often work first with a small random subset (sample ()) of the data. Once the data analysis script/.Rnw is finished data analysis code and the complete data go to the calculation server for over night / over weekend / ... calculation.

使用knitr和将脚本放在Rmd块中也可以获得一些好处。

我通常将代码划分为不同的块,并选择将检查点保存到缓存或RDS文件中

在那里,你可以设置一个块被保存到“缓存”,或者你可以决定运行或不运行一个特定的块。这样,在第一次运行时,你只能处理“第一部分”,而在另一次执行时,你只能选择“第二部分”,等等。

例子:

part1
```{r corpus, warning=FALSE, cache=TRUE, message=FALSE, eval=TRUE}
corpusTw <- corpus(twitter)  # build the corpus
```
part2
```{r trigrams, warning=FALSE, cache=TRUE, message=FALSE, eval=FALSE}
dfmTw <- dfm(corpusTw, verbose=TRUE, removeTwitter=TRUE, ngrams=3)
```

作为一个副作用,这也可以让你在可重复性方面省去一些麻烦:)

请注意这些数据。table包的tables()似乎是Dirk的.ls.objects()自定义函数的一个很好的替代品(在前面的回答中有详细说明),尽管只是针对data.frames/tables,而不是矩阵,数组,列表。

我使用数据。表方案。使用它的:=运算符,你可以:

通过引用添加列 通过引用修改现有列的子集,通过引用修改组 通过引用删除列

这些操作都不会复制(可能很大的)数据。连一张桌子都没有。

聚合也特别快,因为数据。表使用更少的工作内存。

相关链接:

来自数据的新闻。表,伦敦R展示,2012年 什么时候我应该在data.table中使用:=操作符?