从 R 写入 Excel 时处理 java.lang.OutOfMemory 错误

xlsx软件包可以用来从 R 读写 Excel 电子表格。不幸的是,即使对于中等大小的电子表格,也可能出现 java.lang.OutOfMemoryError。尤其是,

Jcall 中的错误(“ RJavaTools”,“ Ljava/lang/Object;”,“ invokeMethod”,cl,:
错误: Java 堆空间

Jcall 中的错误(“ RJavaTools”,“ Ljava/lang/Object;”,“ newInstance”,. jfindClass (class) ,:
错误: 超出 GC 开销限制

(其它相关的例外情况也有可能出现,但较为罕见。)

在阅读电子表格时,也提出了类似的问题。

将一个大的 xlsx 文件导入 R?

与 CSV 相比,使用 Excel 电子表格作为数据存储介质的主要优势在于,您可以在同一个文件中存储多个表格,因此在这里我们考虑每个工作表编写一个数据帧的数据帧列表。此示例数据集包含40个数据框架,每个数据框架有两列,最多200k 行。它被设计成足够大以至于有问题,但是你可以通过改变 n_sheetsn_rows来改变它的大小。

library(xlsx)
set.seed(19790801)
n_sheets <- 40
the_data <- replicate(
n_sheets,
{
n_rows <- sample(2e5, 1)
data.frame(
x = runif(n_rows),
y = sample(letters, n_rows, replace = TRUE)
)
},
simplify = FALSE
)
names(the_data) <- paste("Sheet", seq_len(n_sheets))

将此内容写入文件的自然方法是使用 createWorkbook创建一个工作簿,然后循环遍历调用 createSheetaddDataFrame的每个数据帧。最后,可以使用 saveWorkbook将工作簿写入到文件中。我已经在循环中添加了消息,以便更容易看到它在哪里结束。

wb <- createWorkbook()
for(i in seq_along(the_data))
{
message("Creating sheet", i)
sheet <- createSheet(wb, sheetName = names(the_data)[i])
message("Adding data frame", i)
addDataFrame(the_data[[i]], sheet)
}
saveWorkbook(wb, "test.xlsx")

在一台有8GB 内存的机器上以64位运行这个命令,第一次运行 addDataFrame时会抛出 GC overhead limit exceeded错误。

如何使用 xlsx将大型数据集写入 Excel 电子表格?

75278 次浏览

This is a known issue: http://code.google.com/p/rexcel/issues/detail?id=33

While unresolved, the issue page links to a solution by Gabor Grothendieck suggesting that the heap size should be increased by setting the java.parameters option before the rJava package is loaded. (rJava is a dependency of xlsx.)

options(java.parameters = "-Xmx1000m")

The value 1000 is the number of megabytes of RAM to allow for the Java heap; it can be replaced with any value you like. My experiments with this suggest that bigger values are better, and you can happily use your full RAM entitlement. For example, I got the best results using:

options(java.parameters = "-Xmx8000m")

on the machine with 8GB RAM.

A further improvement can be obtained by requesting a garbage collection in each iteration of the loop. As noted by @gjabel, R garbage collection can be performed using gc(). We can define a Java garbage collection function that calls the Java System.gc() method:

jgc <- function()
{
.jcall("java/lang/System", method = "gc")
}

Then the loop can be updated to:

for(i in seq_along(the_data))
{
gc()
jgc()
message("Creating sheet", i)
sheet <- createSheet(wb, sheetName = names(the_data)[i])
message("Adding data frame", i)
addDataFrame(the_data[[i]], sheet)
}

With both these code fixes, the code ran as far as i = 29 before throwing an error.

One technique that I tried unsuccessfully was to use write.xlsx2 to write the contents to file at each iteration. This was slower than the other code, and it fell over on the 10th iteration (but at least part of the contents were written to file).

for(i in seq_along(the_data))
{
message("Writing sheet", i)
write.xlsx2(
the_data[[i]],
"test.xlsx",
sheetName = names(the_data)[i],
append    = i > 1
)
}

Building on @richie-cotton answer, I found adding gc() to the jgc function kept the CPU usage low.

jgc <- function()
{
gc()
.jcall("java/lang/System", method = "gc")
}

My previous for loop still struggled with the original jgc function, but with extra command, I no longer run into GC overhead limit exceeded error message.

You can also use gc() inside the loop if you are writing row by row. gc() stands for garbage collection. gc() can be used in any case of memory issue.

Solution for the above error: Please use the below mentioned r - code:

detach(package:xlsx)
detach(package:XLConnect)
library(openxlsx)

And, try to import the file again and you will not get any error as it works for me.

I was having issues with write.xlsx() rather than reading.... but then realised that I had accidentally been running 32bit R. Swapping it out to 64bit has fixed the issue.

Restart R and, before loading the R packages, insert:

 options(java.parameters = "-Xmx2048m")

or

options(java.parameters = "-Xmx8000m")