修剪一个巨大的(3.5 GB) csv 文件以读入 R

所以我得到了一个数据文件(分号分隔) ,它有很多细节和不完整的行(导致 Access 和 SQL 无法使用)。它的县级数据集被分成片段、子片段和子片段(总共约200个因子) ,历时40年。简而言之,它是巨大的,如果我只是简单地阅读它,它是不会进入记忆的。

所以我的问题是,考虑到我想要所有的县,但只有一年(只是最高级别的段... 导致大约100,000行的最后) ,什么是最好的方式去得到这个汇总到 R?

目前,我正在尝试用 Python 省去不相关的时间,通过一次读取和操作一行文件来绕过文件大小限制,但是我更喜欢只使用 R 的解决方案(CRAN 包 OK)。在 R 语言中是否有类似的方法一次读取一段文件?

有任何想法都会很感激。

更新:

  • 约束
    • 需要使用 天啊机器,因此没有 EC2实例
    • 尽可能的减速,速度和资源不是问题,只要我的机器不爆炸。
    • 正如您在下面看到的,数据包含混合类型,稍后我需要对其进行操作
  • 百科
    • 数据是3.5 GB,包含大约850万行和17列
    • 有几千行(大约2k)是畸形的,只有一列,而不是17列
      • 这些完全不重要,可以放弃
    • 我只需要从这个文件中删除约100,000行(见下文)

数据例子:

County; State; Year; Quarter; Segment; Sub-Segment; Sub-Sub-Segment; GDP; ...
Ada County;NC;2009;4;FIRE;Financial;Banks;80.1; ...
Ada County;NC;2010;1;FIRE;Financial;Banks;82.5; ...
NC  [Malformed row]
[8.5 Mill rows]

我想剪掉一些栏目,从40个可用年份(2009-2010年和1980-2020年)中挑出两个,这样数据就可以适合 R:

County; State; Year; Quarter; Segment; GDP; ...
Ada County;NC;2009;4;FIRE;80.1; ...
Ada County;NC;2010;1;FIRE;82.5; ...
[~200,000 rows]

结果:

在修改了所有提出的建议之后,我决定使用 JD 和 Marek 提出的 readLines 最好。我给 Marek 支票是因为他给了一个实现示例。

我在这里复制了一个稍微修改过的 Marek 实现版本作为我的最终答案,使用 strsplit 和 cat 只保留我想要的列。

还应该注意的是,很多的效率低于 Python... ... 就像 Python 在5分钟内处理3.5 GB 的文件,而 R 大约需要60... ... 但是如果你只有 R,那么这就是代价。

## Open a connection separately to hold the cursor position
file.in <- file('bad_data.txt', 'rt')
file.out <- file('chopped_data.txt', 'wt')
line <- readLines(file.in, n=1)
line.split <- strsplit(line, ';')
# Stitching together only the columns we want
cat(line.split[[1]][1:5], line.split[[1]][8], sep = ';', file = file.out, fill = TRUE)
## Use a loop to read in the rest of the lines
line <- readLines(file.in, n=1)
while (length(line)) {
line.split <- strsplit(line, ';')
if (length(line.split[[1]]) > 1) {
if (line.split[[1]][3] == '2009') {
cat(line.split[[1]][1:5], line.split[[1]][8], sep = ';', file = file.out, fill = TRUE)
}
}
line<- readLines(file.in, n=1)
}
close(file.in)
close(file.out)

方法上的失误:

  • 是的
    • 如果数据是格式良好的,那么这肯定是我将来用于解决此类问题的方法。然而,如果不是这样,那么 SQLite 就会窒息。
  • MapReduce
    • 老实说,医生们在这个问题上有点吓到我了,所以我没有抽出时间去尝试。看起来它要求对象也在内存中,如果是这样的话,那么这个问题就无法解决了。
  • 大内存
    • 这种方法干净地链接到数据,但是一次只能处理一种类型。因此,当放入 big.table 时,所有的字符向量都会丢失。但是,如果我需要为将来设计大型数据集,我会考虑仅仅使用数字来保持这个选项的存在。
  • 扫描
    • Scan 似乎有类似于大内存的类型问题,但是具有 readLines 的所有机制。简而言之,这次它不符合条件。
24959 次浏览

I'm not an expert at this, but you might consider trying MapReduce, which would basically mean taking a "divide and conquer" approach. R has several options for this, including:

  1. mapReduce (pure R)
  2. RHIPE (which uses Hadoop); see example 6.2.2 in the documentation for an example of subsetting files

Alternatively, R provides several packages to deal with large data that go outside memory (onto disk). You could probably load the whole dataset into a bigmemory object and do the reduction completely within R. See http://www.bigmemory.org/ for a set of tools to handle this.

Have you consisered bigmemory ? Check out this and this.

Perhaps you can migrate to MySQL or PostgreSQL to prevent youself from MS Access limitations.

It is quite easy to connect R to these systems with a DBI (available on CRAN) based database connector.

You could import data to SQLite database and then use RSQLite to select subsets.

Is there a similar way to read in files a piece at a time in R?

Yes. The readChar() function will read in a block of characters without assuming they are null-terminated. If you want to read data in a line at a time you can use readLines(). If you read a block or a line, do an operation, then write the data out, you can avoid the memory issue. Although if you feel like firing up a big memory instance on Amazon's EC2 you can get up to 64GB of RAM. That should hold your file plus plenty of room to manipulate the data.

If you need more speed, then Shane's recommendation to use Map Reduce is a very good one. However if you go the route of using a big memory instance on EC2 you should look at the multicore package for using all cores on a machine.

If you find yourself wanting to read many gigs of delimited data into R you should at least research the sqldf package which allows you to import directly into sqldf from R and then operate on the data from within R. I've found sqldf to be one of the fastest ways to import gigs of data into R, as mentioned in this previous question.

I would go for a DB and then make some queries to extract the samples you need via DBI

Please avoid importing a 3,5 GB csv file into SQLite. Or at least double check that your HUGE db fits into SQLite limits, http://www.sqlite.org/limits.html

It's a damn big DB you have. I would go for MySQL if you need speed. But be prepared to wait a lot of hours for the import to finish. Unless you have some unconventional hardware or you are writing from the future...

Amazon's EC2 could be a good solution also for instantiating a server running R and MySQL.

my two humble pennies worth.

scan() has both an nlines argument and a skip argument. Is there some reason you can just use that to read in a chunk of lines a time, checking the date to see if it's appropriate? If the input file is ordered by date, you can store an index that tells you what your skip and nlines should be that would speed up the process in the future.

My try with readLines. This piece of a code creates csv with selected years.

file_in <- file("in.csv","r")
file_out <- file("out.csv","a")
x <- readLines(file_in, n=1)
writeLines(x, file_out) # copy headers


B <- 300000 # depends how large is one pack
while(length(x)) {
ind <- grep("^[^;]*;[^;]*; 20(09|10)", x)
if (length(ind)) writeLines(x[ind], file_out)
x <- readLines(file_in, n=B)
}
close(file_in)
close(file_out)

There's a brand-new package called colbycol that lets you read in only the variables you want from enormous text files:

http://colbycol.r-forge.r-project.org/

It passes any arguments along to read.table, so the combination should let you subset pretty tightly.

The ff package is a transparent way to deal with huge files.

You may see the package website and/or a presentation about it.

I hope this helps

These days, 3.5GB just isn't really that big, I can get access to a machine with 244GB RAM (r3.8xlarge) on the Amazon cloud for $2.80/hour. How many hours will it take you to figure out how to solve the problem using big-data type solutions? How much is your time worth? Yes it will take you an hour or two to figure out how to use AWS - but you can learn the basics on a free tier, upload the data and read the first 10k lines into R to check it worked and then you can fire up a big memory instance like r3.8xlarge and read it all in! Just my 2c.

What about using readr and the read_*_chunked family?

So in your case:

testfile.csv

County; State; Year; Quarter; Segment; Sub-Segment; Sub-Sub-Segment; GDP
Ada County;NC;2009;4;FIRE;Financial;Banks;80.1
Ada County;NC;2010;1;FIRE;Financial;Banks;82.5
lol
Ada County;NC;2013;1;FIRE;Financial;Banks;82.5

Actual code

require(readr)
f <- function(x, pos) subset(x, Year %in% c(2009, 2010))
read_csv2_chunked("testfile.csv", DataFrameCallback$new(f), chunk_size = 1)

This applies f to each chunk, remembering the col-names and combining the filtered results in the end. See ?callback which is the source of this example.

This results in:

# A tibble: 2 × 8
County State  Year Quarter Segment `Sub-Segment` `Sub-Sub-Segment`   GDP
*      <chr> <chr> <int>   <int>   <chr>         <chr>             <chr> <dbl>
1 Ada County    NC  2009       4    FIRE     Financial             Banks   801
2 Ada County    NC  2010       1    FIRE     Financial             Banks   825

You can even increase chunk_size but in this example there are only 4 lines.

Now, 2017, I would suggest to go for spark and sparkR.

  • the syntax can be written in a simple rather dplyr-similar way

  • it fits quite well to small memory (small in the sense of 2017)

However, it may be an intimidating experience to get started...