读取固定宽度的文本文件

我正试图把这个丑陋的格式化数据集加载到我的 R 会话中: Http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for

Weekly SST data starts week centered on 3Jan1990


Nino1+2      Nino3        Nino34        Nino4
Week          SST SSTA     SST SSTA     SST SSTA     SST SSTA
03JAN1990     23.4-0.4     25.1-0.3     26.6 0.0     28.6 0.3
10JAN1990     23.4-0.8     25.2-0.3     26.6 0.1     28.6 0.3
17JAN1990     24.2-0.3     25.3-0.3     26.5-0.1     28.6 0.3

目前为止,我能读懂

  x = readLines(path)

但是这个文件混合了‘ white space’和‘-’作为分隔符,而且我不是正则表达式专家。 我感谢任何帮助,使这成为一个良好和干净的 R 数据框架。 谢谢!

85725 次浏览

I don't know a thing about R, but I can provide you with a regex that will match such lines:

\s[0-9]{2}[A-Z]{3}[0-9]{4}(\s{5}[0-9]+\.[0-9]+[ -][0-9]+\.[0-9]+){4}

This is a fixed width file. Use read.fwf() to read it:

x <- read.fwf(
file=url("http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for"),
skip=4,
widths=c(12, 7, 4, 9, 4, 9, 4, 9, 4))


head(x)


V1   V2   V3   V4   V5   V6   V7   V8  V9
1  03JAN1990   23.4 -0.4 25.1 -0.3 26.6  0.0 28.6 0.3
2  10JAN1990   23.4 -0.8 25.2 -0.3 26.6  0.1 28.6 0.3
3  17JAN1990   24.2 -0.3 25.3 -0.3 26.5 -0.1 28.6 0.3
4  24JAN1990   24.4 -0.5 25.5 -0.4 26.5 -0.1 28.4 0.2
5  31JAN1990   25.1 -0.2 25.8 -0.2 26.7  0.1 28.4 0.2
6  07FEB1990   25.8  0.2 26.1 -0.1 26.8  0.1 28.4 0.3

Update

The package readr (released April, 2015) provides a simple and fast alternative.

library(readr)


x <- read_fwf(
file="http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for",
skip=4,
fwf_widths(c(12, 7, 4, 9, 4, 9, 4, 9, 4)))

Speed comparison: readr::read_fwf() was ~2x faster than utils::read.fwf ().

Another way to determine widths...

df <- read.fwf(
file=url("http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for"),
widths=c(-1, 9, -5, 4, 4, -5, 4, 4, -5, 4, 4, -5, 4, 4),
skip=4
)

The -1 in the widths argument says there is a one-character column that should be ignored,the -5 in the widths argument says there is a five-character column that should be ignored, likewise...

ref : https://www.inkling.com/read/r-cookbook-paul-teetor-1st/chapter-4/recipe-4-6

First off, that question is directly from a the Coursera "Get Data and Clean It" course by Leeks. While there is another part of the question, the tough part is reading the file.

That said, the course is mostly intended for learning.

I hate R's fixed width procedure. It is slow and for large number of variables, it very quickly becomes a pain to negate certain columns, etc.

I think its easier to use readLines() and then from that use substr() to make your variables

x <- readLines(con=url("http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for"))


# Skip 4 lines
x <- x[-(1:4)]


mydata <- data.frame(var1 = substr(x, 1, 10),
var2 = substr(x, 16, 19),
var3 = substr(x, 20, 23),
var4 = substr(x, 29, 32)  # and so on and so on
)

You can now use the read_fwf() function in Hadley Wickham's readr package.

A huge performance improvement is to be expected, compared to base read.fwf().

I document here the list of alternatives for reading fixed-width files in R, as well as providing some benchmarks for which is fastest.

My preferred approach is to combine fread with stringi; it's competitive as the fastest approach, and has the added benefit (IMO) of storing your data as a data.table:

library(data.table)
library(stringi)


col_ends <-
list(beg = c(1, 10, 15, 19, 23, 28, 32, 36,
41, 45, 49, 54, 58),
end = c(9, 14, 18, 22, 27, 31, 35,
40, 44, 48, 53, 57, 61))


data = fread(
"http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for",
header = FALSE, skip = 4L, sep = NULL
)[, lapply(1:(length(col_ends$beg)),
function(ii)
stri_sub(V1, col_ends$beg[ii], col_ends$end[ii]))
][ , paste0("V", c(2, 5, 8, 11)) := NULL]
#              V1   V3   V4   V6   V7   V9  V10  V12  V13
#    1: 03JAN1990 23.4 -0.4 25.1 -0.3 26.6  0.0 28.6  0.3
#    2: 10JAN1990 23.4 -0.8 25.2 -0.3 26.6  0.1 28.6  0.3
#    3: 17JAN1990 24.2 -0.3 25.3 -0.3 26.5 -0.1 28.6  0.3
#    4: 24JAN1990 24.4 -0.5 25.5 -0.4 26.5 -0.1 28.4  0.2
#    5: 31JAN1990 25.1 -0.2 25.8 -0.2 26.7  0.1 28.4  0.2
#   ---
# 1365: 24FEB2016 27.1  0.9 28.4  1.8 29.0  2.1 29.5  1.4
# 1366: 02MAR2016 27.3  1.0 28.6  1.8 28.9  1.9 29.5  1.4
# 1367: 09MAR2016 27.7  1.2 28.6  1.6 28.9  1.8 29.6  1.5
# 1368: 16MAR2016 27.5  1.0 28.8  1.7 28.9  1.7 29.6  1.4
# 1369: 23MAR2016 27.2  0.9 28.6  1.4 28.8  1.5 29.5  1.2

Note that fread automatically strips leading and trailing whitespace -- sometimes, this is undesirable, in which case set strip.white = FALSE.


We could also have started with a vector of column widths ww by doing:

ww <- c(9, 5, 4, 4, 5, 4, 4, 5, 4, 4, 5, 4, 4)
nd <- cumsum(ww)


col_ends <-
list(beg = c(1, nd[-length(nd)]+1L),
end = nd)

And we could have picked which columns to exclude more robustly by using negative indices like:

col_ends <-
list(beg = c(1, -10, 15, 19, -23, 28, 32, -36,
41, 45, -49, 54, 58),
end = c(9, 14, 18, 22, 27, 31, 35,
40, 44, 48, 53, 57, 61))

Then replace col_ends$beg[ii] with abs(col_ends$beg[ii]) and in the next line:

paste0("V", which(col_ends$beg < 0))

Lastly, if you want the column names to be read programatically as well, you could clean up with readLines:

cols <-
gsub("\\s", "",
sapply(1:(length(col_ends$beg)),
function(ii)
stri_sub(readLines(URL, n = 4L)[4L],
col_ends$beg[ii]+1L,
col_ends$end[ii]+1L)))


cols <- cols[cols != ""]

(note that combining this step with fread would require creating a copy of the table in order to remove the header row, and would thus be inefficient for large data sets)