为每个连续的序列创建一个组号

小开

可以使用各种技术向数据中添加列。下面的引用来自相关帮助文本 [[.data.frame的“详细信息”部分。

数据帧可以以多种模式进行索引。当 [和 [[与单个向量索引(x[i]或 x[[i]])一起使用时，它们将数据帧作为列表进行索引。

my.dataframe["new.col"] <- a.vector
my.dataframe[["new.col"]] <- a.vector

用于 $的 data.frame 方法将 x视为一个列表

my.dataframe$new.col <- a.vector

When [ and [[ are used with two indices (x[i, j] and x[[i, j]]) they act like indexing a matrix

my.dataframe[ , "new.col"] <- a.vector

因为 data.frame的方法假设如果您没有指定是处理列还是行，它将假设您指的是列。

For your example, this should work:

# make some fake data
your.df <- data.frame(no = c(1:4, 1:7, 1:5), h_freq = runif(16), h_freqsq = runif(16))


# find where one appears and
from <- which(your.df$no == 1)
to <- c((from-1)[-1], nrow(your.df)) # up to which point the sequence runs


# generate a sequence (len) and based on its length, repeat a consecutive number len times
get.seq <- mapply(from, to, 1:length(from), FUN = function(x, y, z) {
len <- length(seq(from = x[1], to = y[1]))
return(rep(z, times = len))
})


# when we unlist, we get a vector
your.df$group <- unlist(get.seq)
# and append it to your original data.frame. since this is
# designating a group, it makes sense to make it a factor
your.df$group <- as.factor(your.df$group)




no     h_freq   h_freqsq group
1   1 0.40998238 0.06463876     1
2   2 0.98086928 0.33093795     1
3   3 0.28908651 0.74077119     1
4   4 0.10476768 0.56784786     1
5   1 0.75478995 0.60479945     2
6   2 0.26974011 0.95231761     2
7   3 0.53676266 0.74370154     2
8   4 0.99784066 0.37499294     2
9   5 0.89771767 0.83467805     2
10  6 0.05363139 0.32066178     2
11  7 0.71741529 0.84572717     2
12  1 0.10654430 0.32917711     3
13  2 0.41971959 0.87155514     3
14  3 0.32432646 0.65789294     3
15  4 0.77896780 0.27599187     3
16  5 0.06100008 0.55399326     3

小开

除了罗曼的回答，这样的事情可能更简单。注意，我还没有测试它，因为我现在没有访问 R 的权限。

# Note that I use a global variable here
# normally not advisable, but I liked the
# use here to make the code shorter
index <<- 0
new_column = sapply(df$h_no, function(x) {
if(x == 1) index = index + 1
return(index)
})

函数迭代 n_ho中的值，并始终返回当前值所属的类别。如果检测到 1值，则增加全局变量 index并继续。

小开

如果我正确理解了这个问题，那么您需要检测 h_no何时没有增加，然后再增加 class。(我将详细介绍我是如何解决这个问题的，在最后有一个自包含的函数。)

工作

我们现在只关心 h_no列，所以我们可以从数据帧中提取它:

> h_no <- data$h_no

我们想要检测 h_no什么时候没有上升，我们可以通过计算连续元素之间的差值为负或为零来做到这一点。R 提供了 diff函数，它给出了差异的向量:

> d.h_no <- diff(h_no)
> d.h_no
[1]  1  1  1 -3  1  1  1  1  1  1 -6  1  1  1

一旦我们有了这些，就很容易找到那些非阳性的:

> nonpos <- d.h_no <= 0
> nonpos
[1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
[13] FALSE FALSE

在 R 中，TRUE和 FALSE基本上与 1和 0相同，所以如果我们得到 nonpos的累积总和，它将在(几乎)合适的点上增加1。cumsum函数(基本上与 diff相反)可以做到这一点。

> cumsum(nonpos)
[1] 0 0 0 1 1 1 1 1 1 1 2 2 2 2

但是，有两个问题: 数字太小; 以及，我们缺少第一个元素(在第一个类中应该有四个)。

第一个问题被简单地解决了: 1+cumsum(nonpos)。第二个只需要在向量的前面加上一个 1，因为第一个元素总是在类 1中:

 > classes <- c(1, 1 + cumsum(nonpos))
> classes
[1] 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3

现在，我们可以使用 cbind将它重新附加到我们的数据框架上(通过使用 class=语法，我们可以将该列设置为 class标题) :

 > data_w_classes <- cbind(data, class=classes)

data_w_classes现在包含结果。

最终结果

We can compress the lines together and wrap it all up into a function to make it easier to use:

classify <- function(data) {
cbind(data, class=c(1, 1 + cumsum(diff(data$h_no) <= 0)))
}

或者，由于 class成为一个因素是有道理的:

classify <- function(data) {
cbind(data, class=factor(c(1, 1 + cumsum(diff(data$h_no) <= 0))))
}

您可以使用以下任一函数:

> classified <- classify(data) # doesn't overwrite data
> data <- classify(data) # data now has the "class" column

(This method of solving this problem is good because it avoids explicit iteration, which is generally recommend for R, and avoids generating lots of intermediate vectors and list etc. And also it's kinda neat how it can be written on one line :) )

小开

简单: 您的数据框架是 A

b <- A[,1]
b <- b==1
b <- cumsum(b)

然后是 B 栏。

小开

Data.frame[,'h_new_column'] <- as.integer(Data.frame[,'h_no'], breaks=c(1, 4, 7))

小开

我相信使用“ cbind”是将一列添加到 R 中的数据框架的最简单的方法，下面是一个示例:

    myDf = data.frame(index=seq(1,10,1), Val=seq(1,10,1))
newCol= seq(2,20,2)
myDf = cbind(myDf,newCol)

小开

基于识别组数(mapply中的 x)及其长度(mapply中的 y)的方法

mytb<-read.table(text="h_no  h_freq  h_freqsq group
1     0.09091 0.008264628 1
2     0.00000 0.000000000 1
3     0.04545 0.002065702 1
4     0.00000 0.000000000 1
1     0.13636 0.018594050 2
2     0.00000 0.000000000 2
3     0.00000 0.000000000 2
4     0.04545 0.002065702 2
5     0.31818 0.101238512 2
6     0.00000 0.000000000 2
7     0.50000 0.250000000 2
1     0.13636 0.018594050 3
2     0.09091 0.008264628 3
3     0.40909 0.167354628 3
4     0.04545 0.002065702 3", header=T, stringsAsFactors=F)
mytb$group<-NULL


positionsof1s<-grep(1,mytb$h_no)


mytb$newgroup<-unlist(mapply(function(x,y)
rep(x,y),                      # repeat x number y times
x= 1:length(positionsof1s),    # x is 1 to number of nth group = g1:g3
y= c( diff(positionsof1s),     # y is number of repeats of groups g1 to penultimate (g2) = 4, 7
nrow(mytb)-              # this line and the following gives number of repeat for last group (g3)
(positionsof1s[length(positionsof1s )]-1 )  # number of rows - position of penultimate group (g2)
) ) )
mytb

小开

data.table函数 rleid对于这样的事情很方便。我们减去序列 1:nrow(data)将连续序列转换为常量，然后使用 rleid创建组 ID:

data$g = data.table::rleid(data$h_no - 1:nrow(data))