“后删除字符串的一部分”

我正在研究国家加州调查局的参考基因组登记号码,比如变量 a:

a <- c("NM_020506.1","NM_020519.1","NM_001030297.2","NM_010281.2","NM_011419.3", "NM_053155.2")

为了获得信息从生物艺术包,我需要删除后的 .1.2等加入号码。我通常这样做的代码:

b <- sub("..*", "", a)


# [1] "" "" "" "" "" ""

但是正如你所看到的,这不是这个变量的正确方法,有人能帮我解决这个问题吗?

184789 次浏览

You could do:

sub("*\\.[0-9]", "", a)

or

library(stringr)
str_sub(a, start=1, end=-3)

You just need to escape the period:

a <- c("NM_020506.1","NM_020519.1","NM_001030297.2","NM_010281.2","NM_011419.3", "NM_053155.2")


gsub("\\..*","",a)
[1] "NM_020506"    "NM_020519"    "NM_001030297" "NM_010281"    "NM_011419"    "NM_053155"

We can pretend they are filenames and remove extensions:

tools::file_path_sans_ext(a)
# [1] "NM_020506"    "NM_020519"    "NM_001030297" "NM_010281"    "NM_011419"    "NM_053155"

If the string should be of fixed length, then substr from base R can be used. But, we can get the position of the . with regexpr and use that in substr

substr(a, 1, regexpr("\\.", a)-1)
#[1] "NM_020506"    "NM_020519"    "NM_001030297" "NM_010281"    "NM_011419"    "NM_053155"

We can use a lookahead regex to extract the strings before ..

library(stringr)


str_extract(a, ".*(?=\\.)")
[1] "NM_020506"    "NM_020519"    "NM_001030297" "NM_010281"
[5] "NM_011419"    "NM_053155"

Another option is to use str_split from stringr:

library(stringr)
str_split(a, "\\.", simplify=T)[,1]
[1] "NM_020506"    "NM_020519"    "NM_001030297" "NM_010281"    "NM_011419"    "NM_053155"