从文件中随机选择行

小开

对文件随机排序，并选择第100行:

lines=100
input_file=/usr/share/dict/words


# This is the basic selection method
<$input_file sort -R | head -n $lines


# If the file has duplicates that must never cause duplicate results
<$input_file sort | uniq        | sort -R | head -n $lines


# If the file has blank lines that must be filtered, use sed
<$input_file sed $'/^[ \t]*$/d' | sort -R | head -n $lines

当然，<$input_file可以替换为任何管道标准输入。这(sort -R和$'...\t...'以获得sed来匹配制表符)适用于GNU/Linux和BSD/macOS。

小开

使用shuf和-n选项，如下所示，获得N随机行:

shuf -n N input > output

小开

根据对洗牌答案的评论，他在一分钟内洗牌了7800000000行。

接受挑战……

编辑:我打破了自己的记录

Powershuf只用了0.047秒

$ time ./powershuf.py -n 10 --file lines_78000000000.txt > /dev/null
./powershuf.py -n 10 --file lines_78000000000.txt > /dev/null  0.02s user 0.01s system 80% cpu 0.047 total

它这么快的原因是，我没有读取整个文件，只是移动文件指针10次，打印指针后面的行。

Gitlab Repo

旧的尝试

首先，我需要一个78.000.000.000行的文件:

seq 1 78 | xargs -n 1 -P 16 -I% seq 1 1000 | xargs -n 1 -P 16 -I% echo "" > lines_78000.txt
seq 1 1000 | xargs -n 1 -P 16 -I% cat lines_78000.txt > lines_78000000.txt
seq 1 1000 | xargs -n 1 -P 16 -I% cat lines_78000000.txt > lines_78000000000.txt

这给了我一个带有780亿年换行符的文件;-)

现在是洗牌部分:

$ time shuf -n 10 lines_78000000000.txt




















shuf -n 10 lines_78000000000.txt  2171.20s user 22.17s system 99% cpu 36:35.80 total

瓶颈是CPU和不使用多线程，它固定了1个核在100%，其他15个没有使用。

Python是我经常使用的，所以我将使用它来提高速度:

#!/bin/python3
import random
f = open("lines_78000000000.txt", "rt")
count = 0
while 1:
buffer = f.read(65536)
if not buffer: break
count += buffer.count('\n')


for i in range(10):
f.readline(random.randint(1, count))

这只花了我不到一分钟:

$ time ./shuf.py




















./shuf.py  42.57s user 16.19s system 98% cpu 59.752 total

我用的是联想X1 extreme 2代i9和三星NVMe，它们给了我很高的读写速度。

我知道它可以变得更快，但我会留一些空间给其他人尝试。

行计数器来源:Luther Blissett

小开

我的首选是非常快的，我采样了一个以制表符分隔的数据文件，它有13列，23.1万行，2.0GB未压缩。

# randomly sample select 5% of lines in file
# including header row, exclude blank lines, new seed


time \
awk 'BEGIN  {srand()}
!/^$/  { if (rand() <= .05 || FNR==1) print > "data-sample.txt"}' data.txt


# awk  tsv004  3.76s user 1.46s system 91% cpu 5.716 total

小开

seq 1 100 | python3 -c 'print(__import__("random").choice(__import__("sys").stdin.readlines()))'

小开

# Function to sample N lines randomly from a file
# Parameter $1: Name of the original file
# Parameter $2: N lines to be sampled
rand_line_sampler() {
N_t=$(awk '{print $1}' $1 | wc -l) # Number of total lines


N_t_m_d=$(( $N_t - $2 - 1 )) # Number oftotal lines minus desired number of lines


N_d_m_1=$(( $2 - 1)) # Number of desired lines minus 1


# vector to have the 0 (fail) with size of N_t_m_d
echo '0' > vector_0.temp
for i in $(seq 1 1 $N_t_m_d); do
echo "0" >> vector_0.temp
done


# vector to have the 1 (success) with size of desired number of lines
echo '1' > vector_1.temp
for i in $(seq 1 1 $N_d_m_1); do
echo "1" >> vector_1.temp
done


cat vector_1.temp vector_0.temp | shuf > rand_vector.temp


paste -d" " rand_vector.temp $1 |
awk '$1 != 0 {$1=""; print}' |
sed 's/^ *//' > sampled_file.txt # file with the sampled lines


rm vector_0.temp vector_1.temp rand_vector.temp
}


rand_line_sampler "parameter_1" "parameter_2"

小开

下面的“c”是要从输入中选择的行数。根据需要修改:

#!/bin/sh


gawk '
BEGIN   { srand(); c = 5 }
c/NR >= rand() { lines[x++ % c] = $0 }
END { for (i in lines)  print lines[i] }


' "$@"

小开

只是为了完整起见，因为它可以从Arch的社区回购:还有一个名为shuffle的工具，但它没有任何命令行开关来限制行数，并在其手册页中警告:“由于shuffle将输入读取到内存中，它可能会在非常大的文件上失败。”