UNIX 排序命令如何对一个非常大的文件进行排序?

UNIX sort命令可以像下面这样对一个非常大的文件进行排序:

sort large_file

How is the sort algorithm implemented?

为什么它不会导致内存的过度消耗?

74050 次浏览

UNIX 排序命令的算法详细信息显示 Unix Sort 使用了一个外部 R-Way 合并排序算法。该链接进入更多的细节,但在本质上,它将输入分成更小的部分(适合内存) ,然后在结束时将每个部分合并在一起。

sort命令将工作数据存储在临时磁盘文件中(通常是在 /tmp中)。

我对这个程序并不熟悉,但我猜它是通过外排序来完成的(大部分问题保存在临时文件中,而相对较小的一部分问题保存在内存中)。请参阅 Donald Knuth 的 计算机程序设计艺术,第3卷,分类与搜索,第5.4节,了解关于这个主题的深入讨论。

警告: 这个脚本每块启动一个 shell,对于非常大的文件,可能有数百个 shell。


这是我为此写的一个脚本。在一台4处理器的机器上,它将排序性能提高了100% !

#! /bin/ksh


MAX_LINES_PER_CHUNK=1000000
ORIGINAL_FILE=$1
SORTED_FILE=$2
CHUNK_FILE_PREFIX=$ORIGINAL_FILE.split.
SORTED_CHUNK_FILES=$CHUNK_FILE_PREFIX*.sorted


usage ()
{
echo Parallel sort
echo usage: psort file1 file2
echo Sorts text file file1 and stores the output in file2
echo Note: file1 will be split in chunks up to $MAX_LINES_PER_CHUNK lines
echo  and each chunk will be sorted in parallel
}


# test if we have two arguments on the command line
if [ $# != 2 ]
then
usage
exit
fi


#Cleanup any lefover files
rm -f $SORTED_CHUNK_FILES > /dev/null
rm -f $CHUNK_FILE_PREFIX* > /dev/null
rm -f $SORTED_FILE


#Splitting $ORIGINAL_FILE into chunks ...
split -l $MAX_LINES_PER_CHUNK $ORIGINAL_FILE $CHUNK_FILE_PREFIX


for file in $CHUNK_FILE_PREFIX*
do
sort $file > $file.sorted &
done
wait


#Merging chunks to $SORTED_FILE ...
sort -m $SORTED_CHUNK_FILES > $SORTED_FILE


#Cleanup any lefover files
rm -f $SORTED_CHUNK_FILES > /dev/null
rm -f $CHUNK_FILE_PREFIX* > /dev/null

参见: “ 使用 shell 脚本更快地排序大文件

内存不应该是个问题——排序已经解决了这个问题。如果您希望优化多核 CPU 的使用,我已经用一个小脚本实现了它(类似于您可能在网上找到的一些脚本,但是比大多数脚本更简单/干净;)。

#!/bin/bash
# Usage: psort filename <chunksize> <threads>
# In this example a the file largefile is split into chunks of 20 MB.
# The part are sorted in 4 simultaneous threads before getting merged.
#
# psort largefile.txt 20m 4
#
# by h.p.
split -b $2 $1 $1.part
suffix=sorttemp.`date +%s`
nthreads=$3
i=0
for fname in `ls *$1.part*`
do
let i++
sort $fname > $fname.$suffix &
mres=$(($i % $nthreads))
test "$mres" -eq 0 && wait
done
wait
sort -m *.$suffix
rm $1.part*
#!/bin/bash


usage ()
{
echo Parallel sort
echo usage: psort file1 file2
echo Sorts text file file1 and stores the output in file2
}


# test if we have two arguments on the command line
if [ $# != 2 ]
then
usage
exit
fi


pv $1 | parallel --pipe --files sort -S512M | parallel -Xj1 sort -S1024M -m {} ';' rm {} > $2

Look carefully at the options of sort to speed performance and understand it's impact on your machine and problem. Ubuntu 的关键参数是

  • 临时文件的位置-T 目录 _ name
  • Amount of memory to use -S N% ( N% of all memory to use, the more the better but avoid over subscription that causes swapping to disk. You can use it like "-S 80%" to use 80% of available RAM, or "-S 2G" for 2 GB RAM.)

The questioner asks "Why no high memory usage?" The answer to that comes from history, older unix machines were small and the default memory size is set small. Adjust this as big as possible for your workload to vastly improve sort performance. Set the working directory to a place on your fastest device that has enough space to hold at least 1.25 * the size of the file being sorted.

如何使用-T 选项排序大型文件

I have to sort a large file's 7th column.

我用的是:

grep vdd  "file name" | sort -nk 7 |

我面临以下错误:

******sort: write failed: /tmp/sort1hc37c: No space left on device******

然后我使用-T 选项,如下所示:

grep vdda  "file name" | sort -nk 7  -T /dev/null/ |