如何分割一个文件,并保持每个部分的第一行?

给定: 一个大的文本数据文件(例如 CSV 格式) ,带有一个“特殊”的第一行(例如,字段名)。

Wanted: coreutils split -l命令的等价物,但附加要求原始文件的标题行出现在每个结果块的开头。

我猜一些 splithead的混合物会起作用?

66327 次浏览

I'm a novice when it comes to Bash-fu, but I was able to concoct this two-command monstrosity. I'm sure there are more elegant solutions.

$> tail -n +2 file.txt | split -l 4
$> for file in `ls xa*`; do echo "`head -1 file.txt`" > tmp; cat $file >> tmp; mv -f tmp $file; done

This is assuming your input file is file.txt, you're not using the prefix argument to split, and you're working in a directory that doesn't have any other files that start with split's default xa* output format. Also, replace the '4' with your desired split line size.

This is robhruska's script cleaned up a bit:

tail -n +2 file.txt | split -l 4 - split_
for file in split_*
do
head -n 1 file.txt > tmp_file
cat "$file" >> tmp_file
mv -f tmp_file "$file"
done

I removed wc, cut, ls and echo in the places where they're unnecessary. I changed some of the filenames to make them a little more meaningful. I broke it out onto multiple lines only to make it easier to read.

If you want to get fancy, you could use mktemp or tempfile to create a temporary filename instead of using a hard coded one.

Edit

Using GNU split it's possible to do this:

split_filter () { { head -n 1 file.txt; cat; } > "$FILE"; }; export -f split_filter; tail -n +2 file.txt | split --lines=4 --filter=split_filter - split_

Broken out for readability:

split_filter () { { head -n 1 file.txt; cat; } > "$FILE"; }
export -f split_filter
tail -n +2 file.txt | split --lines=4 --filter=split_filter - split_

When --filter is specified, split runs the command (a function in this case, which must be exported) for each output file and sets the variable FILE, in the command's environment, to the filename.

A filter script or function could do any manipulation it wanted to the output contents or even the filename. An example of the latter might be to output to a fixed filename in a variable directory: > "$FILE/data.dat" for example.

This is a more robust version of Denis Williamson's script. The script creates a lot of temporary files, and it would be a shame if they were left lying around if the run was incomplete. So, let's add signal trapping (see http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_12_02.html and then http://tldp.org/LDP/abs/html/debugging.html) and remove our temporary files; this is a best practice anyways.

trap 'rm split_* tmp_file ; exit 13' SIGINT SIGTERM SIGQUIT
tail -n +2 file.txt | split -l 4 - split_
for file in split_*
do
head -n 1 file.txt > tmp_file
cat $file >> tmp_file
mv -f tmp_file $file
done

Replace '13' with whatever return code you want. Oh, and you should probably be using mktemp anyways (as some have already suggested), so go ahead and remove 'tmp_file" from the rm in the trap line. See the signal man page for more signals to catch.

You can use [mg]awk:

awk 'NR==1{
header=$0;
count=1;
print header > "x_" count;
next
}


!( (NR-1) % 100){
count++;
print header > "x_" count;
}
{
print $0 > "x_" count
}' file

100 is the number of lines of each slice. It doesn't require temp files and can be put on a single line.

You could use the new --filter functionality in GNU coreutils split >= 8.13 (2011):

tail -n +2 FILE.in | split -l 50 - --filter='sh -c "{ head -n1 FILE.in; cat; } > $FILE"'

I liked the awk version of marco, adopted from this a simplified one-liner where you can easily specify the split fraction as granular as you want:

awk 'NR==1{print $0 > FILENAME ".split1";  print $0 > FILENAME ".split2";} NR>1{if (NR % 10 > 5) print $0 >> FILENAME ".split1"; else print $0 >> FILENAME ".split2"}' file

I really liked Rob and Dennis' versions, so much so that I wanted to improve them.

Here's my version:

in_file=$1
awk '{if (NR!=1) {print}}' $in_file | split -d -a 5 -l 100000 - $in_file"_" # Get all lines except the first, split into 100,000 line chunks
for file in $in_file"_"*
do
tmp_file=$(mktemp $in_file.XXXXXX) # Create a safer temp file
head -n 1 $in_file | cat - $file > $tmp_file # Get header from main file, cat that header with split file contents to temp file
mv -f $tmp_file $file # Overwrite non-header containing file with header-containing file
done

Differences:

  1. in_file is the file argument you want to split maintaining headers
  2. Use awk instead of tail due to awk having better performance
  3. split into 100,000 line files instead of 4
  4. Split file name will be input file name appended with an underscore and numbers (up to 99999 - from the "-d -a 5" split argument)
  5. Use mktemp to safely handle temporary files
  6. Use single head | cat line instead of two lines

Use GNU Parallel:

parallel -a bigfile.csv --header : --pipepart 'cat > {#}'

If you need to run a command on each of the parts, then GNU Parallel can help do that, too:

parallel -a bigfile.csv --header : --pipepart my_program_reading_from_stdin
parallel -a bigfile.csv --header : --pipepart --fifo my_program_reading_from_fifo {}
parallel -a bigfile.csv --header : --pipepart --cat my_program_reading_from_a_file {}

If you want to split into 2 parts per CPU core (e.g. 24 cores = 48 equal sized parts):

parallel --block -2 -a bigfile.csv --header : --pipepart my_program_reading_from_stdin

If you want to split into 10 MB blocks:

parallel --block 10M -a bigfile.csv --header : --pipepart my_program_reading_from_stdin

This one-liner will split the big csv into pieces of 999 records, preserving the header row at the top of each one (so 999 records + 1 header = 1000 rows)

cat bigFile.csv | parallel --header : --pipe -N999 'cat >file_{#}.csv'

Based on Ole Tange's answer.

See comments for some tips on installing parallel

Below is a 4 liner that can be used to split a bigfile.csv into multiple smaller files, and preserve the csv header. Uses only built-in Bash commands (head, split, find, grep, xargs, and sed) which should work on most *nix systems. Should also work on Windows if you install mingw-64 / git-bash.



csvheader=`head -1 bigfile.csv`
split -d -l10000 bigfile.csv smallfile_
find .|grep smallfile_ | xargs sed -i "1s/^/$csvheader\n/"
sed -i '1d' smallfile_00


Line by line explanation:

  1. Capture the header to a variable named csvheader
  2. Split the bigfile.csv into a number of smaller files with prefix smallfile_
  3. Find all smallfiles and insert the csvheader into the FIRST line using xargs and sed -i. Note that you need to use sed within "double quotes" in order to use variables.
  4. The first file named smallfile_00 will now have redundant headers on lines 1 and 2 (from the original data as well as from the sed header insert in step 3). We can remove the redundant header with sed -i '1d' command.

Inspired by @Arkady's comment on a one-liner.

  • MYFILE variable simply to reduce boilerplate
  • split doesn't show file name, but the --additional-suffix option allows us to easily control what to expect
  • removal of intermediate files via rm $part (assumes no files with same suffix)

MYFILE=mycsv.csv && for part in $(split -n4 --additional-suffix=foo $MYFILE; ls *foo); do cat <(head -n1 $MYFILE) $part > $MYFILE.$part; rm $part; done

Evidence:

-rw-rw-r--  1 ec2-user ec2-user  32040108 Jun  1 23:18 mycsv.csv.xaafoo
-rw-rw-r--  1 ec2-user ec2-user  32040108 Jun  1 23:18 mycsv.csv.xabfoo
-rw-rw-r--  1 ec2-user ec2-user  32040108 Jun  1 23:18 mycsv.csv.xacfoo
-rw-rw-r--  1 ec2-user ec2-user  32040110 Jun  1 23:18 mycsv.csv.xadfoo

and of course head -2 *foo to see the header is added.

A simple but maybe not as elegant way: Cut off the header beforehand, split the file, and then rejoin the header on each file with cat, or with whatever file is reading it in. So something like:

  1. head -n1 file.txt > header.txt
  2. split -l file.txt
  3. cat header.txt f1.txt