如何只获取二进制文件的前十个字节

我正在编写一个 bash 脚本,它需要获取一个文件的头部(前10个字节) ,然后在另一个部分中获取除前10个字节之外的所有内容。这些是二进制文件,在前10个字节中可能包含 \0\n。似乎大多数实用程序都使用 ASCII 文件。完成这项任务的好方法是什么?

88166 次浏览

head -c 10 does the right thing here.

You can use the dd command to copy an arbitrary number of bytes from a binary file.

dd if=infile of=outfile1 bs=10 count=1
dd if=infile of=outfile2 bs=10 skip=1

To get the first 10 bytes, as noted already:

head -c 10

To get all but the first 10 bytes (at least with GNU tail):

tail -c+11

How to split a stream (or a file) under

Two answer here!

Reading SO request:

get the header (first 10 bytes) of a file and then in another section get everything except the first 10 bytes.

I understand:

How to split a file at specific point

As all answers here does access same file two time, instead of just split it!!

Here is my two cents:

The interesting thing using Un*x is considering every whole job as a filter, it's easy to a split stream using unbuffered I/O. Most of standard un*x tools (cat, grep, awk, sed, python, perl ...) work as filters.

1. Using head or dd but in a single pass

{ head -c 10 >head_part; cat >tail_part;} <file

This is the more efficient, as your file is read only 1 time, the first 10 byte goes to head_part and the rest goes to tail_part.

Note: second redirection >tail_part could be place outside of whole list ({ ...;}) as well...

You could do same, using dd:

{ dd count=1 bs=10 of=head_part; cat;} <file >tail_part

This stay more efficient than running two process of dd to open same file two times.

...And still use standard block size for the rest of file:

Another sample based on read by line:

Split HTTP (or mail) stream on near empty line (line containing only carriage return: \r):

nc google.com 80 <<<$'GET / HTTP/1.0\r\nHost: google.com\r\n\r' |
{ sed -u '/^\r$/q' >/tmp/so_head.raw; cat;} >/tmp/so_body.raw

or, to drop empty last head line:

nc google.com 80 <<<$'GET / HTTP/1.0\r\nHost: google.com\r\n\r' |
{ sed -nu '/^\r$/q;p' >/tmp/so_head.raw; cat;} >/tmp/so_body.raw

This will produce two files:

ls -l so_*.raw
-rw-r--r-- 1 root    root           307 Apr 25 11:40  so_head.raw
-rw-r--r-- 1 root    root           219 Apr 25 11:40  so_body.raw


grep www so_*.raw
so_body.raw:<A HREF="http://www.google.com/">here</A>.
so_head.raw:Location: http://www.google.com/

2. Pure bash way:

If the goal is to obtain values of first 10 bytes in a usable variable, here is a nice and efficient way:

Because ten byte are few, fork to head could be avoided. from Read a file by bytes in BASH:

read8() {
local _r8_var=${1:-OUTBIN} _r8_car LANG=C IFS=
read -r -d '' -n 1 _r8_car || { printf -v $_r8_var '';return 1;}
printf -v $_r8_var %02X "'"$_r8_car
}
{
first10=()
for i in {0..9};do
read8 first10[i] || break
done
cat
} < "$infile" >"$outfile"

This will create an array ${first10[@]} containing hexadecimal values of first ten bytes of $infile and store rest of data into $outfile.

declare -p first10


declare -a first10=([0]="25" [1]="50" [2]="44" [3]="46" [4]="2D" [5]="31" [6]="2E"
[7]="34" [8]="0A" [9]="25")

This was a PDF (%PDF -> 25 50 44 46)... Here's another sample:

{
first10=()
for i in {0..9};do
read8 first10[i] || break
done
cat
} <<<"Hello world!"
d!

As I didn't redirect output, string d! will be output on terminal.

echo ${first10[@]}
48 65 6C 6C 6F 20 77 6F 72 6C


printf '%b%b%b%b%b%b%b%b%b%b\n' ${first10[@]/#/\\x}
Hello worl

About binary

You said:

These are binary files and will likely have \0's and \n's throughout the first 10 bytes.

{
first10=()
for i in {0..9};do
read8 first10[i] || break
done
cat
} < <(gzip <<<"Hello world!") >/dev/null


echo ${first10[@]}
1F 8B 08 00 00 00 00 00 00 03

( Sample with a \n at bottom of this ;)

As a function

read8() { local _r8_var=${1:-OUTBIN} _r8_car LANG=C IFS=
read -r -d '' -n 1 _r8_car || { printf -v $_r8_var '';return 1;}
printf -v $_r8_var %02X "'"$_r8_car ;}
get10() {
local -n result=${1:-first10}     # 1st arg is array name
local -i _i
result=()
for ((_i=0;_i<${2:-10};_i++));do  # 2nd arg is number of bytes
read8 result[_i] || { unset result[_i] ; return 1 ;}
done
cat
}

Then (here, I use the special character for: there was no newline. ).

get10 pdf 4 <$infile >$outfile
printf %b ${pdf[@]/#/\\x}
%PDF⛶


echo $(( $(stat -c %s $infile) - $(stat -c %s $outfile) ))
4


get10 test 8 <<<'Hello world'
rld!


printf %b ${test[@]/#/\\x}
Hello Wo⛶


get10 test 24 <<<'Hello World!'
printf %b ${test[@]/#/\\x}
Hello World!

( And the last character printed is a \n! ;)

Final binary demo:

get10 test 256 < <(gzip <<<'Hello world!')


printf '%b' ${test[@]/#/\\x} | gunzip
Hello world!


printf "  %s %s %s %s  %s %s %s %s    %s %s %s %s  %s %s %s %s\n" ${test[@]}
1F 8B 08 00  00 00 00 00    00 03 F3 48  CD C9 C9 57
28 CF 2F CA  49 51 E4 02    00 41 E4 A9  B2 0D 00 00
00

Note!! This work fine and is very quick while number of byte to read stay low, even processing large files. This could be used for file recognition, for sample. But for spliting files on larger parts, you have to use split, head, tail and/or dd.