使用同步/并发文件传输加速 rsync? ?

我们需要尽可能快地将 15TB数据从一个服务器传输到另一个服务器。我们目前正在使用 rsync,但我们只得到大约 150Mb/s的速度,当我们的网络能够 900+Mb/s(测试与 iperf)。我已经对磁盘、网络等进行了测试,发现 rsync 一次只传输一个文件,这导致了速度下降。

我找到了一个脚本,可以为目录树中的每个文件夹运行不同的 rsync (允许限制为 x 个) ,但是我无法让它工作,它仍然只是一次运行一个 rsync。

我找到了 script 给你(复制如下)。

我们的目录树是这样的:

/main
- /files
- /1
- 343
- 123.wav
- 76.wav
- 772
- 122.wav
- 55
- 555.wav
- 324.wav
- 1209.wav
- 43
- 999.wav
- 111.wav
- 222.wav
- /2
- 346
- 9993.wav
- 4242
- 827.wav
- /3
- 2545
- 76.wav
- 199.wav
- 183.wav
- 23
- 33.wav
- 876.wav
- 4256
- 998.wav
- 1665.wav
- 332.wav
- 112.wav
- 5584.wav

因此,我希望为/main/files 中的每个目录创建一个 rsync,一次最多创建5个目录。因此在这种情况下,对于 /main/files/1/main/files/2/main/files/3,将运行3个 rsync。

我试过这样使用它,但是对于 /main/files/2文件夹,它只是一次运行1个 rsync:

#!/bin/bash


# Define source, target, maxdepth and cd to source
source="/main/files"
target="/main/filesTest"
depth=1
cd "${source}"


# Set the maximum number of concurrent rsync threads
maxthreads=5
# How long to wait before checking the number of rsync threads again
sleeptime=5


# Find all folders in the source directory within the maxdepth level
find . -maxdepth ${depth} -type d | while read dir
do
# Make sure to ignore the parent folder
if [ `echo "${dir}" | awk -F'/' '{print NF}'` -gt ${depth} ]
then
# Strip leading dot slash
subfolder=$(echo "${dir}" | sed 's@^\./@@g')
if [ ! -d "${target}/${subfolder}" ]
then
# Create destination folder and set ownership and permissions to match source
mkdir -p "${target}/${subfolder}"
chown --reference="${source}/${subfolder}" "${target}/${subfolder}"
chmod --reference="${source}/${subfolder}" "${target}/${subfolder}"
fi
# Make sure the number of rsync threads running is below the threshold
while [ `ps -ef | grep -c [r]sync` -gt ${maxthreads} ]
do
echo "Sleeping ${sleeptime} seconds"
sleep ${sleeptime}
done
# Run rsync in background for the current subfolder and move one to the next one
nohup rsync -a "${source}/${subfolder}/" "${target}/${subfolder}/" </dev/null >/dev/null 2>&1 &
fi
done


# Find all files above the maxdepth level and rsync them as well
find . -maxdepth ${depth} -type f -print0 | rsync -a --files-from=- --from0 ./ "${target}/"
254655 次浏览

rsync transfers files as fast as it can over the network. For example, try using it to copy one large file that doesn't exist at all on the destination. That speed is the maximum speed rsync can transfer data. Compare it with the speed of scp (for example). rsync is even slower at raw transfer when the destination file exists, because both sides have to have a two-way chat about what parts of the file are changed, but pays for itself by identifying data that doesn't need to be transferred.

A simpler way to run rsync in parallel would be to use parallel. The command below would run up to 5 rsyncs in parallel, each one copying one directory. Be aware that the bottleneck might not be your network, but the speed of your CPUs and disks, and running things in parallel just makes them all slower, not faster.

run_rsync() {
# e.g. copies /main/files/blah to /main/filesTest/blah
rsync -av "$1" "/main/filesTest/${1#/main/files/}"
}
export -f run_rsync
parallel -j5 run_rsync ::: /main/files/*

Updated answer (Jan 2020)

xargs is now the recommended tool to achieve parallel execution. It's pre-installed almost everywhere. For running multiple rsync tasks the command would be:

ls /srv/mail | xargs -n1 -P4 -I% rsync -Pa % myserver.com:/srv/mail/

This will list all folders in /srv/mail, pipe them to xargs, which will read them one-by-one and and run 4 rsync processes at a time. The % char replaces the input argument for each command call.

Original answer using parallel:

ls /srv/mail | parallel -v -j8 rsync -raz --progress {} myserver.com:/srv/mail/{}

There are a number of alternative tools and approaches for doing this listed arround the web. For example:

  • The NCSA Blog has a description of using xargs and find to parallelize rsync without having to install any new software for most *nix systems.

  • And parsync provides a feature rich Perl wrapper for parallel rsync.

I've developed a python package called: parallel_sync

https://pythonhosted.org/parallel_sync/pages/examples.html

Here is a sample code how to use it:

from parallel_sync import rsync
creds = {'user': 'myusername', 'key':'~/.ssh/id_rsa', 'host':'192.168.16.31'}
rsync.upload('/tmp/local_dir', '/tmp/remote_dir', creds=creds)

parallelism by default is 10; you can increase it:

from parallel_sync import rsync
creds = {'user': 'myusername', 'key':'~/.ssh/id_rsa', 'host':'192.168.16.31'}
rsync.upload('/tmp/local_dir', '/tmp/remote_dir', creds=creds, parallelism=20)

however note that ssh typically has the MaxSessions by default set to 10 so to increase it beyond 10, you'll have to modify your ssh settings.

You can use xargs which supports running many processes at a time. For your case it will be:

ls -1 /main/files | xargs -I {} -P 5 -n 1 rsync -avh /main/files/{} /main/filesTest/

The simplest I've found is using background jobs in the shell:

for d in /main/files/*; do
rsync -a "$d" remote:/main/files/ &
done

Beware it doesn't limit the amount of jobs! If you're network-bound this is not really a problem but if you're waiting for spinning rust this will be thrashing the disk.

You could add

while [ $(jobs | wc -l | xargs) -gt 10 ]; do sleep 1; done

inside the loop for a primitive form of job control.

Have you tried using rclone.org?

With rclone you could do something like

rclone copy "${source}/${subfolder}/" "${target}/${subfolder}/" --progress --multi-thread-streams=N

where --multi-thread-streams=N represents the number of threads you wish to spawn.

The shortest version I found is to use the --cat option of parallel like below. This version avoids using xargs, only relying on features of parallel:

cat files.txt | \
parallel -n 500 --lb --pipe --cat rsync --files-from={} user@remote:/dir /dir -avPi


#### Arg explainer
# -n 500           :: split input into chunks of 500 entries
#
# --cat            :: create a tmp file referenced by {} containing the 500
#                     entry content for each process
#
# user@remote:/dir :: the root relative to which entries in files.txt are considered
#
# /dir             :: local root relative to which files are copied

Sample content from files.txt:

/dir/file-1
/dir/subdir/file-2
....

Note that this doesn't use -j 50 for job count, that didn't work on my end here. Instead I've used -n 500 for record count per job, calculated as a reasonable number given the total number of records.

I've found UDR/UDT to be an amazing tool. The TLDR; It's a UDT wrapper for rsync, utilizing multiple UPD connections rather than a single TCP connection.

References: https://udt.sourceforge.io/ & https://github.com/jaystevens/UDR#udr

If you use any RHEL distros, they've pre-compiled it for you... http://hgdownload.soe.ucsc.edu/admin/udr

The ONLY downside I've encountered is that you can't specify a different SSH port, so your remote server must use 22.

Anyway, after installing the rpm, it's literally as simple as:

udr rsync -aP user@IpOrFqdn:/source/files/* /dest/folder/

and your transfer speeds will increase drastically in most cases, depending on the server I've seen easily 10x increase in transfer speed.

Side note: if you choose to gzip everything first, then make sure to use --rsyncable arg so that it only updates what has changed.

using parallel rsync on a regular disk would only cause them to compete for the i/o, turning what should be a sequential read into an inefficient random read. You could try instead tar the directory into a stream through ssh pull from the destination server, then pipe the stream to tar extract.

3 tricks for speeding up rsync on local net.

1. Copying from/to local network: don't use ssh!

If you're locally copying a server to another, there is no need to encrypt data during transfer!

By default, rsync use ssh to transer data through network. To avoid this, you have to create a rsync server on target host. You could punctually run daemon by something like:

rsync --daemon --no-detach --config filename.conf

where minimal configuration file could look like: (see man rsyncd.conf)

filename.conf

port = 12345
[data]
path = /some/path
use chroot = false

Then

rsync -ax rsync://remotehost:12345/data/. /path/to/target/.
rsync -ax /path/to/source/. rsync://remotehost:12345/data/.

2. Using zstandard zstd for high speed compression

Zstandard could be upto 8x faster than the common gzip. So using this newer compression algorithm will improve significantly your transfer!

rsync -axz --zc=zstd rsync://remotehost:12345/data/. /path/to/target/.
rsync -axz --zc=zstd /path/to/source/. rsync://remotehost:12345/data/.

3. Multiplexing rsync to reduce inactivity due to browse time

This kind of optimisation is about disk access and filesystem structure. There is nothing to see with number of CPU! So this could improve transfer even if your host use single core CPU.

As the goal is to ensure maximum data are using bandwidth while other task browse filesystem, the most suited number of simultaneous process depend on number of small files presents.

Here is a sample script using wait -n -p PID:

#!/bin/bash


maxProc=3
source=''
destination='rsync://remotehost:12345/data/'
array=("$@")


declare -ai start elap results order
wait4oneTask() {
wait -np epid
results[epid]=$?
elap[epid]=" ${EPOCHREALTIME/.} - ${start[epid]} "
unset "running[$epid]"
while [ -v elap[${order[0]}] ];do
i=${order[0]}
printf " - %(%a %d %T)T.%06.0f %-36s %4d %12d\n" "${start[i]:0:-6}" \
"${start[i]: -6}" "${paths[i]}" "${results[i]}" "${elap[i]}"
order=(${order[@]:1})
done
}
printf "   %-22s %-36s %4s %12s\n" Started Path Rslt 'microseconds'
while ((${#array[@]}));do
path="${array[0]}"
rsync -axz --zc zstd "$source$path/." "$destination$path/." &
lpid=$!
paths[lpid]="$path"
start[lpid]=${EPOCHREALTIME/.}
running[lpid]=''
array=("${array[@]:1}")
order+=($lpid)
((${#running[@]}>=maxProc)) && wait4oneTask
done
for ((;${#running[@]};)){ wait4oneTask ;}

Output could look like:

myRsyncP.sh files/*/*
Started                Path                                 Rslt microseconds
- Fri 03 09:20:44.673637 files/1/343                             0      1186903
- Fri 03 09:20:44.673914 files/1/43                              0      2276767
- Fri 03 09:20:44.674147 files/1/55                              0      2172830
- Fri 03 09:20:45.861041 files/1/772                             0      1279463
- Fri 03 09:20:46.847241 files/2/346                             0      2363101
- Fri 03 09:20:46.951192 files/2/4242                            0      2180573
- Fri 03 09:20:47.140953 files/3/23                              0      1789049
- Fri 03 09:20:48.930306 files/3/2545                            0      3259273
- Fri 03 09:20:49.132076 files/3/4256                            0      2263019

Quick check:

printf "%'d\n" $(( 49132076 + 2263019 - 44673637)) \
$((1186903+2276767+2172830+1279463+2363101+2180573+1789049+3259273+2263019))
6’721’458
18’770’978

There was 6,72seconds elapsed to process 18,77seconds under upto three subprocess.

Note: you could use musec2str to improve ouptut, by replacing 1st long printf line by:

        musec2str -v elapsed "${elap[i]}"
printf " - %(%a %d %T)T.%06.0f %-36s %4d %12s\n" "${start[i]:0:-6}" \
"${start[i]: -6}" "${paths[i]}" "${results[i]}" "$elapsed"
myRsyncP.sh files/*/*
Started                Path                                 Rslt      Elapsed
- Fri 03 09:27:33.463009 files/1/343                             0   18.249400"
- Fri 03 09:27:33.463264 files/1/43                              0   18.153972"
- Fri 03 09:27:33.463502 files/1/55                             93   10.104106"
- Fri 03 09:27:43.567882 files/1/772                           122   14.748798"
- Fri 03 09:27:51.617515 files/2/346                             0   19.286811"
- Fri 03 09:27:51.715848 files/2/4242                            0    3.292849"
- Fri 03 09:27:55.008983 files/3/23                              0    5.325229"
- Fri 03 09:27:58.317356 files/3/2545                            0   10.141078"
- Fri 03 09:28:00.334848 files/3/4256                            0   15.306145"

The more: you could add overall stat line by adding at end of script:

for i in ${!start[@]};do  sortstart[${start[i]}]=$i;done
sortstartstr=${!sortstart[*]}
fstarted=${sortstartstr%% *}
lstarted=${sortstartstr##* }
musec2str -v totElap $((lstarted+${elap[${sortstart[lstarted]}]}-fstarted))
sumElap=${elap[*]}
musec2str -v sumElap $(( ${sumElap// /+} ))


printf " = %(%a %d %T)T.%06.0f %-41s %12s\n" "${fstarted:0:-6}" \
"${fstarted: -6}" "Real: $totElap, Total:" "$sumElap"

Could produce:

   Started                Path                                 Rslt      Elapsed
- Thu 15 08:40:12.121585 files/1/343                             0     1.32067"
- Thu 15 08:40:12.121801 files/1/43                             57    2.108366"
- Thu 15 08:40:12.122007 files/1/56                              0    2.305846"
- Thu 15 08:40:13.443005 files/2/346                             0    2.165281"
- Thu 15 08:40:14.231550 files/2/4342                            0    2.164196"
- Thu 15 08:40:14.428520 files/3/33                              0    2.709537"
- Thu 15 08:40:15.609233 files/3/3345                           61      2.2848"
- Thu 15 08:40:16.396410 files/3/4523                            0    2.808285"
= Thu 15 08:40:12.121585 Real: 7.08311", Total:                      17.866981"

Fake rsync for testing this script

Note: For testing this, I've used a fake rsync:

## Fake rsync wait 1.0 - 2.99 seconds and return 0-255 ~ 1x/10
rsync() { sleep $((RANDOM%2+1)).$RANDOM;exit $(( RANDOM%10==3?RANDOM%128:0));}
export -f rsync