犰狳是否解决()线程安全?

在我的代码中,我有一个循环,在这个循环中,我构造并且超过确定的线性系统,并且尝试解决它:

#pragma omp parallel for
for (int i = 0; i < n[0]+1; i++) {
for (int j = 0; j < n[1]+1; j++) {
for (int k = 0; k < n[2]+1; k++) {
arma::mat A(max_points, 2);
arma::mat y(max_points, 1);
// initialize A and y


arma::vec solution = solve(A,y);
}
}
}

有时,程序随机挂起或解向量中的结果是 NaN。如果我这样写:

arma::vec solution;
#pragma omp critical
{
solution = solve(weights*A,weights*y);
}

那么这些问题就不会再发生了。

当它挂起时,是因为一些线程正在 OpenMP 屏障处等待:

Thread 2 (Thread 0x7fe4325a5700 (LWP 39839)):
#0  0x00007fe44d3c2084 in gomp_team_barrier_wait_end () from /usr/lib64/gcc-4.9.2/lib64/gcc/x86_64-redhat-linux-gnu/4.9.2/libgomp.so.1
#1  0x00007fe44d3bf8c2 in gomp_thread_start () at ../.././libgomp/team.c:118
#2  0x0000003f64607851 in start_thread () from /lib64/libpthread.so.0
#3  0x0000003f642e890d in clone () from /lib64/libc.so.6

其他的线都卡在犰狳里面:

Thread 1 (Thread 0x7fe44afe2e60 (LWP 39800)):
#0  0x0000003ee541f748 in dscal_ () from /usr/lib64/libblas.so.3
#1  0x00007fe44c0d3666 in dlarfp_ () from /usr/lib64/atlas/liblapack.so.3
#2  0x00007fe44c058736 in dgelq2_ () from /usr/lib64/atlas/liblapack.so.3
#3  0x00007fe44c058ad9 in dgelqf_ () from /usr/lib64/atlas/liblapack.so.3
#4  0x00007fe44c059a32 in dgels_ () from /usr/lib64/atlas/liblapack.so.3
#5  0x00007fe44f09fb3d in bool arma::auxlib::solve_ud<double, arma::Glue<arma::Mat<double>, arma::Mat<double>, arma::glue_times> >(arma::Mat<double>&, arma::Mat<double>&, arma::Base<double, arma::Glue<arma::Mat<double>, arma::Mat<double>, arma::glue_times> > const&) () at /usr/include/armadillo_bits/lapack_wrapper.hpp:677
#6  0x00007fe44f0a0f87 in arma::Col<double>::Col<arma::Glue<arma::Glue<arma::Mat<double>, arma::Mat<double>, arma::glue_times>, arma::Glue<arma::Mat<double>, arma::Mat<double>, arma::glue_times>, arma::glue_solve> >(arma::Base<double, arma::Glue<arma::Glue<arma::Mat<double>, arma::Mat<double>, arma::glue_times>, arma::Glue<arma::Mat<double>, arma::Mat<double>, arma::glue_times>, arma::glue_solve> > const&) ()
at /usr/include/armadillo_bits/glue_solve_meat.hpp:39

正如你从堆栈跟踪中看到的,我的犰狳版本使用了地图集。根据这份文档,地图集似乎是线程安全的: Ftp://lsec.cc.ac.cn/netlib/atlas/faq.html # tsafe

2015年9月11日更新

我终于有时间运行更多的测试,基于弗拉基米尔 F 的建议。

当我用 ATLAS 的 BLAS 编译犰狳时,我仍然能够复制然后挂起和 NaN。当它挂起时,堆栈跟踪中唯一改变的是对 BLAS 的调用:

#0  0x0000003fa8054718 in ATL_dscal_xp1yp0aXbX@plt () from /usr/lib64/atlas/libatlas.so.3
#1  0x0000003fb05e7666 in dlarfp_ () from /usr/lib64/atlas/liblapack.so.3
#2  0x0000003fb0576a61 in dgeqr2_ () from /usr/lib64/atlas/liblapack.so.3
#3  0x0000003fb0576e06 in dgeqrf_ () from /usr/lib64/atlas/liblapack.so.3
#4  0x0000003fb056d7d1 in dgels_ () from /usr/lib64/atlas/liblapack.so.3
#5  0x00007ff8f3de4c34 in void arma::lapack::gels<double>(char*, int*, int*, int*, double*, int*, double*, int*, double*, int*, int*) () at /usr/include/armadillo_bits/lapack_wrapper.hpp:677
#6  0x00007ff8f3de1787 in bool arma::auxlib::solve_od<double, arma::Glue<arma::Mat<double>, arma::Mat<double>, arma::glue_times> >(arma::Mat<double>&, arma::Mat<double>&, arma::Base<double, arma::Glue<arma::Mat<double>, arma::Mat<double>, arma::glue_times> > const&) () at /usr/include/armadillo_bits/auxlib_meat.hpp:3434

在没有 ATLAS、只有 netlib BLAS 和 LAPACK 的情况下进行编译,我能够复制 Nan,但不能复制 hang。

在这两种情况下,周围的 solve()#pragma omp 临界我没有任何问题

3057 次浏览

Are you sure your systems are over determined? solve_ud in your stack trace says otherwise. Though you have solve_od too, and probably that's nothing to do with the issue. But it doesn't hurt to find why that's happening and fix it if you think the systems should be od.

Is armadillo solve() thread safe?

That I think depends on your lapack version, also see this. Looking at the code of solve_od all the variables accessed seem to be local. Note the warning in the code:

NOTE: the dgels() function in the lapack library supplied by ATLAS 3.6 seems to have problems

Thus it seems only lapack::gels can cause trouble for you. If fixing lapack is not possible, a workaround is to stack your systems and solve a single large system. That probably would be even more efficient if your individual systems are small.

The thread safety of Armadillo's solve() function depends (only) on the BLAS library that you use. The LAPACK implementations are thread safe when BLAS is. The Armadillo solve() function is not thread safe when linking to the reference BLAS library. However, it is thread safe when using OpenBLAS. Additionally, ATLAS provides a BLAS implementation that also mentions it is thread safe, and the Intel MKL is thread safe as well, but I have no experience with Armadillo linked to those libraries.

Of course, this only applies when you run solve() from multiple threads with different data.