Java 中使用数组的速度是 C + + 中使用 std: : Vector 的8倍。我做错了什么?

下面的 Java 代码有几个大数组,它们的大小从不改变。它在我的电脑上运行1100毫秒。

我在 C + + 中实现了相同的代码,并使用了 std::vector

在我的计算机上运行完全相同代码的 C + + 实现的时间是8800毫秒。我到底做错了什么,让它跑得这么慢?

该守则基本上做到了以下几点:

for (int i = 0; i < numberOfCells; ++i) {
h[i] =  h[i] + 1;
floodedCells[i] =  !floodedCells[i];
floodedCellsTimeInterval[i] =  !floodedCellsTimeInterval[i];
qInflow[i] =  qInflow[i] + 1;
}

它迭代不同的数组,大小约为20000。

你可在以下连结找到这两个实施方案:

(由于时间限制,我只能运行400次循环,而不是2000次。但即使在这里也有三倍的差别)

8119 次浏览

Yep, the cache in the c++ version takes a hammering. It seems the JIT is better equipped to handle this.

If you change the outer for in isUpdateNeeded() to shorter snippets. The difference goes away.

The sample below produces a 4x speedup.

void isUpdateNeeded() {
for (int i = 0; i < numberOfCells; ++i) {
h[i] =  h[i] + 1;
floodedCells[i] =  !floodedCells[i];
floodedCellsTimeInterval[i] =  !floodedCellsTimeInterval[i];
qInflow[i] =  qInflow[i] + 1;
qStartTime[i] =  qStartTime[i] + 1;
qEndTime[i] =  qEndTime[i] + 1;
}


for (int i = 0; i < numberOfCells; ++i) {
lowerFloorCells[i] =  lowerFloorCells[i] + 1;
cellLocationX[i] =  cellLocationX[i] + 1;
cellLocationY[i] =  cellLocationY[i] + 1;
cellLocationZ[i] =  cellLocationZ[i] + 1;
levelOfCell[i] =  levelOfCell[i] + 1;
valueOfCellIds[i] =  valueOfCellIds[i] + 1;
h0[i] =  h0[i] + 1;
vU[i] =  vU[i] + 1;
vV[i] =  vV[i] + 1;
vUh[i] =  vUh[i] + 1;
vVh[i] =  vVh[i] + 1;
}
for (int i = 0; i < numberOfCells; ++i) {
vUh0[i] =  vUh0[i] + 1;
vVh0[i] =  vVh0[i] + 1;
ghh[i] =  ghh[i] + 1;
sfx[i] =  sfx[i] + 1;
sfy[i] =  sfy[i] + 1;
qIn[i] =  qIn[i] + 1;
for(int j = 0; j < nEdges; ++j) {
neighborIds[i * nEdges + j] = neighborIds[i * nEdges + j] + 1;
}
for(int j = 0; j < nEdges; ++j) {
typeInterface[i * nEdges + j] = typeInterface[i * nEdges + j] + 1;
}
}


}

This shows to a reasonable degree that cache misses are the reason for the slowdown. It is also important to note that the variables are not dependent so a threaded solution is easily created.

Order restored

As per stefans comment I tried grouping them in a struct using the original sizes. This removes the immediate cache pressure in a similar fashion. The result is that the c++ (CCFLAG -O3) version is about 15% faster than the java version.

Varning neither short nor pretty.

#include <vector>
#include <cmath>
#include <iostream>
 

 

 

class FloodIsolation {
struct item{
char floodedCells;
char floodedCellsTimeInterval;
double valueOfCellIds;
double h;
double h0;
double vU;
double vV;
double vUh;
double vVh;
double vUh0;
double vVh0;
double sfx;
double sfy;
double qInflow;
double qStartTime;
double qEndTime;
double qIn;
double nx;
double ny;
double ghh;
double floorLevels;
int lowerFloorCells;
char flagInterface;
char floorCompletelyFilled;
double cellLocationX;
double cellLocationY;
double cellLocationZ;
int levelOfCell;
};
struct inner_item{
int typeInterface;
int neighborIds;
};


std::vector<inner_item> inner_data;
std::vector<item> data;


public:
FloodIsolation() :
numberOfCells(20000), inner_data(numberOfCells * nEdges), data(numberOfCells)
{


}
~FloodIsolation(){
}
 

void isUpdateNeeded() {
for (int i = 0; i < numberOfCells; ++i) {
data[i].h = data[i].h + 1;
data[i].floodedCells = !data[i].floodedCells;
data[i].floodedCellsTimeInterval = !data[i].floodedCellsTimeInterval;
data[i].qInflow = data[i].qInflow + 1;
data[i].qStartTime = data[i].qStartTime + 1;
data[i].qEndTime = data[i].qEndTime + 1;
data[i].lowerFloorCells = data[i].lowerFloorCells + 1;
data[i].cellLocationX = data[i].cellLocationX + 1;
data[i].cellLocationY = data[i].cellLocationY + 1;
data[i].cellLocationZ = data[i].cellLocationZ + 1;
data[i].levelOfCell = data[i].levelOfCell + 1;
data[i].valueOfCellIds = data[i].valueOfCellIds + 1;
data[i].h0 = data[i].h0 + 1;
data[i].vU = data[i].vU + 1;
data[i].vV = data[i].vV + 1;
data[i].vUh = data[i].vUh + 1;
data[i].vVh = data[i].vVh + 1;
data[i].vUh0 = data[i].vUh0 + 1;
data[i].vVh0 = data[i].vVh0 + 1;
data[i].ghh = data[i].ghh + 1;
data[i].sfx = data[i].sfx + 1;
data[i].sfy = data[i].sfy + 1;
data[i].qIn = data[i].qIn + 1;
for(int j = 0; j < nEdges; ++j) {
inner_data[i * nEdges + j].neighborIds = inner_data[i * nEdges + j].neighborIds + 1;
inner_data[i * nEdges + j].typeInterface = inner_data[i * nEdges + j].typeInterface + 1;
}
}
 

}
 

static const int nEdges;
private:
 

const int numberOfCells;


};
 

const int FloodIsolation::nEdges = 6;


int main() {
FloodIsolation isolation;
clock_t start = clock();
for (int i = 0; i < 4400; ++i) {
if(i % 100 == 0) {
std::cout << i << "\n";
}
isolation.isUpdateNeeded();
}


clock_t stop = clock();
std::cout << "Time: " << difftime(stop, start) / 1000 << "\n";
}
                                                                              

My result differs slightly from Jerry Coffins for the original sizes. For me the differences remains. It might well be my java version, 1.7.0_75.

As @Stefan guessed in a comment on @CaptainGiraffe's answer, you gain quite a bit by using a vector of structs instead of a struct of vectors. Corrected code looks like this:

#include <vector>
#include <cmath>
#include <iostream>
#include <time.h>


class FloodIsolation {
public:
FloodIsolation() :
h(0),
floodedCells(0),
floodedCellsTimeInterval(0),
qInflow(0),
qStartTime(0),
qEndTime(0),
lowerFloorCells(0),
cellLocationX(0),
cellLocationY(0),
cellLocationZ(0),
levelOfCell(0),
valueOfCellIds(0),
h0(0),
vU(0),
vV(0),
vUh(0),
vVh(0),
vUh0(0),
vVh0(0),
ghh(0),
sfx(0),
sfy(0),
qIn(0),
typeInterface(nEdges, 0),
neighborIds(nEdges, 0)
{
}


~FloodIsolation(){
}


void Update() {
h =  h + 1;
floodedCells =  !floodedCells;
floodedCellsTimeInterval =  !floodedCellsTimeInterval;
qInflow =  qInflow + 1;
qStartTime =  qStartTime + 1;
qEndTime =  qEndTime + 1;
lowerFloorCells =  lowerFloorCells + 1;
cellLocationX =  cellLocationX + 1;
cellLocationY =  cellLocationY + 1;
cellLocationZ =  cellLocationZ + 1;
levelOfCell =  levelOfCell + 1;
valueOfCellIds =  valueOfCellIds + 1;
h0 =  h0 + 1;
vU =  vU + 1;
vV =  vV + 1;
vUh =  vUh + 1;
vVh =  vVh + 1;
vUh0 =  vUh0 + 1;
vVh0 =  vVh0 + 1;
ghh =  ghh + 1;
sfx =  sfx + 1;
sfy =  sfy + 1;
qIn =  qIn + 1;
for(int j = 0; j < nEdges; ++j) {
++typeInterface[j];
++neighborIds[j];
}
}


private:


static const int nEdges = 6;
bool floodedCells;
bool floodedCellsTimeInterval;


std::vector<int> neighborIds;
double valueOfCellIds;
double h;
double h0;
double vU;
double vV;
double vUh;
double vVh;
double vUh0;
double vVh0;
double ghh;
double sfx;
double sfy;
double qInflow;
double qStartTime;
double qEndTime;
double qIn;
double nx;
double ny;
double floorLevels;
int lowerFloorCells;
bool flagInterface;
std::vector<int> typeInterface;
bool floorCompleteleyFilled;
double cellLocationX;
double cellLocationY;
double cellLocationZ;
int levelOfCell;
};


int main() {
std::vector<FloodIsolation> isolation(20000);
clock_t start = clock();
for (int i = 0; i < 400; ++i) {
if(i % 100 == 0) {
std::cout << i << "\n";
}


for (auto &f : isolation)
f.Update();
}
clock_t stop = clock();
std::cout << "Time: " << difftime(stop, start) / 1000 << "\n";
}

Compiled with the compiler from VC++ 2015 CTP, using -EHsc -O2b2 -GL -Qpar, I get results like:

0
100
200
300
Time: 0.135

Compiling with g++ produces a result that's slightly slower:

0
100
200
300
Time: 0.156

On the same hardware, using the compiler/JVM from Java 8u45, I get results like:

0
100
200
300
Time: 181

This is around 35% slower than the version from VC++, and about 16% slower than the version from g++.

If we increase the number of iterations to the desired 2000, the difference drops to only 3%, suggesting that part of the advantage of C++ in this case is simply faster loading (a perennial problem with Java), not really in execution itself. This doesn't strike me as surprising in this case--the computation being measured (in the posted code) is so trivial that I doubt most compilers can do a whole lot to optimize it.

Here is the C++ version with the per-node data gathered into a structure, and a single vector of that structure used:

#include <vector>
#include <cmath>
#include <iostream>






class FloodIsolation {
public:
FloodIsolation() :
numberOfCells(20000),
data(numberOfCells)
{
}
~FloodIsolation(){
}


void isUpdateNeeded() {
for (int i = 0; i < numberOfCells; ++i) {
data[i].h = data[i].h + 1;
data[i].floodedCells = !data[i].floodedCells;
data[i].floodedCellsTimeInterval = !data[i].floodedCellsTimeInterval;
data[i].qInflow = data[i].qInflow + 1;
data[i].qStartTime = data[i].qStartTime + 1;
data[i].qEndTime = data[i].qEndTime + 1;
data[i].lowerFloorCells = data[i].lowerFloorCells + 1;
data[i].cellLocationX = data[i].cellLocationX + 1;
data[i].cellLocationY = data[i].cellLocationY + 1;
data[i].cellLocationZ = data[i].cellLocationZ + 1;
data[i].levelOfCell = data[i].levelOfCell + 1;
data[i].valueOfCellIds = data[i].valueOfCellIds + 1;
data[i].h0 = data[i].h0 + 1;
data[i].vU = data[i].vU + 1;
data[i].vV = data[i].vV + 1;
data[i].vUh = data[i].vUh + 1;
data[i].vVh = data[i].vVh + 1;
data[i].vUh0 = data[i].vUh0 + 1;
data[i].vVh0 = data[i].vVh0 + 1;
data[i].ghh = data[i].ghh + 1;
data[i].sfx = data[i].sfx + 1;
data[i].sfy = data[i].sfy + 1;
data[i].qIn = data[i].qIn + 1;




for(int j = 0; j < nEdges; ++j) {
data[i].flagInterface[j] = !data[i].flagInterface[j];
data[i].typeInterface[j] = data[i].typeInterface[j] + 1;
data[i].neighborIds[j] = data[i].neighborIds[j] + 1;
}
}


}


private:


const int numberOfCells;
static const int nEdges = 6;
struct data_t {
bool floodedCells = 0;
bool floodedCellsTimeInterval = 0;


double valueOfCellIds = 0;
double h = 0;


double h0 = 0;
double vU = 0;
double vV = 0;
double vUh = 0;
double vVh = 0;
double vUh0 = 0;
double vVh0 = 0;
double ghh = 0;
double sfx = 0;
double sfy = 0;
double qInflow = 0;
double qStartTime = 0;
double qEndTime = 0;
double qIn = 0;
double nx = 0;
double ny = 0;
double floorLevels = 0;
int lowerFloorCells = 0;
bool floorCompleteleyFilled = 0;
double cellLocationX = 0;
double cellLocationY = 0;
double cellLocationZ = 0;
int levelOfCell = 0;
bool flagInterface[nEdges] = {};
int typeInterface[nEdges] = {};
int neighborIds[nEdges] = {};
};
std::vector<data_t> data;


};


int main() {
std::ios_base::sync_with_stdio(false);
FloodIsolation isolation;
clock_t start = clock();
for (int i = 0; i < 400; ++i) {
if(i % 100 == 0) {
std::cout << i << "\n";
}
isolation.isUpdateNeeded();
}
clock_t stop = clock();
std::cout << "Time: " << difftime(stop, start) / 1000 << "\n";
}

live example

The time is now 2x the speed of the Java version. (846 vs 1631).

Odds are the JIT noticed the cache burning of accessing data all over the place, and transformed your code into a logically similar but more efficient order.

I also turned off stdio synchronization, as that is only needed if you mix printf/scanf with C++ std::cout and std::cin. As it happens, you only print out a few values, but C++'s default behavior for printing is overly paranoid and inefficient.

If nEdges is not an actual constant value, then the 3 "array" values will have to be stripped out of the struct. That shouldn't cause a huge performance hit.

You might be able to get another performance boost by sorting the values in that struct by decreasing size, thus reducing memory footprint (and sorting access as well when it doesn't matter). But I am unsure.

A rule of thumb is that a single cache miss is 100x more expensive than an instruction. Arranging your data to have cache coherency has lots of value.

If rearranging the data into a struct is infeasible, you can change your iteration to be over each container in turn.

As an aside, note that the Java and C++ versions had some subtle differences in them. The one I spotted was that the Java version has 3 variables in the "for each edge" loop, while the C++ one only had 2. I made mine match the Java. I don't know if there are others.

I suspect this is about allocation of memory.

I am thinking that Java grabs a large contiguous block at program startup whereas C++ asks the OS for bits and pieces as it goes along.

To put that theory to the test I made one modification to the C++ version and it suddenly started running slightly faster than the Java version:

int main() {
{
// grab a large chunk of contiguous memory and liberate it
std::vector<double> alloc(20000 * 20);
}
FloodIsolation isolation;
clock_t start = clock();
for (int i = 0; i < 400; ++i) {
if(i % 100 == 0) {
std::cout << i << "\n";
}
isolation.isUpdateNeeded();
}
clock_t stop = clock();
std::cout << "Time: " << (1000 * difftime(stop, start) / CLOCKS_PER_SEC) << "\n";
}

Runtime without the preallocating vector:

0
100
200
300
Time: 1250.31

Runtime with the preallocating vector:

0
100
200
300
Time: 331.214

Runtime for Java version:

0
100
200
300
Time: 407