用于 Java 和 C/C + + 之间进程间通信的最快(低延迟)方法

我有一个 Java 应用程序,通过 TCP 套接字连接到一个用 C/C + + 开发的“服务器”。

应用程序和服务器都运行在同一台机器上,一个 Solaris 机器(但我们正在考虑最终迁移到 Linux)。 交换的数据类型是简单的消息(登录,登录 ACK,然后客户端要求的东西,服务器答复)。每条消息大约300字节长。

目前我们正在使用 Socket,一切都很好,但是我正在寻找一种更快的方法来交换数据(更低的延迟) ,使用 IPC 方法。

我一直在研究网络,发现了以下技术:

  • 共享记忆
  • 管道
  • 排队
  • 以及所谓的 DMA (直接存储器访问)

但是我找不到对它们各自性能的适当分析,也找不到如何在 JAVA 和 C/C + + 中实现它们(这样它们就可以相互交流) ,除了我能想象如何做的管道。

在这种情况下,有人能评论每种方法的性能和可行性吗? 任何指向有用实现信息的指针/链接?


编辑/更新

在我得到的评论和回答之后,我找到了关于 Unix 域套接字的信息,它似乎是通过管道构建的,可以为我节省整个 TCP 堆栈。 它是特定于平台的,所以我计划用 JNI 或者 Juds或者 Junixsocket来测试它。

接下来可能的步骤是直接实现管道,然后是共享内存,尽管我已经被警告了额外的复杂程度..。


谢谢你的帮助

71527 次浏览

In my former company we used to work with this project, http://remotetea.sourceforge.net/, very easy to understand and integrate.

I don't know much about native inter-process communication, but I would guess that you need to communicate using native code, which you can access using JNI mechanisms. So, from Java you would call a native function that talks to the other process.

If you ever consider using native access (since both your application and the "server" are on the same machine), consider JNA, it has less boilerplate code for you to deal with.

DMA is a method by which hardware devices can access physical RAM without interrupting the CPU. E.g. a common example is a harddisk controller which can copy bytes straight from disk to RAM. As such it's not applicable to IPC.

Shared memory and pipes are both supported directly by modern OSes. As such, they're quite fast. Queues are typically abstractions, e.g. implemented on top of sockets, pipes and/or shared memory. This may look like a slower mechanism, but the alternative is that you create such an abstraction.

Here's a project containing performance tests for various IPC transports:

http://github.com/rigtorp/ipc-bench

Have you considered keeping the sockets open, so the connections can be reused?

Oracle bug report on JNI performance: http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4096069

JNI is a slow interface and so Java TCP sockets are the fastest method for notification between applications, however that doesn't mean you have to send the payload over a socket. Use LDMA to transfer the payload, but as previous questions have pointed out, Java support for memory mapping is not ideal and you so will want to implement a JNI library to run mmap.

Just tested latency from Java on my Corei5 2.8GHz, only single byte send/received, 2 Java processes just spawned, without assigning specific CPU cores with taskset:

TCP         - 25 microseconds
Named pipes - 15 microseconds

Now explicitly specifying core masks, like taskset 1 java Srv or taskset 2 java Cli:

TCP, same cores:                      30 microseconds
TCP, explicit different cores:        22 microseconds
Named pipes, same core:               4-5 microseconds !!!!
Named pipes, taskset different cores: 7-8 microseconds !!!!

so

TCP overhead is visible
scheduling overhead (or core caches?) is also the culprit

At the same time Thread.sleep(0) (which as strace shows causes a single sched_yield() Linux kernel call to be executed) takes 0.3 microsecond - so named pipes scheduled to single core still have much overhead

Some shared memory measurement: September 14, 2009 – Solace Systems announced today that its Unified Messaging Platform API can achieve an average latency of less than 700 nanoseconds using a shared memory transport. http://solacesystems.com/news/fastest-ipc-messaging/

P.S. - tried shared memory next day in the form of memory mapped files, if busy waiting is acceptable, we can reduce latency to 0.3 microsecond for passing a single byte with code like this:

MappedByteBuffer mem =
new RandomAccessFile("/tmp/mapped.txt", "rw").getChannel()
.map(FileChannel.MapMode.READ_WRITE, 0, 1);


while(true){
while(mem.get(0)!=5) Thread.sleep(0); // waiting for client request
mem.put(0, (byte)10); // sending the reply
}

Notes: Thread.sleep(0) is needed so 2 processes can see each other's changes (I don't know of another way yet). If 2 processes forced to same core with taskset, the latency becomes 1.5 microseconds - that's a context switch delay

P.P.S - and 0.3 microsecond is a good number! The following code takes exactly 0.1 microsecond, while doing a primitive string concatenation only:

int j=123456789;
String ret = "my-record-key-" + j  + "-in-db";

P.P.P.S - hope this is not too much off-topic, but finally I tried replacing Thread.sleep(0) with incrementing a static volatile int variable (JVM happens to flush CPU caches when doing so) and obtained - record! - 72 nanoseconds latency java-to-java process communication!

When forced to same CPU Core, however, volatile-incrementing JVMs never yield control to each other, thus producing exactly 10 millisecond latency - Linux time quantum seems to be 5ms... So this should be used only if there is a spare core - otherwise sleep(0) is safer.

The question was asked some time ago, but you might be interested in https://github.com/peter-lawrey/Java-Chronicle which supports typical latencies of 200 ns and throughputs of 20 M messages/second. It uses memory mapped files shared between processes (it also persists the data which makes it fastest way to persist data)

A late arrival, but wanted to point out an open source project dedicated to measuring ping latency using Java NIO.

Further explored/explained in this blog post. The results are(RTT in nanos):

Implementation, Min,   50%,   90%,   99%,   99.9%, 99.99%,Max
IPC busy-spin,  89,    127,   168,   3326,  6501,  11555, 25131
UDP busy-spin,  4597,  5224,  5391,  5958,  8466,  10918, 18396
TCP busy-spin,  6244,  6784,  7475,  8697,  11070, 16791, 27265
TCP select-now, 8858,  9617,  9845,  12173, 13845, 19417, 26171
TCP block,      10696, 13103, 13299, 14428, 15629, 20373, 32149
TCP select,     13425, 15426, 15743, 18035, 20719, 24793, 37877

This is along the lines of the accepted answer. System.nanotime() error (estimated by measuring nothing) is measured at around 40 nanos so for the IPC the actual result might be lower. Enjoy.