Java 中的稀疏矩阵/数组

小开

这看起来很简单。

可以使用行 * maxcolums + 列作为索引使用数据的二叉树。

要查找 item，只需计算 row * maxcolums + column 并在树中进行二进制搜索即可，如果没有，可以返回 null (它是(log n) ，其中 n 是包含对象的单元格数)。

小开

可能不是最快的运行时解决方案，但是我能想到的最快的解决方案似乎是有效的。创建 Index 类并将其用作 SortedMap 的键，如:

    SortedMap<Index, Object> entries = new TreeMap<Index, Object>();
entries.put(new Index(1, 4), "1-4");
entries.put(new Index(5555555555l, 767777777777l), "5555555555l-767777777777l");
System.out.println(entries.size());
System.out.println(entries.get(new Index(1, 4)));
System.out.println(entries.get(new Index(5555555555l, 767777777777l)));

我的 Index 类如下所示(在 Eclipse 代码生成器的帮助下)。

public static class Index implements Comparable<Index>
{
private long x;
private long y;


public Index(long x, long y)
{
super();
this.x = x;
this.y = y;
}


public int compareTo(Index index)
{
long ix = index.x;
if (ix == x)
{
long iy = index.y;
if (iy == y)
{
return 0;
}
else if (iy < y)
{
return -1;
}
else
{
return 1;
}
}
else if (ix < x)
{
return -1;
}
else
{
return 1;
}
}


public int hashCode()
{
final int PRIME = 31;
int result = 1;
result = PRIME * result + (int) (x ^ (x >>> 32));
result = PRIME * result + (int) (y ^ (y >>> 32));
return result;
}


public boolean equals(Object obj)
{
if (this == obj)
return true;
if (obj == null)
return false;
if (getClass() != obj.getClass())
return false;
final Index other = (Index) obj;
if (x != other.x)
return false;
if (y != other.y)
return false;
return true;
}


public long getX()
{
return x;
}


public long getY()
{
return y;
}
}

小开

你可以只使用一个嵌套的映射，虽然如果你需要做矩阵微积分，它可能不是最好的选择

 Map<Integer, Map<integer, Object>> matrix;

也许不使用 object，而是使用一些 tuple 来处理实际的数据，这样在提取之后就可以更容易地处理它，比如:

class Tuple<T extends yourDataObject> {
public final int x;
public final int y;
public final T object;
}


class Matrix {
private final Map<Integer, Map<interger, Tupple>> data = new...;


void add(int x, int y, Object object) {
data.get(x).put(new Tupple(x,y,object);
}
}




//etc

为简洁起见省略了空检查等

小开

最佳答案

使用散列映射构建的稀疏数组对于频繁读取的数据非常低效。最有效的实现使用 Trie，它允许访问分布段的单个向量。

Trie 可以通过仅执行只读的 TWO 数组索引来计算表中是否存在元素，以获得元素存储的有效位置，或者知道元素是否在底层存储中缺失。

它还可以为稀疏数组的默认值在后台存储中提供一个默认位置，这样你就不需要对返回的索引进行任何测试，因为 Trie 保证所有可能的源索引都至少映射到后台存储中的默认位置(在这里你经常存储零、空字符串或空对象)。

有一些实现支持快速更新的 Tries，可以使用一个可选的“压缩()”操作来在多个操作结束时优化后备存储区的大小。尝试比散列映射快得多，因为它们不需要任何复杂的散列函数，也不需要为读操作处理冲突(对于散列映射，读和写操作都有冲突，这需要一个循环来跳到下一个候选位置，并对它们进行测试来比较有效的源索引...)

此外，Java Hashmap 只能对 Object 进行索引，并且为每个散列源索引创建一个 Integer 对象(每次读操作都需要创建这个对象，而不仅仅是写操作)在内存操作方面代价高昂，因为它会给垃圾收集器带来压力。

我真的希望 JRE 包含一个 IntegerTrieMap < Object > 作为缓慢的 HashMap < Integer，Object > 的默认实现，或者 LongTrieMap < Object > 作为更慢的 HashMap < Long，Object > 的默认实现... ... 但是情况仍然不是这样。

你可能想知道什么是 Trie？

它只是一个小的整数数组(比矩阵的全部坐标范围小) ，允许将坐标映射到向量中的整数位置。

例如，假设您想要一个1024 * 1024矩阵，其中只包含几个非零值。与其将矩阵存储在一个包含1024 * 1024个元素(超过100万个)的数组中，不如将其分割为大小为16 * 16的子区域，而且只需要64 * 64个这样的子区域。

在这种情况下，Trie 索引只包含64 * 64个整数(4096) ，并且至少有16 * 16个数据元素(包含默认的零，或者在稀疏矩阵中最常见的子区域)。

用于存储值的向量对于彼此相等的子区域只包含1个副本(大多数子区域都是零，它们将由相同的子区域表示)。

因此，与其使用类似 matrix[i][j]的语法，不如使用如下语法:

trie.values[trie.subrangePositions[(i & ~15) + (j >> 4)] +
((i & 15) << 4) + (j & 15)]

使用 trie 对象的访问方法可以更方便地处理。

下面是一个内置在注释类中的示例(我希望它的编译是 OK 的，因为它被简化了; 如果有错误需要更正，请通知我) :

/**
* Implement a sparse matrix. Currently limited to a static size
* (<code>SIZE_I</code>, <code>SIZE_I</code>).
*/
public class DoubleTrie {


/* Matrix logical options */
public static final int SIZE_I = 1024;
public static final int SIZE_J = 1024;
public static final double DEFAULT_VALUE = 0.0;


/* Internal splitting options */
private static final int SUBRANGEBITS_I = 4;
private static final int SUBRANGEBITS_J = 4;


/* Internal derived splitting constants */
private static final int SUBRANGE_I =
1 << SUBRANGEBITS_I;
private static final int SUBRANGE_J =
1 << SUBRANGEBITS_J;
private static final int SUBRANGEMASK_I =
SUBRANGE_I - 1;
private static final int SUBRANGEMASK_J =
SUBRANGE_J - 1;
private static final int SUBRANGE_POSITIONS =
SUBRANGE_I * SUBRANGE_J;


/* Internal derived default values for constructors */
private static final int SUBRANGES_I =
(SIZE_I + SUBRANGE_I - 1) / SUBRANGE_I;
private static final int SUBRANGES_J =
(SIZE_J + SUBRANGE_J - 1) / SUBRANGE_J;
private static final int SUBRANGES =
SUBRANGES_I * SUBRANGES_J;
private static final int DEFAULT_POSITIONS[] =
new int[SUBRANGES](0);
private static final double DEFAULT_VALUES[] =
new double[SUBRANGE_POSITIONS](DEFAULT_VALUE);


/* Internal fast computations of the splitting subrange and offset. */
private static final int subrangeOf(
final int i, final int j) {
return (i >> SUBRANGEBITS_I) * SUBRANGE_J +
(j >> SUBRANGEBITS_J);
}
private static final int positionOffsetOf(
final int i, final int j) {
return (i & SUBRANGEMASK_I) * MAX_J +
(j & SUBRANGEMASK_J);
}


/**
* Utility missing in java.lang.System for arrays of comparable
* component types, including all native types like double here.
*/
public static final int arraycompare(
final double[] values1, final int position1,
final double[] values2, final int position2,
final int length) {
if (position1 >= 0 && position2 >= 0 && length >= 0) {
while (length-- > 0) {
double value1, value2;
if ((value1 = values1[position1 + length]) !=
(value2 = values2[position2 + length])) {
/* Note: NaN values are different from everything including
* all Nan values; they are are also neigher lower than nor
* greater than everything including NaN. Note that the two
* infinite values, as well as denormal values, are exactly
* ordered and comparable with <, <=, ==, >=, >=, !=. Note
* that in comments below, infinite is considered "defined".
*/
if (value1 < value2)
return -1;        /* defined < defined. */
if (value1 > value2)
return 1;         /* defined > defined. */
if (value1 == value2)
return 0;         /* defined == defined. */
/* One or both are NaN. */
if (value1 == value1) /* Is not a NaN? */
return -1;        /* defined < NaN. */
if (value2 == value2) /* Is not a NaN? */
return 1;         /* NaN > defined. */
/* Otherwise, both are NaN: check their precise bits in
* range 0x7FF0000000000001L..0x7FFFFFFFFFFFFFFFL
* including the canonical 0x7FF8000000000000L, or in
* range 0xFFF0000000000001L..0xFFFFFFFFFFFFFFFFL.
* Needed for sort stability only (NaNs are otherwise
* unordered).
*/
long raw1, raw2;
if ((raw1 = Double.doubleToRawLongBits(value1)) !=
(raw2 = Double.doubleToRawLongBits(value2)))
return raw1 < raw2 ? -1 : 1;
/* Otherwise the NaN are strictly equal, continue. */
}
}
return 0;
}
throw new ArrayIndexOutOfBoundsException(
"The positions and length can't be negative");
}


/**
* Utility shortcut for comparing ranges in the same array.
*/
public static final int arraycompare(
final double[] values,
final int position1, final int position2,
final int length) {
return arraycompare(values, position1, values, position2, length);
}


/**
* Utility missing in java.lang.System for arrays of equalizable
* component types, including all native types like double here.
*/
public static final boolean arrayequals(
final double[] values1, final int position1,
final double[] values2, final int position2,
final int length) {
return arraycompare(values1, position1, values2, position2, length) ==
0;
}


/**
* Utility shortcut for identifying ranges in the same array.
*/
public static final boolean arrayequals(
final double[] values,
final int position1, final int position2,
final int length) {
return arrayequals(values, position1, values, position2, length);
}


/**
* Utility shortcut for copying ranges in the same array.
*/
public static final void arraycopy(
final double[] values,
final int srcPosition, final int dstPosition,
final int length) {
arraycopy(values, srcPosition, values, dstPosition, length);
}


/**
* Utility shortcut for resizing an array, preserving values at start.
*/
public static final double[] arraysetlength(
double[] values,
final int newLength) {
final int oldLength =
values.length < newLength ? values.length : newLength;
System.arraycopy(values, 0, values = new double[newLength], 0,
oldLength);
return values;
}


/* Internal instance members. */
private double values[];
private int subrangePositions[];
private bool isSharedValues;
private bool isSharedSubrangePositions;


/* Internal method. */
private final reset(
final double[] values,
final int[] subrangePositions) {
this.isSharedValues =
(this.values = values) == DEFAULT_VALUES;
this.isSharedsubrangePositions =
(this.subrangePositions = subrangePositions) ==
DEFAULT_POSITIONS;
}


/**
* Reset the matrix to fill it with the same initial value.
*
* @param initialValue  The value to set in all cell positions.
*/
public reset(final double initialValue = DEFAULT_VALUE) {
reset(
(initialValue == DEFAULT_VALUE) ? DEFAULT_VALUES :
new double[SUBRANGE_POSITIONS](initialValue),
DEFAULT_POSITIONS);
}


/**
* Default constructor, using single default value.
*
* @param initialValue  Alternate default value to initialize all
*                      positions in the matrix.
*/
public DoubleTrie(final double initialValue = DEFAULT_VALUE) {
this.reset(initialValue);
}


/**
* This is a useful preinitialized instance containing the
* DEFAULT_VALUE in all cells.
*/
public static DoubleTrie DEFAULT_INSTANCE = new DoubleTrie();


/**
* Copy constructor. Note that the source trie may be immutable
* or not; but this constructor will create a new mutable trie
* even if the new trie initially shares some storage with its
* source when that source also uses shared storage.
*/
public DoubleTrie(final DoubleTrie source) {
this.values = (this.isSharedValues =
source.isSharedValues) ?
source.values :
source.values.clone();
this.subrangePositions = (this.isSharedSubrangePositions =
source.isSharedSubrangePositions) ?
source.subrangePositions :
source.subrangePositions.clone());
}


/**
* Fast indexed getter.
*
* @param i  Row of position to set in the matrix.
* @param j  Column of position to set in the matrix.
* @return   The value stored in matrix at that position.
*/
public double getAt(final int i, final int j) {
return values[subrangePositions[subrangeOf(i, j)] +
positionOffsetOf(i, j)];
}


/**
* Fast indexed setter.
*
* @param i      Row of position to set in the sparsed matrix.
* @param j      Column of position to set in the sparsed matrix.
* @param value  The value to set at this position.
* @return       The passed value.
* Note: this does not compact the sparsed matric after setting.
* @see compact(void)
*/
public double setAt(final int i, final int i, final double value) {
final int subrange       = subrangeOf(i, j);
final int positionOffset = positionOffsetOf(i, j);
// Fast check to see if the assignment will change something.
int subrangePosition, valuePosition;
if (Double.compare(
values[valuePosition =
(subrangePosition = subrangePositions[subrange]) +
positionOffset],
value) != 0) {
/* So we'll need to perform an effective assignment in values.
* Check if the current subrange to assign is shared of not.
* Note that we also include the DEFAULT_VALUES which may be
* shared by several other (not tested) trie instances,
* including those instanciated by the copy contructor. */
if (isSharedValues) {
values = values.clone();
isSharedValues = false;
}
/* Scan all other subranges to check if the position in values
* to assign is shared by another subrange. */
for (int otherSubrange = subrangePositions.length;
--otherSubrange >= 0; ) {
if (otherSubrange != subrange)
continue; /* Ignore the target subrange. */
/* Note: the following test of range is safe with future
* interleaving of common subranges (TODO in compact()),
* even though, for now, subranges are sharing positions
* only between their common start and end position, so we
* could as well only perform the simpler test <code>
* (otherSubrangePosition == subrangePosition)</code>,
* instead of testing the two bounds of the positions
* interval of the other subrange. */
int otherSubrangePosition;
if ((otherSubrangePosition =
subrangePositions[otherSubrange]) >=
valuePosition &&
otherSubrangePosition + SUBRANGE_POSITIONS <
valuePosition) {
/* The target position is shared by some other
* subrange, we need to make it unique by cloning the
* subrange to a larger values vector, copying all the
* current subrange values at end of the new vector,
* before assigning the new value. This will require
* changing the position of the current subrange, but
* before doing that, we first need to check if the
* subrangePositions array itself is also shared
* between instances (including the DEFAULT_POSITIONS
* that should be preserved, and possible arrays
* shared by an external factory contructor whose
* source trie was declared immutable in a derived
* class). */
if (isSharedSubrangePositions) {
subrangePositions = subrangePositions.clone();
isSharedSubrangePositions = false;
}
/* TODO: no attempt is made to allocate less than a
* fully independant subrange, using possible
* interleaving: this would require scanning all
* other existing values to find a match for the
* modified subrange of values; but this could
* potentially leave positions (in the current subrange
* of values) unreferenced by any subrange, after the
* change of position for the current subrange. This
* scanning could be prohibitively long for each
* assignement, and for now it's assumed that compact()
* will be used later, after those assignements. */
values = setlengh(
values,
(subrangePositions[subrange] =
subrangePositions = values.length) +
SUBRANGE_POSITIONS);
valuePosition = subrangePositions + positionOffset;
break;
}
}
/* Now perform the effective assignment of the value. */
values[valuePosition] = value;
}
}
return value;
}


/**
* Compact the storage of common subranges.
* TODO: This is a simple implementation without interleaving, which
* would offer a better data compression. However, interleaving with its
* O(N²) complexity where N is the total length of values, should
* be attempted only after this basic compression whose complexity is
* O(n²) with n being SUBRANGE_POSITIIONS times smaller than N.
*/
public void compact() {
final int oldValuesLength = values.length;
int newValuesLength = 0;
for (int oldPosition = 0;
oldPosition < oldValuesLength;
oldPosition += SUBRANGE_POSITIONS) {
int oldPosition = positions[subrange];
bool commonSubrange = false;
/* Scan values for possible common subranges. */
for (int newPosition = newValuesLength;
(newPosition -= SUBRANGE_POSITIONS) >= 0; )
if (arrayequals(values, newPosition, oldPosition,
SUBRANGE_POSITIONS)) {
commonSubrange = true;
/* Update the subrangePositions|] with all matching
* positions from oldPosition to newPosition. There may
* be several index to change, if the trie has already
* been compacted() before, and later reassigned. */
for (subrange = subrangePositions.length;
--subrange >= 0; )
if (subrangePositions[subrange] == oldPosition)
subrangePositions[subrange] = newPosition;
break;
}
if (!commonSubrange) {
/* Move down the non-common values, if some previous
* subranges have been compressed when they were common.
*/
if (!commonSubrange && oldPosition != newValuesLength) {
arraycopy(values, oldPosition, newValuesLength,
SUBRANGE_POSITIONS);
/* Advance compressed values to preserve these new ones. */
newValuesLength += SUBRANGE_POSITIONS;
}
}
}
/* Check the number of compressed values. */
if (newValuesLength < oldValuesLength) {
values = values.arraysetlength(newValuesLength);
isSharedValues = false;
}
}


}

注意: 此代码不完整，因为它处理单个矩阵大小，并且其压缩器仅限于检测公共子区域，而不交错它们。

此外，代码不会根据矩阵的大小来确定将矩阵分割成子区域(x 或 y 坐标)的最佳宽度或高度。它只是使用相同的静态子区域大小16(对于两个坐标) ，但它可以方便地任何其他2的小幂(但是非2的幂会减慢 int indexOf(int, int)和 int offsetOf(int, int)内部方法) ，对于两个坐标独立，直到矩阵的最大宽度或高度。理想情况下，compact()方法应该能够确定最佳的拟合尺寸。

如果这些分割子区域的大小可以变化，那么就需要为这些子区域的大小添加实例成员，而不是静态的 SUBRANGE_POSITIONS，并使静态方法 int subrangeOf(int i, int j)和 int positionOffsetOf(int i, int j)成为非静态的; 初始化数组 DEFAULT_POSITIONS和 DEFAULT_VALUES将需要删除或重新定义不同。

如果你想支持交织，基本上你会开始把现有的值分成两个大小相同的大小(都是最小子区域大小的倍数，第一个子区域可能比第二个子区域多一个子区域) ，你会在所有连续的位置扫描较大的一个，找到一个匹配的交织; 然后你会尝试匹配这些值。然后通过将子集分成两半(也是最小子集大小的倍数)进行递归循环，然后再次扫描以匹配这些子集(这将使子集的数量乘以2: 你必须考虑子集位置指数的倍数值是否值得与现有值的大小相比，看看它是否提供了有效的压缩(如果不是，你就停在那里: 你已经找到了最佳子集大小直接从交错压缩过程)。在这种情况下，在压缩过程中，子区域的大小是可变的。

但是这段代码展示了如何为额外的(非零)子区域分配非零值和重新分配 data数组，以及如何优化(在使用 setAt(int i, int j, double value)方法完成分配后使用 compact())存储数据，当有重复的子区域可能在数据中被统一，并且在 subrangePositions数组的同一位置重新索引时。

无论如何，所有的审判原则都是在那里实施的:

使用单个向量代替双索引数组(每个数组分别分配)表示矩阵总是更快(内存更紧凑，意味着更好的位置)。这种改进在 double getAt(int, int)方法中是显而易见的！
您节省了大量的空间，但是在赋值时，可能需要花费一些时间来重新分配新的子区域。由于这个原因，子区域不应该太小，否则在设置矩阵时会频繁发生重新分配。
通过检测公共子矩阵，可以将初始大矩阵自动转换为更紧凑的矩阵。然后，一个典型的实现将包含一个方法，如上面的 compact()。然而，如果 get ()访问非常快，set ()相当快，那么 compact()可能会非常慢，如果有很多公共子集需要压缩(例如，当减去一个大的非稀疏随机填充矩阵时，或者将其乘以零: 在这种情况下，通过实例化一个新的并删除旧的尝试来重置尝试会更简单和快得多)。
公共子区域使用数据中的公共存储，因此此共享数据必须是只读的。如果必须更改单个值而不更改矩阵的其余部分，则必须首先确保在 subrangePositions索引中只引用该值一次。否则，您将需要在 values向量的任何地方(方便地在末尾)分配一个新的子区域，然后将这个新子区域的位置存储到 subrangePositions索引中。

请注意，通用的 Colt 库虽然非常好，但是在处理稀疏矩阵时就不那么好了，因为它使用了哈希(或行压缩)技术，这些技术暂时不实现对 try 的支持，尽管它是一个出色的优化，它既节省了还有的空间，又节省了时间，特别是对于最频繁的 getAt ()操作。

即使这里描述的用于尝试的 setAt ()操作也节省了很多时间(这里实现的方法是在这里实现的，即在设置之后没有自动压缩，这仍然可以根据需求和估计的时间来实现，压缩仍然会以时间的代价节省大量存储空间) : 节省的时间与子区中的单元数量成正比，节省的空间与每个子区中的单元数量成反比。如果使用子区域大小，如每个子区域的细胞数量是2D 矩阵中细胞总数的平方根(在使用3D 矩阵时将是立方根) ，则是一个很好的折衷方案。

Colt 稀疏矩阵实现中使用的散列技术存在不便之处，它们增加了大量的存储开销，并且由于可能的冲突而导致访问时间变慢。尝试可以避免所有的冲突，然后可以保证在最坏的情况下将线性 O (n)时间节省到 O (1)时间，其中(n)是可能的冲突的数量(在稀疏矩阵的情况下，可能是矩阵中非默认值单元的数量，即矩阵的总大小乘以与散列填充因子成正比的因子，对于非稀疏矩阵即完整矩阵)。

Colt 中使用的 RC (行压缩)技术更接近于 Tries，但这是另一个代价，这里使用的压缩技术，对于最频繁的只读 get ()操作，访问时间非常慢，而对于 setAt ()操作，压缩速度非常慢。此外，所使用的压缩不是正交的，不像在这个呈现的尝试，其中保留了正交性。对于相关的查看操作，如步进、换位(被视为基于整数循环模块操作的步进操作)、子区域(和子选择，包括排序视图) ，尝试也将保持这种正交性。

我只是希望 Colt 将来能够更新，用 Tries 实现另一个实现(即 TrieSparseMatrix，而不仅仅是 HashSparseMatrix 和 RCSparseMatrix)。这些想法都在这篇文章里。

Trove 实现(基于 int-> int 映射)也基于类似于 Colt 的 HashedSparseMatrix 的散列技术，即它们具有相同的不便之处。尝试会快得多，并且会消耗适度的额外空间(但是这个空间可以被优化，并且在延迟的时间内比 Trove 和 Colt 更好，在生成的矩阵/trie 上使用最后的紧凑()离子操作)。

注意: 这个 Trie 实现绑定到一个特定的本机类型(这里是 double)。这是自愿的，因为使用装箱类型的通用实现具有巨大的空间开销(并且访问时间要慢得多)。在这里，它只是使用了 double 的本机一维数组，而不是通用 Vector。不幸的是，Java 仍然不允许编写具有本机类型所有优点的真正的通用类，除非通过编写多个实现(针对一个通用对象类型或每个本机类型) ，并通过类型工厂提供所有这些操作。该语言应该能够自动实例化本机实现并自动构建工厂(目前甚至在 Java7中也没有这种情况，而且这是在。对于真正的泛型类型，Net 仍然保持着与本机类型一样快的优势)。

小开

下面的框架用于测试 Java 矩阵库，也提供了一个很好的列表！ Https://lessthanoptimal.github.io/java-matrix-benchmark/

测试图书馆:

* Colt
* Commons Math
* Efficient Java Matrix Library (EJML)
* Jama
* jblas
* JScience (Older benchmarks only)
* Matrix Toolkit Java (MTJ)
* OjAlgo
* Parallel Colt
* Universal Java Matrix Package (UJMP)

小开

您可以查看 La4j(Java 中的线性代数)库。它支持稀疏矩阵的压缩行存储和压缩柱存储内部表示。因此，对于稀疏数据，这些是最有效和最快速的内部结构。

下面是在 La4j中使用稀疏矩阵的简单例子:

Matrix a = new CRSMatrix(new double[][]{ // 'a' - CRS sparse matrix
{ 1.0, 0.0, 3.0 },
{ 0.0, 5.0, 0.0 },
{ 7.0, 0.0. 9.0 }
});


Matrix b = a.transpose(); // 'b' - CRS sparse matrix


Matrix c = b.multiply(a, Matrices.CCS_FACTORY); // 'c' = 'b' * 'a';
// 'c' - CCS sparse matrix

小开

HashMap 棒极了。只需使用 StringBuilder (非 + 或 String.format)将索引(作为字符串)与分隔符(例如’/’)连接起来，并将其用作键。没有比这更快更节省内存的了。稀疏矩阵太20世纪了。:-)