Dataset.from_tensor 和 Dataset.from_tensor_sets 之间的区别是什么?

我有一个数据集表示为形状 (num_features, num_examples)的 NumPy 矩阵,我希望将其转换为 TensorFlow 类型 tf.Dataset

我正在努力理解这两种方法之间的区别: Dataset.from_tensorsDataset.from_tensor_slices。什么是正确的,为什么?

TensorFlow 文档(链接)说,这两种方法都接受张量的嵌套结构,尽管当使用 from_tensor_slices时,张量在0维应该具有相同的大小。

42391 次浏览

from_tensors combines the input and returns a dataset with a single element:

>>> t = tf.constant([[1, 2], [3, 4]])
>>> ds = tf.data.Dataset.from_tensors(t)
>>> [x for x in ds]
[<tf.Tensor: shape=(2, 2), dtype=int32, numpy=
array([[1, 2],
[3, 4]], dtype=int32)>]

from_tensor_slices creates a dataset with a separate element for each row of the input tensor:

>>> t = tf.constant([[1, 2], [3, 4]])
>>> ds = tf.data.Dataset.from_tensor_slices(t)
>>> [x for x in ds]
[<tf.Tensor: shape=(2,), dtype=int32, numpy=array([1, 2], dtype=int32)>,
<tf.Tensor: shape=(2,), dtype=int32, numpy=array([3, 4], dtype=int32)>]

1) Main difference between the two is that nested elements in from_tensor_slices must have the same dimension in 0th rank:

# exception: ValueError: Dimensions 10 and 9 are not compatible
dataset1 = tf.data.Dataset.from_tensor_slices(
(tf.random_uniform([10, 4]), tf.random_uniform([9])))
# OK, first dimension is same
dataset2 = tf.data.Dataset.from_tensors(
(tf.random_uniform([10, 4]), tf.random_uniform([10])))

2) The second difference, explained here, is when the input to a tf.Dataset is a list. For example:

dataset1 = tf.data.Dataset.from_tensor_slices(
[tf.random_uniform([2, 3]), tf.random_uniform([2, 3])])


dataset2 = tf.data.Dataset.from_tensors(
[tf.random_uniform([2, 3]), tf.random_uniform([2, 3])])


print(dataset1) # shapes: (2, 3)
print(dataset2) # shapes: (2, 2, 3)

In the above, from_tensors creates a 3D tensor while from_tensor_slices merge the input tensor. This can be handy if you have different sources of different image channels and want to concatenate them into a one RGB image tensor.

3) A mentioned in the previous answer, from_tensors convert the input tensor into one big tensor:

import tensorflow as tf


tf.enable_eager_execution()


dataset1 = tf.data.Dataset.from_tensor_slices(
(tf.random_uniform([4, 2]), tf.random_uniform([4])))


dataset2 = tf.data.Dataset.from_tensors(
(tf.random_uniform([4, 2]), tf.random_uniform([4])))


for i, item in enumerate(dataset1):
print('element: ' + str(i + 1), item[0], item[1])


print(30*'-')


for i, item in enumerate(dataset2):
print('element: ' + str(i + 1), item[0], item[1])

output:

element: 1 tf.Tensor(... shapes: ((2,), ()))
element: 2 tf.Tensor(... shapes: ((2,), ()))
element: 3 tf.Tensor(... shapes: ((2,), ()))
element: 4 tf.Tensor(... shapes: ((2,), ()))
-------------------------
element: 1 tf.Tensor(... shapes: ((4, 2), (4,)))

Try this :

import tensorflow as tf  # 1.13.1
tf.enable_eager_execution()


t1 = tf.constant([[11, 22], [33, 44], [55, 66]])


print("\n=========     from_tensors     ===========")
ds = tf.data.Dataset.from_tensors(t1)
print(ds.output_types, end=' : ')
print(ds.output_shapes)
for e in ds:
print (e)


print("\n=========   from_tensor_slices    ===========")
ds = tf.data.Dataset.from_tensor_slices(t1)
print(ds.output_types, end=' : ')
print(ds.output_shapes)
for e in ds:
print (e)

output :

=========      from_tensors    ===========
<dtype: 'int32'> : (3, 2)
tf.Tensor(
[[11 22]
[33 44]
[55 66]], shape=(3, 2), dtype=int32)


=========   from_tensor_slices      ===========
<dtype: 'int32'> : (2,)
tf.Tensor([11 22], shape=(2,), dtype=int32)
tf.Tensor([33 44], shape=(2,), dtype=int32)
tf.Tensor([55 66], shape=(2,), dtype=int32)

The output is pretty much self-explanatory but as you can see, from_tensor_slices() slices the output of (what would be the output of) from_tensors() on its first dimension. You can also try with :

t1 = tf.constant([[[11, 22], [33, 44], [55, 66]],
[[110, 220], [330, 440], [550, 660]]])

I think @MatthewScarpino clearly explained the differences between these two methods.

Here I try to describe the typical usage of these two methods:

  • from_tensors can be used to construct a larger dataset from several small datasets, i.e., the size (length) of the dataset becomes larger;

  • while from_tensor_slices can be used to combine different elements into one dataset, e.g., combine features and labels into one dataset (that's also why the 1st dimension of the tensors should be the same). That is, the dataset becomes "wider".

In simple:

from_tensors()

returns: single element,
type: TensorDataset

from_tensor_slices()

returns: multiple elements of input length,
type: TensorSliceDataset

Explanation:

from_tensors()

With 1-D input

import tensorflow as tf
dataset_ft = tf.data.Dataset.from_tensors([1, 2, 3])
type(dataset_ft)


>>> tensorflow.python.data.ops.dataset_ops.TensorDataset

Now, if we loop through this Dataset we will only get one object:

for _ in dataset_ft:
print(_)


>>> tf.Tensor([1 2 3], shape=(3,), dtype=int32)

What if we provide 2-D or more dimensional input?

With 2-D input

import tensorflow as tf
dataset_ft = tf.data.Dataset.from_tensors([[1, 2, 3], [4, 5, 6]])
type(dataset_ft)


>>> tensorflow.python.data.ops.dataset_ops.TensorDataset

Now, if we loop through this Dataset we will still get only one object:

for _ in dataset_ft:
print(_)


>>> tf.Tensor(
>>> [[1 2 3]
>>> [4 5 6]], shape=(2, 3), dtype=int32)

As you see that the shape or produced tensor is as of input. There is no change in the shape.

from_tensor_slices()

It removes the first dimension and use it as a dataset dimension.

With 1-D input

import tensorflow as tf
dataset_fts = tf.data.Dataset.from_tensor_slices([1, 2, 3])
type(dataset_fts)


>>> tensorflow.python.data.ops.dataset_ops.TensorSliceDataset

Now, if we loop through this Dataset we will have multiple objects:

for _ in dataset_fts:
print(_)


>>> tf.Tensor(1, shape=(), dtype=int32)
>>> tf.Tensor(2, shape=(), dtype=int32)
>>> tf.Tensor(3, shape=(), dtype=int32)

What if we provide 2-D or more dimensional input?

With 2-D input

import tensorflow as tf
dataset_fts = tf.data.Dataset.from_tensor_slices([[1, 2, 3], [4, 5, 6]])
type(dataset_fts)


>>> tensorflow.python.data.ops.dataset_ops.TensorSliceDataset

If we loop through this 2-D dataset we will have two 1-D elements:

for _ in dataset_fts:
print(_)


>>> tf.Tensor([1 2 3], shape=(3,), dtype=int32)
>>> tf.Tensor([4 5 6], shape=(3,), dtype=int32)

That's the simplest I can explain. To get a better understanding I would suggest you run both these functions with different inputs and see the shape of returned elements.