面向列的 NoSQL 和面向文档的 NoSQL 有什么不同?

我所了解的三种 NoSQL 数据库类型是键值型、面向列型和面向文档型。

键-value 非常简单-一个带有普通值的键。

我曾经看到过面向文档的数据库被描述为类似于键-值,但是值可以是一种结构,类似于 JSON 对象。每个“文档”可以具有与另一个“文档”相同的全部、部分或全部键。

面向列非常类似于面向文档,因为您没有指定结构。

那么这两者之间有什么区别,为什么要用一个而不是另一个呢?

我特别关注了 MongoDB 和 Cassandra。我基本上需要一个动态结构,可以改变,但不影响其他值。同时,我需要能够搜索/过滤特定的关键字和运行报告。在 CAP,AP 对我来说是最重要的。只要没有数据冲突或丢失,数据“最终”可以跨节点同步。每个用户都有自己的“表”。

43182 次浏览

In Cassandra, each row (addressed by a key) contains one or more "columns". Columns are themselves key-value pairs. The column names need not be predefined, i.e. the structure isn't fixed. Columns in a row are stored in sorted order according to their keys (names).

In some cases, you may have very large numbers of columns in a row (e.g. to act as an index to enable particular kinds of query). Cassandra can handle such large structures efficiently, and you can retrieve specific ranges of columns.

There is a further level of structure (not so commonly used) called super-columns, where a column contains nested (sub)columns.

You can think of the overall structure as a nested hashtable/dictionary, with 2 or 3 levels of key.

Normal column family:

row
col  col  col ...
val  val  val ...

Super column family:

row
supercol                      supercol                     ...
(sub)col  (sub)col  ...       (sub)col  (sub)col  ...
val       val      ...        val       val      ...

There are also higher-level structures - column families and keyspaces - which can be used to divide up or group together your data.

See also this Question: Cassandra: What is a subcolumn

Or the data modelling links from http://wiki.apache.org/cassandra/ArticlesAndPresentations

Re: comparison with document-oriented databases - the latter usually insert whole documents (typically JSON), whereas in Cassandra you can address individual columns or supercolumns, and update these individually, i.e. they work at a different level of granularity. Each column has its own separate timestamp/version (used to reconcile updates across the distributed cluster).

The Cassandra column values are just bytes, but can be typed as ASCII, UTF8 text, numbers, dates etc.

Of course, you could use Cassandra as a primitive document store by inserting columns containing JSON - but you wouldn't get all the features of a real document-oriented store.

In "insert", to use rdbms words, Document-based is more consistent and straight foward. Note than cassandra let you achieve consistency with the notion of quorum, but that won't apply to all column-based systems and that reduce availibility. On a write-once / read-often heavy system, go for MongoDB. Also consider it if you always plan to read the whole structure of the object. A document-based system is designed to return the whole document when you get it, and is not very strong at returning parts of the whole row.

The column-based systems like Cassandra are way better than document-based in "updates". You can change the value of a column without even reading the row that contains it. The write doesn't actualy need to be done on the same server, a row may be contained on multiple files of multiple server. On huge fast-evolving data system, go for Cassandra. Also consider it if you plan to have very big chunk of data per key, and won't need to load all of them at each query. In "select", Cassandra let you load only the column you need.

Also consider that Mongo DB is written in C++, and is at its second major release, while Cassandra needs to run on a JVM, and its first major release is in release candidate only since yesterday (but the 0.X releases turned in productions of major company already).

On the other hand, Cassandra's designed was partly based on Amazon Dynamo, and it is built at its core to be an High Availibility solution, but that does not have anything to do with the column-based format. MongoDB scales out too, but not as gracefully as Cassandra.

The main difference is that document stores (e.g. MongoDB and CouchDB) allow arbitrarily complex documents, i.e. subdocuments within subdocuments, lists with documents, etc. whereas column stores (e.g. Cassandra and HBase) only allow a fixed format, e.g. strict one-level or two-level dictionaries.

I would say that the main difference is the way each of these DB types physically stores the data.
With column types, the data is stored by columns which can enable efficient aggregation operations / queries on a particular column.
With document types, the entire document is logically stored in one place and is generally retrieved as a whole (no efficient aggregation possible on "columns" / "fields").

The confusing bit is that a wide-column "row" can be easily represented as a document, but, as mentioned they are stored differently and optimized for different purposes.