MongoDB 模式设计-多个小文档还是少个大文档?

背景资料
我正在构建从 RDBMS 数据库到 MongoDB 的转换原型。在反规范化过程中,我似乎有两个选择,一个选择导致产生许多(数百万)较小的文档,另一个选择导致产生较少(数十万)较大的文档。

如果我可以把它提炼成一个简单的模拟,那么它就是拥有更少的 Customer 文档的集合之间的区别(在 Java 中) :

class Customer {
private String name;
private Address address;
// each CreditCard has hundreds of Payment instances
private Set<CreditCard> creditCards;
}

或收集许多,许多付款文件,如下:

class Payment {
private Customer customer;
private CreditCard creditCard;
private Date payDate;
private float payAmount;
}

提问
MongoDB 的设计是倾向于使用许多许多小文档还是使用较少的大文档?答案是否主要取决于我计划运行哪些查询?(即:。客户 X 有多少张信用卡?上个月所有顾客的平均支付金额是多少?)

我已经看了很多,但是我没有碰到任何 MongoDB 模式的最佳实践,它们可以帮助我回答我的问题。

31813 次浏览

You'll definitely need to optimize for the queries you're doing.

Here's my best guess based on your description.

You'll probably want to know all Credit Cards for each Customer, so keep an array of those within the Customer Object. You'll also probably want to have a Customer reference for each Payment. This will keep the Payment document relatively small.

The Payment object will automatically have its own ID and index. You'll probably want to add an index on the Customer reference as well.

This will allow you to quickly search for Payments by Customer without storing the whole customer object every time.

If you want to answer questions like "What was the average amount all customers paid last month" you're instead going to want a map / reduce for any sizeable dataset. You're not getting this response "real-time". You'll find that storing a "reference" to Customer is probably good enough for these map-reduces.

So to answer your question directly: Is MongoDB designed to prefer many, many small documents or fewer large documents?

MongoDB is designed to find indexed entries very quickly. MongoDB is very good at finding a few needles in a large haystack. MongoDB is not very good at finding most of the needles in the haystack. So build your data around your most common use cases and write map/reduce jobs for the rarer use cases.

Documents that grow substantially over time can be ticking time bombs. Network bandwidth and RAM usage will likely become measurable bottlenecks, forcing you to start over.

First, let's consider two collections: Customer and Payment. Thus, the grain is fairly small: one document per payment.

Next you must decide how to model account information, such as credit cards. Let's consider whether customer documents contain arrays of account information or whether you need a new Account collection.

If account documents are separate from customer documents, loading all of the accounts for one customer into memory requires fetching multiple documents. That might translate into extra memory, I/O, bandwidth, and CPU usage. Does that immediately mean the Account collection is a bad idea?

Your decision affects payment documents. If account information is embedded in a customer document, how would you reference it? Separate account documents have their own _id attribute. With embedded account information, your application would either generate new ids for accounts or use the account's attributes (e.g., account number) for the key.

Could a payment document actually contain all the payments made in fixed timeframe (e.g., day?). Such complexity will affect all code that reads and writes payment documents. Premature optimization can be deadly to projects.

Like account documents, payments are easily referenced as long as a payment document contains only one payment. A new type of document, credit for example, could reference a payment. But would you create a Credit collection or would you embed credit information inside payment information? What would happen if you later needed to reference a credit?

To summarize, I have been successful with lots of small documents and many collections. I implement references with _id and only with _id. Thus, I don't worry about ever-growing documents destroying my application. The schema is easy to understand and index because each entity has its own collection. Important entities aren't hiding inside other documents.

I'd love to hear about your findings. Good luck!

According to MongoDB's own documentation, it sounds like it's designed for many small documents.

From Performance Best Practices for MongoDB:

The maximum size for documents in MongoDB is 16 MB. In practice most documents are a few kilobytes or less. Consider documents more like rows in a table than the tables themselves. Rather than maintaining lists of records in a single document, instead make each record a document.

From 6 Rules of Thumb for MongoDB Schema Design: Part 1:

Modeling One-to-Few

An example of “one-to-few” might be the addresses for a person. This is a good use case for embedding – you’d put the addresses in an array inside of your Person object.

One-to-Many

An example of “one-to-many” might be parts for a product in a replacement parts ordering system. Each product may have up to several hundred replacement parts, but never more than a couple thousand or so. This is a good use case for referencing – you’d put the ObjectIDs of the parts in an array in product document.

One-to-Squillions

An example of “one-to-squillions” might be an event logging system that collects log messages for different machines. Any given host could generate enough messages to overflow the 16 MB document size, even if all you stored in the array was the ObjectID. This is the classic use case for “parent-referencing” – you’d have a document for the host, and then store the ObjectID of the host in the documents for the log messages.