标准使用‘ Z’代替 NULL 来表示丢失的数据?

除了是否应该使用 NULL 的争论之外: 我负责一个现有的使用 NULL 表示“丢失或从未输入”数据的数据库。它不同于空字符串,空字符串意味着“用户设置此值,并选择‘ em pty’”

该项目的另一个承包商坚定地站在“ NULL 对我来说并不存在; 我从不使用 NULL,其他人也不应该使用 NULL”的立场上。然而,令我困惑的是,由于承包商的团队确实承认“丢失/从未输入”和“故意空白或用户指示为未知”之间的区别,他们在整个代码和存储过程中使用单个字符“ Z”来表示“丢失/从未输入”,在整个数据库中使用与 NULL 相同的含义。

尽管我们的共享客户要求更改这个属性,我也支持这个请求,但是团队将其作为比我先进得多的 DBA 中的“标准实践”; 他们不愿意仅仅因为我的无知请求而更改为使用 NULL。那么,有人能帮我克服我的无知吗?在 SQL 专家中,是否有任何标准,或者一小群人,甚至是一个单独的声音提倡使用“ Z”代替 NULL?

更新

我收到了承包商的回复。下面是当客户要求删除特殊值以允许在没有数据的列中使用 NULL 时他所说的话:

基本上,我设计数据库是为了尽可能避免 NULL:

字符串[ VARCHAR ]字段中不需要 NULL,因为空(零长度)字符串提供完全相同的信息。

整数字段(例如,ID 值)中的 NULL 可以通过使用数据中永远不会出现的值来处理(例如,整数 IDENTITY 字段为 -1)。

日期字段中的 NULL 很容易导致日期计算的复杂性。例如,在计算日期差异的逻辑中,比如[ RecoveryDate ]和[ OnsetDate ]之间的天数差异,如果一个或两个日期都为 NULL,逻辑就会崩溃——除非明确允许两个日期都为 NULL。这是额外的工作和额外的操作。如果对[ RecoveryDate ]和[ OnsetDate ]使用“ default”或“ placeholder”日期(例如,“1/1/1900”) ,数学计算可能会显示“异常”值——但日期逻辑不会崩溃。

传统上,NULL 处理是开发人员在存储过程中犯错误的领域。

在我15年的 DBA 生涯中,我发现尽可能避免 NULL 是最好的。

这似乎验证了对这个问题的大多数消极反应。不是应用公认的6NF 方法来设计 NULL,而是使用特殊值来“尽可能避免 NULL”我以一种开放的心态发布了这个问题,我很高兴我学到了更多关于“ NULL 是有用的/NULL 是邪恶的”的争论,但是我现在非常舒服地将“特殊值”方法标记为完全无意义的。

空(零长度)字符串提供完全相同的信息。

不,它不是; 在我们正在修改的现有数据库中,NULL 表示“从未输入”,空字符串表示“作为空值输入”。

传统上,NULL 处理是开发人员在存储过程中犯错误的领域。

是的,但是这些错误已经被成千上万的开发人员犯过成千上万次了,避免这些错误的教训和警告已经被记录下来了。正如这里提到的: 无论接受还是拒绝 NULL,缺少值的表示都是 解决了问题。没有必要仅仅因为开发人员继续犯易于克服(和易于识别)的错误而发明新的解决方案。


作为一个脚注: 我已经做了20多年的数据库工程师和开发人员(这足以让我知道数据库工程师和数据库管理员之间的区别)。在我的整个职业生涯中,我一直是“ NULL 是有用的”阵营,尽管我知道有几个非常聪明的人不这么认为。我对“特殊价值观”的方法持极度怀疑态度,但对“如何避免正确的方法为零”的学术知识还不够精通,因此不能坚定立场。我总是喜欢学习新的东西ーー而且20年过去了,我还有很多东西要学。感谢所有为这次讨论做出贡献的人。

3565 次浏览

If the domain allows missing values, then using NULL to represent 'undefined' is perfectly OK (that's what it is there for). The only downside is that code that consumes the data has to be written to check for NULLs. This is the way I've always done it.

I have never heard of (or seen in practice) the use of 'Z' to represent missing data. As to "the contractor cites this as 'standard practice' among DBAs", can he provide some evidence of that assertion? As @Dems mentioned, you also need to document that 'Z' doesn't mean 'Z': what about a MiddleInitial column?

Like Aaron Alton and many others, I believe that NULL values are an integral part of database design, and should be used where appropriate.

Sack your contractor.

Okay, seriously, this isn't standard practice. This can be seen simply because all RDBMS that I have ever worked with implement NULL, logic for NULL, take account of NULL in foreign keys, have different behaviour for NULL in COUNT, etc, etc.

I would actually contend that using 'Z' or any other place holder is worse. You still require code to check for 'Z'. But you also need to document that 'Z' doesn't mean 'Z', it means something else. And you have to ensure that such documentation is read. And then what happens if 'Z' ever becomes a valid piece of data? (Such as a field for an initial?)

At a basic level, even without debating the validity of NULL vs 'Z', I would insist that the contractor conforms to standard practices that exist within your company, not his. Instituting his standard practice in an environment with an alternative standard practice will cause confusion, maintenance overheads, mis-understanding, and in the end increased costs and mistakes.


EDIT

There are cases where using an alternative to NULL is valid in my opinion. But only where doing so reduces code, rather than creating special cases which require accounting for.

I've used that for date bound data, for example. If data is valid between a start-date and an end-date, code can be simplified by not having NULL values. Instead a NULL start-date could be replaced with '01 Jan 1900' and a NULL end-date could be replaced with '31 Dec 2079'.

This still can change behaviour from what may be expected, and so should be used with care:

  • WHERE end-date IS NULL no longer give data that is still valid
  • You just created your own millennium bug
  • etc.

This is equivalent to reforming abstractions such that all properties can always have valid values. It is markedly different from implicitly encoding specific meaning into arbitrarily chosen values.

Still, sack the contractor.

This is easily one of the weirdest opinions I've ever heard. Using a magic value to represent "no data" rather than NULL means that every piece of code that you have will have to post-process the results to account/discard the "no-data"/"Z" values.

NULL is special because of the way that the database handles it in queries. For instance, take these two simple queries:

select * from mytable where name = 'bob';
select * from mytable where name != 'bob';

If name is ever NULL, it obviously won't show up in the first query's results. More importantly, neither will it show up in the second queries results. NULL doesn't match anything other than an explicit search for NULL, as in:

select * from mytable where name is NULL;

And what happens when the data could have Z as a valid value? Let's say you're storing someone's middle initial? Would Zachary Z Zonkas be lumped in with those people with no middle initial? Or would your contractor come up with yet another magic value to handle this?

Avoid magic values that require you to implement database features in code that the database is already fully capable of handling. This is a solved and well understood problem, and it may just be that your contractor never really grokked the notion of NULL and therefore avoids using it.

While I have never seen 'Z' as a magic value to represent null, I have seen 'X' used to represent a field that has not been filled in. That said, I have only ever seen this in one place, and my interface to it was not a database, but rather an XML file… so I would not be prepared to use this an argument for being common practice.

Note that we do have to handle the 'X' specially, and, as Dems mentioned, we do have to document it, and people have been confused by it. In our defence, this is forced on us by an external supplier, not something that we cooked up ourselves!

Even if you somehow manage to explain to all your current and future developers and DBAs about "Z" instead of NULL, and even if they code everything perfectly, you will still confuse the optimizer because it will not know that you've cooked this up.

Using a special value to represent NULL (which is already a special value to represent NULL) will result in skews in the data. e.g. So many things happened on 1-Jan-1900 that it will throw out the optimizer's ability to understand that actual range of dates that really are relevant to your application.

This is like a manager deciding: "Wearing a tie is bad for productivity, so we're all going to wear masking tape around our necks. Problem solved."

Nothing in principle requires nulls for correct database design. In fact there are plenty of databases designed without using null and there are plenty of very good database designers and whole development teams who design databases without using nulls. In general it's a good thing to be cautious about adding nulls to a database because they inevitably lead to incorrect or ambiguous results later on.

I've not heard of using Z being called "standard practice" as a placeholder value instead of nulls but I expect your contractor is referring to the concept of sentinel values in general, which are sometimes used in database design. However, a much more common and flexible way to avoid nulls without using "dummy" data is simply to design them out. Decompose the table such that each type of fact is recorded in a table that doesn't have "extra", unspecified attributes.

I've never heard about the wide-spread use of 'Z' as a substitute for NULL.

(Incidentally, I'd not particularly like to work with a contractor who tells you in the face that they and other "advanced" DBAs are so much more knowledgeable and better than you.)

 +=================================+
|  FavoriteLetters                |
+=================================+
|  Person      |  FavoriteLetter  |
+--------------+------------------+
|  'Anna'      |  'A'             |
|  'Bob'       |  'B'             |
|  'Claire'    |  'C'             |
|  'Zaphod'    |  'Z'             |
+---------------------------------+

How would your contractor interpret the data from the last row?

Probably he would choose a different "magic value" in this table to avoid collision with the real data 'Z'? Meaning you'd have to remember several magic values and also which one is used where... how is this better than having just one magic token NULL, and having to remember the three-valued logic rules (and pitfalls) that go with it? NULL at least is standardized, unlike your contractor's 'Z'.

I don't particularly like NULL either, but mindlessly substituting it with an actual value (or worse, with several actual values) everywhere is almost definitely worse than NULL.

Let me repeat my above comment here for better visibility: If you want to read something serious and well-grounded by people who are against NULL, I would recommend the short article "How to handle missing information without using NULLs" (links to a PDF from The Third Manifesto homepage).

In reply to contractors comments

  • Empty string <> NULL
  • Empty string requires 2 bytes storage + an offset read
  • NULL uses null bitmap = quicker
  • IDENTITY doesn't always start at 1 (why waste half your range?)

The whole concept is flawed as per most other answers here