<p>And, Third, you say you've a divide in your team. My guess is this means different members have already adopted different approaches, and you need to standardise. Ruling that <code>#if</code> is the preferred choice means that code using <code>#ifdef</code> will compile -and run- even when <code>DEBUG_ENABLED</code> is false. And it's <em>much</em> easier to track down and remove debug output that is produced when it shouldn't be than vice-versa.</p> How many database indexes is too many?

小开

最佳答案

If the table is heavily hit by UPDATEs, INSERTs + DELETEs ... these will be very slow with lots of indexes since they all need to be modified each time one of these operations takes place

Having said that, you can clearly add a lot of pointless indexes to a table that won't do anything. Adding B-Tree indexes to a column with 2 distinct values will be pointless since it doesn't add anything in terms of looking the data up. The more unique the values in a column, the more it will benefit from an index.

For the purposes of performing conditional compilation, #if and #ifdef are almost the same, but not quite. If your conditional compilation depends on two symbols then #ifdef will not work as well. For example, suppose you have two conditional compilation symbols, PRO_VERSION and TRIAL_VERSION, you might have something like this:

#if defined(PRO_VERSION) && !defined(TRIAL_VERSION)
...
#else
...
#endif

Using #ifdef the above becomes much more complicated, especially getting the #else part to work.

#if whenever two or more symbols are being evaluation.

小开

In general the more inserting you do the more painful your indexes become. Each time you do an insert, all the indexes that include that table have to be updated.

#ifdef just checks if a token is defined, given

#define FOO 0

Now if your application has a decent amount of reading, or even more so if it's almost all reading, then indexes are the way to go as there will be major performance improvements for very little cost.

小开

then

#ifdef FOO // is true
#if FOO // is false, because it evaluates to "#if 0"

If you do mostly reads (and few updates) then there's really no reason not to index everything you'll need to index. If you update often, then you may need to be cautious on how many indexes you have. There's no hard number, but you'll notice when things start to slow down. Make sure your clustered index is the one that makes the most sense based on the data.

小开

But then you again forgot to include the header file in file.cpp:

#if COOL_FEATURE()
// definitely awseome stuff here...
#endif

They're both hideous. Instead, do this:

#ifdef DEBUG
#define D(x) do { x } while(0)
#else
#define D(x) do { } while(0)
#endif

The preprocessor would have errored out because of the use of an undefined function macro.

小开

One thing you may consider is building indexes to target a standard combination of searches. If column1 is commonly searched, and column2 is often used with it, and column3 is sometimes used with column2 and column1, then an index on column1, column2, and column3 in that order can be used for any of those three circumstances, though it is only one index that has to be maintained.

小开

What it really comes down to is, don't add an index unless you know (and this often means gathering usage statistics) that it will be used far more often than it's updated.

That is not a matter of style at all. Also the question is unfortunately wrong. You cannot compare these preprocessor directives in the sense of better or safer.

#ifdef macro

Any index that doesn't meet that criteria will cost you more to rebuild than the performance penalty of not having it in the odd case it got used.

e>#if macro

if always compare to a value. In the above example it is the standard implicit comparison:

#if macro !=0

example for the usage of #if

#if CFLAG_EDITION == 0
return EDITION_FREE;
#elif CFLAG_EDITION == 1
return EDITION_BASIC;
#else
return EDITION_PRO;
#endif

you now can either put the definition of CFLAG_EDITION either in your code

#define CFLAG_EDITION 1

or you can set the macro as compiler flag. Also see here.

小开

In a paraphrase of Einstein about simplicity, add as many indexes as you need and no more.

Seriously, however, every index you add requires maintenance whenever data is added to the table. On tables that are primarily read only, lots of indexes are a good thing. On tables that are highly dynamic, fewer is better.

My advice is to cover the common and obvious cases and then, as you encounter issues where you need more speed in getting data from specific tables, evaluate and add indices at that point.

Also, it's a good idea to re-evaluate your indexing schemes every few months, just to see if there is anything new that needs indexing or any indices that you've created that aren't being used for anything and should be gotten rid of.

小开

There's no static answer in my opinion, this sort of thing falls under 'performance tuning'.

It could be that everything your app does is looked up by a primary key, or it could be the oposite in that queries are done over unristricted combinations of fields and any one in particular could be used at any given time.

Beyond just indexing, there's reogranizing your DB to include calculated search fields, splitting tables, etc - it's really dependant on your load shapes and query parameters, how much/what data 'really' needs to be retruend by a query.

means "if macro is defined" or "if macro exists". The value of macro does not matter here. It can be whatever.

#if macro

If your entire DB is fronted by stored-procedure facades turning becomes a bit easier, as you don't have to wory about every ad-hoc query. Or you may have a deep understanding of the kind of queries that will hit your DB, and can limit the tuning to those.

if always compare to a value. In the above example it is the standard implicit comparison:

#if macro !=0

For SQL Server I've found the Database Engine Tuning advisor usefull - you set up 'typical' workloads and it can make recommendations about adding/removing indexes and statistics. I'm sure other DBs have similar tools, either 'offical' or third party.

小开

3. Use a separate table for new/updated data, and run a nightly process which combines the data together. This would require a change in your application logic.

I used to use #ifdef, but when I switched to Doxygen for documentation, I found that commented-out macros cannot be documented (or, at least, Doxygen produces a warning). This means I cannot document the feature-switch macros that are not currently enabled.

4. Switch to IOT (index organized table), if your data support this.

Although it is possible to define the macros only for Doxygen, this means that the macros in the non-active portions of the code will be documented, too. I personally want to show the feature switches and otherwise only document what is currently selected. Furthermore, it makes the code quite messy if there are many macros that have to be defined only when Doxygen processes the file.

Of course there might be many more solutions for such case. My first suggestion to you, would be to clone the DB to a development environment, and run some stress testing against it.

小开

As with many things, the answer depends. #ifdef is great for things that are guaranteed to be defined or not defined in a particular unit. Include guards for example. If the include file is present at least once, the symbol is guaranteed to be defined, otherwise not.

However, some things don't have that guarantee. Think about the symbol HAS_FEATURE_X. How many states exist?

Undefined
Get a log of the real queries run on the data on a typical day.
Add indexes so the most important queries hit the indexes in their execution plan.
Try to avoid indexing fields that have a lot of updates or inserts
After a few indexes, get a new log and repeat.

Defined

As with all any optimization, I stop when the requested performance is reached (this obviously implies that point 0. would be getting specific performance requirements).

Defined with a value (say 0 or 1).

code, especially shared code, where some may #define HAS_FEATURE_X 0 to mean feature X isn't present and others may just not define it, you need to handle all those cases.

So, if you're writing code, especially shared code, where some may #define HAS_FEATURE_X 0 to mean feature X isn't present and others may just not define it, you need to handle all those cases.

#if !defined(HAS_FEATURE_X) || HAS_FEATURE_X == 1

Using just an #ifdef could allow for a subtle error where something is switched in (or out) unexpectedly because someone or some team has a convention of defining unused things to 0. In some ways, I like this #if approach because it means the programmer actively made a decision. Leaving something undefined is passive and from an external point of view, it can sometimes be unclear whether that was intentional or an oversight.

小开

Can you tolerate the additional time it takes to complete an update?

In MySQL, If I have a list of date ranges (range-start and range-end). e.g.

10/06/1983 to 14/06/1983
15/07/1983 to 16/07/1983
18/07/1983 to 18/07/1983

You need to compare costs and benefits. That's particular to your situation. There's no magic number of indexes that passes the threshold of "too many".

And I want to check if another date range contains ANY of the ranges already in the list, how would I do that?

There's also the cost of the space needed to store the index, but you've said that in your situation that's not an issue. The same is true in most situations, given how cheap disk space has become.

小开

e.g.

06/06/1983 to 18/06/1983 = IN LIST
10/06/1983 to 11/06/1983 = IN LIST
14/07/1983 to 14/07/1983 = NOT IN LIST

r maximum flexibility a data warehouse generally uses single column bitmap indexes except on high cardinality columns, where (compressed) btree indexes can be used.

In data warehousing it is very common to have a high number of indexes. I have worked with fact tables having two hundred columns and 190 of them indexed.

Although there is an overhead to this it must be understood in the context that in a data warehouse we generally only insert a row once, we never update it, but it can then participate in thousands of SELECT queries which might benefit from indexing on any of the columns.

For maximum flexibility a data warehouse generally uses single column bitmap indexes except on high cardinality columns, where (compressed) btree indexes can be used.

The overhead on index maintenance is mostly associated with the expense of writing to a great many blocks and the block splits as new rows are added with values that are "in the middle" of existing value ranges for that column. This can be mitigated by partitioning and having the new data loads aligned with the partitioning scheme, and by using direct path inserts.

To address your question more directly, I think it is probably fine to index the obvious at first, but do not be afraid of adding more indexes on if the queries against the table would benefit.

小开

Everyone else has been giving you great advice. I have an added suggestion for you as you move forward. At some point you have to make a decision as to your best indexing strategy. In the end though, the best PLANNED indexing strategy can still end up creating indexes that don't end up getting used. One strategy that lets you find indexes that aren't used is to monitor index usage. You do this as follows:-

alter index my_index_name monitoring usage;

This is a classical problem, and it's actually easier if you reverse the logic.

Let me give you an example.

You can then monitor whether the index is used or not from that point forward by querying v$object_usage. Information on this can be found in the Oracle® Database Administrator's Guide.

Just remember that if you have a warehousing strategy of dropping indexes before updating a table, then recreating them, you will have to set the index up for monitoring again, and you'll lose any monitoring history for that index.

that doesn't overlap:

           |-------------------|          compare to this one
|---|                                ends before
|---|    starts after

I'll post one period of time here, and all the different variations of other periods that overlap in some way.

           |-------------------|          compare to this one
|---------|                contained within
|----------|                   contained within, equal start
|-----------|          contained within, equal end
|-------------------|          contained within, equal start+end
|------------|                       not fully contained, overlaps start
|---------------|      not fully contained, overlaps end
|-------------------------|          overlaps start, bigger
|-----------------------|      overlaps end, bigger
|------------------------------|     overlaps entire period

So if you simple reduce the comparison to:

starts after end
ends before start

on the other hand, let me post all those that doesn't overlap:

           |-------------------|          compare to this one
|---|                                ends before
|---|    starts after

then you'll find all those that doesn't overlap, and then you'll find all the non-matching periods.

So if you simple reduce the comparison to:

starts after end
ends before start

For your final NOT IN LIST example, you can see that it matches those two rules.

then you'll find all those that doesn't overlap, and then you'll find all the non-matching periods.

You will need to decide wether the following periods are IN or OUTSIDE your ranges:

           |-------------|
|-------|                       equal end with start of comparison period
|-----|   equal start with end of comparison period

If your table has columns called range_end and range_start, here's some simple SQL to retrieve all the matching rows:

SELECT *
FROM periods
WHERE NOT (range_start > @check_period_end
OR range_end < @check_period_start)

Note the NOT in there. Since the two simple rules finds all the non-matching rows, a simple NOT will reverse it to say: if it's not one of the non-matching rows, it has to be one of the matching ones.

Applying simple reversal logic here to get rid of the NOT and you'll end up with:

SELECT *
FROM periods
WHERE range_start <= @check_period_end
AND range_end >= @check_period_start

小开

In addition to the points everyone else has raised, the Cost Based Optimizer incurs a cost when creating a plan for an SQL statement if there are more indexes because there are more combinations for it to consider. You can reduce this by correctly using bind variables so that SQL statements stay in the SQL cache. Oracle can then do a soft parse and re-use the plan it found last time.

As always, nothing is simple. If there are skewed columns and histograms involved then this can be a bad idea.

In our web applications we tend to limit the combinations of searches that we allow. Otherwise you would have to test literally every combination for performance to ensure you did not have a lurking problem that someone will find one day. We have also implemented resource limits to stop this causing issues elsewhere in the application should something go wrong.

小开

So is it ok to add many indexes? - It depends :) I gave you my results - You

Taking your example range of 06/06/1983 to 18/06/1983 and assuming you have columns called start and end for your ranges, you could use a clause like this

where ('1983-06-06' <= end) and ('1983-06-18' >= start)

decide!

小开

This article, http://www.mssqltips.com/tip.asp?tip=1239, gives you some queries that let you get a better insight into how much an index is used, as opposed to how much it is updated.

In your expected results you say

rel="nofollow noreferrer">http://www.cs.arizona.edu/people/rts/tdbbook.pdf) useful: it pre-dates mysql but the concept of time hasn't changed ;-)

小开

END ;; DELIMITER ;