何时在 MySQL 中使用 STRIGHT_JOIN

我只是有一个相当复杂的查询,我的工作,它需要8秒钟运行。EXPLAIN 显示了一个奇怪的表顺序,即使使用 FORCE INDEX 提示,我的索引也没有全部使用。我偶然发现了 STRAIGHT _ JOIN 连接关键字,并开始用它替换一些 INNER JOIN 关键字。我注意到速度有了很大的提高。最后,我只是将这个查询的所有 INNER JOIN 关键字替换为 STRIGHT _ JOIN,现在运行时间为0.01秒。

我的问题是什么时候使用 STRAIGHT _ JOIN,什么时候使用 INNER JOIN?如果您正在编写好的查询,是否有理由不使用 STRIGHT _ JOIN?

60259 次浏览

MySQL isn’t necessarilly good at choosing the join order in complex queries. By specifying a complex query as a straight_join the query executes the joins in the order they’re specified. By placing the table to be the least common denominator first and specifying straight_join you are able to improve the query performance.

From MySQL JOIN reference:

"STRAIGHT_JOIN is similar to JOIN, except that the left table is always read before the right table. This can be used for those (few) cases for which the join optimizer puts the tables in the wrong order."

I wouldn't recommend using STRAIGHT_JOIN without a good reason. My own experience is that the MySQL query optimizer chooses a poor query plan more often than I'd like, but not often enough that you should just bypass it in general, which is what you would be doing if you always used STRAIGHT_JOIN.

My recommendation is to leave all queries as regular JOINs. If you discover that one query is using a sub-optimal query plan, I would suggest first trying to rewrite or re-structure the query a bit to see if the optimizer will then pick a better query plan. Also, for innodb at least, make sure it's not just that your index statistics are out-of-date (ANALYZE TABLE). That can cause the optimizer to choose a poor query plan. Optimizer hints should generally be your last resort.

Another reason not to use query hints is that your data distribution may change over time, or your index selectivity may change, etc. as your table grows. Your query hints that are optimal now, may become sub-optimal over time. But the optimizer will be unable to adapt the query plan because of your now outdated hints. You stay more flexible if you allow the optimizer to make the decisions.

STRAIGHT_JOIN, using this clause, you can control the JOIN order: which table is scanned in the outer loop and which one is in the inner loop.

Here's a scenario that came up just recently at work.

Consider three tables, A, B, C.

A has 3,000 rows; B has 300,000,000 rows; and C has 2,000 rows.

Foreign keys are defined: B(a_id), B(c_id).

Suppose you had a query that looks like this:

select a.id, c.id
from a
join b on b.a_id = a.id
join c on c.id = b.c_id

In my experience, MySQL may choose to go C -> B -> A in this case. C is smaller than A and B is enormous, and they're all equijoins.

The trouble is MySQL doesn't necessarily take into account the size of the intersection between (C.id and B.c_id) vs (A.id and B.a_id). If the join between B and C returns just as many rows as B, then it's a very poor choice; if starting with A would have filtered down B to as many rows as A, then it would have been a much better choice. straight_join could be used to force this order like this:

select a.id, c.id
from a
straight_join b on b.a_id = a.id
join c on c.id = b.c_id

Now a must be joined on before b.

Generally you want to do your joins in an order that minimizes the number of rows in the resulting set. So starting with a small table and joining such that the resulting join will also be small, is ideal. Things go pear-shaped if starting with a small table and joining it to a bigger table ends up just as large as the big table.

It's stats dependent though. If the data distribution changes, the calculation may change. It's also dependent on the implementation details of the join mechanism.

The worst cases that I've seen for MySQL that all but required straight_join or aggressive index hinting are queries that paginate over a lot of data in a strict sort order with light filtering. MySQL strongly prefers to use indexes for any filters and joins over sorts; this makes sense because most people aren't trying to sort the whole database but rather have a limited subset of rows that are responsive to the query, and sorting a limited subset is much faster than filtering the whole table, no matter whether it's sorted or not. In this case, putting straight join immediately after the table that had the indexed column I wanted to sort on fixed things.

--use 120s, 18 million data
explain SELECT DISTINCT d.taid
FROM tvassist_recommend_list_everyday_diverse d, tvassist_taid_all t
WHERE d.taid = t.taid
AND t.client_version >= '21004007'
AND t.utdid IS NOT NULL
AND d.recommend_day = '20170403'
LIMIT 0, 10000


--use 3.6s repalce by straight join
explain SELECT DISTINCT d.taid
FROM tvassist_recommend_list_everyday_diverse d
STRAIGHT_JOIN
tvassist_taid_all t on d.taid = t.taid
WHERE
t.client_version >= '21004007'
AND d.recommend_day = '20170403'


AND t.utdid IS NOT NULL
LIMIT 0, 10000

I will tell you why I had to use STRAIGHT_JOIN :

  • I had a performance issue with a query.
  • Simplifying the query, the query was suddently more efficient
  • Trying to figure out which specific part was bringing the issue, I just couldn't. (2 left joins together were slow, and each one was independently fast)
  • I then executed the EXPLAIN with both slow and fast query (addind one of the left joins)
  • Surprisingly, MySQL changed entirely the JOIN orders between the 2 queries.

Therefore I forced one of the joins to be straight_join to FORCE the previous join to be read first. This prevented MySQL to change the execution order and worked like a charm !

In my short experience, one of the situations that STRAIGHT_JOIN has reduced my query from 30 seconds to 100 milliseconds is that the first table in the execution plan was not the table that has the order by columns

-- table sales (45000000) rows
-- table stores (3) rows
SELECT whatever
FROM
sales
INNER JOIN stores ON sales.storeId = stores.id
ORDER BY sales.date, sales.id
LIMIT 50;
-- there is an index on (date, id)

IF the optimizer chooses to hit stores first it will cause Using index; Using temporary; Using filesort because

if the ORDER BY or GROUP BY contains columns from tables other than the first table in the join queue, a temporary table is created.

source

here the optimizer needs a little help by telling him to hit sales first using

sales STRAIGHT_JOIN stores

If your query ends with ORDER BY... LIMIT..., it may be optimal to reformulate the query to trick the optimizer into doing the LIMIT before the JOIN.

(This Answer does not apply only to the original question about STRAIGHT_JOIN, nor does it apply to all cases of STRAIGHT_JOIN.)

Starting with the example by @Accountantم, this should run faster in most situations. (And it avoids needing hints.)

SELECT  whatever
FROM  ( SELECT id FROM sales
ORDER BY  date, id
LIMIT  50
) AS x
JOIN  sales   ON sales.id = x.id
JOIN  stores  ON sales.storeId = stores.id
ORDER BY  sales.date, sales.id;

Notes:

  • First, 50 ids are fetched. This will be especially fast with INDEX(date, id).
  • Then the join back to sales lets you get only 50 "whatevers" without hauling them around in a temp table.
  • since a subquery is, by definition, unordered, the ORDER BY must be repeated in the outer query. (The Optimizer may find a way to avoid actually doing another sort.)
  • Yes, it is messier. But it is usually faster.

I am opposed to using hits because "Even if it is faster today, it may fail to be faster tomorrow."

I know it's a bit old but here's a scenario, I've been doing batch script to populate a certain table. At some point, the query ran very slow. It appears that the join order was incorrect on particular records:

  • In correct order

enter image description here

  • Incrementing the id by 1 messes up the order. Notice the 'Extra' field

enter image description here

  • Using straight_join fixes the issue

enter image description here

Incorrect order runs for about 65 secs while using straight_join runs in milliseconds