数据库表中的随机记录(T-SQL)

有没有从 sql 服务器表中检索随机记录的简洁方法?

我想随机化我的单元测试数据,所以我寻找一个简单的方法来选择一个随机标识从一个表。在英语中,选择将是“从表中选择一个 id,其中 id 是表中最低 id 和最高 id 之间的随机数”

我无法找到一种不必运行查询、测试空值、然后在空值时重新运行的方法。

有什么想法吗?

64537 次浏览

Is there a succinct way to retrieve a random record from a sql server table?

Yes

SELECT TOP 1 * FROM table ORDER BY NEWID()

Explanation

A NEWID() is generated for each row and the table is then sorted by it. The first record is returned (i.e. the record with the "lowest" GUID).

Notes

  1. GUIDs are generated as pseudo-random numbers since version four:

    The version 4 UUID is meant for generating UUIDs from truly-random or pseudo-random numbers.

    The algorithm is as follows:

    • Set the two most significant bits (bits 6 and 7) of the clock_seq_hi_and_reserved to zero and one, respectively.
    • Set the four most significant bits (bits 12 through 15) of the time_hi_and_version field to the 4-bit version number from Section 4.1.3.
    • Set all the other bits to randomly (or pseudo-randomly) chosen values.

    A Universally Unique IDentifier (UUID) URN Namespace - RFC 4122

  2. The alternative SELECT TOP 1 * FROM table ORDER BY RAND() will not work as one would think. RAND() returns one single value per query, thus all rows will share the same value.

  3. While GUID values are pseudo-random, you will need a better PRNG for the more demanding applications.

  4. Typical performance is less than 10 seconds for around 1,000,000 rows — of course depending on the system. Note that it's impossible to hit an index, thus performance will be relatively limited.

Also try your method to get a random Id between MIN(Id) and MAX(Id) and then

SELECT TOP 1 * FROM table WHERE Id >= @yourrandomid

It will always get you one row.

On larger tables you can also use TABLESAMPLE for this to avoid scanning the whole table.

SELECT  TOP 1 *
FROM YourTable
TABLESAMPLE (1000 ROWS)
ORDER BY NEWID()

The ORDER BY NEWID is still required to avoid just returning rows that appear first on the data page.

The number to use needs to be chosen carefully for the size and definition of table and you might consider retry logic if no row is returned. The maths behind this and why the technique is not suited to small tables is discussed here

I was looking to improve on the methods I had tried and came across this post. I realize it's old but this method is not listed. I am creating and applying test data; this shows the method for "address" in a SP called with @st (two char state)

Create Table ##TmpAddress (id Int Identity(1,1), street VarChar(50), city VarChar(50), st VarChar(2), zip VarChar(5))
Insert Into ##TmpAddress(street, city, st, zip)
Select street, city, st, zip
From tbl_Address (NOLOCK)
Where st = @st




-- unseeded RAND() will return the same number when called in rapid succession so
-- here, I seed it with a guaranteed different number each time. @@ROWCOUNT is the count from the most recent table operation.


Set @csr = Ceiling(RAND(convert(varbinary, newid())) * @@ROWCOUNT)


Select street, city, st, Right(('00000' + ltrim(zip)),5) As zip
From ##tmpAddress (NOLOCK)
Where id = @csr

If you want to select large data the best way that I know is:

SELECT * FROM Table1
WHERE (ABS(CAST(
(BINARY_CHECKSUM
(keycol1, NEWID())) as int))
% 100) < 10

Source: MSDN

If you really want a random sample of individual rows, modify your query to filter out rows randomly, instead of using TABLESAMPLE. For example, the following query uses the NEWID function to return approximately one percent of the rows of the Sales.SalesOrderDetail table:

SELECT * FROM Sales.SalesOrderDetail
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), SalesOrderID) & 0x7fffffff AS float)
/ CAST (0x7fffffff AS int)

The SalesOrderID column is included in the CHECKSUM expression so that NEWID() evaluates once per row to achieve sampling on a per-row basis. The expression CAST(CHECKSUM(NEWID(), SalesOrderID) & 0x7fffffff AS float / CAST (0x7fffffff AS int) evaluates to a random float value between 0 and 1."

Source: http://technet.microsoft.com/en-us/library/ms189108(v=sql.105).aspx

This is further explained below:

How does this work? Let's split out the WHERE clause and explain it.

The CHECKSUM function is calculating a checksum over the items in the list. It is arguable over whether SalesOrderID is even required, since NEWID() is a function that returns a new random GUID, so multiplying a random figure by a constant should result in a random in any case. Indeed, excluding SalesOrderID seems to make no difference. If you are a keen statistician and can justify the inclusion of this, please use the comments section below and let me know why I'm wrong!

The CHECKSUM function returns a VARBINARY. Performing a bitwise AND operation with 0x7fffffff, which is the equivalent of (111111111...) in binary, yields a decimal value that is effectively a representation of a random string of 0s and 1s. Dividing by the co-efficient 0x7fffffff effectively normalizes this decimal figure to a figure between 0 and 1. Then to decide whether each row merits inclusion in the final result set, a threshold of 1/x is used (in this case, 0.01) where x is the percentage of the data to retrieve as a sample.

Source: https://www.mssqltips.com/sqlservertip/3157/different-ways-to-get-random-data-for-sql-server-data-sampling