Google BigQuery 中的随机抽样

我刚刚发现 RAND ()函数虽然没有文档,但是可以在 BigQuery 中工作。我能够从莎士比亚的数据集中生成一个(看似)随机的10个单词的样本,使用:

SELECT word FROM
(SELECT rand() as random,word FROM [publicdata:samples.shakespeare] ORDER BY random)
LIMIT 10

我的问题是: 使用这种方法代替参考手册“高级示例”部分中定义的 HASH ()方法有什么缺点吗?https://developers.google.com/bigquery/query-reference

74724 次浏览

For stratified sampling, check https://stackoverflow.com/a/52901452/132438


Good job finding it :). I requested the function recently, but it hasn't made it to documentation yet.

I would say the advantage of RAND() is that the results will vary, while HASH() will keep giving you the same results for the same values (not guaranteed over time, but you get the idea).

In case you want the variability that RAND() brings while still getting consistent results - you can seed it with an integer, as in RAND(3).

Notice though that the example you pasted is doing a full sort of the random values - for sufficiently big inputs this approach won't scale.

A scalable approach, to get around 10 random rows:

SELECT word
FROM [publicdata:samples.shakespeare]
WHERE RAND() < 10/164656

(where 10 is the approximate number of results I want to get, and 164656 the number of rows that table has)


standardSQL update:

#standardSQL
SELECT word
FROM `publicdata.samples.shakespeare`
WHERE RAND() < 10/164656

or even:

#standardSQL
SELECT word
FROM `publicdata.samples.shakespeare`
WHERE RAND() < 10/(SELECT COUNT(*) FROM `publicdata.samples.shakespeare`)

Great to know RAND() is available!

In my case I needed a predefined sample size. Instead of needing to know the total number of rows and do the division sample size over total rows, I'm using the following query:

SELECT word, rand(5) as rand
FROM [publicdata:samples.shakespeare]
order by rand
#Sample size needed = 10
limit 10

Summarizing, I use ORDER BY + LIMIT to ramdomize and then extract a defined number of samples.

One additional tip to make it even simpler: You can order by the function it self, ie:

select x from y order by rand() limit 100

=> Sample of 100

Once you calculate how much of total you need % wise you can...

Like mentioned before one way is to do non deterministic ( every time you run = different sample) with random such as for example if you want 0.1% of your total database sampled you would do :

SELECT *
FROM `dataset.table`
WHERE RAND() < 0.001

You could actually make it deterministic by say saving this table so you can query it later, you could also select just one key column and save only that to be used in the future.


Another way that gets you the same repeatable random sample is to use cryptographic hashing function to generate a fingerprint of your (unique identifier field) column and then to select rows based on the two digits of the fingerprint. Below would label a random sample of 70% of total database. After which you can filter table on in_sample = True:

SELECT
*,
IF(MOD(ABS(FARM_FINGERPRINT(CAST(YOUR_COLUMN AS STRING))), 100) < 70,'True', 'False')
AS in_sample
FROM (
SELECT
DISTINCT(YOUR_UNIQUE_IDENTIFIER_COLUMN) AS YOUR_COLUMN
FROM
`dataset.table`)

If you don't have a unique identifier column you could make one with GENERATE_UUID() OR use row_number() over() OR concatenate multiple columns to make one


Similar way as above but with hashing function. Repeatable and gets you 70% of your sample. If you want other number just change 7 to your desire %:

SELECT
*
FROM
`dataset.table`
WHERE
ABS(HASH(YOUR_COLUMN)) % 10 < 7

Don't know about scalability of fingerprint vs hash so mentioned both, one may work better than other for some.

Best of luck,

BigQuery have the sampling option, which let us query random subsets of data from large BigQuery tables. We can also mention the percentage of records.

Example:

SELECT * FROM dataset.my_table TABLESAMPLE SYSTEM (10 PERCENT)


SELECT *
FROM dataset.table1 T1 TABLESAMPLE SYSTEM (10 PERCENT)
JOIN dataset.table2 T2 TABLESAMPLE SYSTEM (20 PERCENT) ON customer_id

https://cloud.google.com/bigquery/docs/table-sampling