Elasticsearch 2.1: 结果窗口太大(index.max_Result_window)

我们从 Elasticsearch 2.1检索信息,并允许用户浏览结果。当用户请求一个较高的页码时,我们会得到以下错误消息:

结果窗口太大,from + size 必须小于或等于 To: [10000]但是是[10020]。请参阅卷轴 api 以获得更有效的 请求大型数据集的方法。这个限制可以通过更改 [ index.max _ result _ window ]索引级别参数

橡皮图书说,这是因为高内存消耗和使用滚动 api:

值大于每个 搜索和每个碎片执行搜索。最安全的做法是留下这个 值,因为它是一个用于任何深度滚动的卷轴 api < a href = “ https://www.elastic.co/guide/en/elasticsearch/reference/2.x/break _ 21 _ search _ changes.html # _ from _ size _ limit”> https://www.elastic.co/guide/en/elasticsearch/reference/2.x/breaking_21_search_changes.html#_from_size_limits

问题是,我不想检索大型数据集。我只想从结果集中位置非常高的数据集中检索一个片段。滚动文档还显示:

滚动不适用于实时用户请求 https://www.elastic.co/guide/en/elasticsearch/reference/2.2/search-request-scroll.html

这给我留下了一些问题:

1)如果我使用滚动 api 向上滚动到结果10020(并且忽略10000以下的所有内容) ,而不是对结果10000-10020执行“正常”搜索请求,内存消耗真的会降低吗?

2)滚动 API 对我来说似乎不是一个选项,但我必须增加“ index.max _ result _ window”。有人有这方面的经验吗?

3)解决我的问题还有其他选择吗?

101857 次浏览

The following pages in the elastic documentation talk about deep paging:

https://www.elastic.co/guide/en/elasticsearch/guide/current/pagination.html https://www.elastic.co/guide/en/elasticsearch/guide/current/_fetch_phase.html

Depending on the size of your documents, the number of shards, and the hardware you are using, paging 10,000 to 50,000 results (1,000 to 5,000 pages) deep should be perfectly doable. But with big-enough from values, the sorting process can become very heavy indeed, using vast amounts of CPU, memory, and bandwidth. For this reason, we strongly advise against deep paging.

If you need deep pagination, one possible solution is to increase the value max_result_window. You can use curl to do this from your shell command line:

curl -XPUT "http://localhost:9200/my_index/_settings" -H 'Content-Type: application/json' -d '{ "index" : { "max_result_window" : 500000 } }'

I did not notice increased memory usage, for values of ~ 100k.

Use the Scroll API to get more than 10000 results.

Scroll example in ElasticSearch NEST API

I have used it like this:

private static Customer[] GetCustomers(IElasticClient elasticClient)
{
var customers = new List<Customer>();
var searchResult = elasticClient.Search<Customer>(s => s.Index(IndexAlias.ForCustomers())
.Size(10000).SearchType(SearchType.Scan).Scroll("1m"));


do
{
var result = searchResult;
searchResult = elasticClient.Scroll<Customer>("1m", result.ScrollId);
customers.AddRange(searchResult.Documents);
} while (searchResult.IsValid && searchResult.Documents.Any());


return customers.ToArray();
}

If you want more than 10000 results then in all the data nodes the memory usage will be very high because it has to return more results in each query request. Then if you have more data and more shards then merging those results will be inefficient. Also es cache the filter context, hence again more memory. You have to trial and error how much exactly you are taking. If you are getting many requests in small window you should do multiple query for more than 10k and merge it by urself in the code, which is supposed to take less application memory then if you increase the window size.

2) It does not seem that the scrolling API is an option for me but that I have to increase "index.max_result_window". Does anyone have any experience with this?

--> You can define this value in index templates , es template will be applicable for new indexes only ,so you either have to delete old indexes after creating template or wait for new data to be ingested in elasticsearch .

{ "order": 1, "template": "index_template*", "settings": { "index.number_of_replicas": "0", "index.number_of_shards": "1", "index.max_result_window": 2147483647 },

The right solution would be to use scrolling.
However, if you want to extend the results search returns beyond 10,000 results, you can do it easily with Kibana:

Go to Dev Tools and just post the following to your index (your_index_name), specifing what would be the new max result window

enter image description here

PUT your_index_name/_settings
{
"max_result_window" : 500000
}

If all goes well, you should see the following success response:

{
"acknowledged": true
}

In my case it looks like reducing the results via the from & size prefixes to the query will remove the error as we don't need all the results:

GET widgets_development/_search
{
"from" : 0,
"size": 5,
"query": {
"bool": {}
},
"sort": {
"col_one": "asc"
}
}