you can also do it in python, which gives you a proper list:
import elasticsearch
es = elasticsearch.Elasticsearch()
res = es.search(
index=your_index,
body={"query": {"match_all": {}}, "size": 30000, "fields": ["_id"]})
ids = [d['_id'] for d in res['hits']['hits']]
Better to use scroll and scan to get the result list so elasticsearch doesn't have to rank and sort the results.
With the elasticsearch-dsl python lib this can be accomplished by:
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search
es = Elasticsearch()
s = Search(using=es, index=ES_INDEX, doc_type=DOC_TYPE)
s = s.fields([]) # only get ids, otherwise `fields` takes a list of field names
ids = [h.meta.id for h in s.scan()]
Console log:
GET http://localhost:9200/my_index/my_doc/_search?search_type=scan&scroll=5m [status:200 request:0.003s]
GET http://localhost:9200/_search/scroll?scroll=5m [status:200 request:0.005s]
GET http://localhost:9200/_search/scroll?scroll=5m [status:200 request:0.005s]
GET http://localhost:9200/_search/scroll?scroll=5m [status:200 request:0.003s]
GET http://localhost:9200/_search/scroll?scroll=5m [status:200 request:0.005s]
...
Note: scroll pulls batches of results from a query and keeps the cursor open for a given amount of time (1 minute, 2 minutes, which you can update); scan disables sorting. The scan helper function returns a python generator which can be safely iterated through.
Inspired by @Aleck-Landgraf answer, for me it worked by using directly scan function in standard elasticsearch python API:
from elasticsearch import Elasticsearch
from elasticsearch.helpers import scan
es = Elasticsearch()
for dobj in scan(es,
query={"query": {"match_all": {}}, "fields" : []},
index="your-index-name", doc_type="your-doc-type"):
print dobj["_id"],
Elaborating on answers by Robert Lujo and Aleck Landgraf,
if you want the IDs in a list from the returned generator, here is what I use:
from elasticsearch import Elasticsearch
from elasticsearch import helpers
es = Elasticsearch(hosts=[YOUR_ES_HOST])
hits = helpers.scan(
es,
query={"query":{"match_all": {}}},
scroll='1m',
index=INDEX_NAME
)
ids = [hit['_id'] for hit in hits]
"fields" has been deprecated.
(Error: "The field [fields] is no longer supported, please use [stored_fields] to retrieve stored fields or _source filtering if the field is not stored")
I know this post has a lot of answers, but I want to combine several to document what I've found to be fastest (in Python anyway). I'm dealing with hundreds of millions of documents, rather than thousands.
The helpers class can be used with sliced scroll and thus allow multi-threaded execution. In my case, I have a high cardinality field to provide (acquired_at) as well. You'll see I set max_workers to 14, but you may want to vary this depending on your machine.