如何在 MongoDB 中使用 Elasticsearch？

小开

最佳答案

这个答案应该足以让您按照使用 MongoDB、 Elasticsearch 和 AngularJS 构建一个功能性搜索组件上的这个教程设置。

如果你想使用来自 API 的数据分面搜索，那么 Matthiasn 的鸟类观察回购是你可能想要看看的东西。

下面介绍如何设置单个节点 Elasticsearch“ cluster”来索引 MongoDB，以便在新的 EC2 Ubuntu 14.04实例上的 NodeJS，Express 应用程序中使用。

确保一切都是最新的。

sudo apt-get update

安装 NodeJS。

sudo apt-get install nodejs
sudo apt-get install npm

安装 MongoDB -这些步骤直接来自 MongoDB 文档。选择你喜欢的版本。我坚持使用 v2.4.9，因为它似乎是 MongoDB-River支持的最新版本，没有任何问题。

导入 MongoDB 公共 GPG 密钥。

sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 7F0CEB10

更新你的来源列表。

echo 'deb http://downloads-distro.mongodb.org/repo/ubuntu-upstart dist 10gen' | sudo tee /etc/apt/sources.list.d/mongodb.list

去拿10G 包裹。

sudo apt-get install mongodb-10gen

然后选择你的版本，如果你不想要最新的。如果您正在 Windows 7或8机器上设置您的环境，请远离 v2.6，直到它们将其作为服务运行时解决了一些 bug。

apt-get install mongodb-10gen=2.4.9

防止在更新时升级 MongoDB 安装版本。

echo "mongodb-10gen hold" | sudo dpkg --set-selections

启动 MongoDB 服务。

sudo service mongodb start

数据库文件默认为/var/lib/mongo，日志文件默认为/var/log/mongo。

通过 mongo shell 创建一个数据库，并将一些虚拟数据放入其中。

mongo YOUR_DATABASE_NAME
db.createCollection(YOUR_COLLECTION_NAME)
for (var i = 1; i <= 25; i++) db.YOUR_COLLECTION_NAME.insert( { x : i } )

现在转到将独立 MongoDB 转换为副本集。

首先关闭进程。

mongo YOUR_DATABASE_NAME
use admin
db.shutdownServer()

现在我们将 MongoDB 作为一个服务运行，因此在重新启动 mongod 进程时，我们不会在命令行参数中传递“—— repSet rs0”选项。相反，我们将它放在 mongod.conf 文件中。

vi /etc/mongod.conf

添加这些行，为数据库和日志路径进行替代。

replSet=rs0
dbpath=YOUR_PATH_TO_DATA/DB
logpath=YOUR_PATH_TO_LOG/MONGO.LOG

现在再次打开 mongo shell 以初始化副本集。

mongo DATABASE_NAME
config = { "_id" : "rs0", "members" : [ { "_id" : 0, "host" : "127.0.0.1:27017" } ] }
rs.initiate(config)
rs.slaveOk() // allows read operations to run on secondary members.

现在安装弹性搜索。我只是按照这个有用的要点。

确保已安装 Java。

sudo apt-get install openjdk-7-jre-headless -y

现在继续使用 v1.1.x，直到 Mongo-River 插件 bug 在 v1.2.1中得到修复。

wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-1.1.1.deb
sudo dpkg -i elasticsearch-1.1.1.deb


curl -L http://github.com/elasticsearch/elasticsearch-servicewrapper/tarball/master | tar -xz
sudo mv *servicewrapper*/service /usr/local/share/elasticsearch/bin/
sudo rm -Rf *servicewrapper*
sudo /usr/local/share/elasticsearch/bin/service/elasticsearch install
sudo ln -s `readlink -f /usr/local/share/elasticsearch/bin/service/elasticsearch` /usr/local/bin/rcelasticsearch

确保/etc/elasticsearch/elasticsearch.yml 启用了以下配置选项，如果您现在只在单个节点上开发:

cluster.name: "MY_CLUSTER_NAME"
node.local: true

启动 Elasticsearch 服务。

sudo service elasticsearch start

验证它是否有效。

curl http://localhost:9200

如果你看到这样的东西，然后你很好。

{
"status" : 200,
"name" : "Chi Demon",
"version" : {
"number" : "1.1.2",
"build_hash" : "e511f7b28b77c4d99175905fac65bffbf4c80cf7",
"build_timestamp" : "2014-05-22T12:27:39Z",
"build_snapshot" : false,
"lucene_version" : "4.7"
},
"tagline" : "You Know, for Search"
}

现在安装 Elasticsearch 插件，这样它就可以使用 MongoDB 了。

bin/plugin --install com.github.richardwilly98.elasticsearch/elasticsearch-river-mongodb/1.6.0
bin/plugin --install elasticsearch/elasticsearch-mapper-attachments/1.6.0

这两个插件不是必需的，但是它们对于测试查询和可视化对索引的更改非常有用。

bin/plugin --install mobz/elasticsearch-head
bin/plugin --install lukas-vlcek/bigdesk

重新启动弹性搜索。

sudo service elasticsearch restart

最后索引一个来自 MongoDB 的集合。

curl -XPUT localhost:9200/_river/DATABASE_NAME/_meta -d '{
"type": "mongodb",
"mongodb": {
"servers": [
{ "host": "127.0.0.1", "port": 27017 }
],
"db": "DATABASE_NAME",
"collection": "ACTUAL_COLLECTION_NAME",
"options": { "secondary_read_preference": true },
"gridfs": false
},
"index": {
"name": "ARBITRARY INDEX NAME",
"type": "ARBITRARY TYPE NAME"
}
}'

检查你的索引是否在 Elasticsearch

curl -XGET http://localhost:9200/_aliases

检查您的群集健康状况。

curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'

可能是黄色的，还有一些未分配的碎片。我们必须告诉 Elasticsearch 我们想要处理什么。

curl -XPUT 'localhost:9200/_settings' -d '{ "index" : { "number_of_replicas" : 0 } }'

再次检查群集健康状况，现在应该是绿色的。

curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'

去玩吧。

小开

当您的操作扩展时，使用 river 可能会出现问题。河流将使用大量的记忆时，在繁重的操作。我建议实现您自己的弹性搜索模型，或者如果您使用猫鼬，您可以建立您的弹性搜索模型的权利，或使用蒙哥马利，基本上做到了这一点。

Mongodb River 的另一个不利之处是，你将被困在 Mongodb 2.4. x 和 ElasticSearch 0.90. x 的分支上。你会发现你错过了很多非常好的特性，而且 mongodb river 项目并没有快速生成一个可用的产品来保持稳定。也就是说，Mongodb River 绝对不是我要投入生产的地方。它带来的问题比它的价值还多。它会在重负载下随机删除写操作，它会消耗大量内存，而且没有设置来限制这一点。此外，river 不会实时更新，它会从 mongodb 读取 oplog，根据我的经验，这可能会将更新延迟长达5分钟。

我们最近不得不重写项目的大部分内容，因为 ElasticSearch 每周都会出现一些问题。我们甚至聘请了一位开发行动顾问，他也同意最好离开 River。

更新: Elasticsearch-mongodb-river 现在支持 ES v1.4.0和 mongodb v2.6.x。但是，在执行大量的插入/更新操作时，您仍然可能会遇到性能问题，因为这个插件将尝试读取 mongodb 的 oplog 以进行同步。如果自锁(或闩锁)解锁以来有很多操作，那么您将注意到 elasticsearch 服务器上的内存使用量非常高。如果你计划进行大规模的经营，河流不是一个好的选择。ElasticSearch 的开发人员仍然建议您管理自己的索引，方法是使用针对您的语言的客户端库直接与它们的 API 进行通信，而不是使用 river。这不是河流的真正目的。Twitter-river 是河流应该如何使用的一个很好的例子。它本质上是从外部来源获取数据的一种很好的方式，但对于高流量或内部使用来说并不十分可靠。

还要考虑到 mongodb-river 在版本上落后了，因为它不是由 ElasticSearch Organization 维护的，而是由第三方维护的。在 v1.0发布之后，开发在 v0.90分支上停滞了很长一段时间，当 v1.0的版本发布之后，直到 elasticsearch 发布 v1.3.0之后，它才稳定下来。Mongodb 版本也落后了。您可能会发现自己在一个紧迫的位置，当您正在寻找每个版本的更新版本，特别是与 ElasticSearch 在如此繁重的开发，许多非常期待的功能的道路上。坚持使用最新的 ElasticSearch 是非常重要的，因为我们非常依赖于不断改进我们的搜索功能，这是我们产品的核心部分。

总而言之，如果你自己动手，你可能会得到一个更好的产品。没那么难。它只是在您的代码中管理的另一个数据库，并且它可以很容易地放入到您现有的模型中，而不需要进行重大的重构。

小开

这里如何做到这一点在蒙哥布3.0。我使用这个不错的博客

安装 mongodb。
创建数据目录:

$ mkdir RANDOM_PATH/node1
$ mkdir RANDOM_PATH/node2>
$ mkdir RANDOM_PATH/node3

启动蒙神实例

$ mongod --replSet test --port 27021 --dbpath node1
$ mongod --replSet test --port 27022 --dbpath node2
$ mongod --replSet test --port 27023 --dbpath node3

配置副本集:

$ mongo
config = {_id: 'test', members: [ {_id: 0, host: 'localhost:27021'}, {_id: 1, host: 'localhost:27022'}]};
rs.initiate(config);

安装 Elasticsearch:

a. Download and unzip the [latest Elasticsearch][2] distribution


b. Run bin/elasticsearch to start the es server.


c. Run curl -XGET http://localhost:9200/ to confirm it is working.

安装和配置 MongoDB 河:

$bin/plugin —— install Com.github.richardwily98.elasticsearch/elasticsearch-river-mongodb

$bin/plugin —— install elasticsearch/elasticsearch-mapper-attachments $bin/plugin ——安装 elasticsearch/elasticsearch-mapper-attachments

创建“河流”和索引:

Curl-XPUT‘ http://localhost:8080/_river/mongodb/_meta’-d’{ “类型”: “蒙哥布” “ mongodb”: { “ db”: “ mydb” “集合”: “福” }, “索引”: { “名字”“名字” “类型”: “随机” } }'

浏览器测试:

Http://localhost:9200/_search?q=home

小开

我发现 Mongo 连接器很有用，它是 MongoLabs (MongoDBInc.)的格式，现在可以与 Elasticsearch2.x 一起使用

弹性2.x 文档管理器: https://github.com/mongodb-labs/elastic2-doc-manager

Mongo-connect 创建从 MongoDB 集群到一个或多个目标系统(如 Solr、 Elasticsearch 或另一个 MongoDB 集群)的管道。它将 MongoDB 中的数据同步到目标，然后尾随 MongoDB oplog，实时跟踪 MongoDB 中的操作。它已经用 Python 2.6、2.7和3.3 + 进行了测试。详细的文档可以在 wiki 上找到。

Https://github.com/mongodb-labs/mongo-connector Https://github.com/mongodb-labs/mongo-connector/wiki/usage%20with%20elasticsearch

小开

River 是一个很好的解决方案，一旦你想有一个几乎实时的同步和通用的解决方案。

如果您已经在 MongoDB 中有数据，并且希望像“ one-shot”一样非常容易地将其发送到 Elasticsearch，那么您可以尝试使用 Node.js https://github.com/itemsapi/elasticbulk中的包。

它使用 Node.js 流，因此您可以从所有支持流的数据(例如 MongoDB、 PostgreSQL、 MySQL、 JSON 文件等)中导入数据

MongoDB 到 Elasticsearch 的例子:

安装软件包:

npm install elasticbulk
npm install mongoose
npm install bluebird

创建脚本，比如 script.js:

const elasticbulk = require('elasticbulk');
const mongoose = require('mongoose');
const Promise = require('bluebird');
mongoose.connect('mongodb://localhost/your_database_name', {
useMongoClient: true
});


mongoose.Promise = Promise;


var Page = mongoose.model('Page', new mongoose.Schema({
title: String,
categories: Array
}), 'your_collection_name');


// stream query
var stream = Page.find({
}, {title: 1, _id: 0, categories: 1}).limit(1500000).skip(0).batchSize(500).stream();


elasticbulk.import(stream, {
index: 'my_index_name',
type: 'my_type_name',
host: 'localhost:9200',
})
.then(function(res) {
console.log('Importing finished');
})

发送数据:

node script.js

虽然速度不是特别快，但是它可以处理数百万条记录(多亏了数据流)。

小开

由于 Mongo 连接器现在似乎已死，我的公司决定建立一个工具，使用 Mongo 变更流输出到 Elasticsearch。

我们的初步结果看起来很有希望。你可以去 https://github.com/electionsexperts/mongo-stream看看。我们仍处于早期开发阶段，欢迎提出建议或贡献。

小开

在这里，我找到了另一个将 MongoDB 数据迁移到 Elasticsearch 的好选择。一个实时同步 mongob 和 elasticsearch 的 Go 守护进程。它是 Monstache，可以在以下网址下载:

在初始设置下面配置和使用它。

第一步:

C:\Program Files\MongoDB\Server\4.0\bin>mongod --smallfiles --oplogSize 50 --replSet test

第二步:

C:\Program Files\MongoDB\Server\4.0\bin>mongo


C:\Program Files\MongoDB\Server\4.0\bin>mongo
MongoDB shell version v4.0.2
connecting to: mongodb://127.0.0.1:27017
MongoDB server version: 4.0.2
Server has startup warnings:
2019-01-18T16:56:44.931+0530 I CONTROL  [initandlisten]
2019-01-18T16:56:44.931+0530 I CONTROL  [initandlisten] ** WARNING: Access control is not enabled for the database.
2019-01-18T16:56:44.931+0530 I CONTROL  [initandlisten] **          Read and write access to data and configuration is unrestricted.
2019-01-18T16:56:44.931+0530 I CONTROL  [initandlisten]
2019-01-18T16:56:44.931+0530 I CONTROL  [initandlisten] ** WARNING: This server is bound to localhost.
2019-01-18T16:56:44.931+0530 I CONTROL  [initandlisten] **          Remote systems will be unable to connect to this server.
2019-01-18T16:56:44.931+0530 I CONTROL  [initandlisten] **          Start the server with --bind_ip <address> to specify which IP
2019-01-18T16:56:44.931+0530 I CONTROL  [initandlisten] **          addresses it should serve responses from, or with --bind_ip_all to
2019-01-18T16:56:44.931+0530 I CONTROL  [initandlisten] **          bind to all interfaces. If this behavior is desired, start the
2019-01-18T16:56:44.931+0530 I CONTROL  [initandlisten] **          server with --bind_ip 127.0.0.1 to disable this warning.
2019-01-18T16:56:44.931+0530 I CONTROL  [initandlisten]
MongoDB Enterprise test:PRIMARY>

步骤3: 验证复制。

MongoDB Enterprise test:PRIMARY> rs.status();
{
"set" : "test",
"date" : ISODate("2019-01-18T11:39:00.380Z"),
"myState" : 1,
"term" : NumberLong(2),
"syncingTo" : "",
"syncSourceHost" : "",
"syncSourceId" : -1,
"heartbeatIntervalMillis" : NumberLong(2000),
"optimes" : {
"lastCommittedOpTime" : {
"ts" : Timestamp(1547811537, 1),
"t" : NumberLong(2)
},
"readConcernMajorityOpTime" : {
"ts" : Timestamp(1547811537, 1),
"t" : NumberLong(2)
},
"appliedOpTime" : {
"ts" : Timestamp(1547811537, 1),
"t" : NumberLong(2)
},
"durableOpTime" : {
"ts" : Timestamp(1547811537, 1),
"t" : NumberLong(2)
}
},
"lastStableCheckpointTimestamp" : Timestamp(1547811517, 1),
"members" : [
{
"_id" : 0,
"name" : "localhost:27017",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"uptime" : 736,
"optime" : {
"ts" : Timestamp(1547811537, 1),
"t" : NumberLong(2)
},
"optimeDate" : ISODate("2019-01-18T11:38:57Z"),
"syncingTo" : "",
"syncSourceHost" : "",
"syncSourceId" : -1,
"infoMessage" : "",
"electionTime" : Timestamp(1547810805, 1),
"electionDate" : ISODate("2019-01-18T11:26:45Z"),
"configVersion" : 1,
"self" : true,
"lastHeartbeatMessage" : ""
}
],
"ok" : 1,
"operationTime" : Timestamp(1547811537, 1),
"$clusterTime" : {
"clusterTime" : Timestamp(1547811537, 1),
"signature" : {
"hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
"keyId" : NumberLong(0)
}
}
}
MongoDB Enterprise test:PRIMARY>

第四步。下载「 https://github.com/rwynn/monstache/releases」。解压下载并调整 PATH 变量以包含平台文件夹的路径。转到 cmd 键入 "monstache -v" 4.13.1英镑 Monstache 使用 TOML 格式进行配置

第五步。

我的 config.toml —— >

mongo-url = "mongodb://127.0.0.1:27017/?replicaSet=test"
elasticsearch-urls = ["http://localhost:9200"]


direct-read-namespaces = [ "admin.users" ]


gzip = true
stats = true
index-stats = true


elasticsearch-max-conns = 4
elasticsearch-max-seconds = 5
elasticsearch-max-bytes = 8000000


dropped-collections = false
dropped-databases = false


resume = true
resume-write-unsafe = true
resume-name = "default"
index-files = false
file-highlighting = false
verbose = true
exit-after-direct-reads = false


index-as-update=true
index-oplog-time=true

第六步。

D:\15-1-19>monstache -f config.toml