我应该为 Spark 选择哪种集群类型?

我是 Apache Spark 的新手,我刚刚了解到 Spark 支持三种类型的集群:

  • 独立——意味着 Spark 将管理自己的集群
  • YARN-使用 Hadoop 的 YARN 资源管理器
  • Mesos-Apache 的专用资源管理器项目

我认为我应该先尝试 单独行动。在将来,我需要建立一个大的集群(数百个实例)。

我应该选择哪种集群类型?

42775 次浏览

I think the best to answer that are those who work on Spark. So, from Learning Spark

Start with a standalone cluster if this is a new deployment. Standalone mode is the easiest to set up and will provide almost all the same features as the other cluster managers if you are only running Spark.

If you would like to run Spark alongside other applications, or to use richer resource scheduling capabilities (e.g. queues), both YARN and Mesos provide these features. Of these, YARN will likely be preinstalled in many Hadoop distributions.

One advantage of Mesos over both YARN and standalone mode is its fine-grained sharing option, which lets interactive applications such as the Spark shell scale down their CPU allocation between commands. This makes it attractive in environments where multiple users are running interactive shells.

In all cases, it is best to run Spark on the same nodes as HDFS for fast access to storage. You can install Mesos or the standalone cluster manager on the same nodes manually, or most Hadoop distributions already install YARN and HDFS together.

Spark Standalone Manager : A simple cluster manager included with Spark that makes it easy to set up a cluster. By default, each application uses all the available nodes in the cluster.

A few benefits of YARN over Standalone & Mesos:

  1. YARN allows you to dynamically share and centrally configure the same pool of cluster resources between all frameworks that run on YARN.

  2. You can take advantage of all the features of YARN schedulers for categorizing, isolating, and prioritizing workloads.

  3. The Spark standalone mode requires each application to run an executor on every node in the cluster; whereas with YARN, you choose the number of executors to use

  4. YARN directly handles rack and machine locality in your requests, which is convenient.

  5. The resource request model is, oddly, backwards in Mesos. In YARN, you (the framework) request containers with a given specification and give locality preferences. In Mesos you get resource "offers" and choose to accept or reject those based on your own scheduling policy. The Mesos model is a arguably more flexible, but seemingly more work for the person implementing the framework.

  6. If you have a big Hadoop cluster already in place, YARN is better choice.

  7. The Standalone manager requires the user configure each of the nodes with the shared secret. Mesos’ default authentication module, Cyrus SASL, can be replaced with a custom module. YARN has security for authentication, service level authorization, authentication for Web consoles and data confidentiality. Hadoop authentication uses Kerberos to verify that each user and service is authenticated by Kerberos.

  8. High availability is offered by all three cluster managers but Hadoop YARN doesn’t need to run a separate ZooKeeper Failover Controller.

Useful links:

spark documentation page

agildata article

Standalone is pretty clear as other mentioned it should be used only when you have spark only workload.

Between yarn and mesos, One thing to consider is the fact that unlike mapreduce, spark job grabs executors and hold it for entire lifetime of a job. where in mapreduce a job can get and release mappers and reducers over lifetime.

if you have long running spark jobs which during the lifetime of a job doesn't fully utilize all the resources it got in beginning, you may want to share those resources to other app and that you can only do either via Mesos or Spark dynamic scheduling. https://spark.apache.org/docs/2.0.2/job-scheduling.html#scheduling-across-applications So with yarn, only way have dynamic allocation for spark is by using spark provided dynamic allocation. Yarn won't interfere in that while Mesos will. Again this whole point is only important if you have a long running spark application and you would like to scale it up and down dynamically.

Mesos has more sophisticated scheduling design, allowing applications like Spark to negotiate with it. It's more suitable for the diversity of applications today. I found this site really insightful:

https://www.oreilly.com/ideas/a-tale-of-two-clusters-mesos-and-yarn

"... YARN is optimized for scheduling Hadoop jobs, which are historically (and still typically) batch jobs with long run times. This means that YARN was not designed for long-running services, nor for short-lived interactive queries (like small and fast Spark jobs), and while it’s possible to have it schedule other kinds of workloads, this is not an ideal model. The resource demands, execution model, and architectural demands of MapReduce are very different from those of long-running services, such as web servers or SOA applications, or real-time workloads like those of Spark or Storm..."

In this case and similar dilemmas in data engineering, there are many side questions to be answered before choosing one distribution method over another. For example, if you are not running your processing engine on more than 3 nodes, you usually are not facing too big of a problem to handle so your margin of performance tuning between YARN and SparkStandalone (based on experience) will not clarify your decision. Because usually you will try to make your pipeline simple, specially when your services are not self-managed by cloud and bugs and failures happen often.

I choose standalone for relatively small or not-complex pipelines but if I'm feeling alright and have a Hadoop cluster already in place, I prefer to take advantage of all the extra configs that Hadoop(Yarn) can give me.