是否可以在 PySpark 中获得当前的火花上下文设置?

我正在尝试获取当前 sparkcontextspark.worker.dir的路径。

如果我显式地将它设置为 config param,我可以从 SparkConf读出它,但是是否存在使用 PySpark访问完整的 config(包括所有默认值)的方法?

158437 次浏览

Not sure if you can get all the default settings easily, but specifically for the worker dir, it's quite straigt-forward:

from pyspark import SparkFiles
print SparkFiles.getRootDirectory()

Yes: sc.getConf().getAll()

Which uses the method:

SparkConf.getAll()

as accessed by

SparkContext.sc.getConf()

See it in action:

    In [4]: sc.getConf().getAll()
Out[4]:
[(u'spark.master', u'local'),
(u'spark.rdd.compress', u'True'),
(u'spark.serializer.objectStreamReset', u'100'),
(u'spark.app.name', u'PySparkShell')]

You can use:

sc.sparkContext.getConf.getAll

For example, I often have the following at the top of my Spark programs:

logger.info(sc.sparkContext.getConf.getAll.mkString("\n"))

Spark 1.6+

sc.getConf.getAll.foreach(println)

Just for the records the analogous java version:

Tuple2<String, String> sc[] = sparkConf.getAll();
for (int i = 0; i < sc.length; i++) {
System.out.println(sc[i]);
}

Spark 2.1+

spark.sparkContext.getConf().getAll() where spark is your sparksession (gives you a dict with all configured settings)

For a complete overview of your Spark environment and configuration I found the following code snippets useful:

SparkContext:

for item in sorted(sc._conf.getAll()): print(item)

Hadoop Configuration:

hadoopConf = {}
iterator = sc._jsc.hadoopConfiguration().iterator()
while iterator.hasNext():
prop = iterator.next()
hadoopConf[prop.getKey()] = prop.getValue()
for item in sorted(hadoopConf.items()): print(item)

Environment variables:

import os
for item in sorted(os.environ.items()): print(item)

For Spark 2+ you can also use when using scala

spark.conf.getAll; //spark as spark session

update configuration in Spark 2.3.1

To change the default spark configurations you can follow these steps:

Import the required classes

from pyspark.conf import SparkConf
from pyspark.sql import SparkSession

Get the default configurations

spark.sparkContext._conf.getAll()

Update the default configurations

conf = spark.sparkContext._conf.setAll([('spark.executor.memory', '4g'), ('spark.app.name', 'Spark Updated Conf'), ('spark.executor.cores', '4'), ('spark.cores.max', '4'), ('spark.driver.memory','4g')])

Stop the current Spark Session

spark.sparkContext.stop()

Create a Spark Session

spark = SparkSession.builder.config(conf=conf).getOrCreate()

Unfortunately, no, the Spark platform as of version 2.3.1 does not provide any way to programmatically access the value of every property at run time. It provides several methods to access the values of properties that were explicitly set through a configuration file (like spark-defaults.conf), set through the SparkConf object when you created the session, or set through the command line when you submitted the job, but none of these methods will show the default value for a property that was not explicitly set. For completeness, the best options are:

  • The Spark application’s web UI, usually at http://<driver>:4040, has an “Environment” tab with a property value table.
  • The SparkContext keeps a hidden reference to its configuration in PySpark, and the configuration provides a getAll method: spark.sparkContext._conf.getAll().
  • Spark SQL provides the SET command that will return a table of property values: spark.sql("SET").toPandas(). You can also use SET -v to include a column with the property’s description.

(These three methods all return the same data on my cluster.)

Simply running

sc.getConf().getAll()

should give you a list with all settings.

Suppose I want to increase the driver memory in runtime using Spark Session:

s2 = SparkSession.builder.config("spark.driver.memory", "29g").getOrCreate()

Now I want to view the updated settings:

s2.conf.get("spark.driver.memory")

To get all the settings, you can make use of spark.sparkContext._conf.getAll()

UPDATED SETTINGS

Hope this helps

If you want to see the configuration in data bricks use the below command

spark.sparkContext._conf.getAll()

I would suggest you try the method below in order to get the current spark context settings.

SparkConf.getAll()

as accessed by

SparkContext.sc._conf

Get the default configurations specifically for Spark 2.1+

spark.sparkContext.getConf().getAll()

Stop the current Spark Session

spark.sparkContext.stop()

Create a Spark Session

spark = SparkSession.builder.config(conf=conf).getOrCreate()