如何传递-d 参数或环境变量到星火作业？

小开

I Had a lot of problems with passing -D parameters to spark executors and the driver, I've added a quote from my blog post about it: " The right way to pass the parameter is through the property: “spark.driver.extraJavaOptions” and “spark.executor.extraJavaOptions”: I’ve passed both the log4J configurations property and the parameter that I needed for the configurations. (To the Driver I was able to pass only the log4j configuration). For example (was written in properties file passed in spark-submit with “—properties-file”): “

spark.driver.extraJavaOptions –Dlog4j.configuration=file:///spark/conf/log4j.properties -
spark.executor.extraJavaOptions –Dlog4j.configuration=file:///spark/conf/log4j.properties -Dapplication.properties.file=hdfs:///some/path/on/hdfs/app.properties
spark.application.properties.file hdfs:///some/path/on/hdfs/app.properties

“

You can read my blog post about overall configurations of spark. I'm am running on Yarn as well.

小开

I am starting my Spark application via a spark-submit command launched from within another Scala application. So I have an Array like

Array(".../spark-submit", ..., "--conf", confValues, ...)

where confValues is:

for yarn-cluster mode:
"spark.driver.extraJavaOptions=-Drun.mode=production -Dapp.param=..."
for local[*] mode:
"run.mode=development"

It is a bit tricky to understand where (not) to escape quotes and spaces, though. You can check the Spark web interface for system property values.

小开

Change spark-submit command line adding three options:

--files <location_to_your_app.conf>
--conf 'spark.executor.extraJavaOptions=-Dconfig.resource=app'
--conf 'spark.driver.extraJavaOptions=-Dconfig.resource=app'

小开

Here is my spark program run with addition java option

/home/spark/spark-1.6.1-bin-hadoop2.6/bin/spark-submit \
--files /home/spark/jobs/fact_stats_ad.conf \
--conf spark.executor.extraJavaOptions=-Dconfig.fuction.conf \
--conf 'spark.driver.extraJavaOptions=-Dalluxio.user.file.writetype.default=CACHE_THROUGH -Dalluxio.user.file.write.location.policy.class=alluxio.client.file.policy.MostAvailableFirstPolicy -Dconfig.file=/home/spark/jobs/fact_stats_ad.conf' \
--class jobs.DiskDailyJob \
--packages com.databricks:spark-csv_2.10:1.4.0 \
--jars /home/spark/jobs/alluxio-core-client-1.2.0-RC2-jar-with-dependencies.jar \
--driver-memory 2g \
/home/spark/jobs/convert_to_parquet.jar \
AD_COOKIE_REPORT FACT_AD_STATS_DAILY | tee /data/fact_ad_stats_daily.log

as you can see the custom config file --files /home/spark/jobs/fact_stats_ad.conf

the executor java options --conf spark.executor.extraJavaOptions=-Dconfig.fuction.conf

the driver java options. --conf 'spark.driver.extraJavaOptions=-Dalluxio.user.file.writetype.default=CACHE_THROUGH -Dalluxio.user.file.write.location.policy.class=alluxio.client.file.policy.MostAvailableFirstPolicy -Dconfig.file=/home/spark/jobs/fact_stats_ad.conf'

Hope it can helps.

小开

--files <location_to_your_app.conf> --conf 'spark.executor.extraJavaOptions=-Dconfig.resource=app' --conf 'spark.driver.extraJavaOptions=-Dconfig.resource=app'

if you write in this way, the later --conf will overwrite the previous one, you can verify this by looking at sparkUI after job started under Environment tab.

so the correct way is to put the options under same line like this: --conf 'spark.executor.extraJavaOptions=-Da=b -Dc=d' if you do this, you can find all your settings will be shown under sparkUI.

小开

Use the method like in below command, may be helpful for you -

spark-submit --master local[2] --conf 'spark.driver.extraJavaOptions=Dlog4j.configuration=file:/tmp/log4j.properties' --conf 'spark.executor.extraJavaOptions=-Dlog4j.configuration=file:/tmp/log4j.properties' --class com.test.spark.application.TestSparkJob target/application-0.0.1-SNAPSHOT-jar-with-dependencies.jar prod

I have tried and it worked for me, I would suggest also go through heading below spark post which is really helpful - https://spark.apache.org/docs/latest/running-on-yarn.html

小开

I originally had this config file:

my-app {
environment: dev
other: xxx
}

This is how I'm loading my config in my spark scala code:

val config = ConfigFactory.parseFile(File<"my-app.conf">)
.withFallback(ConfigFactory.load())
.resolve
.getConfig("my-app")

With this setup, despite what the Typesafe Config documentation and all the other answers say, the system property override didn't work for me when I launched my spark job like so:

spark-submit \
--master yarn \
--deploy-mode cluster \
--name my-app \
--driver-java-options='-XX:MaxPermSize=256M -Dmy-app.environment=prod' \
--files my-app.conf \
my-app.jar

To get it to work I had to change my config file to:

my-app {
environment: dev
environment: ${?env.override}
other: xxx
}

and then launch it like so:

spark-submit \
--master yarn \
--deploy-mode cluster \
--name my-app \
--driver-java-options='-XX:MaxPermSize=256M -Denv.override=prod' \
--files my-app.conf \
my-app.jar

小开

spark-submit --driver-java-options "-Denv=DEV -Dmode=local" --class co.xxx.datapipeline.jobs.EventlogAggregator target/datapipeline-jobs-1.0-SNAPSHOT.jar

The above command works for me:

-Denv=DEV => to read DEV env properties file, and
-Dmode=local => to create SparkContext in local - .setMaster("local[*]")