How to aggregate custom application logs in Spark on HDInsight?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP



How to aggregate custom application logs in Spark on HDInsight?



CONTEXT



I want to configure custom logging in an application written in python and running on a HDInsight Spark cluster (hence Hortonworks-style).
HDInsight cluster type: Spark 2.2 on Linux (HDI 3.6), Spark version: 2.2.0.2.6.3.2-13



My requirements are as follows:



RESEARCH



I managed to modify the log4j.properties creating a custom log appender and a logger that uses it and it writes to a file but I'm failing to make it aggregate the logs.


log4j.properties



When I tried to use the standard $spark.yarn.app.container.log.dir/filename.log it got resolved to /filename.log and returned a permission denied error both in pyspark and using spark-submit but the file filename.log appeared in the RM UI (it was empty though).


$spark.yarn.app.container.log.dir/filename.log


/filename.log


permission denied


pyspark


spark-submit


filename.log



The path spark.yarn.app.container.log.dir normally should look like this: /var/log/hadoop-yarn/container/<applicationId>/<containerId>, e.g.: /var/log/hadoop-yarn/container/application_1504924099862_7571/container_e16_1504924099862_7571_01_000005
so the solution I was considering is to set the appender destination file from within the application using either the value of spark.yarn.app.container.log.dir or the applicationId and containerId.


spark.yarn.app.container.log.dir


/var/log/hadoop-yarn/container/<applicationId>/<containerId>


/var/log/hadoop-yarn/container/application_1504924099862_7571/container_e16_1504924099862_7571_01_000005


spark.yarn.app.container.log.dir



In both cases I don't know how to do it in python: spark.yarn.app.container.log.dir looks unset (
sc._conf.getAll() doesn't contain it) and I don't know where to look for
containerId, other than extracting it from the spark.yarn.app.container.log.dir path.


spark.yarn.app.container.log.dir


sc._conf.getAll()


spark.yarn.app.container.log.dir



I managed to obtain spark.yarn.app.container.log.dir in Scala thanks to How do I get the YARN ContainerId from inside the container? but it returns multiple paths so I'm not sure if it is usable.


spark.yarn.app.container.log.dir



QUESTIONS



Is it possible that spark.yarn.app.container.log.dir has different values from Scala and Python APIs?


spark.yarn.app.container.log.dir



How can I read the value of spark.yarn.app.container.log.dir in pyspark knowing that I can do this using System.getProperty("spark.yarn.app.container.log.dir") in Scala?


spark.yarn.app.container.log.dir


System.getProperty("spark.yarn.app.container.log.dir")



Can I make YARN aggregate logs from a custom appender not using spark.yarn.app.container.log.dir?


spark.yarn.app.container.log.dir









By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

Firebase Auth - with Email and Password - Check user already registered

Dynamically update html content plain JS

How to determine optimal route across keyboard