Suddenly, my YARN cluster has stopped working, everything I submit fails with "Exit code 1". I want to track down that problem, but as soon as an application failed, YARN deletes the log files. What is the configuration setting I have to adjust for YARN to keep these log files?
It seems your container is exiting with exit code 1.
You are unable to see the logs on the UI, because by default, the log aggregation is disabled. Following parameter determines the log aggregation: "yarn.log-aggregation-enable" (set to "false" if log aggregation is disabled).
If this is set to "false", then all the node managers store the container logs in a local directory, determined by the following configuration parameter: "yarn.nodemanager.log-dirs".
For e.g. in my case, this is set to:
<property>
<name>yarn.nodemanager.log-dirs</name>
<value>e:\hdpdata\hadoop\logs</value>
</property>
So, all my container logs for a particular application, will be found in the folder "e:\hdpdata\hadoop\logs\ {application-id} \ {container-id}", in the Node Manager machine, where the Application Master ran.
Let's assume that my application: "application_1443377528298_0010" FAILED. In the YARNRM's UI (determined by config parameter: yarn.resourcemanager.webapp.address), you can get the information about the node, on which the Application Manager ran. In the figure below, the Application Manager ran on the machine "120243".
If you login to this machine and search in the folder "e:\hdpdata\hadoop\logs\application_1443377528298_0010\", you can see the logs for all the containers of application "application_1443377528298_0010".
But, now if you want to see the logs through YARN RM web UI, then you need to enable the log aggregation. For that, you need to set the following parameters, in yarn-site.xml:
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>/app-logs</value>
</property>
<property>
<name>yarn.nodemanager.remote-app-log-dir-suffix</name>
<value>logs</value>
</property>
With the above settings, my logs are aggregated in HDFS at "/app-logs/{username}/logs/". Under this folder, you can find logs for all the applications run so far. Again the log retention is determined by the configuration parameter "yarn.log-aggregation.retain-seconds" (how long to retain the aggregated logs).
When the MapReduce applications are running, then you can access the logs from the YARN's web UI. Once the application is completed, the logs are served through Job History Server.
In your case, if you want to see the logs on the Web UI, after the application is terminated, then you need to start running the MapReduce Job History server also. To enable it, set following configuration parameters in mapred-site.xml:
<property>
<name>mapreduce.jobhistory.address</name>
<value>{job-history-hostname}:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>{job-history-hostname}:19888</value>
</property>
And set following configuration parameter in yarn-site.xml:
<property>
<name>yarn.log.server.url</name>
<value>http://{job-history-hostname}:19888/jobhistory/logs</value>
</property>
I have replicated settings from HDP installation on Windows and these settings work for me. These should work for you also. For the description of each of the configurations mentioned above, refer the links below:
https://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-common/yarn-default.xml