Why should we use Temperature in softmax?

Sara picture Sara ยท Nov 8, 2019 ยท Viewed 9.1k times ยท Source

I'm recently working on CNN and I want to know what is the function of temperature in softmax formula? and why should we use high temperatures to see a softer norm in probability distribution?Softmax Formula

Answer

Zain Sarwar picture Zain Sarwar ยท Aug 18, 2020

One reason to use the temperature function is to change the output distribution computed by your neural net. It is added to the logits vector according to this equation : ๐‘ž๐‘– =exp(๐‘ง๐‘–/๐‘‡)/ โˆ‘๐‘—exp(๐‘ง๐‘—/๐‘‡) where ๐‘‡ is the temperature parameter.

You see, what this will do is change the final probabilities. You can choose T to be anything (the higher the T, the 'softer' the distribution will be - if it is one, the output distribution will be the same as your normal softmax outputs). What I mean by 'softer' is that is that the model will basically be less confident about it's prediction.

a) Sample 'hard'softmax probs : (0.01,0.01,0.98)

b) Sample 'soft' softmax probs : (0.2,0.2,0.6)

'a' is a 'harder' distribution. Your model is very confident about it's predictions. However, in many cases, you don't want your model to do that. For example, if you are using an RNN to generate text, you are basically sampling from your output distribution and choosing the sampled word as your output token(and next input). IF your model is extremely confident, it may produce very repetitive and uninteresting text. You want it to produce more diverse text which it will not produce because when the sampling procedure is going on, most of the probability mass will be concentrated in a few tokens and thus your model will keep selecting a select number of words over and over again. In order to give other words a chance of being sampled as well, you could plug in the temperature variable and produce more diverse text.

With regards to why higher temperatures lead to softer distributions, that has to do with the exponential function. The temperature parameter penalizes bigger logits more than the smaller logits. The exponential function is an 'increasing function'. So if a term is already big, penalizing it by a small amount would make it much smaller (% wise) than if that term was small.

Here's what I mean, exp(6) ~ 403 exp(3) ~ 20

Now let's 'penalize' this term with a temperature of let's say 1.5: exp(6/1.5) ~ 54 exp(3/1.5) ~ 7.4

You can that in % terms, the bigger the term is, the more it shrinks when the temperature is used to penalize it. When the bigger logits shrink more than your smaller logits, more probability mass (to be computed by the softmax) will be assigned to the smaller logits.