Cross-entropy

Lectures

Basics

Inventory

Forecasting

In Excel

Antipatterns

Software

Pricing

By Joannes Vermorel, January 2018

The cross-entropy is a metric that can be used to reflect the accuracy of probabilistic forecasts. The cross-entropy has strong ties with the maximum likelihood estimation. Cross-entropy is of primary importance to modern forecasting systems, because if it is instrumental in making possible the delivery of superior forecasts, even for alternative metrics. From a supply chain perspective, cross-entropy is particularly important as it supports the estimation of models that are also good at capturing the probabilities of rare events, which frequently happen to be the costliest ones. This metric departs substantially from the intuition that supports simpler accuracy metrics, like the mean square error or the mean absolute percentage error.

Frequentist probability vs Bayesian probability

A common way of understanding statistics is the frequentist probability perspective. When trying to make quantitative sense of an uncertain phenomenon, the frequentist perspective states that measurements should be repeated many times, and that by counting the number of occurrences of the phenomenon of interest, it is possible to estimate the frequency of the phenomenon, i.e. its probability. As the frequency rate converges through many experiments, the probability gets estimated more accurately.

The cross-entropy departs from this perspective by adopting the Bayesian probability perspective. The Bayesian perspective reverses the problem. When trying to make quantitative sense of an uncertain phenomenon, the Bayesian perspective starts with a model that directly gives a probability estimate for the phenomenon. Then, through repeated observations, we assess how the model fares when confronted with the real occurrences of the phenomenon. As the number of occurrences increase, the measurement of the (in)adequacy of the model improves.

The frequentist and the Bayesian perspectives are both valid and useful. From a supply chain perspective, as collecting observations is costly and somewhat inflexible – companies have little control on generating orders for a product – the Bayesian perspective is frequently more tractable.

The intuition of cross-entropy

Before delving into the algebraic formulation of the cross-entropy, let’s try to shed some light on its underlying intuition. Let’s assume that we have a probabilistic model – or just model in the following - that is intended to both explain the past and predict the future. For every past observation, this model provides an estimate of the probability that this observation should have happened just like it did. While it is possible to construct a model that simply memorize all past observations assigning them a probability of exactly 1, this model would not tell us anything about the future. Thus, an interesting model somehow approximates the past, and thus delivers probabilities that are less than 1 for past events.

By adopting the Bayesian perspective, we can evaluate the probability that the model would have generated all the observations. If we further assume all observations to be independent (IID, Independent and Identically Distributed actually), then the probability that this model would have generated the collection of observations that we have is the product of all the probabilities estimated by the model for every past observation.

The mathematical product of thousands of variables that are typically less than 0.5 - assuming that we are dealing with a phenomenon which is quite uncertain – can be expected to be an incredibly small number. For example, even when considering an excellent model to forecast demand, what would be the probability that this model could generate all the sales data that a company has observed over the course of a year? While estimating this number is non-trivial, it is clear that this number would be astoundingly small.

Thus, in order to mitigate this numerical problem known as an arithmetic underflow, logarithms are introduced. Intuitively, logarithms can be used to transform products into sums, which conveniently addresses the arithmetic underflow problem.

Formal definition of the cross-entropy

For two discrete random variables $p$ and $q$, the cross-entropy is defined as: $$H(p, q) = -\sum_x p(x)\, \log q(x). \!$$ This definition is not symmetric. $P$ is intended as the “true” distribution, only partially observed, while $Q$ is intended as the “unnatural” distribution obtained from a constructed statistical model.

In information theory, cross-entropy can be interpreted as the expected length in bits for encoding messages, when $Q$ is used instead of $P$. This perspective goes beyond the present discussion and isn’t of primary importance from a supply chain perspective.

In practice, as $P$ isn’t known, the cross-entropy is empirically estimated from the observations, by simply assuming that all the collected observations are equally probable, that is, $p(x)=1/N$ where $N$ is the number of observations. $$H(q) = - \frac{1}{N} \sum_x \log q(x). \!$$ Interestingly enough, this formula is identical to the average log-likehood estimation. Optimizing the cross-entropy or the log-likelihood is essentially the same thing, both conceptually and numerically.

The superiority of cross-entropy

From the 1990’s to early 2010, most of the statistical community was convinced that the most efficient way, from a purely numerical perspective, to optimize a given metric, say MAPE (mean absolute percentage error), was to build an optimization algorithm directly geared for this metric. Yet, a critical yet counter-intuitive insight achieved by the deep learning community is that this wasn’t the case. Numerical optimization is a very difficult problem, and most metrics are not suitable for efficient, large scale, numerical optimization efforts. Also during the same period, the data science community at large had come to realize that all the forecasting / prediction problems were actually numerical optimization problems.

From a supply chain perspective, the take-away is that even if the goal of the company is to optimize a forecasting metric like MAPE or MSE (mean square error), then, in practice, the most efficient route is to optimize the cross-entropy. At Lokad, in 2017, we have collected a significant amount of empirical evidence supporting this claim. More surprisingly maybe, cross-entropy also outperforms CRPS (continuous-ranked probability score), another probabilistic accuracy metric, even if the resulting models are ultimately judged against CRPS.

It is not entirely clear what makes cross-entropy such a good metric for numerical optimization. One of the most compelling arguments, detailed in Ian Goodfellow et all, is that cross-entropy provides very large gradient values, that are especially valuable for gradient descent, which precisely happens to be the most successful scale optimization method that is available at the moment.

CRPS vs cross-entropy

As far as supply chain is concerned, cross-entropy largely outperforms CRPS as a metric for probabilistic forecasts simply because it puts a much greater emphasis on rare events. Let’s consider a probabilistic model for demand that has a mean at 1000 units, with the entire mass of the distribution concentrated on the segment 990 to 1010. Let’s further assume that the next quantity observed for the demand is 1011.

From the CRPS perspective, the model is relatively good, as the observed demand is about 10 units away from the mean forecast. In contrast, from the cross-entropy perspective, the model has an infinite error: the model did predict that observing 1011 units of demand had a zero probability – a very strong proposition – which turned out to be factually incorrect, as demonstrated by the fact that 1011 units have just been observed.

The propensity of CRPS to favor models that can make absurd claims like the event XY will never happen while the event does happen, largely contributes to explain, from the supply chain perspective, why cross-entropy delivers better results. Cross-entropy favors models that aren’t caught “off guard” so to speak when the improbable happens. In supply chain, the improbable does happen, and when it does with no prior preparation, dealing with this event turns out to be very costly.