0 1 2
1016 240 206
Robert W. Walker
Multinomial variables are, by definition, multidimensional; we would otherwise call them ordered. But even ordered variables are beset by the problem of metrics. If we have low, medium, and high, should they be 1,2,3 or 1,3,5, or 1,3,4 or perhaps 1,5,7? Recall that a slope in a regression is the change in y for a unit change in x. We know things are bigger or smaller but by definition struggle with the question of how much?
Today we will consider two related models of such phenomena. It extends last week’s discussion of models for binary choices to more extensive choice sets or two richer outcomes than the binary case. They are:
There are entire texts on these subjects. For example, Greene and Hensher (2010) presents an entire text on ordered choices; Kenneth E. Train’s Discrete Choice Methods with Simulation examines both.
The AIC [and BIC] are built around the idea of likelihood presented last time. The formal definition, which is correct on Wikipedia explains the following:
There’s a great blog post that I found that details this. In a classic paper for which [among many others], Daniel L. McFadden was awarded the Nobel Prize in Economics, he develops a multinomial/conditional logistic regression model for the study of nominal outcomes. The core statistics demonstration is that, if random utility for options is described by a Gumbel/Type I extreme value distribution, then the difference in utility has a logistic distribution. From this observation, one can develop random utility models for unordered choices that follows from utility maximization. In short, we can use microeconomic foundations to motivate the models that follow.
There are two ways to think about such models. They can be motivated by choice-specific covariates or by chooser specific covariates [or a combination of both]. In general, if the covariates are chooser-specific, we call it a multinomial logit model while, if the covariates are choice specific, we call it conditional logit or conditional logistic regression. McFadden’s paper is built around transportation choices.
The independence of irrelevant alternatives contends that, when people choose among a set of alternatives, the odds of choosing option A over option B should not depend on the presence or absence of unchosen alternative C. Paul Allison, a famous emeritus sociologist at the University of Pennsylvania, has a nice blog post on this that is well worth the time.
Imagine the following scenario; you are out to dinner and you are given menus. One of your companions is excited to choose a steak from the menu. The server arrives and announces the specials of the day; your companion then decides that a pork chop option from the menu is preferable. Perhaps they were originally indifferent between the steak and the pork chop. Nevertheless, it would appear as though the presentation of irrelevant alternatives – the specials – had a nontrivial effect on your companion’s choices.
An example: exchange rate regimes and the vanishing middle.
There is a sizable literature on how countries structure markets for currency exchange. There are two polar approaches:
but there is also a third: intermediate regimes; things like floating pegs, the ERM snake, and others that are mixtures of the two. The literature in international monetary economics highlights these as prone to instability.
0 1 2
1016 240 206
Dependent variable: | ||
0 | 2 | |
(1) | (2) | |
probirch | 18.003*** | 15.319** |
(5.632) | (5.982) | |
dumirch | -23.831*** | -9.226 |
(7.078) | (7.483) | |
fix_l | 6.076*** | 3.347*** |
(0.368) | (0.431) | |
float_l | 2.763*** | 4.992*** |
(0.443) | (0.424) | |
Constant | -2.945*** | -3.277*** |
(0.343) | (0.385) | |
Akaike Inf. Crit. | 967.448 | 967.448 |
Note: | p<0.1; p<0.05; p<0.01 |
regime2aF probirch dumirch fix_l
1: 240 Min. :0.00000 Min. :0.00000 Min. :0.0000
0:1016 1st Qu.:0.01000 1st Qu.:0.00000 1st Qu.:0.0000
2: 206 Median :0.04000 Median :0.00000 Median :1.0000
Mean :0.04034 Mean :0.00754 Mean :0.7175
3rd Qu.:0.05000 3rd Qu.:0.01000 3rd Qu.:1.0000
Max. :0.34000 Max. :0.16000 Max. :1.0000
NA's :150 NA's :150
float_l
Min. :0.0000
1st Qu.:0.0000
Median :0.0000
Mean :0.1327
3rd Qu.:0.0000
Max. :1.0000
pred.data.I <- data.frame(fix_l = 0, float_l = 0, probirch = seq(0, 0.34, by = 0.01),
dumirch = 0)
Preds.1 <- data.frame(predict(multi_model, newdata = pred.data.I, type = "p"))
Preds.1$Irregular <- pred.data.I$probirch
Preds.1.df <- data.frame(Preds.1)
Graph.1 <- Preds.1.df %>%
pivot_longer(., cols = c(X1, X2, X0)) %>%
ggplot(.) + aes(x = Irregular, y = value, color = name) + geom_point() + theme_minimal()
pred.data.I <- data.frame(fix_l = 0, float_l = 0, probirch = seq(0, 0.34, by = 0.01),
dumirch = seq(0, 0.34, by = 0.01))
Preds.1 <- data.frame(predict(multi_model, newdata = pred.data.I, type = "p"))
Preds.1.df <- data.frame(Preds.1)
Preds.1.df$Irregular <- pred.data.I$probirch
Graph.2 <- Preds.1.df %>%
pivot_longer(., cols = c(X1, X2, X0)) %>%
ggplot(.) + aes(x = Irregular, y = value, color = name) + geom_point() + theme_minimal()
library(patchwork)
pred.data.I <- data.frame(fix_l = 1, float_l = 0, probirch = seq(0, 0.34, by = 0.01),
dumirch = 0)
Preds.1 <- data.frame(predict(multi_model, newdata = pred.data.I, type = "p"))
Preds.1$Irregular <- pred.data.I$probirch
Preds.1.df <- data.frame(Preds.1)
Graph.1 <- Preds.1.df %>%
pivot_longer(., cols = c(X1, X2, X0)) %>%
ggplot(.) + aes(x = Irregular, y = value, color = name) + geom_point() + theme_minimal()
pred.data.I <- data.frame(fix_l = 1, float_l = 0, probirch = seq(0, 0.34, by = 0.01),
dumirch = seq(0, 0.34, by = 0.01))
Preds.1 <- data.frame(predict(multi_model, newdata = pred.data.I, type = "p"))
Preds.1.df <- data.frame(Preds.1)
Preds.1.df$Irregular <- pred.data.I$probirch
Graph.2 <- Preds.1.df %>%
pivot_longer(., cols = c(X1, X2, X0)) %>%
ggplot(.) + aes(x = Irregular, y = value, color = name) + geom_point() + theme_minimal()
library(patchwork)
pred.data.I <- data.frame(fix_l = 0, float_l = 1, probirch = seq(0, 0.34, by = 0.01),
dumirch = 0)
Preds.1 <- data.frame(predict(multi_model, newdata = pred.data.I, type = "p"))
Preds.1$Irregular <- pred.data.I$probirch
Preds.1.df <- data.frame(Preds.1)
Graph.1 <- Preds.1.df %>%
pivot_longer(., cols = c(X1, X2, X0)) %>%
ggplot(.) + aes(x = Irregular, y = value, color = name) + geom_point() + theme_minimal()
pred.data.I <- data.frame(fix_l = 0, float_l = 1, probirch = seq(0, 0.34, by = 0.01),
dumirch = seq(0, 0.34, by = 0.01))
Preds.1 <- data.frame(predict(multi_model, newdata = pred.data.I, type = "p"))
Preds.1.df <- data.frame(Preds.1)
Preds.1.df$Irregular <- pred.data.I$probirch
Graph.2 <- Preds.1.df %>%
pivot_longer(., cols = c(X1, X2, X0)) %>%
ggplot(.) + aes(x = Irregular, y = value, color = name) + geom_point() + theme_minimal()
library(patchwork)
# weights: 12 (6 variable)
initial value 1606.171166
iter 10 value 532.176055
iter 20 value 531.979184
iter 20 value 531.979183
iter 20 value 531.979183
final value 531.979183
converged
multi_model.3 <- multinom(formula = regime2aF ~ probirch + fix_l + float_l, maxit = 500,
data = EXRT.data)
# weights: 15 (8 variable)
initial value 1441.379323
iter 10 value 482.863574
iter 20 value 480.237337
iter 30 value 480.208852
final value 480.208838
converged
McFadden CoxSnell Nagelkerke AIC
0.5592956 0.6029024 0.7459794 1075.9583656
McFadden CoxSnell Nagelkerke AIC
0.5617286 0.6086804 0.7497900 976.4176765
It minimizes AIC.
My preferred method of thinking about ordered regression involves latent variables. So what is a latent variable? It is something that is unobservable, hence latent, and we only observe coarse realizations in the form of qualitative categories. Consider the example from Li in the Journal of Politics.
Li investigates the determinants of tax incentives for Foreign Direct Investment in a sample of countries driven by data availability. There are four distinct claims:
Suppose there is some unobserved continuous variable, call it y^{*} that measures the willingness/utility to be derived from tax incentives to FDI. Unfortunately, this latent quantity is unobservable; we instead observe how many incentives are offered and posit that the number of incentives is a manifestation of increasing utility with unknown points of separation – cutpoints – that separate these latent utilities into a mutually exclusive and exhaustive partition. In a simplified example, consider this.
So anything below -3 is zero incentives; anything between -3 and -2 is one incentive, … , and anything above 4 should be all six incentives.
What we have is a regression problem but the outcome is unobserved and takes the form of a logistic random variable. Indeed, one could write the equation as:
y^{*} = X\beta + \epsilon
where \epsilon is assumed to have a logistic distribution but this is otherwise just a linear regression. Indeed, the direct interpretation of the slopes is the effect of a one-unit change in X on that logistic random variable.
This should give us an idea of what is going on. The data come in Stata format; we can read these via the foreign
or haven
libraries in R.
First, let us have a look at Model 1.
Call:
polr(formula = generositygF ~ law00log + transition, data = Li.Data)
Coefficients:
Value Std. Error t value
law00log -0.6192 0.4756 -1.3019
transition -0.5161 0.7126 -0.7243
Intercepts:
Value Std. Error t value
0|1 -1.3617 0.3768 -3.6144
1|2 -0.4888 0.3323 -1.4707
2|3 1.1785 0.3668 3.2133
3|4 2.3771 0.5361 4.4339
4|5 3.1160 0.7325 4.2541
5|6 3.8252 1.0183 3.7565
Residual Deviance: 164.3701
AIC: 180.3701
We can read these by stars. There is nothing that is clearly different from zero as a slope or 1 as an odds-ratio. The authors deploy a common strategy for adjusting standard errors that, in this case, is necessary to find a relationship with statistical confidence. That’s a diversion. To the story. In general, the sign of the rule of law indicator is negative, so as rule of law increases, incentives decrease though we cannot rule out no effect. Transitions also have a negative sign; regime changes have no clear influence on incentives. There is additional information that is commonly given short-shrift. What do the cutpoints separating the categories imply? Let’s think this through recongizing that the estimates have an underlying t/normal distribution. 4|5
is within one standard error of both 3|4
and 5|6
. The model cannot really tell these values apart. Things do improve in the lower part of the scale but we should note that this is where the vast majority of the data are actually observed.
Next, I will turn the estimates into odds-ratios by exponentiating the estimates.
li.mod4 <- polr(generositygF ~ law00log + transition + fdiinf + democfdi + democ +
autocfdi2 + autocfdir + reggengl + reggengl2 + gdppclog + gdplog, data = Li.Data)
summary(li.mod4)
Call:
polr(formula = generositygF ~ law00log + transition + fdiinf +
democfdi + democ + autocfdi2 + autocfdir + reggengl + reggengl2 +
gdppclog + gdplog, data = Li.Data)
Coefficients:
Value Std. Error t value
law00log -0.89148 0.66806 -1.3344
transition -0.57123 0.94945 -0.6016
fdiinf 0.37605 0.18055 2.0828
democfdi -0.39969 0.18228 -2.1927
democ -1.23307 0.77661 -1.5878
autocfdi2 1.24932 1.94198 0.6433
autocfdir -3.17796 1.95351 -1.6268
reggengl 1.81476 0.44754 4.0550
reggengl2 -0.05777 0.01444 -4.0007
gdppclog 0.20891 0.43867 0.4762
gdplog 0.15754 0.18138 0.8686
Intercepts:
Value Std. Error t value
0|1 13.7773 0.1020 135.0542
1|2 14.7314 0.3048 48.3349
2|3 16.9740 0.5230 32.4561
3|4 18.7911 0.8027 23.4107
4|5 20.0387 1.1590 17.2900
5|6 23.2947 4.7216 4.9336
Residual Deviance: 134.2212
AIC: 168.2212
(2 observations deleted due to missingness)
Measured via odds-ratios, we can obtain those:
Goodness of Fit:
McFadden CoxSnell Nagelkerke AIC
0.01104919 0.03405653 0.03560378 180.37009764
McFadden CoxSnell Nagelkerke AIC
0.1649728 0.4054507 0.4235700 168.2212440
The last model is clearly better than the first by any of these measures. That said, there are a lot of additional predictors that add much complexity to the model and the difference in AIC is not very large.
Let’s do this.
To get the data, we will need the following.
And what does this mean….
Models of Choice and Forecasting