Don’t Change The p-value Threshold

Personally, I think the proposal to ‘change’ the p-value for significant results from .05 to .005 is a mistake. The only sense in which this proposal has any real bite is if journals and hiring committees respond by treating research that doesn’t meet p < .005 as less important but all that does is make the incentives for the kind of behavior causing all the problems much stronger.

I’d much rather have a well designed (ideally pre-registered) trial at p < .05 than a p < .005 result that is cherry picked as a result of after the fact choice of analysis. Rather than making the distinction between well designed appropriate methodology and dangerous potentially misleading methodology more apparent this further obscures it and tells any scientist who was standing on principle they need to stop hoping their better methodology will be appreciated and do something to compete on p-value with papers published using problematic data analysis.

In particular, I think this kind of proposal doesn’t take sufficient account of the economics and incentives of researchers. Yes, p < .005 studies would be more convincing but they also cost more (both in $ and time) so by telling fledgling researchers they need p < .005 you force them to put all their eggs in one basket making dubious data analysis choices that much more tempting when their study fails to meet the threshold.

What we need is more results blind publication processes (in which journals publish the results based merely on a description of the experimental process without knowledge of what the results found). That would both help combat many of these biases and truly evaluate researchers on their ability not their luck. Ideally such studies would be pre-accepted before results were actually analyzed. Of course there still needs to be a place for merely suggestive work that invites further research but it should be regarded as such without any particular importance assigned to p-value.

However, as these are only my brief immediate thoughts I’m quite open to potential counterarguments.

Racial Bias in Police Stops

Bias Science Done Right

In a previous post I was very critical of a study claiming to show gender bias in journal publications in political science. Like too many studies of this kind the data only supported the judgement of gender bias to the extent one was already inclined to believe gender bias was the appropriate explanation for gender disparities in the field. However, not all studies suffer from these flaws so when I heard about a recent study in PNAS examining how an individuals race affects how police treat them at traffic stops and saw that it was well done I thought I should post an example of the right way to engage in this kind of study (and the important/unexpected information one gets when one studies bias rigorously).

What the authors of this paper did was take body camera footage from Oakland police officers in April 2014 and examine vehicle stops they made. They had human raters (I presume college students) examine transcripts of the interactions (without knowledge of officer or civilian’s race) and rate them based on respectfulness, formality, friendliness, politeness and impartiality. After determining that such ratings were repeatable (different raters tended to agree on scoring) they then trained a computational model to predict both respect and formality which they verified against human ratings. I’ll let the paper’s authors speak for themselves about the results.

Controlling for these contextual factors, utterances spoken by officers to white community members score higher in Respect [β = 0.05 (0.03, 0.08)]. Officer utterances were also higher in Respect when spoken to older [β = 0.07 (0.05, 0.09)] community members and when a citation was issued [β = 0.04 (0.02, 0.06)]; Respect was lower in stops where a search was conducted [β = −0.08 (−0.11, −0.05)]. Officer race did not contribute a significant effect. Furthermore, in an additional model on 965 stops for which geographic information was available, neither the crime rate nor density of businesses in the area of the stop were significant, although a higher crime rate was indicative of increased Formality [β = 0.03 (0.01, 0.05)].

Note that the authors themselves raised the possibility that geographic region might play a confounding role, e.g., people in high crime areas might be treated more suspiciously, and rejected it. However, one still might worry that any effect we are seeing is a result of minorities being more inclined toward criminal behavior and thus more frequently pulled over on suspicion of serious infractions but that too is considered and rejected.

One might consider the hypothesis that officers were less respectful when pulling over community members for more severe offenses. We tested this by running another model on a subset of 869 interactions for which we obtained ratings of offense severity on a four-point Likert scale from Oakland Police Department officers, including these ratings as a covariate in addition to those mentioned above. We found that the offense severity was not predictive of officer respect levels, and did not substantially change the results described above. To consider whether this disparity persists in the most “everyday” interactions, we also reran our analyses on the subset of interactions that did not involve arrests or searches (N = 781), and found the results from our earlier models were fundamentally unchanged.

Finally, the paper authors are careful to acknowledge limitations of their analysis. In particular, they acknowledge the limitations of their study in identifying the cause of these disparities in treatment/language and with respect to the possibility that it is differences in minority behavior which itself causes officers to respond differently they say:

The racial disparities in officer respect are clear and consistent, yet the causes of these disparities are less clear. It is certainly possible that some of these disparities are prompted by the language and behavior of the community members themselves, particularly as historical tensions in Oakland and preexisting beliefs about the legitimacy of the police may induce fear, anger, or stereotype threat. However, community member speech cannot be the sole cause of these disparities. Study 1 found racial disparities in police language even when annotators judged that language in the context of the community member’s utterances. We observe racial disparities in officer respect even in police utterances from the initial 5% of an interaction, suggesting that officers speak differently to community members of different races even before the driver has had the opportunity to say much at all.

I feel that this analysis considered and fairly convincingly rejected all the plausible confounders. Of course others might disagree and suggest some other factor, e.g., expensiveness of car, is responsible but even if you are inclined to take such a line you have to admit that this study provides some pretty damn good evidence by ruling out many other plausible confounding variables.

Having said this one should still be careful (as the authors of this paper are) in interpreting the results. In particular, we don’t have a good sense of what the psychological reason for officers different behavior with minorities. Is it because they judge them to be less deserving of respect? Or maybe officers judge minorities to be less respectful to them so begin the interaction less respectfully? Or some other explanation? If the goal is making the world a better place and not merely assigning blame those answers matter and hopefully more good scientific studies will reveal them.

I’d like to close with what I take to be one of the most important reasons to do this research rigorously. While most people could have probably guessed that officers would be less respectful to minority drivers it wasn’t at all obvious that officer race wouldn’t play a factor in respectfulness to minorities. Nor was it obvious that we would see a difference in treatment from a broad swath of police officers not merely a few particularly biased officers. The reason this kind of research is important (in addition to validating minority claims of police treatment) is that we need to learn how and why minorities are treated differently if we are going to fix the problem. Without studies like this I think many people’s natural assumption is that hiring minority officers would address these problems. It doesn’t.

Evaluating Gender Bias Claims In Academia Part 1

Does The Data Support The Interpretation

For a number of reasons I think it’s vital that we have a good empirical grip on the reasons why different genders are over/under represented in various disciplines and at various levels of acclaim in those disciplines. There is the obvious reason, namely that, it is only through such an understanding that we can usefully discuss claims of unfairness and evaluate schemes to address those claims. If we get the reasons for under/over representation in various areas wrong we not only risk failing to correct real instances of unfair based treatment but also undermining the credibility of attempts to address unfair treatment more generally. This isn’t only about avoiding gender based biases but, more broadly, identifying ways in which anyone might face unjust hardship in pursuing their chosen career and succeeding at it1.

Also, even putting questions of fairness and discrimination to the side there are important social and cultural reasons to care about these outcomes. For instance, the imbalance of men and women in STEM fields both imposes personal hardships on both genders in those fields but also creates an excuse for dismissing the style of thinking developed by STEM disciplines. As such, identifying simple changes that could substantially increase female participation in STEM subjects is desirable in and of itself and similar cultural considerations beyond mere fairness extend to other fields. However, I worry that incorrect interpretation of the empirical data could lead us to overlook such changes especially when they don’t fit nicely into the default cultural narrative2.

Point is that I genuinely want to accurately identify the causes of gender differences in educational attainment and academic outcomes. One could be forgiven for thinking that we’ve already nailed down these causes. After all every couple of months one sees a new study being touted in the mainstream media claiming to show sexism playing a role in some educational or professional evaluation. Unfortunately, closer examination of the actual studies conducted often reveals that they don’t actually support the interpretation provided and everyone suffers from a misleading interpretation of the empirical data.

So, in an attempt to get a better picture of what the evidence tells us, every time I see a new study claiming to document gender bias or otherwise explain gender differentials in outcomes I’m going to dive into the results and see if they support the claims made by the article. While I can’t claim that I’m choosing studies to examine in a representative fashion I do hope that comparing the stated claims to what the data supports will help uncover the truth.

Gender and Publishing in Political Science

I ran across this claim that there is gender bias against female authors in political science in the wall street journal blog monkey cage. For once, the mainstream media deserves credit because they accurately conveyed the claims made by the study.

The study claims to show gender bias in political science publication based on an analysis of published papers in political science. By coding the authors of published papers the study gives us the following information about the rate of female publication.

Line A represents the share of women in the ladder faculty at the largest 20 PhD-granting departments in the discipline (27%). Line B represents the share of women among all APSA members (31%). Line C represents the share of women among all newly minted PhDs as reported in the NSF’s survey of earned doctorates.

The paper deserves credit for recognizing that this may reflect some degree of sorting by subfield and recognizing that sorting into subfield might falsely create the impression of bias even when none was present. However, any credit granted should be immediately revoked on account of the following argument.

However, gendered sorting into subfields would not explain is the pattern we observe for the four “generalist” journals in our sample (AJPS, APSR, JOP and POP). These four journals—official journals either of the national association or one of its regional affiliates—are all “generalist” outlets, in that their websites indicate that they are open to submissions across all subfields. Yet, as figure 3 shows, women are underrepresented, against all three benchmarks, in three of those four “generalist” journals.

The mere fact that these are generalist journals in no way means that they are not more likely to publish some kinds of analysis rather than others. As the study goes on to observer women are substantially underrepresented in quantitative and statistical work while overrepresented (at least as compared to their representation at prestigious institutions) in qualitative work. Despite the suggestion by the study authors to the contrary choosing, for valid intellectual (or even invalid gender unrelated) reasons, to value quantitative work more highly and publish it more readily doesn’t constitute gender bias in journal publication in the sense that their conclusions and ethical interpretations assume.

Line A represents the share of women in the ladder faculty at the top 20 PhD-granting departments in the discipline (27%). Line B represents the share of women among all APSA members (31%). Line C represents the share of women among all newly minted PhDs, as reported in the NSF’s survey of earned doctorates (40%).

Ideally, the authors would have provided some more quantitative evaluation of what part of the observed effect was explained by choice of subfield and mode of analysis. However, I think it’s fair to say based on the graph above that women aren’t so overrepresented in publications in qualitative areas for subfield preferences to explain everything so lets put the concern about subfield/analysis type based sorting to one side and return to the primary issue

This paper also deserves praise for recognizing that merely comparing the percentage of women in the field with the percentage of prestigious female publications will merely reflect the fact that past discrimination means the oldest, and most influential, segment of the discipline is disproportionately male. In other words, even assuming that all discrimination and bias magically vanished in the year 2000 one would still expect to find men being published and cited at a greater rate than women for the simple reason that eliminating barriers to female participation biases female representation to the less experienced parts of the discipline. By breaking down authors by their professorial rank the study is able to minimize the extent to which this issue affects their conclusions.

Percentage female authorship by professorial rank

Importantly, in the discussion section (and throughout the paper) the study makes it clear that it takes this result to be evidence of bias. The WSJ post was quite right in understanding the paper to be alleging gender bias in publication. Yes, the study doesn’t claim to decide whether this bias is a result of female authors being rejected more frequently or female authors being less likely to publish in the most prestigious journals but in either case it assumes that the ultimate explanation is pernicious gender bias.

The paper also explores the issue of gender based coauthorship and the relative prevalence of papers with all male authors, mixed gender etc.. etc.. These patterns are used to motivate various speculations about the fears women may face in choosing to coauthor but the complete lack of any attempt to determine to what extent these patterns are simply the result of subfield and analysis type preferences, e.g., quantitative and statistical analysis might lend themselves more frequently to coauthorship, and the relevant percentages of women in those fields undermines any attempt to use this data to support such speculations. While I believe that female scholars do face real concerns about being insufficiently credited as co-authors the ways such concerns could play our are so varied that I don’t think we can use this data to draw the conclusion the study authors do: women aren’t benefiting equally from trends toward coauthorship. However, I’m going to set this issue aside.

Political Science Hiring Biased Toward Women?

At this point one might be inclined to think this paper should get pretty good marks. Sure, I’ve identified a few concerns that aren’t fully addressed but surely it makes a pretty good case for the claim of gender bias in political science? Unfortunately, that’s simply a mirage created by thinking about the data in exactly one way. Notice that one could equally well use the same data and analysis to draw the conclusion: Women Hired in Political Science Despite Fewer Publications. After all the way one gets professorial jobs is by publishing papers and this data suggest that women at the same professional level have less publications than their male colleagues.

Now I think there are multiple plausible ways of resisting the conclusion that this data shows a bias in favor of women in hiring. For one, if past discrimination means that men and women at the same professional level haven’t had the same amount of time to right papers (e.g. women are more likely to have just got the job) then the conclusion is suspect. For another, one might point out that not all the jobs given the same professorial rank in the study are really equivalent. There are further reasons to doubt these conclusions, but each and every reason equally well undermines any support this data provides for claims of gender bias.

Ultimately, I think it’s safe to say that while this study shows that women publish in influential journals at a rate lower than their representation in the political science profession would suggest it does little to identify a cause. If you came into this with the prior that said: the reason women are underrepresented in political science is because they face bias and other obstacles you’ll explain this effect in terms of bias and obstacles. In contrast, if you came in with the prior that said: the reason women are still underrepresented in political science is because of gender related differences in ability/interest (which need not be negative it could as well be a greater affinity for some rival career option) then the data are perfectly compatible with women gravitating towards more qualitative less rigorous aspects of the profession and putting greater focus on teaching and other aspects of the profession that don’t result in publications.

Frankly, I don’t know enough about political science to have much opinion on this point one way or another. However, I do think we can safely mark this study down as misleading at least insofar as it is cited as further evidence of gender bias against women. Don’t get me wrong, I think that is a very plausible interpretation of the data but I’m just sharing the bias I came in with rather than being persuaded by evidence.


  1. For instance, differences in male/female performance might justify studying ways in which standard pedagogy favors/disfavors particular learning styles. Accurate empirical data on this subject, regardless of what it shows about gender, lets us correct the ways in which the current system may be unfair to those with particular learning styles, e.g., consider the recent evidence about how mandated attendance can actually hurt performance by those who don’t find the lecture component of a course useful. 
  2. For instance, I worry that it is precisely women’s better performance in high school mathematics and generally greater willingness to approach subjects as desired by their high school (or early college) teachers rather than going their own way which is responsible for some of the observed disaffinity towards studying higher mathematics among women. Ideally, one would teach mathematics by merely communicating the underlying ideas and allowing students to use their conceptual understanding to solve problems. However, few students have the interest and ability to, say, use their conceptual understanding of the derivative to find the maximum value of a given function and the educational system is unwilling to abandon the idea that almost all college freshmen should be able to solve such problems. As a result lower level mathematics is forced to adopt a formulaic approach the favors rote memorization of algorithms meaning that gaining real insight and experiencing mathematics as an enjoyable puzzle often requires rejecting the approach seen in the classroom and working things out on one’s own. I worry that we lose many women who might otherwise be interested in mathematics simply because they are more devoted to working within the framework they are given but because this is largely seen as a positive value it gets neglected as a potential problem. If true, it might be that simple interventions like explicitly encouraging students to deviate from the rote rules being taught if they understand enough to do so could make a big difference.