CATESOL Book Review: Statistics for Linguists, An Introduction Using R

A Professional Organization Serving Teachers of English to Speakers of Other Languages in California

Join

04/18/2020

Christie Sosa

by Kara Mac Donald and Jose Franco

Statistics for Linguists: An Introduction Using R (2020) by Bono Winter takes a unique approach towards introducing statistics of linear models for linguistics, in that it builds model-based thinking instead of test-based thinking. Winters explains that he takes this approach to basic linear modeling as it provides the researcher a foundation of theoretical understanding of the statistical model they choose to use. He also describes why he structured the book around R, opposed to utilizing SPSS. The text do not concentrate on mathematics, but aims to provide the researcher with practical and relevant information using extremely accessible language.

Chapter 1, Introduction to R is very friendly step-by-step explanation of the basics of R. Through using simple math functions, the chapter guides the reader through the basic function in formulating an R script. Everything is presented in small digestible chunks of information, and structured so that the reader can follow along each step of the way. There are explanations for what may have been missed when the outcome isn’t as expected and frequent reminders that learning is through trial. The chapter closes with sample exercises practice what was presented.

Chapter 2 introduces tidyverse, a collection of open source R packages that express the fundamental design of data structures that permit a neat use of analysis. The tidyverse packages can be loaded all at once or individually. The chapter then takes the reader step by step through the function and benefits of each package, starting with tibble and readr. Then, dplyr, maggritter and ggplot2. The need for research to replicable and even more fundamental reproducible, is raised and how many fields experiences of a ‘replication crisis´, as many prominent research result were not replicable in other field and the field of linguistic will likely soon experience the same. The benefits of going open, like other fields, wraps up the discussion, with practice exercises closing the chapter.

Descriptive statistics, model and distributions are addressed in Chapter 3, and open with straightforward analogy of models in the daily life and why they are valuable. Then, the reader is eased into digesting familiar terms, such distributions, mean, median, quantiles, range and standard deviation, but may often not be fully understood as to what they reflect in a data set. With clear, accessible language, in a step by step manner these are explained through examples for univariate distributions. And as always, exercises to apply concepts presented.
Chapter 4 provides the basis for all the other chapter topics in the book, and begins with description of a word comprehension response based on word frequency and a corresponding figure from published work. With basic concepts, linear regression is explained. Then, slopes, intercepts fitted values and residuals. The reader will wish he/she had this explanation high school math and/or first statistics class. Next, these concepts are applied within R and tidyverse functions, with exercises for application at the end.

The use of linear transformations are examined in Chapter 5, and the chapter starts in an accessible manner by building on example data analysis from the previous chapter. Correlation, linear, and nonlinear transformations are articulately unpacked, followed by application in R and one practice exercise.
Chapter 6 describes models that have not one predictor or factor, but various, and leads the reader to an understanding of multiple regression, opposed to linear models. Then, mapping it in to R and with one practice exercise at the end.

Next, the possibility that different groups may result in different data results is addressed, in modeling responses based on categorical predictors. Chapter 7, as customary. Begins by examining an example from published research, but this time one of the author’s to walk readers through the process. Then, a step by step example related to R is offered, followed by related discussions and finally, closing with practice exercises.

Chapter 8 tip toes the reader into the notion of relationships between predictors, by offering an example by the author and a co-author, to show where a context where various factors influence another and how to represent identified combinations of significant predictors/factors. By now, the reader likely easily follows the author through continuous-categorical, categorical-categorical, and continuous-continuous interactions. Of course, a couple of exercises follow.

Chapter 9 is devoted to inferential statistics and the role of sample estimates to make inferences about population parameters, which, at the same time, is linked to the null hypothesis significance testing (NHST). The authors identify three factors that are crucial that affect confidence, they are the magnitude of an effect, the data variability and the sample size. In this sense, standardized effect size measures such as Cohen´s d and Pearson´s r only employ two significant factors (magnitude of an effect and the data variability), while significant testing employ all of them. Exercises are provided at the end of the chapter.

Chapter 10, as well as the previous and next chapter, is based on inferential statistics, however, issues in significant testing are addressed here. In this sense, the chapter deals, on the one hand, with the misinterpretations of p-values, which can be a representation of the null hypothesis being true, or the strength of an effect. On the other hand, it lists the types of errors common to significant testing, which can be Type I errors (false positives), Type II errors (false negatives), Type M errors (wrong magnitude of an effect) and Type S errors (wrong sign of an effect). The role of meaningful comparisons and stopping rules (data collection excess) are also discussed.

Employing data from previous sections, chapter 11 applies the concepts of the null hypothesis significance testing (NHST) to regression modeling to communicate uncertainty by means of parameters estimates and predictions. This chapter is basically intended to offer a source of practice regarding regression models and plot modelling intervals.
Chapter 12 features logistic regression as a form of generalized linear models. In this sense hypothetical data provided by the authors is analyzed by means of logistic regression to show how it works. On the other hand, special attention is drawn to the log odds or logits as probability.

In chapter 13 Poisson regression, as a type of generalized linear model, is analyzed. On the other hand, authors show the usefulness of such model for count processes, which is actually common in linguistics research, most specifically in research related to linguistic diversity and word usage along with other exposure variables. Additionally, the chapter summarizes the GLM framework and points out the three common components to linear regression, logistic regression and Poisson regression.

Chapter 14 is devoted to mixed models, also called multilevel models, which are common in language sciences and are also an extension of regression. The independence assumption is presented and analyzed through diverse examples. This chapter also explains how to deal with non-independence by means of experimental design and averaging, varying intercepts and varying slopes and the interpretation of random effects and random effects correlations.

Chapter 15 as well as the previous chapter is focused on mixed models, however, this one employs mixed models to analyze hypothetical experiment data to show how particular variables are affected by diverse sources of variation. In this sense, this chapter requires certain knowledge and adherence to the sequence of commands that are essential to extract particular information employing the 1me4 package. Likelihood ratio comparisons (deviance tests) along with the differences between probability and likelihood, remaining issues and mixed logistic regression are also discussed.

As a final resource, chapter 16 comprises a guide to identify the central topics of each book chapter. At the same time it contains a set of reflections on model choice, in other words, it reflects on how researchers often choose the appropriate statistical model for their research. With this in mind, the authors list, on the one hand, the cookbook approach, which basically lead students to think about the types of tests to employ rather than about how express their hypothesis by means of a statistical model. On the other hand, another option listed is the stepwise regression, which is seen as a more automatic process to model selection. The author suggest that researchers must identify the strong reasons behind every predictor that is included into any statistical model. In doing so, researchers must consider also their intentions when choosing a determined model, which can be confirmatory or exploratory.
In closing, the book is an amazing resource for both understanding statistics and the use of R. The most beneficial aspect of the book is its ordinary, unadorned language and dialogic nature with the reader. The reader can feel as if the author, teacher, is there with him/her offering guidance and direction through the learning process. The summaries at the end of each chapter bring what was presented in the chapter, and/or link how it relates to coming or previous content. The additional practice exercises at the end of each chapter, and the appendices are highly valuable.