STAT 260
Lec 11: Iteration
Owen G. Ward
Goals of this section
• Basic iteration
• apply and map functions
• Only what we cover will be examinable (if we don’t finish the slides)
Load packages and datasets
library(tidyverse)
Reading
Reading:
• Iteration: Chapter 26 of the online textbook.
Useful Reference:
• purrr cheatsheet: On posit website and on Canvas
Iterating over a vector
• For loops allow iteration.
• A common scenario for iteration is that our data is in a vector
and we want to perform the same operation on each element.
1
Iteration
• Such iteration is so common that special tools have been developed with the aim of
reducing the amount of code (and therefore errors) required for common iterative tasks.
– Tools in base R include the apply() family of functions.
– A tidyverse package called purrr includes more.
Example data
• To illustrate iteration we can simulate data and fit four regression models (you don’t
have to understand them).
[Link](42)
n <- 100
x1 <- rnorm(n)
x2 <- rnorm(n)
y1 <- x1 + rnorm(n, sd = .5)
y2 <- x1 + x2 + rnorm(n, sd = .5)
y3 <- x2 + rnorm(n, sd = .5)
y4 <- rnorm(n, sd = .5)
rr <- list(
fit1 = lm(y1 ~ x1 + x2),
fit2 = lm(y2 ~ x1 + x2),
fit3 = lm(y3 ~ x1 + x2),
fit4 = lm(y4 ~ x1 + x2)
)
coef(rr$fit1)
(Intercept) x1 x2
0.0008831357 0.9281453769 0.0426465892
Exercise 1
• The elements of the list rr from last slide are lm objects. The function coef() is generic.
Assign class “lm_vec” to rr and write a coef() method for objects of this class.
2
(Hint: Your function could include a for() loop like that below. The output of coef() taking
rr as input should be the same as the output from the for loop.)
for (i in seq_along(rr)) {
print(coef(rr[[i]]))
}
(Intercept) x1 x2
0.0008831357 0.9281453769 0.0426465892
(Intercept) x1 x2
0.01572372 1.03114836 1.00306653
(Intercept) x1 x2
-0.06641184 0.04316514 0.93035180
(Intercept) x1 x2
-0.008394232 -0.018428268 -0.116309416
Exercise 1: Solution
Extracting the regression coefficient for x1
• Using a for() loop, we initialize an object to hold the output, loop along a sequence of
values for an index variable, and execute the body for each value of the index variable.
betahat <- vector("double", length(rr))
for (i in seq_along(rr)) {
betahat[i] <- coef(rr[[i]])["x1"]
}
betahat
[1] 0.92814538 1.03114836 0.04316514 -0.01842827
Looping over elements of a set
• The index set in the for() loop can be general.
• We might use this generality to loop over named components of a list.
fits <- paste0("fit", 1:4)
for (ff in fits) {
print(coef(rr[[ff]])["x1"])
}
3
x1
0.9281454
x1
1.031148
x1
0.04316514
x1
-0.01842827
• Looping over a set makes it harder to save the results, though.
Avoid growing vectors incrementally
means <- [Link](1000)
[Link](123)
[Link]({
output <- double()
for (i in seq_along(means)) {
n <- sample(100, 1)
output <- c(output, rnorm(n, means[[i]]))
}
})
user system elapsed
0.060 0.024 0.084
[Link]({
out <- vector("list", length(means))
for (i in seq_along(means)) {
n <- sample(100, 1)
out[[i]] <- rnorm(n, means[[i]])
}
out <- unlist(out)
})
user system elapsed
0.008 0.000 0.008
4
bind_cols() and bind_rows()
# bind_cols(); recall that the length(means) = 1000
out <- vector("list", length(means))
n <- 100
for (i in seq_along(means)) {
out[[i]] <- rnorm(n, means[[i]])
}
out <- bind_cols(out)
dim(out)
[1] 100 1000
# bind_rows()
out <- vector("list", length(means))
for (i in seq_along(means)) {
out[[i]] <- tibble(y = rnorm(n, means[[i]]), x = rnorm(n))
}
out <- bind_rows(out)
dim(out)
[1] 100000 2
The body of a loop can be a part of the code
• In our examples, most of the code is for setting up the output and looping, with very
little to do with the body.
• To illustrate, consider a change: suppose instead of the estimated coefficient of x1 we
want that of x2:
betahat <- vector("double", length(rr))
for (i in seq_along(rr)) {
betahat[i] <- coef(rr[[i]])["x2"]
}
betahat
[1] 0.04264659 1.00306653 0.93035180 -0.11630942
5
Exercise 2
• Write a for() loop to find the mode() of each column in nycflights13::flights.
Exercise 2: Solution
• Write a for() loop to find the mode() of each column in nycflights13::flights.
Using lapply()
• The intent of lapply() is to take care of the output and the loop, allowing us to focus
on the body.
b1fun <- function(fit) {
coef(fit)["x1"]
} # body
lapply(rr, b1fun) # or sapply(rr,b1fun) or unlist(lapply(rr,b1fun))
$fit1
x1
0.9281454
$fit2
x1
1.031148
$fit3
x1
0.04316514
$fit4
x1
-0.01842827
bfun <- function(fit, cc) {
coef(fit)[cc]
} # body
6
lapply(rr, bfun, "x1")
$fit1
x1
0.9281454
$fit2
x1
1.031148
$fit3
x1
0.04316514
$fit4
x1
-0.01842827
Exercise 3
• Re-write your coef() method for objects of class lm_vec to use lapply().
Exercise 3: Solution
• Re-write your coef() method for objects of class lm_vec to use lapply().
Iterating with the map() functions from purrr
• The purrr package provides a family of functions map(), map_dbl(), etc. that do the
same thing as lapply() but work better with other tidyverse functions.
– map() returns a list, like lapply().
– map_dbl() returns a double vector, etc.
library(purrr)
map_dbl(rr, b1fun)
fit1 fit2 fit3 fit4
0.92814538 1.03114836 0.04316514 -0.01842827
7
# or rr |> map_dbl(b1fun) or map_dbl(rr,bfun,"x1")
Exercise 4
• Use map_chr() to return the mode() of each column of the nycflights13::flights
tibble.
• Use map() to return the summary() of each column of the nycflights13::flights
tibble.
Exercise 4: Solution
• Use map_chr() to return the mode() of each column of the nycflights13::flights
tibble.
Exercise 4: Solution
• Use map() to return the summary() of each column of the nycflights13::flights
tibble.
Pipes and map() functions
• Suppose we want to record a model summary returned by the summary() function.
– summary() applied to an lm() object computes regression summaries like standard
errors and model R2 .
rr |>
map(summary) |>
map_dbl(function(ss) {
ss$[Link]
})
fit1 fit2 fit3 fit4
0.78845184 0.91430933 0.73684218 0.04087594
• Notice that we can define a function on-the-fly in the call to a map() function.
• map() functions have a short-cut for function definitions.
8
rr |>
map(summary) |>
map_dbl(~ .$[Link]) # or map_dbl("[Link]")
fit1 fit2 fit3 fit4
0.78845184 0.91430933 0.73684218 0.04087594
• In ~. read ~ as “define a function” and . as “argument to the function”
Exercise 5
• Write a call to map_dbl() that does the same thing as map_dbl(rr,b1fun), but define
the function on the fly, as in the previous slide. You can use multiple calls to map()
functions.
Exercise 5: Solution
• Write a call to map_dbl() that does the same thing as map_dbl(rr,b1fun), but define
the function on the fly, as in the previous slide. You can use multiple calls to map()
functions.
Detour: The apply family of functions in R
• The “original” apply is apply(), which can be used to apply a function to rows or
columns of a matrix.
mat <- matrix(1:6, ncol = 2, nrow = 3)
mat
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
apply(mat, 1, sum) # row-wise sums; rowSums() is faster
[1] 5 7 9
9
apply(mat, 2, sum) # column-wise; colSums() is faster
[1] 6 15
Detour, cont.
• sapply() takes the output of lapply() and simplifies to a vector or matrix.
sapply(rr, coef)
fit1 fit2 fit3 fit4
(Intercept) 0.0008831357 0.01572372 -0.06641184 -0.008394232
x1 0.9281453769 1.03114836 0.04316514 -0.018428268
x2 0.0426465892 1.00306653 0.93035180 -0.116309416
Detour, cont.
• Other apply-like functions vapply(), mapply(), tapply(), …
• These are less common.
– See their respective help pages for information.
Summary and Recap
Cheat sheets
• I will provide a cheat sheet of the tidyverse functions we have studied in this class
• This will be a mixture of all the Posit cheatsheets, so get used to using this before the
exam!
• Everything on it will be relevant to this course
• Sheet is available on Canvas today
10
Visualization with ggplot2
• Build a plot with ggplot layers:
– Start with ggplot() to specify default data and aesthetic mapping of x, y, color,
shapes, etc.
– Add geom_()’s, which can override default data and mapping.
– stat_()’s calculate statistical summaries such as smooths that we can add. Rather
than use the stat_()s directly, we tended to add summaries with built-in geoms,
such as geom_smooth().
– Faceting builds multiple plots by values of a faceting variable.
Data import and tidy
• Import with the read_ functions, such as read_csv().
– remember skip and comment arguments.
– read_ functions guess at how to parse columns of input and use parse_ functions.
– Best bet is to specify column types with col_types=cols().
• tibbles are an improved version of the R [Link].
– Implemented as lists, so subset with [ and extract elements with [[
• Use dplyr’s five key verbs to wrangle:
1. filter() to select subsets of observations
2. arrange() to reorder rows
3. select() to select variables (remember helper functions like starts_with(),
ends_with() and contains())
4. mutate() to create new variables from existing ones, and
5. summarize() to calculate summary statistics (useful with group_by() to do split-
apply-combine)
Tidy Data
• In a tidy dataset,
– each variable has its own column,
– each observation has its own row, and
– each value has its own cell.
• Use pivot_longer() to make data “longer” (i.e, “taller”) and pivot_wider() to make
data “wider”.
11
Relational data: multiple tables
• Modern data comes in multiple tables, called relational data.
• Keys are variables present in two tables that can be used to join them.
• The most common type of join is a “mutating join”, such as a left_join().
• semi_join() can be used for a “filtering join” in which we filter one table based on
characteristics of another.
Working with strings
• Fixed, or literal strings, like fish:
– count the number of characters in a string
– detect (yes/no) or find (starting position) substrings
– extract and substitute substrings
– split and combine strings
• Regular expressions specify string patterns, like f[aeiou]sh:
– detect, find, extract and substitute
• Use tools from the stringr package
Factors
• Factors are categorical variables, implemented as an integer vector with levels.
• The forcats package provides tools for working with factor levels.
• Use fct_recode() to rename or collapse factor levels.
• Use fct_relevel() to partially or completely re-order a factor’s levels.
• Use fct_reorder() to reorder levels by a second variable.
12
Dates and Times
• Moments in time can be dates, times, or date-times.
• The lubridate package contains functions to coerce strings to date, time, or date-time
objects:
– ymd() to coerce data in year-month-date, mdy() to coerce data in month-day-year,
ymd_hm() to coerce data in year-month-date-hour-minute, etc.
• make_datetime() makes a date-time object from components. (make_date() if we do
not want a time component.)
• hour(), minute(), etc. extract components.
• Time data includes time zone. To set a time zone with the lubridate time functions, use
the tz argument.
• Easy to summarize and plot date-time objects.
Functions
• The pipe |> is useful for combining a linear sequence of data processing steps, when we
won’t need the intermediate steps and do not want to save the intermediate tibble(s).
• Encapsulating code in a function has several advantages:
– can be used multiple times on different inputs, and
– can compartmentalize computations and give them a name.
• We discussed when to write a function and the components of a function:
– the code inside the function, or body,
– the list of arguments to the function, and
– a data structure called an environment inside the function.
• Generic functions behave differently depending on the class of input. They call a method,
which is itself a function (so a “subfunction”), depending on the class of input.
13
Vectors and iteration
• Vectors can be either atomic or list
– The elements of an atomic vector must be the same type.
– Lists can be comprised of multiple data types.
• Use vector() to create an empty vector, or c() and list() to construct from data.
– Vector elements can be named.
• Subset with [ or by name.
• Extract individual elements with [[, or $ for named objects.
• Combine subsetting and assignment to change the value of vectors.
• Iterate over a vector with a for() loop, lapply() or map() functions.
– Remember shortcuts for specifying a function to use with a map() function.
The end!
• If you learn and are proficient with the tools summarized above, you’re well on your way
to becoming a data scientist!
• I hope that you enjoyed the course! Feedback is of course welcome and please fill out
the Course Experience Survey!
• This is one of the prerequisites for several upper level courses, including STAT 452, STAT
475, etc
14