0% found this document useful (0 votes)
30 views14 pages

R Iteration Techniques in Statistics

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views14 pages

R Iteration Techniques in Statistics

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

STAT 260

Lec 11: Iteration

Owen G. Ward

Goals of this section

• Basic iteration
• apply and map functions
• Only what we cover will be examinable (if we don’t finish the slides)

Load packages and datasets

library(tidyverse)

Reading

Reading:

• Iteration: Chapter 26 of the online textbook.

Useful Reference:

• purrr cheatsheet: On posit website and on Canvas

Iterating over a vector

• For loops allow iteration.


• A common scenario for iteration is that our data is in a vector
and we want to perform the same operation on each element.

1
Iteration

• Such iteration is so common that special tools have been developed with the aim of
reducing the amount of code (and therefore errors) required for common iterative tasks.

– Tools in base R include the apply() family of functions.


– A tidyverse package called purrr includes more.

Example data

• To illustrate iteration we can simulate data and fit four regression models (you don’t
have to understand them).

[Link](42)
n <- 100
x1 <- rnorm(n)
x2 <- rnorm(n)
y1 <- x1 + rnorm(n, sd = .5)
y2 <- x1 + x2 + rnorm(n, sd = .5)
y3 <- x2 + rnorm(n, sd = .5)
y4 <- rnorm(n, sd = .5)
rr <- list(
fit1 = lm(y1 ~ x1 + x2),
fit2 = lm(y2 ~ x1 + x2),
fit3 = lm(y3 ~ x1 + x2),
fit4 = lm(y4 ~ x1 + x2)
)
coef(rr$fit1)

(Intercept) x1 x2
0.0008831357 0.9281453769 0.0426465892

Exercise 1

• The elements of the list rr from last slide are lm objects. The function coef() is generic.
Assign class “lm_vec” to rr and write a coef() method for objects of this class.

2
(Hint: Your function could include a for() loop like that below. The output of coef() taking
rr as input should be the same as the output from the for loop.)

for (i in seq_along(rr)) {
print(coef(rr[[i]]))
}

(Intercept) x1 x2
0.0008831357 0.9281453769 0.0426465892
(Intercept) x1 x2
0.01572372 1.03114836 1.00306653
(Intercept) x1 x2
-0.06641184 0.04316514 0.93035180
(Intercept) x1 x2
-0.008394232 -0.018428268 -0.116309416

Exercise 1: Solution

Extracting the regression coefficient for x1

• Using a for() loop, we initialize an object to hold the output, loop along a sequence of
values for an index variable, and execute the body for each value of the index variable.

betahat <- vector("double", length(rr))


for (i in seq_along(rr)) {
betahat[i] <- coef(rr[[i]])["x1"]
}
betahat

[1] 0.92814538 1.03114836 0.04316514 -0.01842827

Looping over elements of a set

• The index set in the for() loop can be general.


• We might use this generality to loop over named components of a list.

fits <- paste0("fit", 1:4)


for (ff in fits) {
print(coef(rr[[ff]])["x1"])
}

3
x1
0.9281454
x1
1.031148
x1
0.04316514
x1
-0.01842827

• Looping over a set makes it harder to save the results, though.

Avoid growing vectors incrementally

means <- [Link](1000)


[Link](123)
[Link]({
output <- double()
for (i in seq_along(means)) {
n <- sample(100, 1)
output <- c(output, rnorm(n, means[[i]]))
}
})

user system elapsed


0.060 0.024 0.084

[Link]({
out <- vector("list", length(means))
for (i in seq_along(means)) {
n <- sample(100, 1)
out[[i]] <- rnorm(n, means[[i]])
}
out <- unlist(out)
})

user system elapsed


0.008 0.000 0.008

4
bind_cols() and bind_rows()

# bind_cols(); recall that the length(means) = 1000


out <- vector("list", length(means))
n <- 100
for (i in seq_along(means)) {
out[[i]] <- rnorm(n, means[[i]])
}
out <- bind_cols(out)
dim(out)

[1] 100 1000

# bind_rows()
out <- vector("list", length(means))
for (i in seq_along(means)) {
out[[i]] <- tibble(y = rnorm(n, means[[i]]), x = rnorm(n))
}
out <- bind_rows(out)
dim(out)

[1] 100000 2

The body of a loop can be a part of the code

• In our examples, most of the code is for setting up the output and looping, with very
little to do with the body.
• To illustrate, consider a change: suppose instead of the estimated coefficient of x1 we
want that of x2:

betahat <- vector("double", length(rr))


for (i in seq_along(rr)) {
betahat[i] <- coef(rr[[i]])["x2"]
}
betahat

[1] 0.04264659 1.00306653 0.93035180 -0.11630942

5
Exercise 2

• Write a for() loop to find the mode() of each column in nycflights13::flights.

Exercise 2: Solution

• Write a for() loop to find the mode() of each column in nycflights13::flights.

Using lapply()

• The intent of lapply() is to take care of the output and the loop, allowing us to focus
on the body.

b1fun <- function(fit) {


coef(fit)["x1"]
} # body
lapply(rr, b1fun) # or sapply(rr,b1fun) or unlist(lapply(rr,b1fun))

$fit1
x1
0.9281454

$fit2
x1
1.031148

$fit3
x1
0.04316514

$fit4
x1
-0.01842827

bfun <- function(fit, cc) {


coef(fit)[cc]
} # body

6
lapply(rr, bfun, "x1")

$fit1
x1
0.9281454

$fit2
x1
1.031148

$fit3
x1
0.04316514

$fit4
x1
-0.01842827

Exercise 3

• Re-write your coef() method for objects of class lm_vec to use lapply().

Exercise 3: Solution

• Re-write your coef() method for objects of class lm_vec to use lapply().

Iterating with the map() functions from purrr

• The purrr package provides a family of functions map(), map_dbl(), etc. that do the
same thing as lapply() but work better with other tidyverse functions.
– map() returns a list, like lapply().
– map_dbl() returns a double vector, etc.

library(purrr)
map_dbl(rr, b1fun)

fit1 fit2 fit3 fit4


0.92814538 1.03114836 0.04316514 -0.01842827

7
# or rr |> map_dbl(b1fun) or map_dbl(rr,bfun,"x1")

Exercise 4

• Use map_chr() to return the mode() of each column of the nycflights13::flights


tibble.
• Use map() to return the summary() of each column of the nycflights13::flights
tibble.

Exercise 4: Solution

• Use map_chr() to return the mode() of each column of the nycflights13::flights


tibble.

Exercise 4: Solution

• Use map() to return the summary() of each column of the nycflights13::flights


tibble.

Pipes and map() functions

• Suppose we want to record a model summary returned by the summary() function.


– summary() applied to an lm() object computes regression summaries like standard
errors and model R2 .

rr |>
map(summary) |>
map_dbl(function(ss) {
ss$[Link]
})

fit1 fit2 fit3 fit4


0.78845184 0.91430933 0.73684218 0.04087594

• Notice that we can define a function on-the-fly in the call to a map() function.
• map() functions have a short-cut for function definitions.

8
rr |>
map(summary) |>
map_dbl(~ .$[Link]) # or map_dbl("[Link]")

fit1 fit2 fit3 fit4


0.78845184 0.91430933 0.73684218 0.04087594

• In ~. read ~ as “define a function” and . as “argument to the function”

Exercise 5

• Write a call to map_dbl() that does the same thing as map_dbl(rr,b1fun), but define
the function on the fly, as in the previous slide. You can use multiple calls to map()
functions.

Exercise 5: Solution

• Write a call to map_dbl() that does the same thing as map_dbl(rr,b1fun), but define
the function on the fly, as in the previous slide. You can use multiple calls to map()
functions.

Detour: The apply family of functions in R

• The “original” apply is apply(), which can be used to apply a function to rows or
columns of a matrix.

mat <- matrix(1:6, ncol = 2, nrow = 3)


mat

[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6

apply(mat, 1, sum) # row-wise sums; rowSums() is faster

[1] 5 7 9

9
apply(mat, 2, sum) # column-wise; colSums() is faster

[1] 6 15

Detour, cont.

• sapply() takes the output of lapply() and simplifies to a vector or matrix.

sapply(rr, coef)

fit1 fit2 fit3 fit4


(Intercept) 0.0008831357 0.01572372 -0.06641184 -0.008394232
x1 0.9281453769 1.03114836 0.04316514 -0.018428268
x2 0.0426465892 1.00306653 0.93035180 -0.116309416

Detour, cont.

• Other apply-like functions vapply(), mapply(), tapply(), …


• These are less common.
– See their respective help pages for information.

Summary and Recap

Cheat sheets

• I will provide a cheat sheet of the tidyverse functions we have studied in this class
• This will be a mixture of all the Posit cheatsheets, so get used to using this before the
exam!
• Everything on it will be relevant to this course
• Sheet is available on Canvas today

10
Visualization with ggplot2

• Build a plot with ggplot layers:


– Start with ggplot() to specify default data and aesthetic mapping of x, y, color,
shapes, etc.
– Add geom_()’s, which can override default data and mapping.
– stat_()’s calculate statistical summaries such as smooths that we can add. Rather
than use the stat_()s directly, we tended to add summaries with built-in geoms,
such as geom_smooth().
– Faceting builds multiple plots by values of a faceting variable.

Data import and tidy

• Import with the read_ functions, such as read_csv().


– remember skip and comment arguments.
– read_ functions guess at how to parse columns of input and use parse_ functions.
– Best bet is to specify column types with col_types=cols().
• tibbles are an improved version of the R [Link].
– Implemented as lists, so subset with [ and extract elements with [[

• Use dplyr’s five key verbs to wrangle:


1. filter() to select subsets of observations
2. arrange() to reorder rows
3. select() to select variables (remember helper functions like starts_with(),
ends_with() and contains())
4. mutate() to create new variables from existing ones, and
5. summarize() to calculate summary statistics (useful with group_by() to do split-
apply-combine)

Tidy Data

• In a tidy dataset,
– each variable has its own column,
– each observation has its own row, and
– each value has its own cell.
• Use pivot_longer() to make data “longer” (i.e, “taller”) and pivot_wider() to make
data “wider”.

11
Relational data: multiple tables

• Modern data comes in multiple tables, called relational data.


• Keys are variables present in two tables that can be used to join them.
• The most common type of join is a “mutating join”, such as a left_join().
• semi_join() can be used for a “filtering join” in which we filter one table based on
characteristics of another.

Working with strings

• Fixed, or literal strings, like fish:


– count the number of characters in a string
– detect (yes/no) or find (starting position) substrings
– extract and substitute substrings
– split and combine strings
• Regular expressions specify string patterns, like f[aeiou]sh:
– detect, find, extract and substitute
• Use tools from the stringr package

Factors

• Factors are categorical variables, implemented as an integer vector with levels.


• The forcats package provides tools for working with factor levels.
• Use fct_recode() to rename or collapse factor levels.
• Use fct_relevel() to partially or completely re-order a factor’s levels.
• Use fct_reorder() to reorder levels by a second variable.

12
Dates and Times

• Moments in time can be dates, times, or date-times.


• The lubridate package contains functions to coerce strings to date, time, or date-time
objects:

– ymd() to coerce data in year-month-date, mdy() to coerce data in month-day-year,


ymd_hm() to coerce data in year-month-date-hour-minute, etc.

• make_datetime() makes a date-time object from components. (make_date() if we do


not want a time component.)

• hour(), minute(), etc. extract components.


• Time data includes time zone. To set a time zone with the lubridate time functions, use
the tz argument.
• Easy to summarize and plot date-time objects.

Functions

• The pipe |> is useful for combining a linear sequence of data processing steps, when we
won’t need the intermediate steps and do not want to save the intermediate tibble(s).
• Encapsulating code in a function has several advantages:

– can be used multiple times on different inputs, and


– can compartmentalize computations and give them a name.

• We discussed when to write a function and the components of a function:


– the code inside the function, or body,
– the list of arguments to the function, and
– a data structure called an environment inside the function.
• Generic functions behave differently depending on the class of input. They call a method,
which is itself a function (so a “subfunction”), depending on the class of input.

13
Vectors and iteration

• Vectors can be either atomic or list


– The elements of an atomic vector must be the same type.
– Lists can be comprised of multiple data types.
• Use vector() to create an empty vector, or c() and list() to construct from data.
– Vector elements can be named.
• Subset with [ or by name.

• Extract individual elements with [[, or $ for named objects.


• Combine subsetting and assignment to change the value of vectors.
• Iterate over a vector with a for() loop, lapply() or map() functions.

– Remember shortcuts for specifying a function to use with a map() function.

The end!

• If you learn and are proficient with the tools summarized above, you’re well on your way
to becoming a data scientist!
• I hope that you enjoyed the course! Feedback is of course welcome and please fill out
the Course Experience Survey!
• This is one of the prerequisites for several upper level courses, including STAT 452, STAT
475, etc

14

You might also like