21 Pipe

Author

Jarad Niemi

In this chapter, we will discuss the basics of wrangling an individual data set.

library("tidyverse")

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4.9000     ✔ readr     2.1.5     
✔ forcats   1.0.0          ✔ stringr   1.5.1     
✔ ggplot2   3.5.2          ✔ tibble    3.3.0     
✔ lubridate 1.9.4          ✔ tidyr     1.3.1     
✔ purrr     1.0.4          
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Pipe operators allow for a data pipeline to be constructed within R code that is relatively easy to understand due to steps being conducted sequentially from the top to the bottom.

For example, can you guess what the following code does

# Example data pipeline
ToothGrowth |>
  group_by(supp, dose) |>
  summarize(
    n    = n(),
    mean = mean(len),
    sd   = sd(len),
    
    .groups = "drop"
  ) |>
  mutate(
    mean = round(mean, 2),
    sd   = round(sd, 2)
  ) |>
  arrange(mean)

This code is made much easier to read (than equivalent un-piped R code) due to 1) functions being written so that the first argument is always a data.frame and 2) the pipe operator being used to send the results of the previous operation to the next operation.

The pipe operator is a relatively new feature of R. The base R version |> was introduced in May 2021 (R version 4.1.0), the magittr version %>% was introduced in Dec 2013, and the very first version was introduced in a stackoverflow post in Jan 2012.

The idea behind the pipe operator is pretty simple, it simply passes the contents on its left hand side as the first argument to the function on its right hand side.

# Calculate mean
c(1, 2, 3, 4) |> mean()

[1] 2.5

As the pipe only replaces the first argument, we can also use additional arguments of the following function.

# Calculate mean, ignore missing values
c(1, 2, NA, 4) |> mean(na.rm = TRUE)

[1] 2.333333

Piping is especially useful when combining a series of operations where the pipe is used after each operation.

# Calculate mean, ignore missing values
c(1, 2, NA, 4) |> 
  na.omit() |>
  mean()

[1] 2.333333

We will use the pipe operator extensively in data pipelines.