24 Summarize

Author

Jarad Niemi

In this chapter, we will discuss the basics of wrangling an individual data set.

library("tidyverse")

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4.9000     ✔ readr     2.1.5     
✔ forcats   1.0.0          ✔ stringr   1.5.1     
✔ ggplot2   3.5.2          ✔ tibble    3.3.0     
✔ lubridate 1.9.4          ✔ tidyr     1.3.1     
✔ purrr     1.0.4          
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

24.1 Summarize

Previously we have seen the summary() function which can be used to provide default summaries for numeric and factor variables. The summarize() function can be used to calculate user-determined values. One commonly used function is n() which counts the number of observations in the data.frame.

ToothGrowth |>
  summarize(
    n        = n(),       # number of observations
    mean_len = mean(len),
    sdlen    = sd(len)
  )

   n mean_len    sdlen
1 60 18.81333 7.649315

24.1.1 Group

Especially when summarizing the data.frame, we can use the group_by() function to allow the summarization to happen within each combination of the group by variables.

# Summarize by group
ToothGrowth |>
  group_by(supp, dose) |>
  summarize(
    n        = n(),
    mean_len = mean(len),
    sdlen    = sd(len),
    
    .groups = "drop"      # removes grouping
  )

# A tibble: 6 × 5
  supp   dose     n mean_len sdlen
  <fct> <dbl> <int>    <dbl> <dbl>
1 OJ      0.5    10    13.2   4.46
2 OJ      1      10    22.7   3.91
3 OJ      2      10    26.1   2.66
4 VC      0.5    10     7.98  2.75
5 VC      1      10    16.8   2.52
6 VC      2      10    26.1   4.80

The summarize() function requires that each argument returns a single value. Most of the time this is what you want, but sometimes you want more flexibility. If you try to use summarize() you will receive a warning.

# Summarize returns 2 rows per group
p <- c(.25,.75) # Q1 and Q3
ToothGrowth |>
  group_by(supp, dose) |>
  summarize(
    prob = p,
    qs   = quantile(len, 
                    prob = p)
  )

Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
  always returns an ungrouped data frame and adjust accordingly.

`summarise()` has grouped output by 'supp', 'dose'. You can override using the
`.groups` argument.

# A tibble: 12 × 4
# Groups:   supp, dose [6]
   supp   dose  prob    qs
   <fct> <dbl> <dbl> <dbl>
 1 OJ      0.5  0.25  9.7 
 2 OJ      0.5  0.75 16.2 
 3 OJ      1    0.25 20.3 
 4 OJ      1    0.75 25.6 
 5 OJ      2    0.25 24.6 
 6 OJ      2    0.75 27.1 
 7 VC      0.5  0.25  5.95
 8 VC      0.5  0.75 10.9 
 9 VC      1    0.25 15.3 
10 VC      1    0.75 17.3 
11 VC      2    0.25 23.4 
12 VC      2    0.75 28.8

24.2 Reframe

An alternative function that does not have a requirement on the number of rows returned per group is reframe().

# Use reframe instead
ToothGrowth |>
  group_by(supp, dose) |>
  reframe(
    prob = p,
    qs   = quantile(len, prob = p)
  )

# A tibble: 12 × 4
   supp   dose  prob    qs
   <fct> <dbl> <dbl> <dbl>
 1 OJ      0.5  0.25  9.7 
 2 OJ      0.5  0.75 16.2 
 3 OJ      1    0.25 20.3 
 4 OJ      1    0.75 25.6 
 5 OJ      2    0.25 24.6 
 6 OJ      2    0.75 27.1 
 7 VC      0.5  0.25  5.95
 8 VC      0.5  0.75 10.9 
 9 VC      1    0.25 15.3 
10 VC      1    0.75 17.3 
11 VC      2    0.25 23.4 
12 VC      2    0.75 28.8