In this chapter, we will discuss the basics of wrangling an individual data set.
library("tidyverse")
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4.9000 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
24.1 Summarize
Previously we have seen the summary() function which can be used to provide default summaries for numeric and factor variables. The summarize() function can be used to calculate user-determined values. One commonly used function is n() which counts the number of observations in the data.frame.
ToothGrowth |>summarize(n =n(), # number of observationsmean_len =mean(len),sdlen =sd(len) )
n mean_len sdlen
1 60 18.81333 7.649315
24.1.1 Group
Especially when summarizing the data.frame, we can use the group_by() function to allow the summarization to happen within each combination of the group by variables.
The summarize() function requires that each argument returns a single value. Most of the time this is what you want, but sometimes you want more flexibility. If you try to use summarize() you will receive a warning.
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
always returns an ungrouped data frame and adjust accordingly.
`summarise()` has grouped output by 'supp', 'dose'. You can override using the
`.groups` argument.