3  Data Import And The Tidyverse

3.1 Learning Goals

By the end of this chapter, the main goals are:

  • read local data included with the project
  • recognize the difference between a data frame and a tibble
  • inspect a dataset before plotting it
  • distinguish wide data from long data
  • reshape data with pivot_longer() and pivot_wider()
  • use the core tidyverse verbs for common data preparation tasks
  • recognize which import examples should not run automatically because they require credentials or external services

3.2 Load Packages

The tidyverse is a collection of packages for importing, transforming, and visualizing data. Loading it makes packages such as readr, dplyr, tidyr, ggplot2, stringr, and tibble available.

library(tidyverse)
library(skimr)

3.3 Importing Local Data

Many projects begin with a local CSV file. CSV files are plain-text tables where columns are separated by commas. They are common because nearly every spreadsheet, statistical package, and database can export them.

The preferred tidyverse function is read_csv() from the readr package. Compared with base R’s read.csv(), it usually prints more clearly, guesses column types more consistently, and returns a tibble.

The course data live under Data/. The example below reads the Gapminder CSV file included with the project.

gap <- read_csv("Data/gapminder/gapminder.csv", show_col_types = FALSE)

gap

The object name gap now points to a table in memory. The assignment operator <- reads as “store the result on the right in the object named on the left.” Object names should be short enough to type but descriptive enough to understand later.

3.4 Tibbles And First Inspection

A tibble is the tidyverse version of a data frame. It is still a rectangular table with rows and columns, but it prints in a more controlled way. A tibble tells you the number of rows and columns, shows variable types, and avoids flooding the console with every row of a large dataset.

Start every new dataset with a quick inspection:

glimpse(gap)
Rows: 1,704
Columns: 6
$ country   <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <chr> "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asi…
$ year      <dbl> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
$ pop       <dbl> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …
skim(gap)
Data summary
Name gap
Number of rows 1704
Number of columns 6
_______________________
Column type frequency:
character 2
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
country 0 1 4 24 0 142 0
continent 0 1 4 8 0 5 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1 1979.50 17.27 1952.00 1965.75 1979.50 1993.25 2007.0 ▇▅▅▅▇
lifeExp 0 1 59.47 12.92 23.60 48.20 60.71 70.85 82.6 ▁▆▇▇▇
pop 0 1 29601212.32 106157896.74 60011.00 2793664.00 7023595.50 19585221.75 1318683096.0 ▇▁▁▁▁
gdpPercap 0 1 7215.33 9857.45 241.17 1202.06 3531.85 9325.46 113523.1 ▇▁▁▁▁

glimpse() comes from dplyr, which is loaded as part of the tidyverse. It gives a compact structural view of the data. skim() comes from the skimr package and gives a broader summary, including missingness and descriptive statistics. These checks are not just housekeeping. They help identify the unit of observation, variable types, unexpected missing values, and suspicious ranges before those problems enter a plot.

3.5 Base R Import And Inspection

Base R can also import data. The syntax is common in older examples and in some online answers:

gap_base <- read.csv("Data/gapminder/gapminder.csv")
str(gap_base)
head(gap_base)

3.6 Other Local File Types

CSV is only one possible format. Social science projects often use Excel, Stata, SPSS, or SAS files. The syntax below should be treated as a template. These chunks are non-executing unless the named files are present and the relevant packages are installed.

3.6.1 Excel

The function read_excel() comes from the readxl package.

library(readxl)

gap_excel <- read_excel(
  path = "Data/gapminder/gapminder.xlsx",
  sheet = "gapminder"
)

gap_excel

3.6.2 Stata

The functions for Stata, SPSS, and SAS files below come from the haven package.

library(haven)

state_legislatures <- read_dta("Data/state_policy/shor_mccarty_state_data.dta")

state_legislatures

3.6.3 SPSS

The example uses a small SPSS file that is installed with the haven package.

library(haven)

path <- system.file("examples", "iris.sav", package = "haven")
survey_data <- read_sav(path)

survey_data

3.6.4 SAS

The example uses a small SAS file that is installed with the haven package.

library(haven)

path <- system.file("examples", "iris.sas7bdat", package = "haven")
admin_data <- read_sas(path)

admin_data

3.7 Wide And Long Data

Data can be organized in more than one shape. A wide dataset has one row per unit and a separate column for each measurement (gdpPercap, lifeExp). A long dataset stacks the measurement names into one column and the values into another. Wide data is often easier for people to read; long data is often easier for ggplot2 and group-wise summaries because the measurement name itself can be mapped to color, facets, or groups.

pivot_longer() and pivot_wider() from tidyr move data between the two shapes.

wide_data <- tibble(
  country = c("Afghanistan", "Brazil"),
  year = c(2007, 2007),
  gdpPercap = c(974.6, 9065.8),
  lifeExp = c(43.8, 72.4)
)

long_data <- wide_data |>
  pivot_longer(
    cols = c(gdpPercap, lifeExp),
    names_to = "variable",
    values_to = "value"
  )

long_data
long_data |>
  pivot_wider(names_from = variable, values_from = value)

The pipe operator |> means “take the result so far and send it into the next function.” It lets a sequence of data steps read in the same order as the logic of the analysis.

3.8 Core Tidyverse Verbs

Most data preparation before plotting uses a small set of verbs:

  • filter() keeps rows
  • select() keeps columns
  • mutate() creates or modifies variables
  • group_by() defines groups
  • summarize() collapses rows into summaries
  • joins combine information from two tables

These verbs come from dplyr, which is loaded as part of the tidyverse. They are useful because they correspond to ordinary analytical questions. Which cases belong in this plot? Which variables matter? Do we need a new rate, logged value, or category? What is the summary by group? Does another table contain information that should be merged in?

3.9 filter(): Keep Rows

filter() keeps observations that satisfy logical conditions. The operator == tests exact equality. The operator %in% tests whether a value belongs to a set.

gap |>
  filter(year == 2007)
gap |>
  filter(continent %in% c("Americas", "Europe"))

Use == for one value. Use %in% when several values are acceptable.

3.10 select(): Keep Columns

select() keeps variables. This is useful when a dataset has many columns but the next step only needs a few.

gap |>
  select(country, continent, year, lifeExp, gdpPercap)

Rows are observations. Columns are variables. filter() works on rows; select() works on columns.

3.11 mutate(): Create Variables

mutate() creates new variables or rewrites existing ones. In the next example, round() makes GDP per capita easier to read, and log10() creates a transformed variable that will be useful for plotting.

gap |>
  mutate(
    gdp_pc_round = round(gdpPercap),
    log_gdp_pc = log10(gdpPercap)
  ) |>
  select(country, gdpPercap, gdp_pc_round, log_gdp_pc)

round(gdpPercap) rounds each value in the gdpPercap column. log10(gdpPercap) takes the base-10 logarithm. The original variable remains unless you explicitly overwrite it.

3.12 group_by() And summarize(): Aggregate Data

Many plots need data at a different level than the imported file. For example, the Gapminder data are country-year observations, but we may want a continent-level summary.

continent_summary <- gap |>
  group_by(continent) |>
  summarize(
    median_life_exp = median(lifeExp, na.rm = TRUE),
    median_gdp_pc = median(gdpPercap, na.rm = TRUE),
    n_countries = n_distinct(country)
  )

continent_summary

na.rm = TRUE tells functions such as median() to ignore missing values. Without it, one missing value can make a summary missing.

3.13 Readable Pipelines

Long pipelines are easier to read when each verb gets its own line and arguments are indented consistently.

gap |>
  filter(year == 2007) |>
  mutate(gdp_pc_round = round(gdpPercap)) |>
  group_by(continent) |>
  summarize(
    median_life_exp = median(lifeExp, na.rm = TRUE),
    median_gdp_pc = median(gdp_pc_round, na.rm = TRUE)
  ) |>
  arrange(desc(median_life_exp))

The line breaks are not just style. They make the data workflow visible. Each line is one transformation.

3.14 Joins

Joins combine two tables using one or more key variables. A common pattern is to keep all rows from the main table and add matching columns from a smaller lookup table. That is a left_join(), another dplyr function.

region_notes <- tibble(
  continent = c("Africa", "Americas", "Asia", "Europe", "Oceania"),
  example_note = c(
    "Many countries, large range of outcomes",
    "Includes North, Central, and South America",
    "Large population range",
    "High median life expectancy in many examples",
    "Small number of countries in many teaching datasets"
  )
)

continent_summary |>
  left_join(region_notes, by = "continent")

Rows without a match in the lookup table would remain, but the new variables would be NA. That behavior is usually safer than silently dropping observations.

3.15 Short Exercise

Use the built-in mtcars dataset. Write a pipeline that:

  1. converts mtcars to a tibble with row names stored as a column named car
  2. groups by cyl
  3. calculates average mpg and maximum wt
  4. returns one row per value of cyl
# Write your code here.

3.16 Extras: Google Sheets

Google Sheets can be useful when data are collaboratively maintained. These examples require authentication and should not run automatically in a rendered course book.

library(googlesheets4)

sheet_data <- read_sheet(
  ss = "PASTE-GOOGLE-SHEET-ID-OR-URL-HERE",
  sheet = "Sheet1"
)

sheet_data
library(googlesheets4)

ss <- gs4_create("Visualization 101 2026 Example Sheet")

mtcars |>
  as_tibble(rownames = "car") |>
  sheet_write(ss = ss, sheet = "mtcars")

3.17 Census Data With tidycensus

Some data sources are best accessed through specialized packages. These examples are templates, not automatically running chunks.

The tidycensus package wraps the U.S. Census Bureau’s APIs. The two most common entry points are the American Community Survey (get_acs()) and the decennial census (get_decennial()). A free Census API key is required. Request one at https://api.census.gov/data/key_signup.html, then register it once on your machine with census_api_key("YOUR_KEY", install = TRUE). After that, the key is available to every R session and the line can be removed from teaching files.

library(tidycensus)

#census_api_key("2c5ba48be91062db570c7fc5fa49dbca03306c33", install = TRUE)

load_variables() lists the variables available in a given dataset and year. Caching the result avoids re-downloading the table.

acs_vars <- load_variables(2020, "acs5", cache = TRUE) |>
  distinct(concept, .keep_all = TRUE)

acs_vars

If load_variables() errors with a JSON parse message like invalid char in json text. <html>, the Census Bureau returned an HTML error page instead of the variable table. That usually means the requested year is not yet published for that dataset, the Census server is temporarily down, or the request was rate-limited. Try a slightly older year (for ACS 5-year, two years behind the current year is usually safe), re-run the call after a minute, or check https://www.census.gov/data/developers/data-sets.html for the dataset’s release schedule.

The variables argument of get_acs() accepts named variable codes. Naming the entries gives readable column names instead of raw codes like B19013_001. output = "wide" puts each variable in its own column, which is usually easier to plot from than the default long format.

acs_raw <- get_acs(
  geography = "congressional district",
  variables = c(
    median_income = "B19013_001",
    total_pop = "B01003_001"
  ),
  year = 2022,
  survey = "acs5",
  output = "wide"
)

acs_raw

The GEOID column is a concatenation of geographic identifiers. For congressional districts it combines the two-digit state FIPS code with the two-digit district number. Splitting it with str_sub() gives joinable columns.

acs <- acs_raw |>
  mutate(
    fips = str_sub(GEOID, 1, 2),
    cd = str_sub(GEOID, 3, 4)
  )

acs

State-level summaries are simpler because the GEOID is just the state FIPS code. The wide format makes it easy to compare one variable against another.

acs_state <- get_acs(
  geography = "state",
  variables = c(
    median_income = "B19013_001",
    median_rent = "B25064_001"
  ),
  year = 2022,
  survey = "acs5",
  output = "wide"
)

acs_state |>
  summarize(
    n_states = n(),
    cor_income_rent = cor(median_incomeE, median_rentE, use = "complete.obs")
  )

Note that the wide variable names get an E suffix for estimates and an M suffix for the margin of error. Both are usually returned. For most teaching purposes the E columns are enough; in published work the M columns should be acknowledged.

The decennial census uses get_decennial() instead. It returns counts rather than estimates and does not need a survey argument.

pop_2020 <- get_decennial(
  geography = "state",
  variables = c(total_pop = "P1_001N"),
  year = 2020,
  sumfile = "pl"
)

pop_2020

3.18 Extras: Qualtrics And Canvas

Survey platforms and learning management systems are also possible sources of data. The main point is that API-based imports should be treated as retrieval steps, not as code that runs every time a teaching document renders.

library(qualtRics)

raw_survey <- fetch_survey(
  surveyID = "YOUR-SURVEY-ID",
  label = FALSE,
  convert = FALSE
)
library(rcanvas)

courses <- get_course_list()

3.19 What Comes Next

The next chapter uses these data ideas to build plots. ggplot2 works best when the data are already in a clear rectangular form and the variables needed for the plot have been inspected, selected, transformed, or summarized.