library(tidyverse)
library(skimr)3 Data Import And The Tidyverse
3.1 Learning Goals
By the end of this chapter, the main goals are:
- read local data included with the project
- recognize the difference between a data frame and a tibble
- inspect a dataset before plotting it
- distinguish wide data from long data
- reshape data with
pivot_longer()andpivot_wider() - use the core tidyverse verbs for common data preparation tasks
- recognize which import examples should not run automatically because they require credentials or external services
3.2 Load Packages
The tidyverse is a collection of packages for importing, transforming, and visualizing data. Loading it makes packages such as readr, dplyr, tidyr, ggplot2, stringr, and tibble available.
3.3 Importing Local Data
Many projects begin with a local CSV file. CSV files are plain-text tables where columns are separated by commas. They are common because nearly every spreadsheet, statistical package, and database can export them.
The preferred tidyverse function is read_csv() from the readr package. Compared with base R’s read.csv(), it usually prints more clearly, guesses column types more consistently, and returns a tibble.
The course data live under Data/. The example below reads the Gapminder CSV file included with the project.
gap <- read_csv("Data/gapminder/gapminder.csv", show_col_types = FALSE)
gapThe object name gap now points to a table in memory. The assignment operator <- reads as “store the result on the right in the object named on the left.” Object names should be short enough to type but descriptive enough to understand later.
3.4 Tibbles And First Inspection
A tibble is the tidyverse version of a data frame. It is still a rectangular table with rows and columns, but it prints in a more controlled way. A tibble tells you the number of rows and columns, shows variable types, and avoids flooding the console with every row of a large dataset.
Start every new dataset with a quick inspection:
glimpse(gap)Rows: 1,704
Columns: 6
$ country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <chr> "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asi…
$ year <dbl> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
$ pop <dbl> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …
skim(gap)| Name | gap |
| Number of rows | 1704 |
| Number of columns | 6 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| country | 0 | 1 | 4 | 24 | 0 | 142 | 0 |
| continent | 0 | 1 | 4 | 8 | 0 | 5 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| year | 0 | 1 | 1979.50 | 17.27 | 1952.00 | 1965.75 | 1979.50 | 1993.25 | 2007.0 | ▇▅▅▅▇ |
| lifeExp | 0 | 1 | 59.47 | 12.92 | 23.60 | 48.20 | 60.71 | 70.85 | 82.6 | ▁▆▇▇▇ |
| pop | 0 | 1 | 29601212.32 | 106157896.74 | 60011.00 | 2793664.00 | 7023595.50 | 19585221.75 | 1318683096.0 | ▇▁▁▁▁ |
| gdpPercap | 0 | 1 | 7215.33 | 9857.45 | 241.17 | 1202.06 | 3531.85 | 9325.46 | 113523.1 | ▇▁▁▁▁ |
glimpse() comes from dplyr, which is loaded as part of the tidyverse. It gives a compact structural view of the data. skim() comes from the skimr package and gives a broader summary, including missingness and descriptive statistics. These checks are not just housekeeping. They help identify the unit of observation, variable types, unexpected missing values, and suspicious ranges before those problems enter a plot.
3.5 Base R Import And Inspection
Base R can also import data. The syntax is common in older examples and in some online answers:
gap_base <- read.csv("Data/gapminder/gapminder.csv")
str(gap_base)
head(gap_base)3.6 Other Local File Types
CSV is only one possible format. Social science projects often use Excel, Stata, SPSS, or SAS files. The syntax below should be treated as a template. These chunks are non-executing unless the named files are present and the relevant packages are installed.
3.6.1 Excel
The function read_excel() comes from the readxl package.
library(readxl)
gap_excel <- read_excel(
path = "Data/gapminder/gapminder.xlsx",
sheet = "gapminder"
)
gap_excel3.6.2 Stata
The functions for Stata, SPSS, and SAS files below come from the haven package.
library(haven)
state_legislatures <- read_dta("Data/state_policy/shor_mccarty_state_data.dta")
state_legislatures3.6.3 SPSS
The example uses a small SPSS file that is installed with the haven package.
library(haven)
path <- system.file("examples", "iris.sav", package = "haven")
survey_data <- read_sav(path)
survey_data3.6.4 SAS
The example uses a small SAS file that is installed with the haven package.
library(haven)
path <- system.file("examples", "iris.sas7bdat", package = "haven")
admin_data <- read_sas(path)
admin_data3.7 Wide And Long Data
Data can be organized in more than one shape. A wide dataset has one row per unit and a separate column for each measurement (gdpPercap, lifeExp). A long dataset stacks the measurement names into one column and the values into another. Wide data is often easier for people to read; long data is often easier for ggplot2 and group-wise summaries because the measurement name itself can be mapped to color, facets, or groups.
pivot_longer() and pivot_wider() from tidyr move data between the two shapes.
wide_data <- tibble(
country = c("Afghanistan", "Brazil"),
year = c(2007, 2007),
gdpPercap = c(974.6, 9065.8),
lifeExp = c(43.8, 72.4)
)
long_data <- wide_data |>
pivot_longer(
cols = c(gdpPercap, lifeExp),
names_to = "variable",
values_to = "value"
)
long_datalong_data |>
pivot_wider(names_from = variable, values_from = value)The pipe operator |> means “take the result so far and send it into the next function.” It lets a sequence of data steps read in the same order as the logic of the analysis.
3.8 Core Tidyverse Verbs
Most data preparation before plotting uses a small set of verbs:
filter()keeps rowsselect()keeps columnsmutate()creates or modifies variablesgroup_by()defines groupssummarize()collapses rows into summaries- joins combine information from two tables
These verbs come from dplyr, which is loaded as part of the tidyverse. They are useful because they correspond to ordinary analytical questions. Which cases belong in this plot? Which variables matter? Do we need a new rate, logged value, or category? What is the summary by group? Does another table contain information that should be merged in?
3.9 filter(): Keep Rows
filter() keeps observations that satisfy logical conditions. The operator == tests exact equality. The operator %in% tests whether a value belongs to a set.
gap |>
filter(year == 2007)gap |>
filter(continent %in% c("Americas", "Europe"))Use == for one value. Use %in% when several values are acceptable.
3.10 select(): Keep Columns
select() keeps variables. This is useful when a dataset has many columns but the next step only needs a few.
gap |>
select(country, continent, year, lifeExp, gdpPercap)Rows are observations. Columns are variables. filter() works on rows; select() works on columns.
3.11 mutate(): Create Variables
mutate() creates new variables or rewrites existing ones. In the next example, round() makes GDP per capita easier to read, and log10() creates a transformed variable that will be useful for plotting.
gap |>
mutate(
gdp_pc_round = round(gdpPercap),
log_gdp_pc = log10(gdpPercap)
) |>
select(country, gdpPercap, gdp_pc_round, log_gdp_pc)round(gdpPercap) rounds each value in the gdpPercap column. log10(gdpPercap) takes the base-10 logarithm. The original variable remains unless you explicitly overwrite it.
3.12 group_by() And summarize(): Aggregate Data
Many plots need data at a different level than the imported file. For example, the Gapminder data are country-year observations, but we may want a continent-level summary.
continent_summary <- gap |>
group_by(continent) |>
summarize(
median_life_exp = median(lifeExp, na.rm = TRUE),
median_gdp_pc = median(gdpPercap, na.rm = TRUE),
n_countries = n_distinct(country)
)
continent_summaryna.rm = TRUE tells functions such as median() to ignore missing values. Without it, one missing value can make a summary missing.
3.13 Readable Pipelines
Long pipelines are easier to read when each verb gets its own line and arguments are indented consistently.
gap |>
filter(year == 2007) |>
mutate(gdp_pc_round = round(gdpPercap)) |>
group_by(continent) |>
summarize(
median_life_exp = median(lifeExp, na.rm = TRUE),
median_gdp_pc = median(gdp_pc_round, na.rm = TRUE)
) |>
arrange(desc(median_life_exp))The line breaks are not just style. They make the data workflow visible. Each line is one transformation.
3.14 Joins
Joins combine two tables using one or more key variables. A common pattern is to keep all rows from the main table and add matching columns from a smaller lookup table. That is a left_join(), another dplyr function.
region_notes <- tibble(
continent = c("Africa", "Americas", "Asia", "Europe", "Oceania"),
example_note = c(
"Many countries, large range of outcomes",
"Includes North, Central, and South America",
"Large population range",
"High median life expectancy in many examples",
"Small number of countries in many teaching datasets"
)
)
continent_summary |>
left_join(region_notes, by = "continent")Rows without a match in the lookup table would remain, but the new variables would be NA. That behavior is usually safer than silently dropping observations.
3.15 Short Exercise
Use the built-in mtcars dataset. Write a pipeline that:
- converts
mtcarsto a tibble with row names stored as a column namedcar - groups by
cyl - calculates average
mpgand maximumwt - returns one row per value of
cyl
# Write your code here.3.16 Extras: Google Sheets
Google Sheets can be useful when data are collaboratively maintained. These examples require authentication and should not run automatically in a rendered course book.
library(googlesheets4)
sheet_data <- read_sheet(
ss = "PASTE-GOOGLE-SHEET-ID-OR-URL-HERE",
sheet = "Sheet1"
)
sheet_datalibrary(googlesheets4)
ss <- gs4_create("Visualization 101 2026 Example Sheet")
mtcars |>
as_tibble(rownames = "car") |>
sheet_write(ss = ss, sheet = "mtcars")3.17 Census Data With tidycensus
Some data sources are best accessed through specialized packages. These examples are templates, not automatically running chunks.
The tidycensus package wraps the U.S. Census Bureau’s APIs. The two most common entry points are the American Community Survey (get_acs()) and the decennial census (get_decennial()). A free Census API key is required. Request one at https://api.census.gov/data/key_signup.html, then register it once on your machine with census_api_key("YOUR_KEY", install = TRUE). After that, the key is available to every R session and the line can be removed from teaching files.
library(tidycensus)
#census_api_key("2c5ba48be91062db570c7fc5fa49dbca03306c33", install = TRUE)load_variables() lists the variables available in a given dataset and year. Caching the result avoids re-downloading the table.
acs_vars <- load_variables(2020, "acs5", cache = TRUE) |>
distinct(concept, .keep_all = TRUE)
acs_varsIf load_variables() errors with a JSON parse message like invalid char in json text. <html>, the Census Bureau returned an HTML error page instead of the variable table. That usually means the requested year is not yet published for that dataset, the Census server is temporarily down, or the request was rate-limited. Try a slightly older year (for ACS 5-year, two years behind the current year is usually safe), re-run the call after a minute, or check https://www.census.gov/data/developers/data-sets.html for the dataset’s release schedule.
The variables argument of get_acs() accepts named variable codes. Naming the entries gives readable column names instead of raw codes like B19013_001. output = "wide" puts each variable in its own column, which is usually easier to plot from than the default long format.
acs_raw <- get_acs(
geography = "congressional district",
variables = c(
median_income = "B19013_001",
total_pop = "B01003_001"
),
year = 2022,
survey = "acs5",
output = "wide"
)
acs_rawThe GEOID column is a concatenation of geographic identifiers. For congressional districts it combines the two-digit state FIPS code with the two-digit district number. Splitting it with str_sub() gives joinable columns.
acs <- acs_raw |>
mutate(
fips = str_sub(GEOID, 1, 2),
cd = str_sub(GEOID, 3, 4)
)
acsState-level summaries are simpler because the GEOID is just the state FIPS code. The wide format makes it easy to compare one variable against another.
acs_state <- get_acs(
geography = "state",
variables = c(
median_income = "B19013_001",
median_rent = "B25064_001"
),
year = 2022,
survey = "acs5",
output = "wide"
)
acs_state |>
summarize(
n_states = n(),
cor_income_rent = cor(median_incomeE, median_rentE, use = "complete.obs")
)Note that the wide variable names get an E suffix for estimates and an M suffix for the margin of error. Both are usually returned. For most teaching purposes the E columns are enough; in published work the M columns should be acknowledged.
The decennial census uses get_decennial() instead. It returns counts rather than estimates and does not need a survey argument.
pop_2020 <- get_decennial(
geography = "state",
variables = c(total_pop = "P1_001N"),
year = 2020,
sumfile = "pl"
)
pop_20203.18 Extras: Qualtrics And Canvas
Survey platforms and learning management systems are also possible sources of data. The main point is that API-based imports should be treated as retrieval steps, not as code that runs every time a teaching document renders.
library(qualtRics)
raw_survey <- fetch_survey(
surveyID = "YOUR-SURVEY-ID",
label = FALSE,
convert = FALSE
)library(rcanvas)
courses <- get_course_list()3.19 What Comes Next
The next chapter uses these data ideas to build plots. ggplot2 works best when the data are already in a clear rectangular form and the variables needed for the plot have been inspected, selected, transformed, or summarized.