Data Wrangling Basics: Transform Your Data Sets for ggplot2
While this course focuses on "making charts look pretty with code", none of it works if your data structure does not match the expected format or if variables aren't behaving themselves.
We will explore the logic of tidy data and learn how to reshape messy spreadsheets to match ggplot2's expectations, modify rows and columns, and the specialized helpers to work with factors, strings, and dates.
🧹 Tidy House, Tidy Plots
You’ve likely heard the saying "garbage in, garbage out". In data visualization, it is a literal truth: the quality of your plot is directly tied to the quality of your data.
This refers not just to the correctness and completeness of the values themselves, but also the underlying structure. If your data frame is a mess, ggplot2 will put up a fight.
Most of the time, ggplot2 expects your data to be "tidy." This is not just about being neat: it is a specific structural standard. Data is considered tidy when it follows three simple rules:
- Variables have their own columns.
- Observations have their own rows.
- Values have their own cells.
In a "messy" dataset, you might have years (2022, 2023, 2024) as column headers. While this format is handy for manual data entry (placing the latest value next to the previous one), it makes it impossible to map the different years to a visual aesthetic to encode groups by e.g. position or color, because those values are fragmented across multiple columns.
In a tidy dataset , those columns would be gathered into a single column named year. This "long" format is exactly whatggplot2 needs to translate data variables into visual properties:
Tidying data might feel like an extra step, but it’s the secret to efficiency and automation. Once your data is tidy, you can swap a scatter plot for a line chart or create small multiples with a single line of code. It also allows you to perform calculations across groups effortlessly and ensures your plots update automatically if new categories are added to the source file.
To make our lives easier, we use functions from the tidyverse , a collection of tools that adhere to the same philosophy and follow a shared syntax.
This shared DNA is at the heart of the tidyverse philosophy: every tool in the collection is built to handle tidy data, uses intuitive naming, and is designed to be easily combined.
The tidyverse is a collection of R packages designed specifically for data science. And ggplot2 is actually part of this collection!
In fact, it is the ecosystem's spiritual parent. The principles of "tidy data" were essentially born out of the requirements needed to make the Grammar of Graphics work effectively.
The consistency we see across these tools today exists because every package was built to match the logical flow and structural standards that ggplot2 pioneered.
👯 A Friendlier Data Frame
While the tidyverse functions work perfectly fine with regular R data frames, they are designed to return a modern version called a tibble.
A tibble is still a rectangular data frame, but it is easier to work with, ensures type consistency when wrangling, and prints nicely in the console.
- If you're working with an existing data object, you can upgrade it instantly using
as_tibble(your_data). - If you use functions from the
readrpackage to import your files, likeread_csv(), it will return a tibble directly.
Doors are closed 🚪
Enrollment for the current cohort is closed. Join the waitlist to be notified as soon as the next cohort opens, and become the ggplot2 expert your company needs!
Already enrolled? Login