Coding in the tidyverse
Best practices for human-readable code
Last updated
Best practices for human-readable code
Last updated
The is a collection of packages designed by . There are many different ways to code in R, including base R, but this lab uses the tidyverse wherever possible for a few good reasons:
Tidyverse functions are pretty human-readable which makes troubleshooting a lot easier
Other lab members will understand your code better if everyone commits to the same coding style
There are a ton of resources online to help you understand tidyverse functions
Tidyverse functions expect tidy data, which means that they force you to use good data management practices
The functions in the tidyverse are very powerful and often are designed to facilitate exactly the kinds of transformations we need.
If this is your first time using the tidyverse, you'll need to install all of the packages. Luckily they all come bundled together and can be installed with one line of code! Simply write
into the console of your Rstudio session. Be sure to watch for any additional prompts along the way as you install!
After you've installed the tidyverse, you'll need to load it. Think of installing as screwing in a lightbulb and loading as flipping the light switch. You only need to screw it in once to use the lightbulb, but you need to flip the switch every time. To load the packages, you should add
at the beginning of each script where you use tidyverse functions.
The pipe: %>%
this can be read as "and then" whenever you encounter it in code.
data %>%
Take the dataframe "data" and then
group_by(id) %>%
group the data by the column id
and then
distinct(media_name)
keep only one row (per group) with each distinct media_name
allows you to remove columns in your dataframe, or move columns around into a better order
dplyr is a gramar of data manipulation and wrangling that provides a series of consistent verbs within the tydyverse.
Helper functions that allow to more precisely select columns when using the select function:
Starts_with allows you to select all the columns of a data frame that start with a specific substring. For example "per" to select all the columns' names that start with the word "percentage".
Ends_with allows you to select all the columns of a data frame that end with a specific substring. For example "ratio" to select all the columns' names that end with the word "ratio".
Contains allows you to select all the columns of a data frame that contain a specific substring anywhere in their name. For example "199" to select all the columns' names that contain dates from the 90's.
Matches allows you to select all the columns of a data frame that match several criteria. This function works with regular expressions (see below). For example "y|perc"
to select all the columns' names that either contain the expression "y" or the expression "perc".
Helper functions that allow to modify more efficiently columns when using the mutate function:
Across allows you to perform the same calculations across multiple rows when using the mutate function.
Sub allows you to replace the first occurrence of a substring with a new pattern.
Gsub allows you to replace all the occurrences of a substring with a new pattern.
Where allows you to more efficiently specify columns for a calculation or a replacement. For example across(.cols=where(is.numeric()))
Helper functions that allow a more efficient selection of columns when using the filter function:
If_any allows you to specify rows to be filtered based on the matching of specified criteria. For example : filter(if_any(.cols= starts_with("perc")))
If_all works similarly to the if_any function, but it is used when multiple rows match the specified criteria.
Helper functions that allow combining datasets:
Left_join allows you to keep all the rows from the dataset on the left plus the rows in common from the right data frame. The columns by which the dataframes are joined must have the same names in the left and right dtaaframes.
Inner_join allows you to keep only the rows that are in common between two datasets.
Anti_join allows you to identify rows that are present in one dataset, and that are not present in the second one.
Regular expressions are tools for describing patterns in strings. They work with the stringr library within the tidyverse.
Alternation use the token |
when specifiying an "or" parameter. For example "green|blue"
to specify strings that contain either the expression green or the expression blue.
Anchors use the token ^
when searching for a match at the start of the string (similar to starts_with). For example "^co"
. Use the token $
when searching for a match at the end of a string (similar to ends_with). For example "co$"
.
Set theory clauses are useful functions to join multiple datasets together when working in the tidyverse.
Intersect only keeps rows that exist in both datasets.
Union keeps all the rows from both datasets without duplicating the repeated rows.
Union_all keeps all the rows from both datasets duplicating all the repeated rows.
Setdiff keeps all the rows in the x dataset that are different from the rows in the y dataset.
allows you to remove rows that don't match some criteria you set out. Great for cleaning data.
allows you to make a new column using previous columns
allows you to transpose wide data (e.g. qualtrics output) into a longer format
is the reverse of pivot longer, it allows you to transpose long data into a wider format.