Move Seamlessly from Excel to R - 4 (Filtering rows)
This might be the most important video in the series. In this one, I'm going to show you how to filter data in R.
If you're familiar with filtering in programs like Excel, you'll know it can be a bit tedious and manual. But filtering in R is much simpler and more efficient, and I'll walk you through the process.
Before we dive into it, a quick reminder: the text details for all of these videos are posted on Substack. You can access it at moveseamlessly.substack.com, where you'll find the full transcripts of every video. We also have a GitHub page where I share all the data used for these tutorials, as well as other resources. And don’t forget to visit our website at move-seamlessly.com for more information on various topics.
Now, let's get into it. I’ll demonstrate how to use the filter()
function in R by working through an example.
First, I'll open the R window, clean the screen, and show you how it works. The first step is to load the tidyverse
library. I’ll do that by running library(tidyverse)
.
While that’s running, I’m going to load an Excel file named data.xls
. This file contains two sheets:
The first sheet has columns for first name, last name, address, zip code, and the last updated date for each record.
The second sheet has names along with their scores in English and Math.
I’ll load each sheet separately. But first, I need to change my working directory to the folder where the file is located, which is on my desktop. Once I’ve done that, I'll load the readxl
package using library(readxl)
. You don’t need to install this separately, as it gets installed automatically when you install the tidyverse
, but it doesn’t load by default.
Once that’s ready, I’ll read in the first sheet using read_excel()
and store it in a data frame called data1
. This will give me the first sheet. Then I’ll do the same for the second sheet, which I’ll store in data2
.
Now, let’s take a quick look. data1
has 500 rows and 7 columns, while data2
has 50 rows with scores in English and Math.
Let’s start with filtering on the second sheet (data2). There are three main types of data you might filter: numerical, character (or string), and date. I’ll show you how to handle each.
Filtering Numerical Data
To filter by numbers, for example, I’ll show you how to get everyone whose English score is greater than 70. Here's how you can do it:
data2 %>% filter(English > 70)
This gives me 26 results. Since the output only shows the first few rows, if you want to see the entire table, you can convert it to a data frame like this:
data2 %>% filter(English > 70) %>% as.data.frame()
Now, let’s combine filters. If you want to find everyone who has both an English score and a Math score above 70, you can do this:
data2 %>% filter(English > 70, Math > 70)
This reduces the number of results to 20. Alternatively, you can combine multiple conditions using &
(for "and") or |
(for "or"). Let’s try using |
to get students who have either an English score greater than 70 or a Math score greater than 70:
data2 %>% filter(English > 70 | Math > 70)
Now, we get 41 results. This shows how to combine filters using logical operators.
Filtering Character Data
Let’s move on to filtering by character data. For example, if I want to find everyone whose first name is "Justine," I can use:
data1 %>% filter(first == "Justine")
This gives me 5 results. You can also do partial matches. For example, if you want to find people whose first name starts with the letter "J", you can use a regular expression like this:
data1 %>% filter(str_detect(first, "^J"))
This will return 65 names, as there are 65 people with first names starting with "J."
You can also use filtering for specific values in a column. For example, let’s say you want to find people from Indiana, Ohio, or Illinois. You can filter for multiple values like this:
data1 %>% filter(state %in% c("IN", "OH", "IL"))
This will return all 71 people from Indiana, Ohio, or Illinois.
Filtering Date Data
Finally, let’s look at filtering by date. In data1
, there’s a column called "updated" with the date when the record was last updated. To filter records updated after February 1st, 2024, you can use:
data1 %>% filter(updated > "2024-02-01")
This gives me 130 results. In R, date columns are ordered, so you can easily filter by any date.
Conclusion
And that’s how filtering works in R. There’s a lot more you can do with different types of data, and I’ll cover advanced topics like string manipulation and handling date formats in upcoming videos.
In the next video, I’ll dive deeper into handling strings and using regular expressions. I’ll also cover working with dates in more detail, including formatting and using specialized libraries.
Thanks for watching, and make sure to check out the resources on Substack, GitHub, and our website for more tutorials and updates!