Move Seamlessly from Excel to R - 5 (Select or Sort Columns)
In this video, we'll explore various methods for selecting columns and sorting the entries within those columns.
Let me start by showing you how to select a subset of columns from a data frame. First, I will load the necessary libraries to work with Excel files. We'll need the readxl
library to read the file and tidyverse
for data manipulation, so make sure both are loaded with the following commands:
library(tidyverse)
library(readxl)
Next, we’ll read the Excel file, "data.xls", and work with its first sheet. To load the data, use:
x <- read_excel("data.xls", sheet = 1)
Once you run this, type x
and you should see the data frame. This is the same data frame we used in previous examples for filtering and other operations. It includes columns like "first", "last", "address", "city", "state", "zip", and "updated".
Now, let’s clear the screen and focus on selecting a few columns from this data. The function for selecting columns is select()
. For example, if we only want to select the "first" and "last" name columns, we can use the following command:
select(x, first, last)
This command will display only the "first" and "last" columns. It’s similar to creating a new spreadsheet that contains just these two columns, but with one key difference: if you run the select()
command like this, the original data frame x
remains unchanged. The selection only appears in the output.
If you want to save this subset into a new data frame, you can do so like this:
y <- select(x, first, last)
Now, y
contains only the "first" and "last" columns, and x
remains the same as it was initially.
Another way to select columns is by excluding specific ones using the minus sign. For instance, if you want to select all columns except "first" and "last", you can use:
select(x, -first, -last)
This will return the remaining columns, such as "address", "city", "state", "zip", and "updated".
Let’s dive into some more advanced selection techniques. If you have multiple columns with similar names, like "English1", "English2", "Math1", "Math2", etc., you can select them based on their name patterns.
For example, to select all columns starting with "English", use the starts_with()
function:
select(x, starts_with("English"))
Similarly, you can select columns that end with a specific suffix, such as "ST", by using the ends_with()
function:
select(x, ends_with("ST"))
In our case, the columns "first" and "last" both end with "ST", so running ends_with("ST")
will select them.
You can also use regular expressions for more complex patterns with the matches()
function. For example, to select all columns containing the letter "S", you would use:
select(x, matches("S"))
This will select columns like "first", "last", and "address", which contain the letter "S" in their names.
Now, let’s move on to sorting data. In Excel or any other spreadsheet program, you can sort data based on a specific column, and you can do the same in R with ease.
To sort the data frame x
by the "zip" column, use the arrange()
function like this:
arrange(x, zip)
This will sort the rows in ascending order of the zip codes, with the smallest zip codes appearing at the top. Since our data frame has over 500 rows, you’ll only see the first 10 rows by default. To view the entire data frame, you can run:
as.data.frame(x)
This will print all 450+ rows, now sorted by the zip code.
You can also sort by other columns, such as "first" or "last" names. For example, to sort by "first" name:
arrange(x, first)
This will display names in alphabetical order, starting with "Abigail" and "Adam".
To sort by "last" name:
arrange(x, last)
This will show the names sorted by "last", such as "Abraham", "Alan", and so on.
If you need to sort in descending order, use the desc()
function. For instance, to sort by "last" name in reverse order, you would write:
arrange(x, desc(last))
This will show the names in reverse alphabetical order, starting with those that begin with "Y", followed by "Wilson", and so on.