library(tidyr)
library(mdsr)
library(babynames)
library(dplyr)
Some of the code are from book “Modern Data Science with R”.
Here is a list of five important verbs to work with data frame in R.
- select() take a subset of columns
- filter() take a subset of the rows
- mutate() add or modify existing columns
- arrange() sort the rows
- summarize() aggregate the data across rows
- spread() convert a data table from narrow to wide
- gather() convert a data table from wide to narrow
Each of these functions takes a data frame as its first argument, and returns a data frame.
select() and filter()
head(presidential, 2)
## # A tibble: 2 x 4
## name start end party
## <chr> <date> <date> <chr>
## 1 Eisenhower 1953-01-20 1961-01-20 Republican
## 2 Kennedy 1961-01-20 1963-11-22 Democratic
select(presidential, name, party)
## # A tibble: 11 x 2
## name party
## <chr> <chr>
## 1 Eisenhower Republican
## 2 Kennedy Democratic
## 3 Johnson Democratic
## 4 Nixon Republican
## 5 Ford Republican
## 6 Carter Democratic
## 7 Reagan Republican
## 8 Bush Republican
## 9 Clinton Democratic
## 10 Bush Republican
## 11 Obama Democratic
filter(presidential, party == "Republican")
## # A tibble: 6 x 4
## name start end party
## <chr> <date> <date> <chr>
## 1 Eisenhower 1953-01-20 1961-01-20 Republican
## 2 Nixon 1969-01-20 1974-08-09 Republican
## 3 Ford 1974-08-09 1977-01-20 Republican
## 4 Reagan 1981-01-20 1989-01-20 Republican
## 5 Bush 1989-01-20 1993-01-20 Republican
## 6 Bush 2001-01-20 2009-01-20 Republican
select(filter(presidential, start > 1973 & party == "Democratic"), name)
## # A tibble: 3 x 1
## name
## <chr>
## 1 Carter
## 2 Clinton
## 3 Obama
To get the same result, we can also use the %>% (pipe) operation, which is more readable than above syntax.
presidential %>%
filter(start > 1973 & party == "Democratic") %>%
select(name)
## # A tibble: 3 x 1
## name
## <chr>
## 1 Carter
## 2 Clinton
## 3 Obama
mutate() and rename()
# We are going to use the interval() and eyears() in lubridate to calculate the term
library(lubridate)
1. Add a new column
mypresidents <- presidential %>%
mutate(term.length = interval(start, end) / eyears(1))
head(mypresidents, 2)
## # A tibble: 2 x 5
## name start end party term.length
## <chr> <date> <date> <chr> <dbl>
## 1 Eisenhower 1953-01-20 1961-01-20 Republican 8.005479
## 2 Kennedy 1961-01-20 1963-11-22 Democratic 2.838356
2. Change column name
mypresidents <- mypresidents %>% rename(term_length = term.length)
head(mypresidents, 2)
## # A tibble: 2 x 5
## name start end party term_length
## <chr> <date> <date> <chr> <dbl>
## 1 Eisenhower 1953-01-20 1961-01-20 Republican 8.005479
## 2 Kennedy 1961-01-20 1963-11-22 Democratic 2.838356
arrange()
sort() will sort a vector and arrange() will sort a data frame.
mypresidents %>% arrange(desc(term_length), party)
## # A tibble: 11 x 5
## name start end party term_length
## <chr> <date> <date> <chr> <dbl>
## 1 Clinton 1993-01-20 2001-01-20 Democratic 8.005479
## 2 Obama 2009-01-20 2017-01-20 Democratic 8.005479
## 3 Eisenhower 1953-01-20 1961-01-20 Republican 8.005479
## 4 Reagan 1981-01-20 1989-01-20 Republican 8.005479
## 5 Bush 2001-01-20 2009-01-20 Republican 8.005479
## 6 Nixon 1969-01-20 1974-08-09 Republican 5.553425
## 7 Johnson 1963-11-22 1969-01-20 Democratic 5.167123
## 8 Carter 1977-01-20 1981-01-20 Democratic 4.002740
## 9 Bush 1989-01-20 1993-01-20 Republican 4.002740
## 10 Kennedy 1961-01-20 1963-11-22 Democratic 2.838356
## 11 Ford 1974-08-09 1977-01-20 Republican 2.452055
summarize() with group_by()
For most of the cases, summarize() is used with group_by(). summarize() collapses a data frame into a single row.
mypresidents %>%
summarize(N = n(), first_year = min(year(start)), last_year = max(year(end)),
num_dems = sum(party == "Democratic"),
years = sum(term_length),
avg_term_length = mean(term_length))
## # A tibble: 1 x 6
## N first_year last_year num_dems years avg_term_length
## <int> <dbl> <dbl> <int> <dbl> <dbl>
## 1 11 1953 2017 5 64.04384 5.822167
To have a summary row for each of the party,
mypresidents %>%
group_by(party) %>%
summarize(
N = n(), first_year = min(year(start)), last_year = max(year(end)),
num_dems = sum(party == "Democratic"),
years = sum(term_length),
avg_term_length = mean(term_length))
## # A tibble: 2 x 7
## party N first_year last_year num_dems years avg_term_length
## <chr> <int> <dbl> <dbl> <int> <dbl> <dbl>
## 1 Democratic 5 1961 2017 5 28.01918 5.603836
## 2 Republican 6 1953 2009 0 36.02466 6.004110
Combine multiple tables
library(nycflights13)
head(flights, 2)
## # A tibble: 2 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## # ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <dttm>
head(airlines, 2)
## # A tibble: 2 x 2
## carrier name
## <chr> <chr>
## 1 9E Endeavor Air Inc.
## 2 AA American Airlines Inc.
1. inner_join()
flightsJoined <- flights %>%
inner_join(airlines, by = c("carrier" = "carrier"))
glimpse(flightsJoined)
## Observations: 336,776
## Variables: 20
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,...
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 55...
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 60...
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2...
## $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 7...
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 7...
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -...
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV",...
## $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79...
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN...
## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR"...
## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL"...
## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138...
## $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 94...
## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5,...
## $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013...
## $ name <chr> "United Air Lines Inc.", "United Air Lines Inc....
2. left_join()
For left_join(), the rows of the first table are always returned, regardless of whether there is a match in the second table.
Convert wide to narrow and vice versa
The spread(key = , value = ) converts a data table from narrow to wide. The value is the variable in the narrow format that is to be divided up into multiple variables in the resulting side format. The key is the name of the variable in the narrow format that identifies for each case individually which column in the wide format will receive the value. The gather(key = , value = ) converts a data table from wide to narrow.
babynames %>%
group_by(name, sex) %>%
summarise(total = sum(n)) %>%
head(6)
## # A tibble: 6 x 3
## # Groups: name [6]
## name sex total
## <chr> <chr> <int>
## 1 Aaban M 87
## 2 Aabha F 28
## 3 Aabid M 5
## 4 Aabriella F 15
## 5 Aada F 5
## 6 Aadam M 218
Suppose you want to find the most gender-neutral names from all 93,889 names in babynames? For this, it would be useful to have the results in a wide format, like the one shown below.
BabyWide <- babynames %>%
group_by(sex, name) %>%
summarize(total = sum(n)) %>%
spread(key = sex, value = total, fill = 0)
head(BabyWide, 3)
## # A tibble: 3 x 3
## name F M
## <chr> <dbl> <dbl>
## 1 Aaban 0 87
## 2 Aabha 28 0
## 3 Aabid 0 5
Sometimes spread() and gather() need to be used together to create the format you want.
df1 <- data.frame(grp = c("A", "A", "B", "B"), sex = c("F", "M", "F", "M"),
meanL = c(0.22, 0.47, 0.33, 0.55), sdL = c(0.11, 0.33, 0.11, 0.31))
df1
## grp sex meanL sdL
## 1 A F 0.22 0.11
## 2 A M 0.47 0.33
## 3 B F 0.33 0.11
## 4 B M 0.55 0.31
df1 %>% gather(variable, value, -(grp:sex)) %>%
unite(temp, sex, variable, sep = ".") %>%
spread(temp, value)
## grp F.meanL F.sdL M.meanL M.sdL
## 1 A 0.22 0.11 0.47 0.33
## 2 B 0.33 0.11 0.55 0.31