library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
# Load data
data(email50)
# View the structure of the data
str(email50)
## tibble [50 x 21] (S3: tbl_df/tbl/data.frame)
## $ spam : num [1:50] 0 0 1 0 0 0 0 0 0 0 ...
## $ to_multiple : num [1:50] 0 0 0 0 0 0 0 0 0 0 ...
## $ from : num [1:50] 1 1 1 1 1 1 1 1 1 1 ...
## $ cc : int [1:50] 0 0 4 0 0 0 0 0 1 0 ...
## $ sent_email : num [1:50] 1 0 0 0 0 0 0 1 1 0 ...
## $ time : POSIXct[1:50], format: "2012-01-04 08:19:16" "2012-02-16 15:10:06" ...
## $ image : num [1:50] 0 0 0 0 0 0 0 0 0 0 ...
## $ attach : num [1:50] 0 0 2 0 0 0 0 0 0 0 ...
## $ dollar : num [1:50] 0 0 0 0 9 0 0 0 0 23 ...
## $ winner : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ inherit : num [1:50] 0 0 0 0 0 0 0 0 0 0 ...
## $ viagra : num [1:50] 0 0 0 0 0 0 0 0 0 0 ...
## $ password : num [1:50] 0 0 0 0 1 0 0 0 0 0 ...
## $ num_char : num [1:50] 21.705 7.011 0.631 2.454 41.623 ...
## $ line_breaks : int [1:50] 551 183 28 61 1088 5 17 88 242 578 ...
## $ format : num [1:50] 1 1 0 0 1 0 0 1 1 1 ...
## $ re_subj : num [1:50] 1 0 0 0 0 0 0 1 1 0 ...
## $ exclaim_subj: num [1:50] 0 0 0 0 0 0 0 0 1 0 ...
## $ urgent_subj : num [1:50] 0 0 0 0 0 0 0 0 0 0 ...
## $ exclaim_mess: num [1:50] 8 1 2 1 43 0 0 2 22 3 ...
## $ number : Factor w/ 3 levels "none","small",..: 2 3 1 2 2 2 2 2 2 2 ...
Identify variable types
Recall from the video that the glimpse()
function from dplyr
provides a handy alternative to str()
for previewing a dataset. In addition to the number of observations and variables, it shows the name and type of each column, along with a neatly printed preview of its values.
Let’s have another look at the email50
data, so we can practice identifying variable types.
# Glimpse email50
glimpse(email50)
## Rows: 50
## Columns: 21
## $ spam <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, ...
## $ to_multiple <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...
## $ from <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ cc <int> 0, 0, 4, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ sent_email <dbl> 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, ...
## $ time <dttm> 2012-01-04 08:19:16, 2012-02-16 15:10:06, 2012-01-04 ...
## $ image <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ attach <dbl> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 1, ...
## $ dollar <dbl> 0, 0, 0, 0, 9, 0, 0, 0, 0, 23, 4, 0, 3, 2, 0, 0, 0, 0,...
## $ winner <fct> no, no, no, no, no, no, no, no, no, no, no, no, yes, n...
## $ inherit <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ viagra <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ password <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, ...
## $ num_char <dbl> 21.705, 7.011, 0.631, 2.454, 41.623, 0.057, 0.809, 5.2...
## $ line_breaks <int> 551, 183, 28, 61, 1088, 5, 17, 88, 242, 578, 1167, 198...
## $ format <dbl> 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, ...
## $ re_subj <dbl> 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, ...
## $ exclaim_subj <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ urgent_subj <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ exclaim_mess <dbl> 8, 1, 2, 1, 43, 0, 0, 2, 22, 3, 13, 1, 2, 2, 21, 10, 0...
## $ number <fct> small, big, none, small, small, small, small, small, s...
Nice! Can you determine the type of each variable?
Filtering based on a factor
Categorical data are often stored as factors in R. In this exercise, we’ll practice working with a factor variable, number
, from the email50
dataset. This variable tells us what type of number (none, small, or big) an email contains.
Recall from the video that the filter()
function from dplyr
can be used to filter a dataset to create a subset containing only certain levels of a variable. For example, the following code filters the mtcars
dataset for cars containing 6 cylinders:
mtcars %>%
filter(cyl == 6)
# Subset of emails with big numbers: email50_big
email50_big <- email50 %>%
filter(number == "big")
# Glimpse the subset
glimpse(email50_big)
## Rows: 7
## Columns: 21
## $ spam <dbl> 0, 0, 1, 0, 0, 0, 0
## $ to_multiple <dbl> 0, 0, 0, 0, 0, 0, 0
## $ from <dbl> 1, 1, 1, 1, 1, 1, 1
## $ cc <int> 0, 0, 0, 0, 0, 0, 0
## $ sent_email <dbl> 0, 0, 0, 0, 0, 1, 0
## $ time <dttm> 2012-02-16 15:10:06, 2012-02-04 18:26:09, 2012-01-24 ...
## $ image <dbl> 0, 0, 0, 0, 0, 0, 0
## $ attach <dbl> 0, 0, 0, 0, 0, 0, 0
## $ dollar <dbl> 0, 0, 3, 2, 0, 0, 0
## $ winner <fct> no, no, yes, no, no, no, no
## $ inherit <dbl> 0, 0, 0, 0, 0, 0, 0
## $ viagra <dbl> 0, 0, 0, 0, 0, 0, 0
## $ password <dbl> 0, 2, 0, 0, 0, 0, 8
## $ num_char <dbl> 7.011, 10.368, 42.793, 26.520, 6.563, 11.223, 10.613
## $ line_breaks <int> 183, 198, 712, 692, 140, 512, 225
## $ format <dbl> 1, 1, 1, 1, 1, 1, 1
## $ re_subj <dbl> 0, 0, 0, 0, 0, 0, 0
## $ exclaim_subj <dbl> 0, 0, 0, 1, 0, 0, 0
## $ urgent_subj <dbl> 0, 0, 0, 0, 0, 0, 0
## $ exclaim_mess <dbl> 1, 1, 2, 7, 2, 9, 9
## $ number <fct> big, big, big, big, big, big, big
Great work! Seven emails contain big numbers.
Complete filtering based on a factor
The droplevels()
function removes unused levels of factor variables from our dataset. As we saw in the video, it’s often useful to determine which levels are unused (i.e. contain zero values) with the table()
function.
In this exercise, we’ll see which levels of the number
variable are dropped after applying the droplevels()
function.
# Subset of emails with big numbers: email50_big
email50_big <- email50 %>%
filter(number == "big")
# Table of the number variable
table(email50_big$number)
##
## none small big
## 0 0 7
# Drop levels
email50_big$number_dropped <- droplevels(email50_big$number)
# Table of the number variable
table(email50_big$number_dropped)
##
## big
## 7
Did you notice that dropping the levels of the number
variable gets rid of the levels with counts of zero? This will be useful when you’re creating visualizations later on. Great work!
Discretize a different variable
In this exercise, we’ll create a categorical version of the num_char
variable in the email50
dataset. num_char
is the number of characters in an email, in thousands. This new variable will have two levels ("below median"
and "at or above median"
) depending on whether an email has less than the median number of characters or equal to or more than that value.
The median marks the 50th percentile, or midpoint, of a distribution, so half of the emails should fall in one category and the other half in the other. You will learn more about the median and other measures of center in the next course in this series.
# Calculate median number of characters: med_num_char
med_num_char <- median(email50$num_char)
# Create num_char_cat variable in email50
email50_fortified <- email50 %>%
mutate(num_char_cat = ifelse(num_char < med_num_char, "below median", "at or above median"))
# Count emails in each category
email50_fortified %>%
count(num_char_cat)
## # A tibble: 2 x 2
## num_char_cat n
## <chr> <int>
## 1 at or above median 25
## 2 below median 25
Great job! As you can see, half of the observations are below the median and half are above the median. Makes sense, doesn’t it?
Combining levels of a different factor
Another common way of creating a new variable based on an existing one is by combining levels of a categorical variable. For example, the email50
dataset has a categorical variable called number
with levels "none"
, "small"
, and "big"
, but suppose we’re only interested in whether an email contains a number. In this exercise, we will create a variable containing this information and also visualize it.
For now, do your best to understand the code we’ve provided to generate the plot. We will go through it in detail in the next video.
library(ggplot2)
# Create number_yn column in email50
email50_fortified <- email50 %>%
mutate(
number_yn = case_when(
# if number is "none", make number_yn "no"
number == "none" ~ "no",
# if number is not "none", make number_yn "yes"
number != "none" ~ "yes"
)
)
# Visualize the distribution of number_yn
ggplot(email50_fortified, aes(x = number_yn)) +
geom_bar()
Visualizing numerical and categorical data
In this exercise, we’ll visualize the relationship between two numerical variables from the email50
dataset, conditioned on whether or not the email was spam. This means that we will use an aspect of the plot (like color or shape) to identify the levels in the spam
variable so that we can compare plotted values between them.
Recall that in the ggplot()
function, the first argument is the dataset, then we map the aesthetic features of the plot to variables in the dataset, and finally the geom_*()
layer informs how data are represented on the plot. In this exercise, we will make a scatterplot by adding a geom_point()
layer to the ggplot()
call.
# Load ggplot2
library(ggplot2)
# Scatterplot of exclaim_mess vs. num_char
ggplot(email50, aes(x = num_char, y = exclaim_mess, color = factor(spam))) +
geom_point()
Excellent work! Note how ggplot2
automatically creates a helpful legend for the plot, telling you which color corresponds to each level of the spam
variable.
Identify type of study: Countries
Next, let’s take a look at data from a different study on country characteristics. First, load the data and view it, then identify the type of study. Remember, an experiment requires random assignment.
library(gapminder)
# Load data
data(gapminder)
# Glimpse data
glimpse(gapminder)
## Rows: 1,704
## Columns: 6
## $ country <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afgha...
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asi...
## $ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 199...
## $ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 4...
## $ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372,...
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.113...
# Identify type of study: observational or experimental
type_of_study <- "observational"
Right! Since there is no way to randomly assign countries to attributes, this is an observational study. Nice work!
Number of males and females admitted
The goal of this exercise is to determine the numbers of male and female applicants who got admitted and rejected. Specifically, we want to find out how many males are admitted and how many are rejected. And similarly we want to find how many females are admitted and how many are rejected.
To do so we will use the count()
function from the dplyr
package.
In one step, count() groups the data and then tallies the number of observations in each level of the grouping variable. These counts are available under a new variable called n
.
# Load packages
library(dplyr)
load("_data/ucb_admit.RData")
# Count number of male and female applicants admitted
(ucb_admission_counts <- ucb_admit %>%
count(Gender, Admit))
## Gender Admit n
## 1 Male Admitted 1198
## 2 Male Rejected 1493
## 3 Female Admitted 557
## 4 Female Rejected 1278
Cool counting! Passing several arguments to count()
gives you the number of rows for each combination of those arguments.
Proportion of males admitted overall
Next we’ll calculate the percentage of males and percentage of females admitted, by creating a new variable, called prop
(short for proportion) based off of the counts calculated in the previous exercise and using the mutate()
from the dplyr
package.
Proportions for each row of the data frame we created in the previous exercise can be calculated as n / sum(n)
. Note that since the data are grouped by gender, sum(n)
will be calculated for males and females separately.
ucb_admission_counts %>%
# Group by gender
group_by(Gender) %>%
# Create new variable
mutate(prop = n / sum(n)) %>%
# Filter for admitted
filter(Admit == "Admitted")
## # A tibble: 2 x 4
## # Groups: Gender [2]
## Gender Admit n prop
## <fct> <fct> <int> <dbl>
## 1 Male Admitted 1198 0.445
## 2 Female Admitted 557 0.304
Fantastic! It looks like 44% of males were admitted versus only 30% of females, but as you’ll see in the next exercise, there’s more to the story.
Proportion of males admitted for each department
Finally we’ll make a table similar to the one we constructed earlier, except we’ll first group the data by department. The goal is to compare the proportions of male admitted students across departments.
Proportions for each row of the data frame we create can be calculated as n / sum(n)
. Note that since the data are grouped by department and gender, sum(n)
will be calculated for males and females separately for each department.
ucb_admission_counts <- ucb_admit %>%
# Counts by department, then gender, then admission status
count(Dept, Gender, Admit)
# See the result
ucb_admission_counts
## Dept Gender Admit n
## 1 A Male Admitted 512
## 2 A Male Rejected 313
## 3 A Female Admitted 89
## 4 A Female Rejected 19
## 5 B Male Admitted 353
## 6 B Male Rejected 207
## 7 B Female Admitted 17
## 8 B Female Rejected 8
## 9 C Male Admitted 120
## 10 C Male Rejected 205
## 11 C Female Admitted 202
## 12 C Female Rejected 391
## 13 D Male Admitted 138
## 14 D Male Rejected 279
## 15 D Female Admitted 131
## 16 D Female Rejected 244
## 17 E Male Admitted 53
## 18 E Male Rejected 138
## 19 E Female Admitted 94
## 20 E Female Rejected 299
## 21 F Male Admitted 22
## 22 F Male Rejected 351
## 23 F Female Admitted 24
## 24 F Female Rejected 317
ucb_admission_counts %>%
# Group by department, then gender
group_by(Dept, Gender) %>%
# Create new variable
mutate(prop = n / sum(n)) %>%
# Filter for male and admitted
filter(Gender == "Male", Admit == "Admitted")
## # A tibble: 6 x 5
## # Groups: Dept, Gender [6]
## Dept Gender Admit n prop
## <chr> <fct> <fct> <int> <dbl>
## 1 A Male Admitted 512 0.621
## 2 B Male Admitted 353 0.630
## 3 C Male Admitted 120 0.369
## 4 D Male Admitted 138 0.331
## 5 E Male Admitted 53 0.277
## 6 F Male Admitted 22 0.0590
Amazing admission analyzing! The proportion of males admitted varies wildly between departments.
Simple random sample in R
Suppose we want to collect some data from a sample of eight states. A list of all states and the region they belong to (Northeast, Midwest, South, West) are given in the us_regions
data frame.
load("_data/us_regions.RData")
# Simple random sample
states_srs <- us_regions %>%
sample_n(8)
# Count states by region
states_srs %>%
count(region)
## region n
## 1 Midwest 1
## 2 Northeast 2
## 3 South 3
## 4 West 2
Great work! Notice that this strategy may select an unequal number of states from each region. In the next exercise, you’ll implement stratified sampling to be sure to select an equal number of states from each region.
Stratified sample in R
In the previous exercise, we took a simple random sample of eight states. However, we did not have any control over how many states from each region got sampled. The goal of stratified sampling in this context is to have control over the number of states sampled from each region. Our goal for this exercise is to sample an equal number of states from each region.
# Stratified sample
states_str <- us_regions %>%
group_by(region) %>%
sample_n(2)
# Count states by region
states_str %>%
count(region)
## # A tibble: 4 x 2
## # Groups: region [4]
## region n
## <fct> <int>
## 1 Midwest 2
## 2 Northeast 2
## 3 South 2
## 4 West 2
Nice job! In this stratified sample, each stratum (i.e. Region) is represented equally.
Connect blocking and stratifying
In random sampling, we use stratifying to control for a variable. In random assignment, we use blocking to achieve the same goal.
Inspect the data
The purpose of this chapter is to give you an opportunity to apply and practice what you’ve learned on a real world dataset. For this reason, we’ll provide a little less guidance than usual.
The data from the study described in the video are available in your workspace as evals
. Let’s take a look!
load("_data/evals.RData")
# Inspect evals
glimpse(evals)
## Rows: 463
## Columns: 21
## $ score <dbl> 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5, 3.8...
## $ rank <fct> tenure track, tenure track, tenure track, tenure trac...
## $ ethnicity <fct> minority, minority, minority, minority, not minority,...
## $ gender <fct> female, female, female, female, male, male, male, mal...
## $ language <fct> english, english, english, english, english, english,...
## $ age <int> 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, 40, 4...
## $ cls_perc_eval <dbl> 55.81395, 68.80000, 60.80000, 62.60163, 85.00000, 87....
## $ cls_did_eval <int> 24, 86, 76, 77, 17, 35, 39, 55, 111, 40, 24, 24, 17, ...
## $ cls_students <int> 43, 125, 125, 123, 20, 40, 44, 55, 195, 46, 27, 25, 2...
## $ cls_level <fct> upper, upper, upper, upper, upper, upper, upper, uppe...
## $ cls_profs <fct> single, single, single, single, multiple, multiple, m...
## $ cls_credits <fct> multi credit, multi credit, multi credit, multi credi...
## $ bty_f1lower <int> 5, 5, 5, 5, 4, 4, 4, 5, 5, 2, 2, 2, 2, 2, 2, 2, 2, 7,...
## $ bty_f1upper <int> 7, 7, 7, 7, 4, 4, 4, 2, 2, 5, 5, 5, 5, 5, 5, 5, 5, 9,...
## $ bty_f2upper <int> 6, 6, 6, 6, 2, 2, 2, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4, 9,...
## $ bty_m1lower <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 7,...
## $ bty_m1upper <int> 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 6,...
## $ bty_m2upper <int> 6, 6, 6, 6, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 6,...
## $ bty_avg <dbl> 5.000, 5.000, 5.000, 5.000, 3.000, 3.000, 3.000, 3.33...
## $ pic_outfit <fct> not formal, not formal, not formal, not formal, not f...
## $ pic_color <fct> color, color, color, color, color, color, color, colo...
# Alternative solutions
dim(evals)
## [1] 463 21
str(evals)
## tibble [463 x 21] (S3: tbl_df/tbl/data.frame)
## $ score : num [1:463] 4.7 4.1 3.9 4.8 4.6 4.3 2.8 4.1 3.4 4.5 ...
## $ rank : Factor w/ 3 levels "teaching","tenure track",..: 2 2 2 2 3 3 3 3 3 3 ...
## $ ethnicity : Factor w/ 2 levels "minority","not minority": 1 1 1 1 2 2 2 2 2 2 ...
## $ gender : Factor w/ 2 levels "female","male": 1 1 1 1 2 2 2 2 2 1 ...
## $ language : Factor w/ 2 levels "english","non-english": 1 1 1 1 1 1 1 1 1 1 ...
## $ age : int [1:463] 36 36 36 36 59 59 59 51 51 40 ...
## $ cls_perc_eval: num [1:463] 55.8 68.8 60.8 62.6 85 ...
## $ cls_did_eval : int [1:463] 24 86 76 77 17 35 39 55 111 40 ...
## $ cls_students : int [1:463] 43 125 125 123 20 40 44 55 195 46 ...
## $ cls_level : Factor w/ 2 levels "lower","upper": 2 2 2 2 2 2 2 2 2 2 ...
## $ cls_profs : Factor w/ 2 levels "multiple","single": 2 2 2 2 1 1 1 2 2 2 ...
## $ cls_credits : Factor w/ 2 levels "multi credit",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ bty_f1lower : int [1:463] 5 5 5 5 4 4 4 5 5 2 ...
## $ bty_f1upper : int [1:463] 7 7 7 7 4 4 4 2 2 5 ...
## $ bty_f2upper : int [1:463] 6 6 6 6 2 2 2 5 5 4 ...
## $ bty_m1lower : int [1:463] 2 2 2 2 2 2 2 2 2 3 ...
## $ bty_m1upper : int [1:463] 4 4 4 4 3 3 3 3 3 3 ...
## $ bty_m2upper : int [1:463] 6 6 6 6 3 3 3 3 3 2 ...
## $ bty_avg : num [1:463] 5 5 5 5 3 ...
## $ pic_outfit : Factor w/ 2 levels "formal","not formal": 2 2 2 2 2 2 2 2 2 2 ...
## $ pic_color : Factor w/ 2 levels "black&white",..: 2 2 2 2 2 2 2 2 2 2 ...
Nice work! There are many ways to inspect a data frame in R and to find how many observations and variables it contains.
Identify variable types
It’s always useful to start your exploration of a dataset by identifying variable types. The results from this exercise will help you design appropriate visualizations and calculate useful summary statistics later in your analysis.
# Inspect variable types
glimpse(evals)
## Rows: 463
## Columns: 21
## $ score <dbl> 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5, 3.8...
## $ rank <fct> tenure track, tenure track, tenure track, tenure trac...
## $ ethnicity <fct> minority, minority, minority, minority, not minority,...
## $ gender <fct> female, female, female, female, male, male, male, mal...
## $ language <fct> english, english, english, english, english, english,...
## $ age <int> 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, 40, 4...
## $ cls_perc_eval <dbl> 55.81395, 68.80000, 60.80000, 62.60163, 85.00000, 87....
## $ cls_did_eval <int> 24, 86, 76, 77, 17, 35, 39, 55, 111, 40, 24, 24, 17, ...
## $ cls_students <int> 43, 125, 125, 123, 20, 40, 44, 55, 195, 46, 27, 25, 2...
## $ cls_level <fct> upper, upper, upper, upper, upper, upper, upper, uppe...
## $ cls_profs <fct> single, single, single, single, multiple, multiple, m...
## $ cls_credits <fct> multi credit, multi credit, multi credit, multi credi...
## $ bty_f1lower <int> 5, 5, 5, 5, 4, 4, 4, 5, 5, 2, 2, 2, 2, 2, 2, 2, 2, 7,...
## $ bty_f1upper <int> 7, 7, 7, 7, 4, 4, 4, 2, 2, 5, 5, 5, 5, 5, 5, 5, 5, 9,...
## $ bty_f2upper <int> 6, 6, 6, 6, 2, 2, 2, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4, 9,...
## $ bty_m1lower <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 7,...
## $ bty_m1upper <int> 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 6,...
## $ bty_m2upper <int> 6, 6, 6, 6, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 6,...
## $ bty_avg <dbl> 5.000, 5.000, 5.000, 5.000, 3.000, 3.000, 3.000, 3.33...
## $ pic_outfit <fct> not formal, not formal, not formal, not formal, not f...
## $ pic_color <fct> color, color, color, color, color, color, color, colo...
str(evals) # Another option
## tibble [463 x 21] (S3: tbl_df/tbl/data.frame)
## $ score : num [1:463] 4.7 4.1 3.9 4.8 4.6 4.3 2.8 4.1 3.4 4.5 ...
## $ rank : Factor w/ 3 levels "teaching","tenure track",..: 2 2 2 2 3 3 3 3 3 3 ...
## $ ethnicity : Factor w/ 2 levels "minority","not minority": 1 1 1 1 2 2 2 2 2 2 ...
## $ gender : Factor w/ 2 levels "female","male": 1 1 1 1 2 2 2 2 2 1 ...
## $ language : Factor w/ 2 levels "english","non-english": 1 1 1 1 1 1 1 1 1 1 ...
## $ age : int [1:463] 36 36 36 36 59 59 59 51 51 40 ...
## $ cls_perc_eval: num [1:463] 55.8 68.8 60.8 62.6 85 ...
## $ cls_did_eval : int [1:463] 24 86 76 77 17 35 39 55 111 40 ...
## $ cls_students : int [1:463] 43 125 125 123 20 40 44 55 195 46 ...
## $ cls_level : Factor w/ 2 levels "lower","upper": 2 2 2 2 2 2 2 2 2 2 ...
## $ cls_profs : Factor w/ 2 levels "multiple","single": 2 2 2 2 1 1 1 2 2 2 ...
## $ cls_credits : Factor w/ 2 levels "multi credit",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ bty_f1lower : int [1:463] 5 5 5 5 4 4 4 5 5 2 ...
## $ bty_f1upper : int [1:463] 7 7 7 7 4 4 4 2 2 5 ...
## $ bty_f2upper : int [1:463] 6 6 6 6 2 2 2 5 5 4 ...
## $ bty_m1lower : int [1:463] 2 2 2 2 2 2 2 2 2 3 ...
## $ bty_m1upper : int [1:463] 4 4 4 4 3 3 3 3 3 3 ...
## $ bty_m2upper : int [1:463] 6 6 6 6 3 3 3 3 3 2 ...
## $ bty_avg : num [1:463] 5 5 5 5 3 ...
## $ pic_outfit : Factor w/ 2 levels "formal","not formal": 2 2 2 2 2 2 2 2 2 2 ...
## $ pic_color : Factor w/ 2 levels "black&white",..: 2 2 2 2 2 2 2 2 2 2 ...
# Remove non-factor variables from the vector below
cat_vars <- c("rank", "ethnicity", "gender", "language",
"cls_level", "cls_profs", "cls_credits",
"pic_outfit", "pic_color")
Recode a variable
The cls_students
variable in evals
tells you the number of students in the class. Suppose instead of the exact number of students, you’re interested in whether the class is
"small"
(18 students or fewer),"midsize"
(19 - 59 students), or"large"
(60 students or more).# Recode cls_students as cls_type
evals_fortified <- evals %>%
mutate(
cls_type = case_when(
cls_students <= 18 ~ "small",
cls_students >= 19 & cls_students <= 59 ~ "midsize",
cls_students >= 60 ~ "large"
)
)
Excellent! The cls_type
variable is a categorical variable, stored as a character vector. You could have made it a factor variable by wrapping the nested ifelse()
statements inside factor()
. You don’t have to do that now. Let’s move on!
Create a scatterplot
The bty_avg
variable shows the average beauty rating of the professor by the six students who were asked to rate the attractiveness of these faculty. The score
variable shows the average professor evaluation score, with 1 being very unsatisfactory and 5 being excellent.
# Scatterplot of score vs. bty_avg
ggplot(evals, aes(x = bty_avg, y = score)) +
geom_point()
Create a scatterplot, with an added layer
Suppose you are interested in evaluating how the relationship between a professor’s attractiveness and their evaluation score varies across different class types (small, midsize, and large).
# Scatterplot of score vs. bty_avg colored by cls_type
ggplot(evals_fortified, aes(x = bty_avg, y = score, color = cls_type)) +
geom_point()