You’ll be assessed on the quality (and participation) of your peer review
Updates based on peer review are due Friday at 5pm
Your proposal will be “graded” after that, by me
HW 2 is due Thursday
Setup
# load packageslibrary(countdown)library(tidyverse)library(glue)library(lubridate)library(scales)library(ggthemes)library(gt)library(palmerpenguins)library(openintro)library(ggrepel)# set theme for ggplot2ggplot2::theme_set(ggplot2::theme_minimal(base_size =14))# set width of code outputoptions(width =65)# set figure parameters for knitrknitr::opts_chunk$set(fig.width =7, # 7" widthfig.asp =0.618, # the golden ratiofig.retina =3, # dpi multiplier for displaying HTML output on retinafig.align ="center", # center align figuresdpi =300# higher dpi, sharper image)
HW 1 lessons learned
Highlights
Review HW 1 issues, and show us you reviewed them by closing the issue.
DO NOT hard code paths! Use the here package to help with relative paths, if you need.
Checks
Go to your HW 02 repo and make sure all your changes are committed. Then pull. You’ll see there were some updates. Fix merge conflicts, if any. Then push. Check that your document renders.
Warnings and messages
You should suppress package loading and data loading messages with message: false as a chunk option
You should also suppress warnings with warning: falseafter making sure you’re ok with them, or update your code to eliminate warnings
Missing values I
Is it ok to suppress the following warning? Or should you update your code to eliminate it?
Combine the two position aesthetics (x and y) to produce a 2d position on the plot:
linear coordinate system: horizontal and vertical coordinates
polar coordinate system: angle and radius
maps: latitude and longitude
Draw axes and panel backgrounds in coordination with the faceter coordinate systems
Coordinate systems: types
Linear coordinate systems: preserve the shape of geoms
coord_cartesian(): the default Cartesian coordinate system, where the 2d position of an element is given by the combination of the x and y positions.
coord_fixed(): Cartesian coordinate system with a fixed aspect ratio. (useful only in limited circumstances)
Non-linear coordinate systems: can change the shapes – a straight line may no longer be straight. The closest distance between two points may no longer be a straight line.
coord_trans(): Apply arbitrary transformations to x and y positions, after the data has been processed by the stat
coord_polar(): Polar coordinates
coord_sf(): Map projections
Setting limits: what the plots say
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g, color = species)) +geom_point() +geom_smooth() +labs(title ="Plot 1")ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g, color = species)) +geom_point() +geom_smooth() +scale_x_continuous(limits =c(190, 220)) +scale_y_continuous(limits =c(4000, 5000)) +labs(title ="Plot 2")ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g, color = species)) +geom_point() +geom_smooth() +xlim(190, 220) +ylim(4000, 5000) +labs(title ="Plot 3")ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g, color = species)) +geom_point() +geom_smooth() +coord_cartesian(xlim =c(190,220), ylim =c(4000, 5000)) +labs(title ="Plot 4")
Setting limits: what the plots say
Setting limits: what the warnings say
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g, color = species)) +geom_point() +geom_smooth() +labs(title ="Plot 1")
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
# A tibble: 10 × 2
name profession
<chr> <chr>
1 Ada Lovelace Mathematician
2 Marie Curie Physicist and Chemist
3 Janaki Ammal Botanist
4 Chien-Shiung Wu Physicist
5 Katherine Johnson Mathematician
6 Rosalind Franklin Chemist
7 Vera Rubin Astronomer
8 Gladys West Mathematician
9 Flossie Wong-Staal Virologist and Molecular Biologist
10 Jennifer Doudna Biochemist
dates
# A tibble: 8 × 3
name birth_year death_year
<chr> <dbl> <dbl>
1 Janaki Ammal 1897 1984
2 Chien-Shiung Wu 1912 1997
3 Katherine Johnson 1918 2020
4 Rosalind Franklin 1920 1958
5 Vera Rubin 1928 2016
6 Gladys West 1930 NA
7 Flossie Wong-Staal 1947 NA
8 Jennifer Doudna 1964 NA
works
# A tibble: 9 × 2
name known_for
<chr> <chr>
1 Ada Lovelace first computer algorithm
2 Marie Curie theory of radioactivity, first woman Nobel…
3 Janaki Ammal hybrid species, biodiversity protection
4 Chien-Shiung Wu experiment overturning theory of parity
5 Katherine Johnson orbital mechanics critical to sending first…
6 Vera Rubin existence of dark matter
7 Gladys West mathematical modeling of the shape of the E…
8 Flossie Wong-Staal first to clone HIV and map its genes, which…
9 Jennifer Doudna one of the primary developers of CRISPR
Desired output
# A tibble: 10 × 5
name profession birth…¹ death…² known…³
<chr> <chr> <dbl> <dbl> <chr>
1 Ada Lovelace Mathematician NA NA first …
2 Marie Curie Physicist and Chem… NA NA theory…
3 Janaki Ammal Botanist 1897 1984 hybrid…
4 Chien-Shiung Wu Physicist 1912 1997 experi…
5 Katherine Johnson Mathematician 1918 2020 orbita…
6 Rosalind Franklin Chemist 1920 1958 <NA>
7 Vera Rubin Astronomer 1928 2016 existe…
8 Gladys West Mathematician 1930 NA mathem…
9 Flossie Wong-Staal Virologist and Mol… 1947 NA first …
10 Jennifer Doudna Biochemist 1964 NA one of…
# … with abbreviated variable names ¹birth_year, ²death_year,
# ³known_for
Inputs, reminder
names(professions)
[1] "name" "profession"
names(dates)
[1] "name" "birth_year" "death_year"
names(works)
[1] "name" "known_for"
nrow(professions)
[1] 10
nrow(dates)
[1] 8
nrow(works)
[1] 9
Joining data frames
something_join(x, y)
left_join(): all rows from x
right_join(): all rows from y
full_join(): all rows from both x and y
semi_join(): all rows from x where there are matching values in y, keeping just columns from x
inner_join(): all rows from x where there are matching values in y, return all combination of multiple matches in the case of multiple matches
anti_join(): return all rows from x where there are not matching values in y, never duplicate rows of x
…
Setup
For the next few slides…
x <-tibble(id =c(1, 2, 3),value_x =c("x1", "x2", "x3") )x
# A tibble: 3 × 2
id value_x
<dbl> <chr>
1 1 x1
2 2 x2
3 3 x3
y <-tibble(id =c(1, 2, 4),value_y =c("y1", "y2", "y4") )y
# A tibble: 3 × 2
id value_y
<dbl> <chr>
1 1 y1
2 2 y2
3 4 y4
# A tibble: 10 × 4
name profession birth…¹ death…²
<chr> <chr> <dbl> <dbl>
1 Ada Lovelace Mathematician NA NA
2 Marie Curie Physicist and Chemist NA NA
3 Janaki Ammal Botanist 1897 1984
4 Chien-Shiung Wu Physicist 1912 1997
5 Katherine Johnson Mathematician 1918 2020
6 Rosalind Franklin Chemist 1920 1958
7 Vera Rubin Astronomer 1928 2016
8 Gladys West Mathematician 1930 NA
9 Flossie Wong-Staal Virologist and Molecular B… 1947 NA
10 Jennifer Doudna Biochemist 1964 NA
# … with abbreviated variable names ¹birth_year, ²death_year
# A tibble: 10 × 4
name birth_year death_year known_for
<chr> <dbl> <dbl> <chr>
1 Janaki Ammal 1897 1984 hybrid species, biod…
2 Chien-Shiung Wu 1912 1997 experiment overturni…
3 Katherine Johnson 1918 2020 orbital mechanics cr…
4 Rosalind Franklin 1920 1958 <NA>
5 Vera Rubin 1928 2016 existence of dark ma…
6 Gladys West 1930 NA mathematical modelin…
7 Flossie Wong-Staal 1947 NA first to clone HIV a…
8 Jennifer Doudna 1964 NA one of the primary d…
9 Ada Lovelace NA NA first computer algor…
10 Marie Curie NA NA theory of radioactiv…
inner_join()
inner_join(x, y)
Joining with `by = join_by(id)`
# A tibble: 2 × 3
id value_x value_y
<dbl> <chr> <chr>
1 1 x1 y1
2 2 x2 y2
inner_join()
dates |>inner_join(works)
Joining with `by = join_by(name)`
# A tibble: 7 × 4
name birth_year death_year known_for
<chr> <dbl> <dbl> <chr>
1 Janaki Ammal 1897 1984 hybrid species, biodi…
2 Chien-Shiung Wu 1912 1997 experiment overturnin…
3 Katherine Johnson 1918 2020 orbital mechanics cri…
4 Vera Rubin 1928 2016 existence of dark mat…
5 Gladys West 1930 NA mathematical modeling…
6 Flossie Wong-Staal 1947 NA first to clone HIV an…
7 Jennifer Doudna 1964 NA one of the primary de…
semi_join()
semi_join(x, y)
Joining with `by = join_by(id)`
# A tibble: 2 × 2
id value_x
<dbl> <chr>
1 1 x1
2 2 x2
semi_join()
dates |>semi_join(works)
Joining with `by = join_by(name)`
# A tibble: 7 × 3
name birth_year death_year
<chr> <dbl> <dbl>
1 Janaki Ammal 1897 1984
2 Chien-Shiung Wu 1912 1997
3 Katherine Johnson 1918 2020
4 Vera Rubin 1928 2016
5 Gladys West 1930 NA
6 Flossie Wong-Staal 1947 NA
7 Jennifer Doudna 1964 NA
anti_join()
anti_join(x, y)
Joining with `by = join_by(id)`
# A tibble: 1 × 2
id value_x
<dbl> <chr>
1 3 x3
anti_join()
dates |>anti_join(works)
Joining with `by = join_by(name)`
# A tibble: 1 × 3
name birth_year death_year
<chr> <dbl> <dbl>
1 Rosalind Franklin 1920 1958
Joining with `by = join_by(name)`
Joining with `by = join_by(name)`
scientists
# A tibble: 10 × 5
name profession birth…¹ death…² known…³
<chr> <chr> <dbl> <dbl> <chr>
1 Ada Lovelace Mathematician NA NA first …
2 Marie Curie Physicist and Chem… NA NA theory…
3 Janaki Ammal Botanist 1897 1984 hybrid…
4 Chien-Shiung Wu Physicist 1912 1997 experi…
5 Katherine Johnson Mathematician 1918 2020 orbita…
6 Rosalind Franklin Chemist 1920 1958 <NA>
7 Vera Rubin Astronomer 1928 2016 existe…
8 Gladys West Mathematician 1930 NA mathem…
9 Flossie Wong-Staal Virologist and Mol… 1947 NA first …
10 Jennifer Doudna Biochemist 1964 NA one of…
# … with abbreviated variable names ¹birth_year, ²death_year,
# ³known_for
*_join() functions
From dplyr
Incredibly useful for bringing datasets with common information (e.g., unique identifier) together
Use by argument when the names of the column containing the common information are not the same across datasets
Always check that the numbers of rows and columns of the result dataset makes sense