Session 12

Topics

Rinse, repeat to answer new questions
Plotting continuous data against categorical data
Using factor to order categorical variables
Thinking of our data as paired and redefining tidy

Rinse, repeat to answer new questions

Before tacking representation within a specific (or broad) discipline, I’d like to understand representation of students graduating with a Bachelor’s degree in 2018 at different types of institutions. Within the US there are different categories of schools based on their mission. One way of doing this is by using a framework developed by the Carnegie Commission on Higher Education in 1970 and modified since. The data for the 2000 definition are provided in the CARNEGIE column of ipeds/hd2018.csv (there are newer definitions that break down institutions into finer categories). Let’s ease into our analysis by reading in the awards data again finding the total number of Bachelor’s degrees conferred at each institution. We’ll take advantage of the observation that CIP code 99 indicates the total number of degrees awarded at the institution and that AWLEVEL “05” is for the Bachelor’s degree. Then we’ll bring in the institution name and its Carnegie information.

library(tidyverse)

bachelors_degrees <- read_csv("ipeds/c2018_a.csv") %>%
	filter(MAJORNUM == 1 & CIPCODE == "99" & AWLEVEL == "05" & CTOTALT > 0)

To make the data frame easier to look at I’ll remove these columns that we filtered on as well as any column that starts with an “X” (the X indicates how the data were generated).

bachelors_degrees <- read_csv("ipeds/c2018_a.csv") %>%
	filter(MAJORNUM == 1 & CIPCODE == "99" & AWLEVEL == "05" & CTOTALT > 0) %>%
	select(-CIPCODE, -MAJORNUM, -AWLEVEL, -starts_with("x"))

We’re left with the column for the institution identifier (“UNITID”) and the counts of different groups on campus. Let’s bring in the institution name and Carnegie information

inst_carnegie <- read_csv("ipeds/hd2018.csv") %>%
	select(UNITID, INSTNM, CARNEGIE)

Of course, we’d like to bring in the title for each Carnegie code

library(readxl)

carnegie_code <- read_excel("ipeds/hd2018.xlsx", sheet = "Frequencies") %>%
	filter(varname == "CARNEGIE") %>%
	select(codevalue, valuelabel)

Let’s go ahead and merge our data frames

inst_carnegie %>%
	count(CARNEGIE) %>%
	inner_join(., carnegie_code, by=c("CARNEGIE" = "codevalue"))

## Error: Can't join on `x$CARNEGIE` x `y$CARNEGIE` because of incompatible types.
## ℹ `x$CARNEGIE` is of type <double>>.
## ℹ `y$CARNEGIE` is of type <character>>.

Now we have a new error! In carnegie_code, codevalue is a character and in institutions, CARNEGIE is a double or numeric value. We can convert numbers to characters using as.character and if the character is a number (e.g. “-2”) we can convert it to a number with as.numeric. Since we won’t be doing any math on these columns, I’d suggest we convert the numbers to characters.

inst_carnegie %>%
	mutate(CARNEGIE = as.character(CARNEGIE)) %>%
	inner_join(., carnegie_code, by=c("CARNEGIE" = "codevalue"))

## # A tibble: 6,857 x 4
##    UNITID INSTNM               CARNEGIE valuelabel                              
##     <dbl> <chr>                <chr>    <chr>                                   
##  1 100654 Alabama A & M Unive… 16       Doctoral/Research Universities--Intensi…
##  2 100663 University of Alaba… 15       Doctoral/Research Universities--Extensi…
##  3 100690 Amridge University   51       Theological seminaries and other specia…
##  4 100706 University of Alaba… 16       Doctoral/Research Universities--Intensi…
##  5 100724 Alabama State Unive… 21       Masters Colleges and Universities I     
##  6 100733 University of Alaba… -2       Not applicable, not in Carnegie univers…
##  7 100751 The University of A… 15       Doctoral/Research Universities--Extensi…
##  8 100760 Central Alabama Com… 40       Associates Colleges                     
##  9 100812 Athens State Univer… 32       Baccalaureate Colleges--General         
## 10 100830 Auburn University a… 21       Masters Colleges and Universities I     
## # … with 6,847 more rows

Let’s rename the INSTNM and valuelabel columns to make them easier to read, drop the CARNEGIE column, and add it to the previous pipeline for generating inst_carnegie

inst_carnegie <- read_csv("ipeds/hd2018.csv") %>%
	select(UNITID, INSTNM, CARNEGIE) %>%
	mutate(CARNEGIE = as.character(CARNEGIE)) %>%
	inner_join(., carnegie_code, by=c("CARNEGIE" = "codevalue")) %>%
	rename("institution" = "INSTNM", "carnegie"="valuelabel") %>%
	select(-CARNEGIE)

Now we can join inst_carnegie and bachelors_degrees so that we have our demographic information for each institution with the Carnegie group labeled.

bachelor_demographics <- inner_join(inst_carnegie, bachelors_degrees, by="UNITID") %>%
	select(-UNITID)

Let’s see how many institutions we have in each category

bachelor_demographics %>%
	count(carnegie)

## # A tibble: 18 x 2
##    carnegie                                                                    n
##    <chr>                                                                   <int>
##  1 Associates Colleges                                                       254
##  2 Baccalaureate Colleges--General                                           294
##  3 Baccalaureate Colleges--Liberal Arts                                      216
##  4 Baccalaureate/Associates Colleges                                          40
##  5 Doctoral/Research Universities--Extensive                                 149
##  6 Doctoral/Research Universities--Intensive                                 103
##  7 Masters Colleges and Universities I                                       478
##  8 Masters Colleges and Universities II                                      104
##  9 Medical schools and medical centers                                        17
## 10 Not applicable, not in Carnegie universe (not accredited or nondegree-…   515
## 11 Other separate health profession schools                                   41
## 12 Other specialized institutions                                             32
## 13 Schools of art, music, and design                                          62
## 14 Schools of business and management                                         23
## 15 Schools of engineering and technology                                      21
## 16 Teachers colleges                                                           4
## 17 Theological seminaries and other specialized faith-related institutions   129
## 18 Tribal colleges                                                            13

Now let’s see the median fraction of graduates who were women for each type of institution

bachelor_demographics %>%
	mutate(f_women =CTOTALW/CTOTALT) %>%
	group_by(carnegie) %>%
	summarize(f_women = median(f_women), n=n()) %>%
	arrange(desc(f_women))

## # A tibble: 18 x 3
##    carnegie                                                        f_women     n
##    <chr>                                                             <dbl> <int>
##  1 Other separate health profession schools                          0.838    41
##  2 Medical schools and medical centers                               0.820    17
##  3 Tribal colleges                                                   0.75     13
##  4 Associates Colleges                                               0.646   254
##  5 Baccalaureate/Associates Colleges                                 0.636    40
##  6 Schools of art, music, and design                                 0.620    62
##  7 Masters Colleges and Universities I                               0.610   478
##  8 Masters Colleges and Universities II                              0.610   104
##  9 Not applicable, not in Carnegie universe (not accredited or no…   0.606   515
## 10 Teachers colleges                                                 0.606     4
## 11 Baccalaureate Colleges--General                                   0.602   294
## 12 Baccalaureate Colleges--Liberal Arts                              0.573   216
## 13 Doctoral/Research Universities--Intensive                         0.571   103
## 14 Other specialized institutions                                    0.551    32
## 15 Doctoral/Research Universities--Extensive                         0.527   149
## 16 Schools of business and management                                0.490    23
## 17 Theological seminaries and other specialized faith-related ins…   0.353   129
## 18 Schools of engineering and technology                             0.201    21

These results are pretty interesting! The types of schools that have the most women are health related (nursing?) and Tribal colleges. Those with the poorest representation of women are business, engineering and technology, and theological schools.

Plotting continuous data against categorical data

Believe it or not, we’ve done a lot of analysis so far in this lesson and we haven’t learned any new syntax. Let’s change that and see how we might plot these data. The first type of plot is a strip or jitter plot. Personally, I like these types of plots because they show every data point. This can also be their downfall because the plots can get a bit busy. We will plot the Carnegie category on the x-axis and the fraction of women on the y-axis

bachelor_demographics %>%
	mutate(f_women =CTOTALW/CTOTALT) %>%
	select(carnegie, f_women) %>%
	ggplot(aes(x=carnegie, y=f_women)) +
		geom_jitter()

plot of chunk unnamed-chunk-11

Yeah, there are a lot of points there. There are probably too many categories as well making it impossible to resolve the labels along the x-axis. To solve this second point we could switch the variables that we map to the x and y aesthetics

bachelor_demographics %>%
	mutate(f_women =CTOTALW/CTOTALT) %>%
	select(carnegie, f_women) %>%
	ggplot(aes(y=carnegie, x=f_women)) +
		geom_jitter()

plot of chunk unnamed-chunk-12

Alternatively, we can flip the axes using coord_flip

bachelor_demographics %>%
	mutate(f_women =CTOTALW/CTOTALT) %>%
	select(carnegie, f_women) %>%
	ggplot(aes(x=carnegie, y=f_women)) +
		geom_jitter() +
		coord_flip()

plot of chunk unnamed-chunk-13

As far as there being too many points, there are two options - box plots and violin plots. To flip the axes for these plots we have to use coord_flip

bachelor_demographics %>%
	mutate(f_women =CTOTALW/CTOTALT) %>%
	select(carnegie, f_women) %>%
	ggplot(aes(x=carnegie, y=f_women)) +
		geom_boxplot() +
		coord_flip()

plot of chunk unnamed-chunk-14

…and…

bachelor_demographics %>%
	mutate(f_women =CTOTALW/CTOTALT) %>%
	select(carnegie, f_women) %>%
	ggplot(aes(x=carnegie, y=f_women)) +
		geom_violin() +
		coord_flip()

plot of chunk unnamed-chunk-15

To reduce the number of columns, let’s remove those institutions that are “not in the Carnegie universe” and those categories with fewer than 100 institutions. That will still leave us with 8 categories. I’m not interested in typing those in, so let’s see if we can do this with a join.

carnegie_categories <- bachelor_demographics %>%
	count(carnegie) %>%
	filter(carnegie != "Not applicable, not in Carnegie universe (not accredited or nondegree-granting)" & n > 100)

We can now insert a join to carnegie_categories into our pipeline

bachelor_demographics %>%
	inner_join(., carnegie_categories, by="carnegie") %>%
	mutate(f_women =CTOTALW/CTOTALT) %>%
	select(carnegie, f_women) %>%
	ggplot(aes(x=carnegie, y=f_women)) +
		geom_boxplot() +
		coord_flip()

plot of chunk unnamed-chunk-17

Using `factor` to order categorical variables

How about if we want to order the categories by their median values? I would break up the last pipeline into three parts. The first to create a data frame with the fraction of women (i.e. representation). Then I would generate a summary table that is ordered by the median value of f_women (i.e. carnegie_summary). Then I would use the order of the carnegie column ni the carnegie_summary table to set the order of levels of the carnegie column in representation and create the plot. Let’s see how that plan works out

representation <- bachelor_demographics %>%
	inner_join(., carnegie_categories, by="carnegie") %>%
	mutate(f_women = CTOTALW/CTOTALT) %>%
	select(carnegie, f_women)

carnegie_ordered <- representation %>%
	group_by(carnegie) %>%
	summarize(f_women = median(f_women)) %>%
	arrange(desc(f_women)) %>%
	pull(carnegie)

representation %>%
	mutate(carnegie = factor(carnegie, levels=carnegie_ordered)) %>%
	ggplot(aes(x=carnegie, y=f_women)) +
		geom_boxplot() +
		coord_flip()

plot of chunk unnamed-chunk-18

The two “tricks” here are using pull to create a vector, carnegie_ordered, which is the order of the categories as we want them in the plots. Then we used that vector to set the order of levels in factor when building the plot. We could do more to “beautify” the figure, but let’s move on to other questions.

Instead of looking across all types of institutions, Let’s look instead at the representation of women among institutions in the “Doctoral/Research Universities–Extensive” category. We’ll look at the Bachelor’s (AWLEVEL: 05) and Doctoral (Research) (AWLEVEL: 17) graduates from these institutions. As we did earlier, we’ll begin by filtering the data in ipeds/hd2018.csv to obtain data from institutions in the “Doctoral/Research Universities–Extensive” category, which we saw earlier was CARNEGIE codes 15.

library(tidyverse)

doc_institutions <- read_csv("ipeds/hd2018.csv") %>%
	filter(CARNEGIE == "15") %>%
	select(UNITID, INSTNM)

That results in a data frame with 258 institutions. Now we need to get the Bachelor’s, Master’s, and Doctorate awards data from the ipeds/c2018_a.csv data frame and join it with the doc_institutions data frame.

read_csv("ipeds/c2018_a.csv") %>%
	filter(MAJORNUM == 1 & CIPCODE == "99" & CTOTALT > 0 & (AWLEVEL == "05" | AWLEVEL == "17")) %>%
	inner_join(doc_institutions, ., by="UNITID")

## # A tibble: 300 x 65
##    UNITID INSTNM CIPCODE MAJORNUM AWLEVEL XCTOTALT CTOTALT XCTOTALM CTOTALM
##     <dbl> <chr>  <chr>      <dbl> <chr>   <chr>      <dbl> <chr>      <dbl>
##  1 100663 Unive… 99             1 05      R           2463 R            986
##  2 100663 Unive… 99             1 17      R            161 R             66
##  3 100751 The U… 99             1 05      R           6750 R           2877
##  4 100751 The U… 99             1 17      R            232 R            116
##  5 100858 Aubur… 99             1 05      R           4801 R           2348
##  6 100858 Aubur… 99             1 17      R            253 R            147
##  7 104151 Arizo… 99             1 05      R           9994 R           5467
##  8 104151 Arizo… 99             1 17      R            508 R            301
##  9 104179 Unive… 99             1 05      R           7132 R           3267
## 10 104179 Unive… 99             1 17      R            451 R            229
## # … with 290 more rows, and 56 more variables: XCTOTALW <chr>, CTOTALW <dbl>,
## #   XCAIANT <chr>, CAIANT <dbl>, XCAIANM <chr>, CAIANM <dbl>, XCAIANW <chr>,
## #   CAIANW <dbl>, XCASIAT <chr>, CASIAT <dbl>, XCASIAM <chr>, CASIAM <dbl>,
## #   XCASIAW <chr>, CASIAW <dbl>, XCBKAAT <chr>, CBKAAT <dbl>, XCBKAAM <chr>,
## #   CBKAAM <dbl>, XCBKAAW <chr>, CBKAAW <dbl>, XCHISPT <chr>, CHISPT <dbl>,
## #   XCHISPM <chr>, CHISPM <dbl>, XCHISPW <chr>, CHISPW <dbl>, XCNHPIT <chr>,
## #   CNHPIT <dbl>, XCNHPIM <chr>, CNHPIM <dbl>, XCNHPIW <chr>, CNHPIW <dbl>,
## #   XCWHITT <chr>, CWHITT <dbl>, XCWHITM <chr>, CWHITM <dbl>, XCWHITW <chr>,
## #   CWHITW <dbl>, XC2MORT <chr>, C2MORT <dbl>, XC2MORM <chr>, C2MORM <dbl>,
## #   XC2MORW <chr>, C2MORW <dbl>, XCUNKNT <chr>, CUNKNT <dbl>, XCUNKNM <chr>,
## #   CUNKNM <dbl>, XCUNKNW <chr>, CUNKNW <dbl>, XCNRALT <chr>, CNRALT <dbl>,
## #   XCNRALM <chr>, CNRALM <dbl>, XCNRALW <chr>, CNRALW <dbl>

As we did earlier, we can clean this up a bit to remove the extra columns and assign the output to a variable.

doc_degrees <- read_csv("ipeds/c2018_a.csv") %>%
	filter(MAJORNUM == 1 & CIPCODE == "99" & CTOTALT > 0 & (AWLEVEL == "05" | AWLEVEL == "17")) %>%
	inner_join(doc_institutions, ., by="UNITID") %>%
	select(-CIPCODE, -MAJORNUM, -starts_with("X"))

I’m interested in getting a plot for the fraction of women that graduated with bachelor’s or doctorate degrees. We’ve already seen how to calculate the fraction of women, let’s expand that to include these other categories and then simplify our data frame a bit

doc_degree_rates <- doc_degrees %>%
	mutate(f_women = CTOTALW / CTOTALT) %>%
	select(UNITID, INSTNM, AWLEVEL, f_women)

Like before, we can plot these data with a boxplot

doc_degree_rates %>%
	ggplot(aes(x=AWLEVEL, y=f_women)) +
		geom_boxplot()

plot of chunk unnamed-chunk-23

Thinking of our data as paired and redefining tidy

From the previous figure, it appears that the fraction of women who graduate with a doctorate is less than those that graduate with a bachelor’s degree. Alternatively, we could plot the data as a slope plot by connecting the bachelor’s rate to the doctorate rate.

doc_degree_rates %>%
	ggplot(aes(x=AWLEVEL, y=f_women, group=UNITID)) +
		geom_line()

plot of chunk unnamed-chunk-24

It would be good to quantify the difference between the representation at the two award levels. You should be thinking about using group_by and summarize. The challenge is that we would like to group by institution and calculate the difference between the two levels. Unfortunately, we can’t easily calculate the difference between rows; only columns. To pull this off, we need to spread our f_women column across two columns corresponding to the two award levels. Our data frame has been “tidy” for our purposes, but now it is not - we actually need to make it wider to make it tidy! To do this, we will use the pivot_wider function. To demonstrate it, let’s simplify our doc_degree_rates data frame a bit.

doc_degree_rates %>%
	select(INSTNM, AWLEVEL, f_women)

## # A tibble: 300 x 3
##    INSTNM                              AWLEVEL f_women
##    <chr>                               <chr>     <dbl>
##  1 University of Alabama at Birmingham 05        0.600
##  2 University of Alabama at Birmingham 17        0.590
##  3 The University of Alabama           05        0.574
##  4 The University of Alabama           17        0.5  
##  5 Auburn University                   05        0.511
##  6 Auburn University                   17        0.419
##  7 Arizona State University-Tempe      05        0.453
##  8 Arizona State University-Tempe      17        0.407
##  9 University of Arizona               05        0.542
## 10 University of Arizona               17        0.492
## # … with 290 more rows

What we’d like is one line for each institution and three columns - INSTNM, 05, and 17. The 05 and 17 columns would have the f_women data for each institution. We’ll do this with pivot_wider

doc_degree_rates %>%
	select(INSTNM, AWLEVEL, f_women) %>%
	pivot_wider(names_from=AWLEVEL, values_from=f_women)

## # A tibble: 151 x 3
##    INSTNM                               `05`  `17`
##    <chr>                               <dbl> <dbl>
##  1 University of Alabama at Birmingham 0.600 0.590
##  2 The University of Alabama           0.574 0.5  
##  3 Auburn University                   0.511 0.419
##  4 Arizona State University-Tempe      0.453 0.407
##  5 University of Arizona               0.542 0.492
##  6 University of Arkansas              0.548 0.353
##  7 California Institute of Technology  0.404 0.298
##  8 University of California-Berkeley   0.536 0.414
##  9 University of California-Davis      0.582 0.473
## 10 University of California-Irvine     0.561 0.403
## # … with 141 more rows

It worked - great! You should be able to see that the column names camed from AWLEVEL and hte values came from f_women. Now we can calculate the percent drop with a mutate function

doc_degree_rates %>%
	select(INSTNM, AWLEVEL, f_women) %>%
	pivot_wider(names_from=AWLEVEL, values_from=f_women) %>%
	mutate(percent_drop = 100*(`05` - `17`)/`05`)

## # A tibble: 151 x 4
##    INSTNM                               `05`  `17` percent_drop
##    <chr>                               <dbl> <dbl>        <dbl>
##  1 University of Alabama at Birmingham 0.600 0.590         1.60
##  2 The University of Alabama           0.574 0.5          12.9 
##  3 Auburn University                   0.511 0.419        18.0 
##  4 Arizona State University-Tempe      0.453 0.407        10.0 
##  5 University of Arizona               0.542 0.492         9.17
##  6 University of Arkansas              0.548 0.353        35.6 
##  7 California Institute of Technology  0.404 0.298        26.0 
##  8 University of California-Berkeley   0.536 0.414        22.6 
##  9 University of California-Davis      0.582 0.473        18.7 
## 10 University of California-Irvine     0.561 0.403        28.3 
## # … with 141 more rows

Finally, we can summarize the median representation for each award level as well as for the drop in women using median. Notice that we are using summarize without group_by

doc_degree_rates %>%
	select(INSTNM, AWLEVEL, f_women) %>%
	pivot_wider(names_from=AWLEVEL, values_from=f_women) %>%
	mutate(percent_drop = 100*(`05` - `17`)/`05`) %>%
	summarize(med_bachelors = median(`05`, na.rm=T),
		med_doctorate = median(`17`, na.rm=T),
		med_drop = median(percent_drop, na.rm=T))

## # A tibble: 1 x 3
##   med_bachelors med_doctorate med_drop
##           <dbl>         <dbl>    <dbl>
## 1         0.527         0.465     10.7

At the most extensive doctorate institution, the percentage of women earning a doctorate degree after earning a bachelor’s falls by nearly 6 percentage points or more than 10%.

Exercises

1. Using bachelor_demographics, repeat the analysis we did to compare the fraction of women graduating with bachelor’s degrees across Carnegie categories, but use the fraction of “Black or African American” graduates instead. Can you obtain the median rate of “Black or African American” bachelor’s degree students for each category and the number of schools in each category?

2. Generate a slope plot showing the differences in gender representation among graduates awarded bachelors and doctorate degrees using the data from the “Doctoral/Research Universities–Intensive” Carnegie category data. What is the average difference in representation by degree?

3. Repeat the analysis from Question 2, but quantify the difference in representation among “Black or African American” graduates earning a bachelor’s or doctorate degree from institutions in the “Doctoral/Research Universities–Extensive” Carnegie category

4. Using one of the three slope plots we have generated, modify the code to make the line for your institution or your favorite doctorate-awarding institution stand out