How Do Your Scores Compare: An exploration and analysis of socioeconomic factors and the SAT¶

Sydney Feldman and Gracie Dunaway¶

Project Website

Introduction¶

Recentely, the Supreme Court ruled on affirmative action, taking away the ability for a university to use race as a factor in admission decisions. While we hope most universities do not use this as an excuse to limit their diversity on campus, they have to think more creatively about how they decide who to admit into their school.

Standardized testing has been a long standing factor in the college admissions process, and they don't seem to be going away. Students recieve a range of scores on these exams, the SAT and ACT, that contribute to their admissions decisions. When a student submits their score, all the school recieves is a number, not what factors contribute to the score.

This project is going to be a deep dive into this, looking for different factors that contribute to an SAT score. What contributes the most? Can previous academic performace relate?

Our focus is building a tool that colleges and universities can use to help make admissions decisions regarding SAT scores. The first goal is to figure out what factors have a correlation to SAT scores, and the second is to make a system for colleges to use to help keep an equitable admissions process.

Data Sources

We used 3 different data sources in this project.The first two were found on Kaggle. The first is called US School Scores. This dataset has 99 columns of information regarding SAT scores by state and US territory from 2005-2015. Although all the information in this dataset is based on when the SAT was out of 2400 points instead of the current 1600 points, it is still relevant to our goals. The only difference between the two tests is that the written portion became optional but the data set we have does not include the writing section so the score is out of 1600 still. The columns cover a range of topics including scores from different sections of the SAT, subject specific GPA, the average scores based on family income, and many more data points. You can find the dataset used here.

The secondary data set is also from Kaggle, and contains information organized by state and year, like the first dataset. It contains information about state revenue and spending, the number of students in each grade, 4th and 8th grade standardized test scores, and these separated by factors of race and gender. The dataset is linked here.

Our last dataset is from the National Center for Education Statistics. It contains information on average SAT scores by race by year. The dataset is linked here.

Extraction, Transform, and Load¶

Extraction and Load We downloaded the datasets directly from Kaggle to a CSV files and added them to our Google Drives and mounted it. We then imported the Python libraries we will be using in our first set of analysis, and read in the files using pd.read_csv.

In [ ]:
# Connecting Notebook to Google Drive
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/MyDrive
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/MyDrive
In [ ]:
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.options.mode.chained_assignment = None
In [ ]:
# Reading in the datasets
df1 = pd.read_csv('/content/drive/MyDrive/Data_science_project/school_scores.csv')
df2 = pd.read_csv('/content/drive/MyDrive/Data_science_project/states_all_extended.csv')
race_dataset = '/content/drive/MyDrive/Data_science_project/tabn226.10.xls'
df3 = pd.read_excel(race_dataset)
In [ ]:
df1.head()
Out[ ]:
Year State.Code State.Name Total.Math Total.Test-takers Total.Verbal Academic Subjects.Arts/Music.Average GPA Academic Subjects.Arts/Music.Average Years Academic Subjects.English.Average GPA Academic Subjects.English.Average Years ... Score Ranges.Between 600 to 700.Math.Total Score Ranges.Between 600 to 700.Verbal.Females Score Ranges.Between 600 to 700.Verbal.Males Score Ranges.Between 600 to 700.Verbal.Total Score Ranges.Between 700 to 800.Math.Females Score Ranges.Between 700 to 800.Math.Males Score Ranges.Between 700 to 800.Math.Total Score Ranges.Between 700 to 800.Verbal.Females Score Ranges.Between 700 to 800.Verbal.Males Score Ranges.Between 700 to 800.Verbal.Total
0 2005 AL Alabama 559 3985 567 3.92 2.2 3.53 3.9 ... 1096 596 613 1209 156 292 448 219 250 469
1 2005 AK Alaska 519 3996 523 3.76 1.9 3.35 3.9 ... 787 423 375 798 47 116 163 109 115 224
2 2005 AZ Arizona 530 18184 526 3.85 2.1 3.45 3.9 ... 3814 1831 1679 3510 327 630 957 412 501 913
3 2005 AR Arkansas 552 1600 563 3.90 2.2 3.61 4.0 ... 443 242 226 468 49 83 132 77 93 170
4 2005 CA California 522 186552 504 3.76 1.8 3.32 3.8 ... 37605 16078 14966 31044 5451 8792 14243 5011 4766 9777

5 rows × 99 columns

In [ ]:
df2.head()
Out[ ]:
PRIMARY_KEY STATE YEAR ENROLL TOTAL_REVENUE FEDERAL_REVENUE STATE_REVENUE LOCAL_REVENUE TOTAL_EXPENDITURE INSTRUCTION_EXPENDITURE ... G08_HI_A_READING G08_HI_A_MATHEMATICS G08_AS_A_READING G08_AS_A_MATHEMATICS G08_AM_A_READING G08_AM_A_MATHEMATICS G08_HP_A_READING G08_HP_A_MATHEMATICS G08_TR_A_READING G08_TR_A_MATHEMATICS
0 1992_ALABAMA ALABAMA 1992 NaN 2678885.0 304177.0 1659028.0 715680.0 2653798.0 1481703.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 1992_ALASKA ALASKA 1992 NaN 1049591.0 106780.0 720711.0 222100.0 972488.0 498362.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 1992_ARIZONA ARIZONA 1992 NaN 3258079.0 297888.0 1369815.0 1590376.0 3401580.0 1435908.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 1992_ARKANSAS ARKANSAS 1992 NaN 1711959.0 178571.0 958785.0 574603.0 1743022.0 964323.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 1992_CALIFORNIA CALIFORNIA 1992 NaN 26260025.0 2072470.0 16546514.0 7641041.0 27138832.0 14358922.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 266 columns

In [ ]:
df3.head()
Out[ ]:
Table 226.10. SAT mean scores of college-bound seniors, by race/ethnicity: Selected years, 1986-87 through 2014-15 Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 ... Unnamed: 11 Unnamed: 12 Unnamed: 13 Unnamed: 14 Unnamed: 15 Unnamed: 16 Unnamed: 17 Unnamed: 18 Unnamed: 19 Unnamed: 20
0 Race/ethnicity 1986- 87 1990- 91 1996- 97 2000- 01 2003- 04 2004- 05 2005-06 2006-07 2007-08 ... 2009-10 2010-11 2011-12 2012-13 NaN NaN Score change NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN 2010-11 2011-12 NaN 2013-14 2014-15 1990-91 to \n2004-05 2004-05\nto\n2014-15 2009-10\n to\n 2014-15 2013-14 to \n2014-15
2 1 2 3 4 5 6 7 8 9 10 ... 12 13 14 15 16 17 18 19 20 21
3 SAT—Critical reading NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 All students ............... 507 499 505 506 508 508 503 502 502 ... 501 497 496 496 497 495 9 -13 -6 -2

5 rows × 21 columns

Tidy and Transform

There are no null cells in this dataset 1, however there are 1,037 with the value of 0. We have set these cells to contain NaN instead so that calculations are not impacted.

In [ ]:
df1.isnull().sum().sum() #0
(df1==0).sum().sum()
df1.replace(0, np.nan, inplace=True)

Our datatypes are correct.

In [ ]:
df1.dtypes
Out[ ]:
Year                                               int64
State.Code                                        object
State.Name                                        object
Total.Math                                         int64
Total.Test-takers                                  int64
                                                   ...  
Score Ranges.Between 700 to 800.Math.Males         int64
Score Ranges.Between 700 to 800.Math.Total         int64
Score Ranges.Between 700 to 800.Verbal.Females     int64
Score Ranges.Between 700 to 800.Verbal.Males       int64
Score Ranges.Between 700 to 800.Verbal.Total       int64
Length: 99, dtype: object

Now, we will create new dataframes and tidy them that will be used in our analysis.

We focused on changing the names of columns so that we can merge different dataframes, and creating our own columns to use in our analysis. In addition, we only kept columns in these new dataframes that we will be using for our analysis, and got rid of what we didn't need.

In [ ]:
df1['State.Name'] = df1['State.Name'].str.upper()
df_income = df1[['Year', 'State.Name', 'Family Income.Less than 20k.Test-takers', 'Family Income.Less than 20k.Math','Family Income.Less than 20k.Verbal','Family Income.Between 20-40k.Test-takers', 'Family Income.Between 20-40k.Math', 'Family Income.Between 20-40k.Verbal', 'Family Income.Between 40-60k.Test-takers','Family Income.Between 40-60k.Math', 'Family Income.Between 40-60k.Verbal','Family Income.Between 60-80k.Test-takers', 'Family Income.Between 60-80k.Math','Family Income.Between 60-80k.Verbal', 'Family Income.Between 80-100k.Test-takers', 'Family Income.Between 80-100k.Math', 'Family Income.Between 80-100k.Verbal', 'Family Income.More than 100k.Test-takers', 'Family Income.More than 100k.Math', 'Family Income.More than 100k.Verbal', 'Gender.Female.Math', 'Gender.Female.Verbal', 'Gender.Male.Math', 'Gender.Male.Verbal' ]]
df_income["Total.Test-takers"] = df_income["Family Income.Less than 20k.Test-takers"] + df_income["Family Income.Between 20-40k.Test-takers"] +df_income["Family Income.Between 40-60k.Test-takers"] + df_income["Family Income.Between 60-80k.Test-takers"] + df_income["Family Income.Between 80-100k.Test-takers"] + df_income["Family Income.More than 100k.Test-takers"]
df_income["% < 20k"] = df_income["Family Income.Less than 20k.Test-takers"]/df_income["Total.Test-takers"]
df_income["% 20-40k"] = df_income["Family Income.Between 20-40k.Test-takers"]/df_income["Total.Test-takers"]
df_income["% 40-60k"] = df_income["Family Income.Between 40-60k.Test-takers"]/df_income["Total.Test-takers"]
df_income["% 60-80k"] =df_income["Family Income.Between 60-80k.Test-takers"]/df_income["Total.Test-takers"]
df_income["% 80-100k"] =df_income["Family Income.Between 80-100k.Test-takers"]/df_income["Total.Test-takers"]
df_income["% > 100k"] =df_income["Family Income.More than 100k.Test-takers"]/df_income["Total.Test-takers"]
In [ ]:
df_income = df_income.rename(columns = { 'State.Name': 'State',
                                      'Family Income.Less than 20k.Test-takers' : 'Family Income < 20k: Number of Test-takers',
                                      'Family Income.Less than 20k.Math' : 'Family Income < 20k: Math Average Score',
                                      'Family Income.Less than 20k.Verbal' : 'Family Income < 20k: Reading Average Score',
                                      'Family Income.Between 20-40k.Test-takers' : 'Family Income 20-40k: Number of Test-takers',
                                      'Family Income.Between 20-40k.Math' : 'Family Income 20-40k: Math Average Score',
                                      'Family Income.Between 20-40k.Verbal' : 'Family Income 20-40k: Reading Average Score',
                                      'Family Income.Between 40-60k.Test-takers' : 'Family Income 40-60k: Number of Test-takers',
                                      'Family Income.Between 40-60k.Math' : 'Family Income 40-60k: Math Average Score',
                                      'Family Income.Between 40-60k.Verbal' : 'Family Income 40-60k: Reading Average Score',
                                      'Family Income.Between 60-80k.Test-takers' : 'Family Income 60-80k: Number of Test-takers',
                                      'Family Income.Between 60-80k.Math' : 'Family Income 60-80k: Math Average Score',
                                      'Family Income.Between 60-80k.Verbal' : 'Family Income 60-80k: Reading Average Score',
                                      'Family Income.Between 80-100k.Test-takers' : 'Family Income 80-100k: Number of Test-takers',
                                      'Family Income.Between 80-100k.Math' : 'Family Income 80-100k: Math Average Score',
                                      'Family Income.Between 80-100k.Verbal' : 'Family Income 80-100k: Reading Average Score',
                                      'Family Income.More than 100k.Test-takers' : 'Family Income > 100k: Number of Test-takers',
                                      'Family Income.More than 100k.Math' : 'Family Income > 100k: Math Average Score',
                                      'Family Income.More than 100k.Verbal' : 'Family Income > 100k: Reading Average Score'
})
df_income.head()
Out[ ]:
Year State Family Income < 20k: Number of Test-takers Family Income < 20k: Math Average Score Family Income < 20k: Reading Average Score Family Income 20-40k: Number of Test-takers Family Income 20-40k: Math Average Score Family Income 20-40k: Reading Average Score Family Income 40-60k: Number of Test-takers Family Income 40-60k: Math Average Score ... Gender.Female.Verbal Gender.Male.Math Gender.Male.Verbal Total.Test-takers % < 20k % 20-40k % 40-60k % 60-80k % 80-100k % > 100k
0 2005 ALABAMA 175 462.0 474.0 324 513.0 527 442 539 ... 561 582 574 2869 0.060997 0.112931 0.154061 0.164866 0.165563 0.341582
1 2005 ALASKA 191 464.0 467.0 401 492.0 500 539 517 ... 521 535 526 2718 0.070272 0.147535 0.198308 0.221854 0.163355 0.198675
2 2005 ARIZONA 891 485.0 474.0 2121 498.0 495 2270 520 ... 522 549 531 12603 0.070697 0.168293 0.180116 0.188209 0.148060 0.244624
3 2005 ARKANSAS 107 489.0 486.0 180 513.0 526 245 543 ... 558 570 570 1220 0.087705 0.147541 0.200820 0.186066 0.120492 0.257377
4 2005 CALIFORNIA 19323 451.0 421.0 26161 477.0 458 18347 506 ... 499 543 510 123872 0.155992 0.211194 0.148113 0.144803 0.113989 0.225911

5 rows × 31 columns

In [ ]:
df_gpa = df1[['Academic Subjects.Arts/Music.Average GPA', 'Academic Subjects.Arts/Music.Average Years', 'Academic Subjects.English.Average GPA',
             'Academic Subjects.English.Average Years', 'Academic Subjects.Foreign Languages.Average GPA', 'Academic Subjects.Foreign Languages.Average Years',
             'Academic Subjects.Mathematics.Average GPA', 'Academic Subjects.Mathematics.Average Years', 'Academic Subjects.Natural Sciences.Average GPA',
             'Academic Subjects.Natural Sciences.Average Years', 'Academic Subjects.Social Sciences/History.Average GPA', 'Academic Subjects.Social Sciences/History.Average Years',
             'GPA.A minus.Math', 'GPA.A minus.Test-takers', 'GPA.A minus.Verbal', 'GPA.A plus.Math', 'GPA.A plus.Test-takers', 'GPA.A plus.Verbal', 'GPA.A.Math', 'GPA.A.Test-takers',
             'GPA.A.Verbal', 'GPA.B.Math', 'GPA.B.Test-takers', 'GPA.B.Verbal', 'GPA.C.Math', 'GPA.C.Test-takers', 'GPA.C.Verbal', 'GPA.D or lower.Math', 'GPA.D or lower.Test-takers',
             'GPA.D or lower.Verbal', 'GPA.No response.Math', 'GPA.No response.Test-takers', 'GPA.No response.Verbal', 'Score Ranges.Between 200 to 300.Math.Total', 'Score Ranges.Between 200 to 300.Verbal.Total',
             'Score Ranges.Between 300 to 400.Math.Total', 'Score Ranges.Between 300 to 400.Verbal.Total', 'Score Ranges.Between 400 to 500.Math.Total', 'Score Ranges.Between 400 to 500.Verbal.Total',
             'Score Ranges.Between 500 to 600.Math.Total', 'Score Ranges.Between 500 to 600.Verbal.Total', 'Score Ranges.Between 600 to 700.Math.Total',
             'Score Ranges.Between 600 to 700.Verbal.Total', 'Score Ranges.Between 700 to 800.Math.Total', 'Score Ranges.Between 700 to 800.Verbal.Total']]
df_gpa = df_gpa.rename(columns = { 'Academic Subjects.Arts/Music.Average GPA': 'Arts/Music GPA',
                                  'Academic Subjects.Arts/Music.Average Years': 'Arts/Music Avg Yrs',
                                  'Academic Subjects.English.Average GPA' : 'English GPA',
                                  'Academic Subjects.English.Average Years' : 'English Avg Yrs',
                                  'Academic Subjects.Foreign Languages.Average GPA' : 'Foreign Language GPA',
                                  'Academic Subjects.Foreign Languages.Average Years' : 'Foreign Language Avg Yrs',
                                  'Academic Subjects.Mathematics.Average GPA' : 'Math GPA',
                                  'Academic Subjects.Mathematics.Average Years' : 'Math Avg Yrs',
                                  'Academic Subjects.Natural Sciences.Average GPA' : 'Science GPA',
                                  'Academic Subjects.Natural Sciences.Average Years' : 'Science Avg Yrs',
                                  'Academic Subjects.Social Sciences/History.Average GPA' : 'Social Science GPA',
                                  'Academic Subjects.Social Sciences/History.Average Years' : 'Social Science Avg Yrs'
                                  })
df_gpa.head()
Out[ ]:
Arts/Music GPA Arts/Music Avg Yrs English GPA English Avg Yrs Foreign Language GPA Foreign Language Avg Yrs Math GPA Math Avg Yrs Science GPA Science Avg Yrs ... Score Ranges.Between 300 to 400.Math.Total Score Ranges.Between 300 to 400.Verbal.Total Score Ranges.Between 400 to 500.Math.Total Score Ranges.Between 400 to 500.Verbal.Total Score Ranges.Between 500 to 600.Math.Total Score Ranges.Between 500 to 600.Verbal.Total Score Ranges.Between 600 to 700.Math.Total Score Ranges.Between 600 to 700.Verbal.Total Score Ranges.Between 700 to 800.Math.Total Score Ranges.Between 700 to 800.Verbal.Total
0 3.92 2.2 3.53 3.9 3.54 2.6 3.41 4.0 3.52 3.9 ... 266.0 207 807.0 762.0 1336 1307 1096 1209 448 469
1 3.76 1.9 3.35 3.9 3.34 2.1 3.06 3.5 3.25 3.2 ... 386.0 389 1181.0 1208.0 1429 1325 787 798 163 224
2 3.85 2.1 3.45 3.9 3.41 2.6 3.25 3.9 3.43 3.4 ... 1331.0 1352 5163.0 5446.0 6728 6762 3814 3510 957 913
3 3.90 2.2 3.61 4.0 3.64 2.6 3.46 4.1 3.55 3.7 ... 99.0 88 347.0 324.0 560 538 443 468 132 170
4 3.76 1.8 3.32 3.8 3.29 2.8 3.05 3.7 3.20 3.2 ... 21754.0 26170 51702.0 58157.0 56817 55589 37605 31044 14243 9777

5 rows × 45 columns

Now we can tidy our second dataset.

Here, we are checking for null values. As you can see, there are many in this dataset, but it does not effect our analysis. One of the reasons for these empty values is due to the fact that certain characteristics, like race, was not collected during testing for every year.

In [ ]:
df2.isna().sum().sort_values()
Out[ ]:
PRIMARY_KEY                0
STATE                      0
YEAR                       0
KG_A_A                    83
G12_A_A                   83
                        ... 
G08_AM_A_MATHEMATICS    1655
G04_HP_A_READING        1699
G04_HP_A_MATHEMATICS    1700
G08_HP_A_READING        1701
G08_HP_A_MATHEMATICS    1702
Length: 266, dtype: int64

The datatypes are correct.

In [ ]:
df2.dtypes
Out[ ]:
PRIMARY_KEY              object
STATE                    object
YEAR                      int64
ENROLL                  float64
TOTAL_REVENUE           float64
                         ...   
G08_AM_A_MATHEMATICS    float64
G08_HP_A_READING        float64
G08_HP_A_MATHEMATICS    float64
G08_TR_A_READING        float64
G08_TR_A_MATHEMATICS    float64
Length: 266, dtype: object

We changed this dataset to only contain data from 2005-2015, like the first dataset. We also changed how some columns were named and got rid of rows that would not merge with the other dataframes.

In [ ]:
#pd.set_option('display.max_rows', None)
df2 = df2[(df2['YEAR'] >= 2005) & (df2['YEAR'] <= 2015)]
df2 = df2[(df2['STATE'] != 'NATIONAL') & (df2['STATE'] != 'DODEA')]
df2.index = range(len(df2.index))
df2 = df2.rename(columns = { 'STATE': 'State',  #renaming columns to match previous dataset for easier future use
                            'YEAR': 'Year'})
df1 = df1.rename(columns = {'State.Name' : 'State'})
df2['State'] = df2['State'].replace('DISTRICT_OF_COLUMBIA', 'DISTRICT OF COLUMBIA')
df2['State'] = df2['State'].replace('NEW_HAMPSHIRE', 'NEW HAMPSHIRE')
df2['State'] = df2['State'].replace('NEW_JERSEY', 'NEW JERSEY')
df2['State'] = df2['State'].replace('NEW_YORK', 'NEW YORK')
df2['State'] = df2['State'].replace('NEW_MEXICO', 'NEW MEXICO')
df2['State'] = df2['State'].replace('NORTH_DAKOTA', 'NORTH DAKOTA')
df2['State'] = df2['State'].replace('NORTH_CAROLINA', 'NORTH CAROLINA')
df2['State'] = df2['State'].replace('RHODE_ISLAND', 'RHODE ISLAND')
df2['State'] = df2['State'].replace('SOUTH_CAROLINA', 'SOUTH CAROLINA')
df2['State'] = df2['State'].replace('SOUTH_DAKOTA', 'SOUTH DAKOTA')
df2['State'] = df2['State'].replace('WEST_VIRGINIA', 'WEST VIRGINIA')


df2.head(5)
Out[ ]:
PRIMARY_KEY State Year ENROLL TOTAL_REVENUE FEDERAL_REVENUE STATE_REVENUE LOCAL_REVENUE TOTAL_EXPENDITURE INSTRUCTION_EXPENDITURE ... G08_HI_A_READING G08_HI_A_MATHEMATICS G08_AS_A_READING G08_AS_A_MATHEMATICS G08_AM_A_READING G08_AM_A_MATHEMATICS G08_HP_A_READING G08_HP_A_MATHEMATICS G08_TR_A_READING G08_TR_A_MATHEMATICS
0 2005_ALABAMA ALABAMA 2005 729342.0 5889966.0 665924.0 3247562.0 1976480.0 5904664.0 3053380.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2005_ALASKA ALASKA 2005 132568.0 1599739.0 302846.0 878731.0 418162.0 1751718.0 829666.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 2005_ARIZONA ARIZONA 2005 919875.0 7550109.0 897553.0 3320196.0 3332360.0 7136794.0 3423716.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 2005_ARKANSAS ARKANSAS 2005 461667.0 3978240.0 447012.0 2995788.0 535440.0 3999608.0 2118580.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 2005_CALIFORNIA CALIFORNIA 2005 6307022.0 62171797.0 7148863.0 34912166.0 20110768.0 66196072.0 31070450.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 266 columns

This third dataset was a lot more complicated to clean. We were able to do it by lots of renaming, transpose, and removing information we didn't need.

In [ ]:
df3.at[0, 'Unnamed: 15'] = "2013-14"
df3.at[0, 'Unnamed: 16'] = "2014-15"
df3.at[0, 'Unnamed: 17'] = "Score change: 1990-91 to 2004-05"
df3.at[0, 'Unnamed: 18'] = "Score change: 2004-05 to 2014-15"
df3.at[0, 'Unnamed: 19'] = "Score change: 2009-10 to 2014-15"
df3.at[0, 'Unnamed: 20'] = "Score change: 2013-14 to 2014-15"

df3.columns = df3.iloc[0]
df3 = df3[2:]
df3.reset_index(drop=True, inplace=True)
df3.at[2, "Race/ethnicity"] = "SAT—Critical reading All students"
df3 = df3.drop(1)

df3.at[9, "Race/ethnicity"] = "Asian/Pacific Islander"
df3 = df3.drop(8)


df3.at[11, "Race/ethnicity"] = "American Indian/ Alaska Native"
df3 = df3.drop(10)

df3.at[15, "Race/ethnicity"] = "SAT—Mathematics All students"
df3 = df3.drop(14)
df3 = df3.drop(13)

df3.at[22, "Race/ethnicity"] = "Asian/Pacific Islander"
df3 = df3.drop(21)

df3.at[24, "Race/ethnicity"] = "American Indian/ Alaska Native"
df3 = df3.drop(23)

df3 = df3.drop(26)
df3 = df3.drop(27)

df3.at[28, "Race/ethnicity"] = "SAT—Writing All students"

df3.at[35, "Race/ethnicity"] = "Asian/Pacific Islander"
df3 = df3.drop(34)

df3.at[37, "Race/ethnicity"] = "American Indian/ Alaska Native"
df3 = df3.drop(36)
df3 = df3.drop(39)
df3 = df3.drop(40)
df3 = df3.drop(41)

df3.at[3, "Race/ethnicity"] = "White"
df3.at[4, "Race/ethnicity"] = "Black"
df3.at[5, "Race/ethnicity"] = "Mexican American"
df3.at[6, "Race/ethnicity"] = "Puerto Rican"
df3.at[7, "Race/ethnicity"] = "Other Hispanic"
df3.at[12, "Race/ethnicity"] = "Other"

df3.at[16, "Race/ethnicity"] = "White"
df3.at[17, "Race/ethnicity"] = "Black"
df3.at[18, "Race/ethnicity"] = "Mexican American"
df3.at[19, "Race/ethnicity"] = "Puerto Rican"
df3.at[20, "Race/ethnicity"] = "Other Hispanic"
df3.at[25, "Race/ethnicity"] = "Other"

df3.at[29, "Race/ethnicity"] = "White"
df3.at[30, "Race/ethnicity"] = "Black"
df3.at[31, "Race/ethnicity"] = "Mexican American"
df3.at[32, "Race/ethnicity"] = "Puerto Rican"
df3.at[33, "Race/ethnicity"] = "Other Hispanic"
df3.at[38, "Race/ethnicity"] = "Other"
df3.reset_index(drop=True, inplace=True)
df3.head()
Out[ ]:
Race/ethnicity 1986- 87 1990- 91 1996- 97 2000- 01 2003- 04 2004- 05 2005-06 2006-07 2007-08 ... 2009-10 2010-11 2011-12 2012-13 2013-14 2014-15 Score change: 1990-91 to 2004-05 Score change: 2004-05 to 2014-15 Score change: 2009-10 to 2014-15 Score change: 2013-14 to 2014-15
0 1 2 3 4 5 6 7 8 9 10 ... 12 13 14 15 16 17 18 19 20 21
1 SAT—Critical reading All students 507 499 505 506 508 508 503 502 502 ... 501 497 496 496 497 495 9 -13 -6 -2
2 White 524 518 526 529 528 532 527 527 528 ... 528 528 527 527 529 529 14 -3 1 0
3 Black 428 427 434 433 430 433 434 433 430 ... 429 428 428 431 431 431 6 -2 2 0
4 Mexican American 457 454 451 451 451 453 454 455 454 ... 454 451 448 449 450 448 -1 -5 -6 -2

5 rows × 21 columns

In [ ]:
df3 = df3.drop(['1986- 87', '1990- 91', '1996- 97', '2000- 01', '2003- 04'], axis=1)
df3 = df3.drop(['Score change: 1990-91 to 2004-05', 'Score change: 2004-05 to 2014-15', 'Score change: 2009-10 to 2014-15', 'Score change: 2013-14 to 2014-15'], axis=1)

df3 = df3.rename(columns = {'2004- 05' : '2005',
                            '2005-06' : '2006',
                            '2006-07' : '2007',
                            '2007-08' : '2008',
                            '2008-09' : '2009',
                            '2009-10' : '2010',
                            '2010-11' : '2011',
                            '2011-12' : '2012',
                            '2012-13' : '2013',
                            '2013-14' : '2014',
                            '2014-15' : '2015'
                            })

We created a separate math and reading dataframe with this dataset to align with our other data and work for our model. We also converted the datatypes of the scores to ints, as they were objects before.

In [ ]:
# Creating the reading dataframe


df_R = df3[(df3['2005'].isin([532,433,453,460,463,511,489,495]))& (df3['2006'].isin([527, 434, 454,459,458,510,487,494]))]
df_R = df_R.rename(columns = {'Race/ethnicity': 'Year'})
df_R = df_R.T
df_R.columns = df_R.iloc[0]
df_R = df_R.drop('Year', axis=0)
df_R.reset_index(inplace =True)
df_R = df_R.rename(columns= {0:'Year'})

convert_dict = {'Year': int,
                'White': int,
                'Black': int,
                'Mexican American': int,
                'Puerto Rican': int,
                'Other Hispanic' : int,
                'Asian/Pacific Islander': int,
                'American Indian/ Alaska Native': int,
                'Other': int}
df_R = df_R.astype(convert_dict)
df_R
Out[ ]:
Year Year White Black Mexican American Puerto Rican Other Hispanic Asian/Pacific Islander American Indian/ Alaska Native Other
0 2005 532 433 453 460 463 511 489 495
1 2006 527 434 454 459 458 510 487 494
2 2007 527 433 455 459 459 514 487 497
3 2008 528 430 454 456 455 513 485 496
4 2009 528 429 453 452 455 516 486 494
5 2010 528 429 454 454 454 519 485 494
6 2011 528 428 451 452 451 517 484 493
7 2012 527 428 448 452 447 518 482 491
8 2013 527 431 449 456 450 521 480 492
9 2014 529 431 450 456 451 523 483 493
10 2015 529 431 448 456 449 525 481 490
In [ ]:
# Creating the math dataframe
df_M = df3[(df3['2005'].isin([536,431,463,457,469,580,493,513]))& (df3['2006'].isin([536, 429, 465,456,463,578,494,513]))]
df_M = df_M.rename(columns = {'Race/ethnicity': 'Year'})
df_M = df_M.T
df_M.columns = df_M.iloc[0]
df_M = df_M.drop('Year', axis=0)
df_M.reset_index(inplace =True)
df_M = df_M.rename(columns= {0:'Year'})
df_M.rename = None
convert_dict = {'Year': int,
                'White': int,
                'Black': int,
                'Mexican American': int,
                'Puerto Rican': int,
                'Other Hispanic' : int,
                'Asian/Pacific Islander': int,
                'American Indian/ Alaska Native': int,
                'Other': int}
df_M = df_M.astype(convert_dict)
df_M
Out[ ]:
Year Year White Black Mexican American Puerto Rican Other Hispanic Asian/Pacific Islander American Indian/ Alaska Native Other
0 2005 536 431 463 457 469 580 493 513
1 2006 536 429 465 456 463 578 494 513
2 2007 534 429 466 454 463 578 494 512
3 2008 537 426 463 453 461 581 491 512
4 2009 536 426 463 450 461 587 493 514
5 2010 536 428 467 452 462 591 492 514
6 2011 535 427 466 452 462 595 488 517
7 2012 536 428 465 452 461 595 489 516
8 2013 534 429 464 453 461 597 486 519
9 2014 534 429 461 450 459 598 484 520
10 2015 534 428 457 449 457 598 482 519

Exploratory Data Analysis¶

1. The Relationship Between Household Income and SAT Scores¶

One of our original hypotheses was that family income has a positive correlation with SAT scores, meaning the higher the family income, the higher the average SAT score. We decided to average the math and reading scores, as we want to look at overall scores for this analysis.

The first table we created includes all the data about how SAT scores relate to family income. We have also added new columns to show the percentage of test-takers in each family income bracket of the total number of test takers in each row. We have also found a total number of test takers.

In [ ]:
df_income.head()
Out[ ]:
Year State Family Income < 20k: Number of Test-takers Family Income < 20k: Math Average Score Family Income < 20k: Reading Average Score Family Income 20-40k: Number of Test-takers Family Income 20-40k: Math Average Score Family Income 20-40k: Reading Average Score Family Income 40-60k: Number of Test-takers Family Income 40-60k: Math Average Score ... Gender.Female.Verbal Gender.Male.Math Gender.Male.Verbal Total.Test-takers % < 20k % 20-40k % 40-60k % 60-80k % 80-100k % > 100k
0 2005 ALABAMA 175 462.0 474.0 324 513.0 527 442 539 ... 561 582 574 2869 0.060997 0.112931 0.154061 0.164866 0.165563 0.341582
1 2005 ALASKA 191 464.0 467.0 401 492.0 500 539 517 ... 521 535 526 2718 0.070272 0.147535 0.198308 0.221854 0.163355 0.198675
2 2005 ARIZONA 891 485.0 474.0 2121 498.0 495 2270 520 ... 522 549 531 12603 0.070697 0.168293 0.180116 0.188209 0.148060 0.244624
3 2005 ARKANSAS 107 489.0 486.0 180 513.0 526 245 543 ... 558 570 570 1220 0.087705 0.147541 0.200820 0.186066 0.120492 0.257377
4 2005 CALIFORNIA 19323 451.0 421.0 26161 477.0 458 18347 506 ... 499 543 510 123872 0.155992 0.211194 0.148113 0.144803 0.113989 0.225911

5 rows × 31 columns

This dataframe enables us to explore the correlation between income and SAT scores across different states. We can also analyze SAT score trends over the years to determine if there has been an increase in scores over time. Once we establish the correlation between SAT scores and income by state, we can further assess the percentage of people from various income brackets residing in each state to assess the potential impact of regional differences on educational outcomes.

Combined Correlation between Scores and Household Income We first want to analyze whether family income has an effect on SAT scores. To start off our analysis, we will create a line graph comparing family income to average SAT score. We decided to take the average of math average score and reading average score and plot it against family income. We predict that there will be a positive correlation between family income and SAT scores. We predict this because those who have families with a higher income can afford more resources, such as SAT prep books, which help their children get a higher score.

In [ ]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning) #kept getting warnings using .corr() but we can ignore them for now because it doesn't affect current version of pandas

income_columns = ['< 20k',
       '20-40k',
       '40-60k',
       '60-80k',
       '80-100k',
       '> 100k']
df_avg_0k = (df_income['Family Income < 20k: Math Average Score'].mean() + df_income['Family Income < 20k: Reading Average Score'].mean()) / 2
df_avg_20k = (df_income['Family Income 20-40k: Math Average Score'].mean() + df_income['Family Income 20-40k: Reading Average Score'].mean()) / 2
df_avg_40k = (df_income['Family Income 40-60k: Math Average Score'].mean() + df_income['Family Income 40-60k: Reading Average Score'].mean()) / 2
df_avg_60k = (df_income['Family Income 60-80k: Math Average Score'].mean() + df_income['Family Income 60-80k: Reading Average Score'].mean()) / 2
df_avg_80k = (df_income['Family Income 80-100k: Math Average Score'].mean() + df_income['Family Income 80-100k: Reading Average Score'].mean()) / 2
df_avg_100k = (df_income['Family Income > 100k: Math Average Score'].mean() + df_income['Family Income > 100k: Reading Average Score'].mean()) / 2
df_avg_total = {'score': [df_avg_0k, df_avg_20k, df_avg_40k, df_avg_60k, df_avg_80k, df_avg_100k],
        'income': ['< 20k', '20-40k', '40-60k','60-80k', '80-100k','> 100k'],
          'index': [1, 2, 3, 4, 5, 6]}
df_avg_total = pd.DataFrame(df_avg_total)
df_avg_total.head()
plt.legend()
plt.xlabel('Income')
plt.ylabel('Average Score')
plt.title(f'Avg Score based on income')

plot_income_sat = plt.plot(df_avg_total['income'], df_avg_total['score'])

df_avg_total.corr().loc['score', 'index']


pd.set_option('display.max_colwidth', 800)
df_income.head()
WARNING:matplotlib.legend:No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
Out[ ]:
Year State Family Income < 20k: Number of Test-takers Family Income < 20k: Math Average Score Family Income < 20k: Reading Average Score Family Income 20-40k: Number of Test-takers Family Income 20-40k: Math Average Score Family Income 20-40k: Reading Average Score Family Income 40-60k: Number of Test-takers Family Income 40-60k: Math Average Score ... Gender.Female.Verbal Gender.Male.Math Gender.Male.Verbal Total.Test-takers % < 20k % 20-40k % 40-60k % 60-80k % 80-100k % > 100k
0 2005 ALABAMA 175 462.0 474.0 324 513.0 527 442 539 ... 561 582 574 2869 0.060997 0.112931 0.154061 0.164866 0.165563 0.341582
1 2005 ALASKA 191 464.0 467.0 401 492.0 500 539 517 ... 521 535 526 2718 0.070272 0.147535 0.198308 0.221854 0.163355 0.198675
2 2005 ARIZONA 891 485.0 474.0 2121 498.0 495 2270 520 ... 522 549 531 12603 0.070697 0.168293 0.180116 0.188209 0.148060 0.244624
3 2005 ARKANSAS 107 489.0 486.0 180 513.0 526 245 543 ... 558 570 570 1220 0.087705 0.147541 0.200820 0.186066 0.120492 0.257377
4 2005 CALIFORNIA 19323 451.0 421.0 26161 477.0 458 18347 506 ... 499 543 510 123872 0.155992 0.211194 0.148113 0.144803 0.113989 0.225911

5 rows × 31 columns

Based on the graph there is a positive correlation between income and test scores. We then found the correlation coefficient to see how strong this correlation is. The correlation coefficient is .976 which is really strong.

However, this is the correlation for all of the states combined. Later in the project, we want to use state and income as a variable for our predictive model. Therefore, we need to find the correlation coefficient for each state to make sure the model would be accurate for each state when predicting.

To do this, we created a function that could get the correlation coefficient for any state so we could easily iterate through all of the states seamlessly. Then we added it to a dictionary with state as the key and the correlation coefficient as the value. We then sorted the dictionary from smallest to largest correlation coefficient.

Correlation Between Scores and Household Income by State

In [ ]:
def get_correlation(df_income, statename):
  df_state = df_income[(df_income['State'] == statename)]
  income_columns = []

  df_avg_0k = (df_state['Family Income < 20k: Math Average Score'].mean() + df_state['Family Income < 20k: Reading Average Score'].mean()) / 2
  df_avg_20k = (df_state['Family Income 20-40k: Math Average Score'].mean() + df_state['Family Income 20-40k: Reading Average Score'].mean()) / 2
  df_avg_40k = (df_state['Family Income 40-60k: Math Average Score'].mean() + df_state['Family Income 40-60k: Reading Average Score'].mean()) / 2
  df_avg_60k = (df_state['Family Income 60-80k: Math Average Score'].mean() + df_state['Family Income 60-80k: Reading Average Score'].mean()) / 2
  df_avg_80k = (df_state['Family Income 80-100k: Math Average Score'].mean() + df_state['Family Income 80-100k: Reading Average Score'].mean()) / 2
  df_avg_100k = (df_state['Family Income > 100k: Math Average Score'].mean() + df_state['Family Income > 100k: Reading Average Score'].mean()) / 2
  df_avg = {'score': [df_avg_0k, df_avg_20k, df_avg_40k, df_avg_60k, df_avg_80k, df_avg_100k],
        'income': ['< 20k', '20-40k', '40-60k','60-80k', '80-100k','> 100k'],
            'index': [1, 2, 3, 4, 5, 6]}
  df_avg = pd.DataFrame(df_avg)
  correlation_matrix = df_avg.corr()
  r_value = correlation_matrix.loc['score', 'index']

# Extract the correlation coefficient between two specific columns (e.g., Column1 and Column2)
  return r_value

state_correlations = {}

state_names = df_income['State'].unique()
for i in state_names:
  state_correlations[i] = get_correlation(df_income, i)

sorted_correlations = dict(sorted(state_correlations.items(), key=lambda item: item[1]))
sorted_correlations
Out[ ]:
{'WYOMING': 0.6817823026104805,
 'NORTH DAKOTA': 0.7444367294875884,
 'SOUTH DAKOTA': 0.7511933183748691,
 'VIRGIN ISLANDS': 0.9312281762015785,
 'MONTANA': 0.9397737259414765,
 'NEBRASKA': 0.9447349398116189,
 'MISSISSIPPI': 0.9452556846643638,
 'UTAH': 0.9483990565719241,
 'IOWA': 0.9540450456746358,
 'KANSAS': 0.959994646351036,
 'ARKANSAS': 0.9627908180238146,
 'OKLAHOMA': 0.9629782756291895,
 'ALASKA': 0.9634885788666858,
 'MAINE': 0.9655202926950999,
 'WISCONSIN': 0.967455700113321,
 'FLORIDA': 0.9721455941178156,
 'OHIO': 0.9725710757715881,
 'ALABAMA': 0.9742000091476154,
 'MISSOURI': 0.9743170332983074,
 'RHODE ISLAND': 0.9752676496678933,
 'COLORADO': 0.976166583527223,
 'KENTUCKY': 0.9774019928780511,
 'SOUTH CAROLINA': 0.9778913114637013,
 'LOUISIANA': 0.9791280064127692,
 'TENNESSEE': 0.9791381979672444,
 'WASHINGTON': 0.97979829408224,
 'MINNESOTA': 0.980880125149849,
 'MICHIGAN': 0.981669010843995,
 'NEVADA': 0.9824884335167628,
 'NEW HAMPSHIRE': 0.9825814330307877,
 'PENNSYLVANIA': 0.9826012777025478,
 'NEW YORK': 0.9827062212711544,
 'ILLINOIS': 0.9827348970539682,
 'IDAHO': 0.9834113354285738,
 'MASSACHUSETTS': 0.9840192451238341,
 'CONNECTICUT': 0.9848453372345946,
 'HAWAII': 0.9854316568080004,
 'WEST VIRGINIA': 0.985512001183547,
 'NEW JERSEY': 0.9855406973376892,
 'OREGON': 0.9856439572812011,
 'INDIANA': 0.985864808592127,
 'DISTRICT OF COLUMBIA': 0.9861195539957384,
 'TEXAS': 0.9862451287121754,
 'NORTH CAROLINA': 0.9868290906988119,
 'ARIZONA': 0.9872445761934121,
 'VERMONT': 0.9884377674742938,
 'DELAWARE': 0.9884799176055574,
 'GEORGIA': 0.9889991974408391,
 'MARYLAND': 0.9911364667634063,
 'VIRGINIA': 0.9923846879775922,
 'PUERTO RICO': 0.9928976852939522,
 'CALIFORNIA': 0.9929236107247942,
 'NEW MEXICO': 0.9954872028202372}

As you can see by the dictionary, most of the states have a strong correlation except for Wyoming, North Dakota, and South Dakota which all have a correlation coefficient below .9.

Finding Outliers¶

To understand why the states have such a weak correlation we decided to plot sat scores by year for every family income category per state as seen below. We will create a graph for Wyoming which has the weakest correlation. We will also create a graph for New Mexico because it has the strongest correlation and that way we can see what the graph was suppose to look like.

In [ ]:
df_income
df_income_state = df_income[['Year', 'State', 'Family Income < 20k: Math Average Score','Family Income < 20k: Reading Average Score',	'Family Income 20-40k: Math Average Score',	'Family Income 20-40k: Reading Average Score',	'Family Income 40-60k: Math Average Score',	'Family Income 40-60k: Reading Average Score',	'Family Income 60-80k: Math Average Score',	'Family Income 60-80k: Reading Average Score', 'Family Income 80-100k: Math Average Score',	'Family Income 80-100k: Reading Average Score',	'Family Income > 100k: Math Average Score',	'Family Income > 100k: Reading Average Score' ]]
df_income_state
#df_alabama = [(df_income_state['State'] == 'ALABAMA')]

df_income_state_0k = df_income_state[['State','Year', 'Family Income < 20k: Math Average Score', 'Family Income < 20k: Reading Average Score']]
df_income_state_0k['Avg Score'] = (df_income_state_0k['Family Income < 20k: Math Average Score'] + df_income_state_0k['Family Income < 20k: Reading Average Score'])/2

df_income_state_20k = df_income_state[['State','Year', 'Family Income 20-40k: Math Average Score', 'Family Income 20-40k: Reading Average Score']]
df_income_state_20k['Avg Score'] = (df_income_state_20k['Family Income 20-40k: Math Average Score'] + df_income_state_20k['Family Income 20-40k: Reading Average Score'])/2

df_income_state_40k = df_income_state[['State','Year', 'Family Income 40-60k: Math Average Score', 'Family Income 40-60k: Reading Average Score']]
df_income_state_40k['Avg Score'] = (df_income_state_40k['Family Income 40-60k: Math Average Score'] + df_income_state_40k['Family Income 40-60k: Reading Average Score'])/2

df_income_state_60k = df_income_state[['State','Year', 'Family Income 60-80k: Math Average Score', 'Family Income 60-80k: Reading Average Score']]
df_income_state_60k['Avg Score'] = (df_income_state_60k['Family Income 60-80k: Math Average Score'] + df_income_state_60k['Family Income 60-80k: Reading Average Score'])/2

df_income_state_80k = df_income_state[['State','Year', 'Family Income 80-100k: Math Average Score', 'Family Income 80-100k: Reading Average Score']]
df_income_state_80k['Avg Score'] = (df_income_state_80k['Family Income 80-100k: Math Average Score'] + df_income_state_80k['Family Income 80-100k: Reading Average Score'])/2

df_income_state_100k = df_income_state[['State','Year', 'Family Income > 100k: Math Average Score', 'Family Income > 100k: Reading Average Score']]
df_income_state_100k['Avg Score'] = (df_income_state_100k['Family Income > 100k: Math Average Score'] + df_income_state_100k['Family Income > 100k: Reading Average Score'])/2


def get_state_plot(df_income_state_0k, df_income_state_20k, df_income_state_40k, df_income_state_60k, df_income_state_80k, df_income_state_100k, statename, subplot_config):
  df_alabama_0k = df_income_state_0k[(df_income_state_0k['State'] == statename)]
  df_alabama_20k = df_income_state_20k[df_income_state_20k['State'] == statename]
  df_alabama_40k = df_income_state_40k[df_income_state_40k['State'] == statename]
  df_alabama_60k = df_income_state_60k[df_income_state_60k['State'] == statename]
  df_alabama_80k = df_income_state_80k[df_income_state_80k['State'] == statename]
  df_alabama_100k = df_income_state_100k[df_income_state_100k['State'] == statename]


  rows, cols, index = subplot_config

  plt.subplot(rows, cols, index)


  plot_income_0k = plt.plot(df_alabama_0k['Year'], df_alabama_0k['Avg Score'], label = '<20k')
  plot_income_20k = plt.plot(df_alabama_20k['Year'], df_alabama_20k['Avg Score'], label = '20k-40k')
  plot_income_40k = plt.plot(df_alabama_40k['Year'], df_alabama_40k['Avg Score'], label = '40k-60k')
  plot_income_60k = plt.plot(df_alabama_60k['Year'], df_alabama_60k['Avg Score'], label = '60k-80k')
  plot_income_80k = plt.plot(df_alabama_80k['Year'], df_alabama_80k['Avg Score'], label = '80k-100k')
  plot_income_100k = plt.plot(df_alabama_100k['Year'], df_alabama_100k['Avg Score'], label = '>100k')


  plt.legend()
  plt.xlabel('Year')
  plt.ylabel('Average Score')
  plt.title(f'Avg. Score vs. year for Incomes in {statename}')

plt.figure(figsize=(8, 6))

get_state_plot(df_income_state_0k, df_income_state_20k, df_income_state_40k, df_income_state_60k, df_income_state_80k, df_income_state_100k, 'WYOMING', (2,2,1))
get_state_plot(df_income_state_0k, df_income_state_20k, df_income_state_40k, df_income_state_60k, df_income_state_80k, df_income_state_100k, 'NEW MEXICO', (2,2,2))

plt.tight_layout()
plt.show()

The lines in the Wyoming graph mix together and there is not a clear trend and most of the lines are intertwining with each other. Unlike the Wyoming graph, the New Mexico has the its lines clearly stacked on top of one another in order of family income.

2. The Effect of State Revenue and Spending on SAT Scores¶

We also wanted to see whether a state's overall wealth and state spending on education had an effect on student outcomes. Firstly, we are going to start with an analysis on state revenue and SAT scores. To do this we made a scatterplot with all of the observations with state revenue as the independent variable and total math score as the dependent variable.

In [ ]:
df2_revenue = df2[['State',	'Year',	'TOTAL_REVENUE',	'FEDERAL_REVENUE',	'STATE_REVENUE',	'LOCAL_REVENUE', 'INSTRUCTION_EXPENDITURE', 'ENROLL']]
df1_scores = df1[['Year', 'State', 'Total.Math', 'Total.Verbal']]
In [ ]:
df1_scores = df1_scores[(df1_scores['State'] != 'PUERTO RICO') & (df1_scores['State'] != 'VIRGIN ISLANDS')]
merged_df = pd.merge(df2_revenue, df1_scores, left_on=['State', 'Year'], right_on=['State', 'Year'])
math_rev = plt.scatter(merged_df['STATE_REVENUE'], merged_df['Total.Math'])
plt.title("State Revenue vs SAT scores")
plt.xlabel("Average State Revenue from 2005-2015")
plt.ylabel("Average Math SAT score by state from 2005-2015")

verbal_rev = plt.scatter(merged_df['STATE_REVENUE'], merged_df['Total.Verbal'])
plt.legend([math_rev, verbal_rev], ['math', 'verbal'])
Out[ ]:
<matplotlib.legend.Legend at 0x7c12fb0bfaf0>
In [ ]:
merged_df
Out[ ]:
State Year TOTAL_REVENUE FEDERAL_REVENUE STATE_REVENUE LOCAL_REVENUE INSTRUCTION_EXPENDITURE ENROLL Total.Math Total.Verbal
0 ALABAMA 2005 5889966.0 665924.0 3247562.0 1976480.0 3053380.0 729342.0 559 567
1 ALASKA 2005 1599739.0 302846.0 878731.0 418162.0 829666.0 132568.0 519 523
2 ARIZONA 2005 7550109.0 897553.0 3320196.0 3332360.0 3423716.0 919875.0 530 526
3 ARKANSAS 2005 3978240.0 447012.0 2995788.0 535440.0 2118580.0 461667.0 552 563
4 CALIFORNIA 2005 62171797.0 7148863.0 34912166.0 20110768.0 31070450.0 6307022.0 522 504
... ... ... ... ... ... ... ... ... ... ...
554 VIRGINIA 2015 15857524.0 1012205.0 6240349.0 8604970.0 8755896.0 1279867.0 517 518
555 WASHINGTON 2015 13709442.0 1036422.0 8293812.0 4379208.0 6508964.0 1072359.0 511 502
556 WEST VIRGINIA 2015 3478401.0 362959.0 1979466.0 1135976.0 1819903.0 279565.0 500 509
557 WISCONSIN 2015 11637376.0 814385.0 5869265.0 4953726.0 5723474.0 861813.0 606 591
558 WYOMING 2015 1962874.0 120290.0 1116917.0 725667.0 895910.0 93867.0 587 588

559 rows × 10 columns

The scatter plot indicates that there is no strong or visible correlation between revenue and test scores. We were curious about why this was and conducted further research. During our search, we found this article which explains that state revenue or state spending on education does not have as significant impact on student success. You can find it here.

The study indicates that while there may not be an obvious correlation when looking at the data as a whole, low income students are more sensitive to changes in education spending than higher income students. This is probably because lower income students are more reliant on the resources the state provides them. For a future investigation, it would be interesting to look at an additional data set with per pupil spending over the years for each state and compare it with the performance of students with different income brackets to verify this correlation.

However, for now we wanted to verify the information found on the previous article and create a scatterplot of education spending per student vs. sat scores. To do this we divided instruction expenditure by number of students enrolled to get the education spending and we used the average math score as our dependent variable. When plotting, we staggered the variables a few years apart since it takes a few years for the effects of education spending to show. Although we did not do this with the state revenue scatterplot, staggering the years has little effect since unstaggering the scatterplot below produces a very similar graph.

In [ ]:
merged_df.head()
merged_df['education_spending'] = merged_df['INSTRUCTION_EXPENDITURE']/merged_df['ENROLL']
plt.legend()
plt.xlabel('Education Spending (mil)')
plt.ylabel('Average Math Score')
plt.title(f'Education Spending per Student vs. Average Math Score')
rev_expend = plt.scatter(merged_df[merged_df['Year'] == 2007]['education_spending'], merged_df[merged_df['Year'] == 2011]['Total.Math'])
WARNING:matplotlib.legend:No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.

3. 8th grade NAEP scores and SAT scores¶

The NAEP or the National Assessment of Educational Progess provide national standardized testing for 4th, 8th, and 12th grade students.

The code below shows the process of creating a data frame of 8th grade NAEP exam scores and SAT scores, with a 3 year difference. We chose to look at this 3 year gap because most students take the SAT their junior year of high school, so it is the same group of students taking both exams.

We wanted to look at 8th grade NAEP scores to see if there is any way to identify early on if a child would be more successful based on previous examination results. If this is true then colleges would be able to use this data to advertise their college to students earlier than the student's sophomore year of highschool.

In [ ]:
df2_8G = df2[['Year', 'State', 'G08_A_M_READING', 'G08_A_M_MATHEMATICS', 'G08_A_F_READING', 'G08_A_F_MATHEMATICS']]
df2_8G['G08 Average Math'] =(df2_8G['G08_A_M_MATHEMATICS'] + df2_8G['G08_A_F_MATHEMATICS']) / 2
df2_8G['G08 Average Reading'] =(df2_8G['G08_A_M_READING'] + df2_8G['G08_A_F_READING']) / 2
df2_8G = df2_8G.rename(columns = { 'G08_A_M_READING': 'G08 Average Reading Male',
                                  'G08_A_M_MATHEMATICS': 'G08 Average Math Male',
                                   'G08_A_F_MATHEMATICS': 'G08 Average Math Female',
                                   'G08_A_F_READING': 'G08 Average Reading Female'})

df1_scores = df1_scores.rename(columns = {'Total.Math': 'Average Math Score',
                                          'Total.Verbal': 'Average Verbal Score'})


# For Math Scores

df2_8G_2007 = df2_8G[(df2_8G['Year'] == 2007)]
df1_2010 = df1_scores[(df1_scores['Year']==2010)]
merged_2007_2010 = pd.merge(df2_8G_2007, df1_2010, left_on=['State'], right_on=['State'])
plot2007_2010 = plt.scatter(merged_2007_2010['G08 Average Math'], merged_2007_2010['Average Math Score'])


df2_8G_2009 = df2_8G[(df2_8G['Year'] == 2009)]
df1_2012 = df1_scores[(df1_scores['Year']==2012)]
merged_2009_2012 = pd.merge(df2_8G_2009, df1_2012, left_on=['State'], right_on=['State'])
merged_2009_2012.describe()
plot2009_2012 = plt.scatter(merged_2009_2012['G08 Average Math'], merged_2009_2012['Average Math Score'])


df2_8G_2011 = df2_8G[(df2_8G['Year'] == 2011)]
df1_2014 = df1_scores[(df1_scores['Year']==2014)]
merged_2011_2014 = pd.merge(df2_8G_2011, df1_2014, left_on=['State'], right_on=['State'])
merged_2011_2014
plot2011_2014 = plt.scatter(merged_2011_2014['G08 Average Math'], merged_2011_2014['Average Math Score'])

plt.legend([plot2007_2010, plot2009_2012, plot2011_2014], ['2007-2010', '2009-2011', '2011-2014'])


plt.title("8th Grade NAEP Math Scores vs SAT Math scores")
plt.xlabel("Average State NAEP Score")
plt.ylabel("Average State SAT Score")
Out[ ]:
Text(0, 0.5, 'Average State SAT Score')

As you can see by the graph, there is a pretty loose correlation between average state NAEP score and SAT math score. Since we do not have data of individual people and their scores, we have to make the assumption that the people taking the 8th grade NAEP is the same group of people who are taking the SAT three years later.

In [ ]:
# For reading/ verbal scores

df2_8G_2007 = df2_8G[(df2_8G['Year'] == 2007)]
df1_2010 = df1_scores[(df1_scores['Year']==2010)]
merged_2007_2010 = pd.merge(df2_8G_2007, df1_2010, left_on=['State'], right_on=['State'])
plot2007_2010r = plt.scatter(merged_2007_2010['G08 Average Reading'], merged_2007_2010['Average Verbal Score'], c='#17becf')


df2_8G_2009 = df2_8G[(df2_8G['Year'] == 2009)]
df1_2012 = df1_scores[(df1_scores['Year']==2012)]
merged_2009_2012 = pd.merge(df2_8G_2009, df1_2012, left_on=['State'], right_on=['State'])
merged_2009_2012.describe()
plot2009_2012r = plt.scatter(merged_2009_2012['G08 Average Reading'], merged_2009_2012['Average Verbal Score'], c='#d62728')


df2_8G_2011 = df2_8G[(df2_8G['Year'] == 2011)]
df1_2014 = df1_scores[(df1_scores['Year']==2014)]
merged_2011_2014 = pd.merge(df2_8G_2011, df1_2014, left_on=['State'], right_on=['State'])
merged_2011_2014
plot2011_2014r = plt.scatter(merged_2011_2014['G08 Average Reading'], merged_2011_2014['Average Verbal Score'], c='#e377c2')


plt.legend([plot2007_2010r, plot2009_2012r, plot2011_2014r], ['2007-2010', '2009-2011', '2011-2014'])
plt.title("8th Grade NAEP Reading Scores vs SAT Verbal scores")
plt.xlabel("Average State NAEP Score")
plt.ylabel("Average State SAT Score")
Out[ ]:
Text(0, 0.5, 'Average State SAT Score')

We wanted to see if reading scores were any different and the graphs look pretty similar, however the correlation seems a little looser.

4. Gender Correlation¶

One feature that we wanted to add to the predictive model is gender. To do this we will see if there is any difference between gender and scores. We did an intitial analysis by comparing the average difference in scores between women and men for math and reading and then both of them together. We foudn that there is on average a larger disparity in test scores between the two genders when it comes to math rather than reading. In total the average difference in the total score is around 42. It is difficult to determine how significant the disparity is because SAT scores are calculated on a scale rather than each question being worth an equal amount of points.

In [ ]:
print('avg difference in scores between women and men in math: ', abs((df_income['Gender.Female.Math']- df_income['Gender.Male.Math']).mean()))
print('avg difference in scores between women and men in verbal: ', abs((df_income['Gender.Female.Verbal']- df_income['Gender.Male.Verbal']).mean()))
print('avg difference in total scores: ', abs(((df_income['Gender.Female.Verbal']+df_income['Gender.Female.Math'])- (df_income['Gender.Male.Verbal']+df_income['Gender.Male.Math'])).mean()))
avg difference in scores between women and men in math:  35.49566724436742
avg difference in scores between women and men in verbal:  6.535528596187175
avg difference in total scores:  42.031195840554595

5. Race Correlation¶

Another feature we want to look at for our predictive model is race and the impact it has on SAT scores. We thought this would be an important feature to look at since historically, people of color have a disadvantage due to systematic issues. To do this, we found the average scores of all the years for each race, and also averaged reading and math.

In [ ]:
df_R_copy = df_R
df_M_copy = df_M

white = df_M_copy['White'].mean()
black = df_M_copy['Black'].mean()
mexican = df_M_copy['Mexican American'].mean()
puerto = df_M_copy['Puerto Rican'].mean()
hispanic = df_M_copy['Other Hispanic'].mean()
asian = df_M_copy['Asian/Pacific Islander'].mean()
AI_AN = df_M_copy['American Indian/ Alaska Native'].mean()
other = df_M_copy['Other'].mean()

data = {
    'race/ethnicity': ['White', 'Black', 'Mexican American',	'Puerto Rican',	'Other Hispanic',	'Asian/Pacific Islander',	'American Indian/ Alaska Native',	'Other'],
    'data': [white, black, mexican, puerto, hispanic, asian, AI_AN, other]
}
df_bar = pd.DataFrame(data)

df_bar.plot(kind='bar', x='race/ethnicity', y='data', legend=False)

# Add labels and title
plt.xlabel('Race/Ethnicity')
plt.ylabel('Avg. Math SAT score')
plt.title('Avg. Math SAT score for each race')

# Show the plot
plt.show()

white = df_R_copy['White'].mean()
black = df_R_copy['Black'].mean()
mexican = df_R_copy['Mexican American'].mean()
puerto = df_R_copy['Puerto Rican'].mean()
hispanic = df_R_copy['Other Hispanic'].mean()
asian = df_R_copy['Asian/Pacific Islander'].mean()
AI_AN = df_R_copy['American Indian/ Alaska Native'].mean()
other = df_R_copy['Other'].mean()

data = {
    'race/ethnicity': ['White', 'Black', 'Mexican American',	'Puerto Rican',	'Other Hispanic',	'Asian/Pacific Islander',	'American Indian/ Alaska Native',	'Other'],
    'data': [white, black, mexican, puerto, hispanic, asian, AI_AN, other]
}
df_bar = pd.DataFrame(data)

df_bar.plot(kind='bar', x='race/ethnicity', y='data', legend=False)

# Add labels and title
plt.xlabel('Race/Ethnicity')
plt.ylabel('Avg. Reading SAT score')
plt.title('Avg. Reading SAT score for each race')

# Show the plot
plt.show()

For math, asian/pacific islanders have the highest average score while black people have the lowest average score. For reading, white people have the highest average score while black people have the lowest average score. The disparities between scores are more apparent in math scores than reading scores which makes sense beceause math is a more technical skill and less intuitive.

Our Model¶

We want to focus on the socioeconomic factors that go into an SAT score. The goal of our tutorial is to make a product that college admission offices can use to compare students based on more qualities than just their test scores. There are many socioeconomic factors that play into a students success in school, so comparing two students' SAT scores who have very different backgrounds may not be equitable to do one to one. With our model, however, colleges can input these different factors for each student they are assessing, and figure out how they compare to students who do have a similar background to them.

Our model will predict and output the average score of someone with the same state/US terriroty, gender, race, and family income bracket for the student they are viewing. The admissions officer can then compare their student to that average, instead of a general/ national one.

For our model we decided that we needed it to be nonlinear, and chose a Random Forest Regressor Model.

Creating our table

To create the table to feed to our model, we had to make 192 separate dataframes, each that contain the average score for someone with each specific combination of socioeconomic factors. With this, we decided to average each of the scores for each of the factors, as they all had a correlation with SAT scores. In these dataframes, we also added columns that contained what category they are representing. For example, Gender=M, Income=40k, etc.

To make this process easier, we created a main function that would create each data frame based on given inputs. This would make it easier to copy and paste a section of dataframes and then manually change some of the inputs.

In [ ]:
def add_race(race, df_income, race_data, gender, income, subject):
  #to code: gender, income subject
  if gender == 'M': #male
    if income == 0: #male, 0
      if subject == 'M': #male, 0, math
        main_data = df_income[['Year', 'State', 'Gender.Male.Math', 'Family Income < 20k: Math Average Score']]
        main_data['Gender'] = 'M'
        main_data['Income'] = '<20k'
        main_data['Subject'] = 'Math'
        main_data['Race'] = race
        gender_label = 'Gender.Male.Math'
        income_label = 'Family Income < 20k: Math Average Score'
      else: #male, 0, reading
        main_data = df_income[['Year', 'State', 'Gender.Male.Verbal', 'Family Income < 20k: Reading Average Score']]
        main_data['Gender'] = 'M'
        main_data['Income'] = '<20k'
        main_data['Subject'] = 'Reading'
        main_data['Race'] = race
        gender_label = 'Gender.Male.Verbal'
        income_label = 'Family Income < 20k: Reading Average Score'
    if income == 20:
      if subject == 'M': #male, 20, math
        main_data = df_income[['Year', 'State', 'Gender.Male.Math', 'Family Income 20-40k: Math Average Score']]
        main_data['Gender'] = 'M'
        main_data['Income'] = '20-40k'
        main_data['Subject'] = 'Math'
        main_data['Race'] = race
        gender_label = 'Gender.Male.Math'
        income_label = 'Family Income 20-40k: Math Average Score'
      else: #male, 20, reading
        main_data = df_income[['Year', 'State', 'Gender.Male.Verbal', 'Family Income 20-40k: Reading Average Score']]
        main_data['Gender'] = 'M'
        main_data['Income'] = '20-40k'
        main_data['Subject'] = 'Reading'
        main_data['Race'] = race
        gender_label = 'Gender.Male.Verbal'
        income_label = 'Family Income 20-40k: Reading Average Score'
    if income == 40:
      if subject == 'M': #male, 40, math
        main_data = df_income[['Year', 'State', 'Gender.Male.Math', 'Family Income 40-60k: Math Average Score']]
        main_data['Gender'] = 'M'
        main_data['Income'] = '40-60k'
        main_data['Subject'] = 'Math'
        main_data['Race'] = race
        gender_label = 'Gender.Male.Math'
        income_label = 'Family Income 40-60k: Math Average Score'
      else: #male, 40, reading
        main_data = df_income[['Year', 'State', 'Gender.Male.Verbal', 'Family Income 40-60k: Reading Average Score']]
        main_data['Gender'] = 'M'
        main_data['Income'] = '40-60k'
        main_data['Subject'] = 'Reading'
        main_data['Race'] = race
        gender_label = 'Gender.Male.Verbal'
        income_label = 'Family Income 40-60k: Reading Average Score'
    if income == 60:
      if subject == 'M': #male, 60, math
        main_data = df_income[['Year', 'State', 'Gender.Male.Math', 'Family Income 60-80k: Math Average Score']]
        main_data['Gender'] = 'M'
        main_data['Income'] = '60-80k'
        main_data['Subject'] = 'Math'
        main_data['Race'] = race
        gender_label = 'Gender.Male.Math'
        income_label = 'Family Income 60-80k: Math Average Score'
      else: #male, 60, reading
        main_data = df_income[['Year', 'State', 'Gender.Male.Verbal', 'Family Income 60-80k: Reading Average Score']]
        main_data['Gender'] = 'M'
        main_data['Income'] = '60-80k'
        main_data['Subject'] = 'Reading'
        main_data['Race'] = race
        gender_label = 'Gender.Male.Verbal'
        income_label = 'Family Income 60-80k: Reading Average Score'
    if income == 80:
      if subject == 'M': #male, 80, math
        main_data = df_income[['Year', 'State', 'Gender.Male.Math', 'Family Income 80-100k: Math Average Score']]
        main_data['Gender'] = 'M'
        main_data['Income'] = '80-100k'
        main_data['Subject'] = 'Math'
        main_data['Race'] = race
        gender_label = 'Gender.Male.Math'
        income_label = 'Family Income 80-100k: Math Average Score'
      else: #male, 80, reading
        main_data = df_income[['Year', 'State', 'Gender.Male.Verbal', 'Family Income 80-100k: Reading Average Score']]
        main_data['Gender'] = 'M'
        main_data['Income'] = '80-100k'
        main_data['Subject'] = 'Reading'
        main_data['Race'] = race
        gender_label = 'Gender.Male.Verbal'
        income_label = 'Family Income 80-100k: Reading Average Score'
    if income == 100:
      if subject == 'M': #male, 100, math
        main_data = df_income[['Year', 'State', 'Gender.Male.Math', 'Family Income > 100k: Math Average Score']]
        main_data['Gender'] = 'M'
        main_data['Income'] = '>100k'
        main_data['Subject'] = 'Math'
        main_data['Race'] = race
        gender_label = 'Gender.Male.Math'
        income_label = 'Family Income > 100k: Math Average Score'
      else: #male, 100, reading
        main_data = df_income[['Year', 'State', 'Gender.Male.Verbal', 'Family Income > 100k: Reading Average Score']]
        main_data['Gender'] = 'M'
        main_data['Income'] = '>100k'
        main_data['Subject'] = 'Reading'
        main_data['Race'] = race
        gender_label = 'Gender.Male.Verbal'
        income_label = 'Family Income > 100k: Reading Average Score'
  else:
    if income == 0: #female, 0
      if subject == 'M': #male, 0, math
        main_data = df_income[['Year', 'State', 'Gender.Female.Math', 'Family Income < 20k: Math Average Score']]
        main_data['Gender'] = 'F'
        main_data['Income'] = '<20k'
        main_data['Subject'] = 'Math'
        main_data['Race'] = race
        gender_label = 'Gender.Female.Math'
        income_label = 'Family Income < 20k: Math Average Score'
      else: #female, 0, reading
        main_data = df_income[['Year', 'State', 'Gender.Female.Verbal', 'Family Income < 20k: Reading Average Score']]
        main_data['Gender'] = 'F'
        main_data['Income'] = '<20k'
        main_data['Subject'] = 'Reading'
        main_data['Race'] = race
        gender_label = 'Gender.Female.Verbal'
        income_label = 'Family Income < 20k: Reading Average Score'
    if income == 20:
      if subject == 'M': #female, 20, math
        main_data = df_income[['Year', 'State', 'Gender.Female.Math', 'Family Income 20-40k: Math Average Score']]
        main_data['Gender'] = 'F'
        main_data['Income'] = '20-40k'
        main_data['Subject'] = 'Math'
        main_data['Race'] = race
        gender_label = 'Gender.Female.Math'
        income_label = 'Family Income 20-40k: Math Average Score'
      else: #female, 20, reading
        main_data = df_income[['Year', 'State', 'Gender.Female.Verbal', 'Family Income 20-40k: Reading Average Score']]
        main_data['Gender'] = 'F'
        main_data['Income'] = '20-40k'
        main_data['Subject'] = 'Reading'
        main_data['Race'] = race
        gender_label = 'Gender.Female.Verbal'
        income_label = 'Family Income 20-40k: Reading Average Score'
    if income == 40:
      if subject == 'M': #female, 40, math
        main_data = df_income[['Year', 'State', 'Gender.Female.Math', 'Family Income 40-60k: Math Average Score']]
        main_data['Gender'] = 'F'
        main_data['Income'] = '40-60k'
        main_data['Subject'] = 'Math'
        main_data['Race'] = race
        gender_label = 'Gender.Female.Math'
        income_label = 'Family Income 40-60k: Math Average Score'
      else: #female, 40, reading
        main_data = df_income[['Year', 'State', 'Gender.Female.Verbal', 'Family Income 40-60k: Reading Average Score']]
        main_data['Gender'] = 'F'
        main_data['Income'] = '40-60k'
        main_data['Subject'] = 'Reading'
        main_data['Race'] = race
        gender_label = 'Gender.Female.Verbal'
        income_label = 'Family Income 40-60k: Reading Average Score'
    if income == 60:
      if subject == 'M': #female, 60, math
        main_data = df_income[['Year', 'State', 'Gender.Female.Math', 'Family Income 60-80k: Math Average Score']]
        main_data['Gender'] = 'F'
        main_data['Income'] = '60-80k'
        main_data['Subject'] = 'Math'
        main_data['Race'] = race
        gender_label = 'Gender.Female.Math'
        income_label = 'Family Income 60-80k: Math Average Score'
      else: #female, 60, reading
        main_data = df_income[['Year', 'State', 'Gender.Female.Verbal', 'Family Income 60-80k: Reading Average Score']]
        main_data['Gender'] = 'F'
        main_data['Income'] = '60-80k'
        main_data['Subject'] = 'Reading'
        main_data['Race'] = race
        gender_label = 'Gender.Female.Verbal'
        income_label = 'Family Income 60-80k: Reading Average Score'
    if income == 80:
      if subject == 'M': #female, 80, math
        main_data = df_income[['Year', 'State', 'Gender.Female.Math', 'Family Income 80-100k: Math Average Score']]
        main_data['Gender'] = 'F'
        main_data['Income'] = '80-100k'
        main_data['Subject'] = 'Math'
        main_data['Race'] = race
        gender_label = 'Gender.Female.Math'
        income_label = 'Family Income 80-100k: Math Average Score'
      else: #female, 80, reading
        main_data = df_income[['Year', 'State', 'Gender.Female.Verbal', 'Family Income 80-100k: Reading Average Score']]
        main_data['Gender'] = 'F'
        main_data['Income'] = '80-100k'
        main_data['Subject'] = 'Reading'
        main_data['Race'] = race
        gender_label = 'Gender.Female.Verbal'
        income_label = 'Family Income 80-100k: Reading Average Score'
    if income == 100:
      if subject == 'M': #female, 100, math
        main_data = df_income[['Year', 'State', 'Gender.Female.Math', 'Family Income > 100k: Math Average Score']]
        main_data['Gender'] = 'F'
        main_data['Income'] = '>100k'
        main_data['Subject'] = 'Math'
        main_data['Race'] = race
        gender_label = 'Gender.Female.Math'
        income_label = 'Family Income > 100k: Math Average Score'
      else: #female, 100, reading
        main_data = df_income[['Year', 'State', 'Gender.Female.Verbal', 'Family Income > 100k: Reading Average Score']]
        main_data['Gender'] = 'F'
        main_data['Income'] = '>100k'
        main_data['Subject'] = 'Reading'
        main_data['Race'] = race
        gender_label = 'Gender.Female.Verbal'
        income_label = 'Family Income > 100k: Reading Average Score'


  x = 0
  for i in range(2005, 2016):
    main_data.loc[main_data['Year'] == i, race] = race_data.at[x, race]
    x+=1
  main_data['combined score'] = (main_data[gender_label] + main_data[income_label] +main_data[race])/3
  return main_data

#('race', df_income, df_M/df_R, F/M, income, M/R)

Here is an example of the combination of a white male's math score whos family makes less than $20,000 a year. The combined score represents the average of the male math average score, the less than 20k income bracket, and being white.

In [ ]:
m_math_0k_white = add_race('White', df_income, df_M, 'M', 0, 'M')
m_math_0k_white
Out[ ]:
Year State Gender.Male.Math Family Income < 20k: Math Average Score Gender Income Subject Race White combined score
0 2005 ALABAMA 582 462.0 M <20k Math White 536.0 526.666667
1 2005 ALASKA 535 464.0 M <20k Math White 536.0 511.666667
2 2005 ARIZONA 549 485.0 M <20k Math White 536.0 523.333333
3 2005 ARKANSAS 570 489.0 M <20k Math White 536.0 531.666667
4 2005 CALIFORNIA 543 451.0 M <20k Math White 536.0 510.000000
... ... ... ... ... ... ... ... ... ... ...
572 2015 VIRGINIA 532 441.0 M <20k Math White 534.0 502.333333
573 2015 WASHINGTON 525 449.0 M <20k Math White 534.0 502.666667
574 2015 WEST VIRGINIA 513 448.0 M <20k Math White 534.0 498.333333
575 2015 WISCONSIN 627 573.0 M <20k Math White 534.0 578.000000
576 2015 WYOMING 604 NaN M <20k Math White 534.0 NaN

577 rows × 10 columns

Here is the creation of the rest of the dataframes, calling the above function.

In [ ]:
#male math
m_math_0k_white = add_race('White', df_income, df_M, 'M', 0, 'M')
m_math_0k_black = add_race('Black', df_income, df_M, 'M', 0, 'M')
m_math_0k_mexican = add_race('Mexican American', df_income, df_M, 'M', 0, 'M')
m_math_0k_puerto = add_race('Puerto Rican', df_income, df_M, 'M', 0, 'M')
m_math_0k_hispanic = add_race('Other Hispanic', df_income, df_M, 'M', 0, 'M')
m_math_0k_asian = add_race('Asian/Pacific Islander', df_income, df_M, 'M', 0, 'M')
m_math_0k_AI_AN = add_race('American Indian/ Alaska Native', df_income, df_M, 'M', 0, 'M')
m_math_0k_other = add_race('Other', df_income, df_M, 'M', 0, 'M')

m_math_20k_white = add_race('White', df_income, df_M, 'M', 20, 'M')
m_math_20k_black = add_race('Black', df_income, df_M, 'M', 20, 'M')
m_math_20k_mexican = add_race('Mexican American', df_income, df_M, 'M', 20, 'M')
m_math_20k_puerto = add_race('Puerto Rican', df_income, df_M, 'M', 20, 'M')
m_math_20k_hispanic = add_race('Other Hispanic', df_income, df_M, 'M', 20, 'M')
m_math_20k_asian = add_race('Asian/Pacific Islander', df_income, df_M, 'M', 20, 'M')
m_math_20k_AI_AN = add_race('American Indian/ Alaska Native', df_income, df_M, 'M', 20, 'M')
m_math_20k_other = add_race('Other', df_income, df_M, 'M', 20, 'M')

m_math_40k_white = add_race('White', df_income, df_M, 'M', 40, 'M')
m_math_40k_black = add_race('Black', df_income, df_M, 'M', 40, 'M')
m_math_40k_mexican = add_race('Mexican American', df_income, df_M, 'M', 40, 'M')
m_math_40k_puerto = add_race('Puerto Rican', df_income, df_M, 'M', 40, 'M')
m_math_40k_hispanic = add_race('Other Hispanic', df_income, df_M, 'M', 40, 'M')
m_math_40k_asian = add_race('Asian/Pacific Islander', df_income, df_M, 'M', 40, 'M')
m_math_40k_AI_AN = add_race('American Indian/ Alaska Native', df_income, df_M, 'M', 40, 'M')
m_math_40k_other = add_race('Other', df_income, df_M, 'M', 40, 'M')

m_math_60k_white = add_race('White', df_income, df_M, 'M', 60, 'M')
m_math_60k_black = add_race('Black', df_income, df_M, 'M', 60, 'M')
m_math_60k_mexican = add_race('Mexican American', df_income, df_M, 'M', 60, 'M')
m_math_60k_puerto = add_race('Puerto Rican', df_income, df_M, 'M', 60, 'M')
m_math_60k_hispanic = add_race('Other Hispanic', df_income, df_M, 'M', 60, 'M')
m_math_60k_asian = add_race('Asian/Pacific Islander', df_income, df_M, 'M', 60, 'M')
m_math_60k_AI_AN = add_race('American Indian/ Alaska Native', df_income, df_M, 'M', 60, 'M')
m_math_60k_other = add_race('Other', df_income, df_M, 'M', 60, 'M')

m_math_80k_white = add_race('White', df_income, df_M, 'M', 80, 'M')
m_math_80k_black = add_race('Black', df_income, df_M, 'M', 80, 'M')
m_math_80k_mexican = add_race('Mexican American', df_income, df_M, 'M', 80, 'M')
m_math_80k_puerto = add_race('Puerto Rican', df_income, df_M, 'M', 80, 'M')
m_math_80k_hispanic = add_race('Other Hispanic', df_income, df_M, 'M', 80, 'M')
m_math_80k_asian = add_race('Asian/Pacific Islander', df_income, df_M, 'M', 80, 'M')
m_math_80k_AI_AN = add_race('American Indian/ Alaska Native', df_income, df_M, 'M', 80, 'M')
m_math_80k_other = add_race('Other', df_income, df_M, 'M', 80, 'M')

m_math_100k_white = add_race('White', df_income, df_M, 'M', 100, 'M')
m_math_100k_black = add_race('Black', df_income, df_M, 'M', 100, 'M')
m_math_100k_mexican = add_race('Mexican American', df_income, df_M, 'M', 100, 'M')
m_math_100k_puerto = add_race('Puerto Rican', df_income, df_M, 'M', 100, 'M')
m_math_100k_hispanic = add_race('Other Hispanic', df_income, df_M, 'M', 100, 'M')
m_math_100k_asian = add_race('Asian/Pacific Islander', df_income, df_M, 'M', 100, 'M')
m_math_100k_AI_AN = add_race('American Indian/ Alaska Native', df_income, df_M, 'M', 100, 'M')
m_math_100k_other = add_race('Other', df_income, df_M, 'M', 100, 'M')
In [ ]:
#female, math
f_math_0k_white = add_race('White', df_income, df_M, 'F', 0, 'M')
f_math_0k_black = add_race('Black', df_income, df_M, 'F', 0, 'M')
f_math_0k_mexican = add_race('Mexican American', df_income, df_M, 'F', 0, 'M')
f_math_0k_puerto = add_race('Puerto Rican', df_income, df_M, 'F', 0, 'M')
f_math_0k_hispanic = add_race('Other Hispanic', df_income, df_M, 'F', 0, 'M')
f_math_0k_asian = add_race('Asian/Pacific Islander', df_income, df_M, 'F', 0, 'M')
f_math_0k_AI_AN = add_race('American Indian/ Alaska Native', df_income, df_M, 'F', 0, 'M')
f_math_0k_other = add_race('Other', df_income, df_M, 'M', 0, 'M')

f_math_20k_white = add_race('White', df_income, df_M, 'F', 20, 'M')
f_math_20k_black = add_race('Black', df_income, df_M, 'F', 20, 'M')
f_math_20k_mexican = add_race('Mexican American', df_income, df_M, 'F', 20, 'M')
f_math_20k_puerto = add_race('Puerto Rican', df_income, df_M, 'F', 20, 'M')
f_math_20k_hispanic = add_race('Other Hispanic', df_income, df_M, 'F', 20, 'M')
f_math_20k_asian = add_race('Asian/Pacific Islander', df_income, df_M, 'F', 20, 'M')
f_math_20k_AI_AN = add_race('American Indian/ Alaska Native', df_income, df_M, 'F', 20, 'M')
f_math_20k_other = add_race('Other', df_income, df_M, 'F', 20, 'M')

f_math_40k_white = add_race('White', df_income, df_M, 'F', 40, 'M')
f_math_40k_black = add_race('Black', df_income, df_M, 'F', 40, 'M')
f_math_40k_mexican = add_race('Mexican American', df_income, df_M, 'F', 40, 'M')
f_math_40k_puerto = add_race('Puerto Rican', df_income, df_M, 'F', 40, 'M')
f_math_40k_hispanic = add_race('Other Hispanic', df_income, df_M, 'F', 40, 'M')
f_math_40k_asian = add_race('Asian/Pacific Islander', df_income, df_M, 'F', 40, 'M')
f_math_40k_AI_AN = add_race('American Indian/ Alaska Native', df_income, df_M, 'F', 40, 'M')
f_math_40k_other = add_race('Other', df_income, df_M, 'F', 40, 'M')

f_math_60k_white = add_race('White', df_income, df_M, 'F', 60, 'M')
f_math_60k_black = add_race('Black', df_income, df_M, 'F', 60, 'M')
f_math_60k_mexican = add_race('Mexican American', df_income, df_M, 'F', 60, 'M')
f_math_60k_puerto = add_race('Puerto Rican', df_income, df_M, 'F', 60, 'M')
f_math_60k_hispanic = add_race('Other Hispanic', df_income, df_M, 'F', 60, 'M')
f_math_60k_asian = add_race('Asian/Pacific Islander', df_income, df_M, 'F', 60, 'M')
f_math_60k_AI_AN = add_race('American Indian/ Alaska Native', df_income, df_M, 'F', 60, 'M')
f_math_60k_other = add_race('Other', df_income, df_M, 'F', 60, 'M')

f_math_80k_white = add_race('White', df_income, df_M, 'F', 80, 'M')
f_math_80k_black = add_race('Black', df_income, df_M, 'F', 80, 'M')
f_math_80k_mexican = add_race('Mexican American', df_income, df_M, 'F', 80, 'M')
f_math_80k_puerto = add_race('Puerto Rican', df_income, df_M, 'F', 80, 'M')
f_math_80k_hispanic = add_race('Other Hispanic', df_income, df_M, 'F', 80, 'M')
f_math_80k_asian = add_race('Asian/Pacific Islander', df_income, df_M, 'F', 80, 'M')
f_math_80k_AI_AN = add_race('American Indian/ Alaska Native', df_income, df_M, 'F', 80, 'M')
f_math_80k_other = add_race('Other', df_income, df_M, 'F', 80, 'M')

f_math_100k_white = add_race('White', df_income, df_M, 'F', 100, 'M')
f_math_100k_black = add_race('Black', df_income, df_M, 'F', 100, 'M')
f_math_100k_mexican = add_race('Mexican American', df_income, df_M, 'F', 100, 'M')
f_math_100k_puerto = add_race('Puerto Rican', df_income, df_M, 'F', 100, 'M')
f_math_100k_hispanic = add_race('Other Hispanic', df_income, df_M, 'F', 100, 'M')
f_math_100k_asian = add_race('Asian/Pacific Islander', df_income, df_M, 'F', 100, 'M')
f_math_100k_AI_AN = add_race('American Indian/ Alaska Native', df_income, df_M, 'F', 100, 'M')
f_math_100k_other = add_race('Other', df_income, df_M, 'F', 100, 'M')
In [ ]:
#male reading
m_reading_0k_white = add_race('White', df_income, df_R, 'M', 0, 'R')
m_reading_0k_black = add_race('Black', df_income, df_R, 'M', 0, 'R')
m_reading_0k_mexican = add_race('Mexican American', df_income, df_R, 'M', 0, 'R')
m_reading_0k_puerto = add_race('Puerto Rican', df_income, df_R, 'M', 0, 'R')
m_reading_0k_hispanic = add_race('Other Hispanic', df_income, df_R, 'M', 0, 'R')
m_reading_0k_asian = add_race('Asian/Pacific Islander', df_income, df_R, 'M', 0, 'R')
m_reading_0k_AI_AN = add_race('American Indian/ Alaska Native', df_income, df_R, 'M', 0, 'R')
m_reading_0k_other = add_race('Other', df_income, df_R, 'M', 0, 'R')

m_reading_20k_white = add_race('White', df_income, df_R, 'M', 20, 'R')
m_reading_20k_black = add_race('Black', df_income, df_R, 'M', 20, 'R')
m_reading_20k_mexican = add_race('Mexican American', df_income, df_R, 'M', 20, 'R')
m_reading_20k_puerto = add_race('Puerto Rican', df_income, df_R, 'M', 20, 'R')
m_reading_20k_hispanic = add_race('Other Hispanic', df_income, df_R, 'M', 20, 'R')
m_reading_20k_asian = add_race('Asian/Pacific Islander', df_income, df_R, 'M', 20, 'R')
m_reading_20k_AI_AN = add_race('American Indian/ Alaska Native', df_income, df_R, 'M', 20, 'R')
m_reading_20k_other = add_race('Other', df_income, df_R, 'M', 20, 'R')

m_reading_40k_white = add_race('White', df_income, df_R, 'M', 40, 'R')
m_reading_40k_black = add_race('Black', df_income, df_R, 'M', 40, 'R')
m_reading_40k_mexican = add_race('Mexican American', df_income, df_R, 'M', 40, 'R')
m_reading_40k_puerto = add_race('Puerto Rican', df_income, df_R, 'M', 40, 'R')
m_reading_40k_hispanic = add_race('Other Hispanic', df_income, df_R, 'M', 40, 'R')
m_reading_40k_asian = add_race('Asian/Pacific Islander', df_income, df_R, 'M', 40, 'R')
m_reading_40k_AI_AN = add_race('American Indian/ Alaska Native', df_income, df_R, 'M', 40, 'R')
m_reading_40k_other = add_race('Other', df_income, df_R, 'M', 40, 'R')

m_reading_60k_white = add_race('White', df_income, df_R, 'M', 60, 'R')
m_reading_60k_black = add_race('Black', df_income, df_R, 'M', 60, 'R')
m_reading_60k_mexican = add_race('Mexican American', df_income, df_R, 'M', 60, 'R')
m_reading_60k_puerto = add_race('Puerto Rican', df_income, df_R, 'M', 60, 'R')
m_reading_60k_hispanic = add_race('Other Hispanic', df_income, df_R, 'M', 60, 'R')
m_reading_60k_asian = add_race('Asian/Pacific Islander', df_income, df_R, 'M', 60, 'R')
m_reading_60k_AI_AN = add_race('American Indian/ Alaska Native', df_income, df_R, 'M', 60, 'R')
m_reading_60k_other = add_race('Other', df_income, df_R, 'M', 60, 'R')

m_reading_80k_white = add_race('White', df_income, df_R, 'M', 80, 'R')
m_reading_80k_black = add_race('Black', df_income, df_R, 'M', 80, 'R')
m_reading_80k_mexican = add_race('Mexican American', df_income, df_R, 'M', 80, 'R')
m_reading_80k_puerto = add_race('Puerto Rican', df_income, df_R, 'M', 80, 'R')
m_reading_80k_hispanic = add_race('Other Hispanic', df_income, df_R, 'M', 80, 'R')
m_reading_80k_asian = add_race('Asian/Pacific Islander', df_income, df_R, 'M', 80, 'R')
m_reading_80k_AI_AN = add_race('American Indian/ Alaska Native', df_income, df_R, 'M', 80, 'R')
m_reading_80k_other = add_race('Other', df_income, df_R, 'M', 80, 'R')

m_reading_100k_white = add_race('White', df_income, df_R, 'M', 100, 'R')
m_reading_100k_black = add_race('Black', df_income, df_R, 'M', 100, 'R')
m_reading_100k_mexican = add_race('Mexican American', df_income, df_R, 'M', 100, 'R')
m_reading_100k_puerto = add_race('Puerto Rican', df_income, df_R, 'M', 100, 'R')
m_reading_100k_hispanic = add_race('Other Hispanic', df_income, df_R, 'M', 100, 'R')
m_reading_100k_asian = add_race('Asian/Pacific Islander', df_income, df_R, 'M', 100, 'R')
m_reading_100k_AI_AN = add_race('American Indian/ Alaska Native', df_income, df_R, 'M', 100, 'R')
m_reading_100k_other = add_race('Other', df_income, df_R, 'M', 100, 'R')
In [ ]:
#female reading
f_reading_0k_white = add_race('White', df_income, df_R, 'F', 0, 'R')
f_reading_0k_black = add_race('Black', df_income, df_R, 'F', 0, 'R')
f_reading_0k_mexican = add_race('Mexican American', df_income, df_R, 'F', 0, 'R')
f_reading_0k_puerto = add_race('Puerto Rican', df_income, df_R, 'F', 0, 'R')
f_reading_0k_hispanic = add_race('Other Hispanic', df_income, df_R, 'F', 0, 'R')
f_reading_0k_asian = add_race('Asian/Pacific Islander', df_income, df_R, 'F', 0, 'R')
f_reading_0k_AI_AN = add_race('American Indian/ Alaska Native', df_income, df_R, 'F', 0, 'R')
f_reading_0k_other = add_race('Other', df_income, df_R, 'F', 0, 'R')

f_reading_20k_white = add_race('White', df_income, df_R, 'F', 20, 'R')
f_reading_20k_black = add_race('Black', df_income, df_R, 'F', 20, 'R')
f_reading_20k_mexican = add_race('Mexican American', df_income, df_R, 'F', 20, 'R')
f_reading_20k_puerto = add_race('Puerto Rican', df_income, df_R, 'F', 20, 'R')
f_reading_20k_hispanic = add_race('Other Hispanic', df_income, df_R, 'F', 20, 'R')
f_reading_20k_asian = add_race('Asian/Pacific Islander', df_income, df_R, 'F', 20, 'R')
f_reading_20k_AI_AN = add_race('American Indian/ Alaska Native', df_income, df_R, 'F', 20, 'R')
f_reading_20k_other = add_race('Other', df_income, df_R, 'F', 20, 'R')

f_reading_40k_white = add_race('White', df_income, df_R, 'F', 40, 'R')
f_reading_40k_black = add_race('Black', df_income, df_R, 'F', 40, 'R')
f_reading_40k_mexican = add_race('Mexican American', df_income, df_R, 'F', 40, 'R')
f_reading_40k_puerto = add_race('Puerto Rican', df_income, df_R, 'F', 40, 'R')
f_reading_40k_hispanic = add_race('Other Hispanic', df_income, df_R, 'F', 40, 'R')
f_reading_40k_asian = add_race('Asian/Pacific Islander', df_income, df_R, 'F', 40, 'R')
f_reading_40k_AI_AN = add_race('American Indian/ Alaska Native', df_income, df_R, 'M', 40, 'R')
f_reading_40k_other = add_race('Other', df_income, df_R, 'F', 40, 'R')

f_reading_60k_white = add_race('White', df_income, df_R, 'F', 60, 'R')
f_reading_60k_black = add_race('Black', df_income, df_R, 'F', 60, 'R')
f_reading_60k_mexican = add_race('Mexican American', df_income, df_R, 'F', 60, 'R')
f_reading_60k_puerto = add_race('Puerto Rican', df_income, df_R, 'F', 60, 'R')
f_reading_60k_hispanic = add_race('Other Hispanic', df_income, df_R, 'F', 60, 'R')
f_reading_60k_asian = add_race('Asian/Pacific Islander', df_income, df_R, 'F', 60, 'R')
f_reading_60k_AI_AN = add_race('American Indian/ Alaska Native', df_income, df_R, 'F', 60, 'R')
f_reading_60k_other = add_race('Other', df_income, df_R, 'F', 60, 'R')

f_reading_80k_white = add_race('White', df_income, df_R, 'F', 80, 'R')
f_reading_80k_black = add_race('Black', df_income, df_R, 'F', 80, 'R')
f_reading_80k_mexican = add_race('Mexican American', df_income, df_R, 'F', 80, 'R')
f_reading_80k_puerto = add_race('Puerto Rican', df_income, df_R, 'F', 80, 'R')
f_reading_80k_hispanic = add_race('Other Hispanic', df_income, df_R, 'F', 80, 'R')
f_reading_80k_asian = add_race('Asian/Pacific Islander', df_income, df_R, 'F', 80, 'R')
f_reading_80k_AI_AN = add_race('American Indian/ Alaska Native', df_income, df_R, 'F', 80, 'R')
f_reading_80k_other = add_race('Other', df_income, df_R, 'F', 80, 'R')

f_reading_100k_white = add_race('White', df_income, df_R, 'F', 100, 'R')
f_reading_100k_black = add_race('Black', df_income, df_R, 'F', 100, 'R')
f_reading_100k_mexican = add_race('Mexican American', df_income, df_R, 'F', 100, 'R')
f_reading_100k_puerto = add_race('Puerto Rican', df_income, df_R, 'F', 100, 'R')
f_reading_100k_hispanic = add_race('Other Hispanic', df_income, df_R, 'F', 100, 'R')
f_reading_100k_asian = add_race('Asian/Pacific Islander', df_income, df_R, 'F', 100, 'R')
f_reading_100k_AI_AN = add_race('American Indian/ Alaska Native', df_income, df_R, 'F', 100, 'R')
f_reading_100k_other = add_race('Other', df_income, df_R, 'F', 100, 'R')

Now that we have every data frame that we need, we have to prepare the data for a model. To do this we need to first concatenate every dataframe together so we have one cohesive dataframe we can put into the model.

In [ ]:
df_model_prep = pd.concat([m_math_0k_white, m_math_0k_black, m_math_0k_mexican, m_math_0k_puerto, m_math_0k_hispanic, m_math_0k_asian, m_math_0k_AI_AN, m_math_0k_other, m_math_20k_white, m_math_20k_black, m_math_20k_mexican, m_math_20k_puerto, m_math_20k_hispanic, m_math_20k_asian, m_math_20k_AI_AN, m_math_20k_other, m_math_40k_white, m_math_40k_black, m_math_40k_mexican, m_math_40k_puerto, m_math_40k_hispanic, m_math_40k_asian, m_math_40k_AI_AN, m_math_40k_other, m_math_60k_white, m_math_60k_black, m_math_60k_mexican, m_math_60k_puerto, m_math_60k_hispanic, m_math_60k_asian, m_math_60k_AI_AN, m_math_60k_other, m_math_80k_white, m_math_80k_black, m_math_80k_mexican, m_math_80k_puerto, m_math_80k_hispanic, m_math_80k_asian, m_math_80k_AI_AN, m_math_80k_other, m_math_100k_white, m_math_100k_black, m_math_100k_mexican, m_math_100k_puerto, m_math_100k_hispanic, m_math_100k_asian, m_math_100k_AI_AN, m_math_100k_other,
                     f_math_0k_white, f_math_0k_black, f_math_0k_mexican, f_math_0k_puerto, f_math_0k_hispanic, f_math_0k_asian, f_math_0k_AI_AN, f_math_0k_other, f_math_20k_white, f_math_20k_black, f_math_20k_mexican, f_math_20k_puerto, f_math_20k_hispanic, f_math_20k_asian, f_math_20k_AI_AN, f_math_20k_other, f_math_40k_white, f_math_40k_black, f_math_40k_mexican, f_math_40k_puerto, f_math_40k_hispanic, f_math_40k_asian, f_math_40k_AI_AN, f_math_40k_other, f_math_60k_white, f_math_60k_black, f_math_60k_mexican, f_math_60k_puerto, f_math_60k_hispanic, f_math_60k_asian, f_math_60k_AI_AN, f_math_60k_other, f_math_80k_white, f_math_80k_black, f_math_80k_mexican, f_math_80k_puerto, f_math_80k_hispanic, f_math_80k_asian, f_math_80k_AI_AN, f_math_80k_other, f_math_100k_white, f_math_100k_black, f_math_100k_mexican, f_math_100k_puerto, f_math_100k_hispanic, f_math_100k_asian, f_math_100k_AI_AN, f_math_100k_other,
                      m_reading_0k_white, m_reading_0k_black, m_reading_0k_mexican, m_reading_0k_puerto, m_reading_0k_hispanic, m_reading_0k_asian, m_reading_0k_AI_AN, m_reading_0k_other, m_reading_20k_white, m_reading_20k_black, m_reading_20k_mexican, m_reading_20k_puerto, m_reading_20k_hispanic, m_reading_20k_asian, m_reading_20k_AI_AN, m_reading_20k_other, m_reading_40k_white, m_reading_40k_black, m_reading_40k_mexican, m_reading_40k_puerto, m_reading_40k_hispanic, m_reading_40k_asian, m_reading_40k_AI_AN, m_reading_40k_other, m_reading_60k_white, m_reading_60k_black, m_reading_60k_mexican, m_reading_60k_puerto, m_reading_60k_hispanic, m_reading_60k_asian, m_reading_60k_AI_AN, m_reading_60k_other, m_reading_80k_white, m_reading_80k_black, m_reading_80k_mexican, m_reading_80k_puerto, m_reading_80k_hispanic, m_reading_80k_asian, m_reading_80k_AI_AN, m_reading_80k_other, m_reading_100k_white, m_reading_100k_black, m_reading_100k_mexican, m_reading_100k_puerto, m_reading_100k_hispanic, m_reading_100k_asian, m_reading_100k_AI_AN, m_reading_100k_other,
                     f_reading_0k_white, f_reading_0k_black, f_reading_0k_mexican, f_reading_0k_puerto, f_reading_0k_hispanic, f_reading_0k_asian, f_reading_0k_AI_AN, f_reading_0k_other, f_reading_20k_white, f_reading_20k_black, f_reading_20k_mexican, f_reading_20k_puerto, f_reading_20k_hispanic, f_reading_20k_asian, f_reading_20k_AI_AN, f_reading_20k_other, f_reading_40k_white, f_reading_40k_black, f_reading_40k_mexican, f_reading_40k_puerto, f_reading_40k_hispanic, f_reading_40k_asian, f_reading_40k_AI_AN, f_reading_40k_other, f_reading_60k_white, f_reading_60k_black, f_reading_60k_mexican, f_reading_60k_puerto, f_reading_60k_hispanic, f_reading_60k_asian, f_reading_60k_AI_AN, f_reading_60k_other, f_reading_80k_white, f_reading_80k_black, f_reading_80k_mexican, f_reading_80k_puerto, f_reading_80k_hispanic, f_reading_80k_asian, f_reading_80k_AI_AN, f_reading_80k_other, f_reading_100k_white, f_reading_100k_black, f_reading_100k_mexican, f_reading_100k_puerto, f_reading_100k_hispanic, f_reading_100k_asian, f_reading_100k_AI_AN, f_reading_100k_other],ignore_index=True, axis=0)
df_model_prep
#df_model_math = df_model_math.drop(columns = ['Family Income < 20k: Math Average Score','Family Income 20-40k: Math Average Score', 'Family Income 40-60k: Math Average Score', 'Family Income 60-80k: Math Average Score', 'Family Income 80-100k: Math Average Score', 'Family Income > 100k: Math Average Score', 'Gender.Male.Math', 'Gender.Female.Math' ])
Out[ ]:
Year State Gender.Male.Math Family Income < 20k: Math Average Score Gender Income Subject Race White combined score ... Family Income > 100k: Math Average Score Gender.Female.Math Gender.Male.Verbal Family Income < 20k: Reading Average Score Family Income 20-40k: Reading Average Score Family Income 40-60k: Reading Average Score Family Income 60-80k: Reading Average Score Family Income 80-100k: Reading Average Score Family Income > 100k: Reading Average Score Gender.Female.Verbal
0 2005 ALABAMA 582.0 462.0 M <20k Math White 536.0 526.666667 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2005 ALASKA 535.0 464.0 M <20k Math White 536.0 511.666667 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 2005 ARIZONA 549.0 485.0 M <20k Math White 536.0 523.333333 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 2005 ARKANSAS 570.0 489.0 M <20k Math White 536.0 531.666667 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 2005 CALIFORNIA 543.0 451.0 M <20k Math White 536.0 510.000000 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
110779 2015 VIRGINIA NaN NaN F >100k Reading Other NaN 518.666667 ... NaN NaN NaN NaN NaN NaN NaN NaN 550.0 516.0
110780 2015 WASHINGTON NaN NaN F >100k Reading Other NaN 514.666667 ... NaN NaN NaN NaN NaN NaN NaN NaN 552.0 502.0
110781 2015 WEST VIRGINIA NaN NaN F >100k Reading Other NaN 509.333333 ... NaN NaN NaN NaN NaN NaN NaN NaN 532.0 506.0
110782 2015 WISCONSIN NaN NaN F >100k Reading Other NaN 559.666667 ... NaN NaN NaN NaN NaN NaN NaN NaN 603.0 586.0
110783 2015 WYOMING NaN NaN F >100k Reading Other NaN 558.666667 ... NaN NaN NaN NaN NaN NaN NaN NaN 591.0 595.0

110784 rows × 31 columns

Below is the table for our model. It contains a row for each combination of year, state, gender, income, subject, and race that can come from our original datasets. There were originally 288 null values, but because that is only 0.26% of rows, it is okay to drop.

In [ ]:
df_model = df_model_prep[['Year', 'State', 'Gender', 'Income', 'Subject', 'Race', 'combined score']]
df_model
Out[ ]:
Year State Gender Income Subject Race combined score
0 2005 ALABAMA M <20k Math White 526.666667
1 2005 ALASKA M <20k Math White 511.666667
2 2005 ARIZONA M <20k Math White 523.333333
3 2005 ARKANSAS M <20k Math White 531.666667
4 2005 CALIFORNIA M <20k Math White 510.000000
... ... ... ... ... ... ... ...
110779 2015 VIRGINIA F >100k Reading Other 518.666667
110780 2015 WASHINGTON F >100k Reading Other 514.666667
110781 2015 WEST VIRGINIA F >100k Reading Other 509.333333
110782 2015 WISCONSIN F >100k Reading Other 559.666667
110783 2015 WYOMING F >100k Reading Other 558.666667

110784 rows × 7 columns

In [ ]:
df_model.isnull().sum().sum() #288 null values
df_model = df_model.dropna()

Here is the spread of scores. The lowest possible score on a single section of the SAT is 200 and the highest is 800. Here, we can see that most averages of scores lie withing the 480 to 550 range.

In [ ]:
df_model.hist(['combined score'])
plt.xlabel('Score')
plt.ylabel('Count')
Out[ ]:
Text(0, 0.5, 'Count')
The Model¶

The goal of our model is to make a product that college admission offices can use to compare students stest scores based on their socioeconomic qualities. There are many socioeconomic factors in our nation that give people a disadvantage/advantage depending on their background, which is something a person cannot control. Therefore, by being able to get an estimate of what someone in a student's socioeconomic background would get, it increases equity since they are being compared on a smaller, more specific scale.

Below, we are defining our features for the model and then getting the dummy variables from our model to use in our dataframe. With the way our data is organized, since we do not have observations of individual test takers, we have to have each feature be a column in our dataframe.

In [ ]:
features = ['State', 'Gender', 'Income', 'Subject', "Race"]
df_dummies = pd.get_dummies(df_model[features])
df_dummies.head()
Out[ ]:
State_ALABAMA State_ALASKA State_ARIZONA State_ARKANSAS State_CALIFORNIA State_COLORADO State_CONNECTICUT State_DELAWARE State_DISTRICT OF COLUMBIA State_FLORIDA ... Subject_Math Subject_Reading Race_American Indian/ Alaska Native Race_Asian/Pacific Islander Race_Black Race_Mexican American Race_Other Race_Other Hispanic Race_Puerto Rican Race_White
0 1 0 0 0 0 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 1
1 0 1 0 0 0 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 1
2 0 0 1 0 0 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 1
3 0 0 0 1 0 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 1
4 0 0 0 0 1 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 1

5 rows × 71 columns

In [ ]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
import math


x = df_dummies
y = df_model['combined score']

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

regressor = RandomForestRegressor(n_estimators=100, max_features = 61, criterion = 'squared_error',
                                 random_state = 0)
regressor.fit(x,y)

y_pred = regressor.predict(X_test)


mse = mean_squared_error(y_test, y_pred)
print('RMSE:', math.sqrt(mse))
RMSE: 8.141284713443188

Our RSME error is around 8 which is relatively small considering SAT scores range from 200-800. Now we are going to use this model to predict the average math score of two different people. The first student is a white female from Alabama whos household family income is 70k. The second is a black male from California whos family income in 40k.

Unfortunately, there wasn't a more convenient way to input an instance for the model to check so we have to put in a value for every column instead of just the ones that apply to that person.

In [ ]:
# Math, white, female 80-100k, Alabama

new_instance = pd.DataFrame({'State_ALABAMA': [1], 'State_ALASKA' : [0], 'State_ARIZONA' : [0],'State_ARKANSAS': [0], 'State_CALIFORNIA' :[0], 'State_COLORADO':[0],	'State_CONNECTICUT':[0],
          'State_DELAWARE':[0], 'State_DISTRICT OF COLUMBIA': [0],	'State_FLORIDA' :[0], 'State_GEORGIA': [0],	'State_HAWAII' : [0],	'State_IDAHO':[0],'State_ILLINOIS':[0],
          'State_INDIANA':[0],	'State_IOWA':[0],	'State_KANSAS':[0],	'State_KENTUCKY':[0],	'State_KENTUCKY':[0],	'State_LOUISIANA':[0],	'State_MAINE':[0],
          'State_MARYLAND':[0],	'State_MASSACHUSETTS':[0],	'State_MICHIGAN':[0],	'State_MINNESOTA':[0],'State_MISSISSIPPI':[0], 'State_MISSOURI':[0], 'State_MONTANA':[0], 'State_NEBRASKA':[0],
          'State_NEVADA':[0], 'State_NEW HAMPSHIRE':[0], 'State_NEW JERSEY':[0], 'State_NEW MEXICO':[0], 'State_NEW YORK':[0], 'State_NORTH CAROLINA':[0], 'State_NORTH DAKOTA':[0],
          'State_OHIO':[0], 'State_OKLAHOMA':[0], 'State_OREGON':[0], 'State_PENNSYLVANIA':[0], 'State_PUERTO RICO':[0], 'State_RHODE ISLAND':[0], 'State_SOUTH CAROLINA':[0], 'State_SOUTH DAKOTA':[0], 'State_TENNESSEE':[0],
          'State_TEXAS':[0], 'State_UTAH':[0], 'State_VERMONT':[0], 'State_VIRGIN ISLANDS':[0],'State_VIRGINIA':[0], 'State_WASHINGTON':[0], 'State_WEST VIRGINIA':[0], 'State_WISCONSIN':[0], 'State_WYOMING':[0],
          'Gender_F': [1], 'Gender_M': [0], 'Income_20-40k': [0], 'Income_40-60k':[0],'Income_60-80k': [0],'Income_80-100k': [1],'Income_<20k':[0],'Income_>100k': [0], 'Subject_Math': [1], 'Subject_Reading': [0], 'Race_American Indian/ Alaska Native': [0],
          'Race_Asian/Pacific Islander' : [0], 'Race_Black' : [0], 'Race_Mexican American' : [0], 'Race_Other' : [0],'Race_Other Hispanic' : [0], 'Race_Puerto Rican' : [0], 'Race_White' : [1]  })


predicted_score = regressor.predict(new_instance)
print(f'Predicted Average Score: {predicted_score[0]}')
Predicted Average Score: 539.6554671418911

The Predicted Average Score is 539.6554671418911

In [ ]:
# Math, black, male, 20-40k, California

new_instance = pd.DataFrame({'State_ALABAMA': [0], 'State_ALASKA' : [0], 'State_ARIZONA' : [0],'State_ARKANSAS': [0], 'State_CALIFORNIA' :[1], 'State_COLORADO':[0],	'State_CONNECTICUT':[0],
          'State_DELAWARE':[0], 'State_DISTRICT OF COLUMBIA': [0],	'State_FLORIDA' :[0], 'State_GEORGIA': [0],	'State_HAWAII' : [0],	'State_IDAHO':[0],'State_ILLINOIS':[0],
          'State_INDIANA':[0],	'State_IOWA':[0],	'State_KANSAS':[0],	'State_KENTUCKY':[0],	'State_KENTUCKY':[0],	'State_LOUISIANA':[0],	'State_MAINE':[0],
          'State_MARYLAND':[0],	'State_MASSACHUSETTS':[0],	'State_MICHIGAN':[0],	'State_MINNESOTA':[0],'State_MISSISSIPPI':[0], 'State_MISSOURI':[0], 'State_MONTANA':[0], 'State_NEBRASKA':[0],
          'State_NEVADA':[0], 'State_NEW HAMPSHIRE':[0], 'State_NEW JERSEY':[0], 'State_NEW MEXICO':[0], 'State_NEW YORK':[0], 'State_NORTH CAROLINA':[0], 'State_NORTH DAKOTA':[0],
          'State_OHIO':[0], 'State_OKLAHOMA':[0], 'State_OREGON':[0], 'State_PENNSYLVANIA':[0], 'State_PUERTO RICO':[0], 'State_RHODE ISLAND':[0], 'State_SOUTH CAROLINA':[0], 'State_SOUTH DAKOTA':[0], 'State_TENNESSEE':[0],
          'State_TEXAS':[0], 'State_UTAH':[0], 'State_VERMONT':[0], 'State_VIRGIN ISLANDS':[0],'State_VIRGINIA':[0], 'State_WASHINGTON':[0], 'State_WEST VIRGINIA':[0], 'State_WISCONSIN':[0], 'State_WYOMING':[0],
          'Gender_F': [0], 'Gender_M': [1], 'Income_20-40k': [1], 'Income_40-60k':[0],'Income_60-80k': [0],'Income_80-100k': [0],'Income_<20k':[0],'Income_>100k': [0], 'Subject_Math': [1], 'Subject_Reading': [0], 'Race_American Indian/ Alaska Native': [0],
          'Race_Asian/Pacific Islander' : [0], 'Race_Black' : [1], 'Race_Mexican American' : [0], 'Race_Other' : [0],'Race_Other Hispanic' : [0], 'Race_Puerto Rican' : [0], 'Race_White' : [0]  })


predicted_score = regressor.predict(new_instance)
print(f'Predicted Average Score: {predicted_score[0]}')
Predicted Average Score: 481.79334407141005

The Predicted Average Score is 481.79334407141005

Lets say both student 1 and student 2 scored 520 points on their SAT math section. Although these are the same score, for someone with the same background as the second student, this 39 points above the average, but 19 points below the average for someone like the first.

Decision Tree

From our model, we believed that income has the largest impact on the prediced SAT score due to their very high correlation shown earlier in the tutorial.

To look at this hypothesis and see what other features were most important in the predicted average score, we have created a decision tree using Random Forest Regressor. The max depth has been set to 4 and the number of estimators to 2, so that the trees are readable. By changing the depth and number of estimators, more trees with more levels can be printed.

In [ ]:
from sklearn import tree

regressor_tree = RandomForestRegressor(n_estimators=2, max_features = 61, criterion = 'squared_error',
                                 random_state = 0, max_depth=4)
regressor_tree.fit(x,y)
features = x.columns.values
classes = ['0', '1', '2']


for estimator in regressor_tree.estimators_:
    print(estimator)
    plt.figure(figsize=(20,8))
    tree.plot_tree(estimator,
                   feature_names=features,
                   class_names=classes,
                   fontsize=8,
                   filled=True,
                   rounded=True)
    plt.show()
DecisionTreeRegressor(max_depth=4, max_features=61, random_state=209652396)
DecisionTreeRegressor(max_depth=4, max_features=61, random_state=398764591)

The trees above do not support our hypothesis exactly. As shown it seems that being from the Virgin Islands, a US territory, has the most weight when the model is predicting the average score. This makes sense however, as being for the Virgin Islands is a more unique instance.

Conclusion¶

With our data, we were able to successfully create a predictive model that predicts the average SAT score based on socioeconomic factors such as race, income, and state. However, based on the results from our exploratory analysis and lack of individual test scores, we decided not to include 8th grade scores as a feature in our predictive model. In the future however, we would be interesting in making a similar model that uses these past test scores in addition to other academic markers like GPA to predict SAT scores.

A trend we noticed with our data when doing exploratory analysis is that when we would compare math and reading separately, there was always a higher discreptancy with math. This makes sense because math is a more technical subject that is not as intuitive as reading may be.

We also saw when looking at the decision tree regressor that being from the virgin islands was on top, with being asian, being white, and having an income less than 20k following. We reasoned that being from the Virgin Islands was significant because it is a U.S. territory and not a state so the data may be drastically different. From our exploratory analysis, we saw that on average asian/pacific islanders has the highest average SAT scores, followed by white people which may explain why they were put so high up on the tree. Additionally, having an income of less than 20k puts that student below the poverty line, which has a significant impact on a student's performance because their basic needs are not met.

Overall, the model does a fairly accurate prediction of scores based on the data it's being fed. The model is structured in a way where there is a lot of potential for expansion to test other features or to incorporate a more updated dataset to improve accuracy and have scores better match up with today's trends. This model has the potential to be very beneficial to universities who wish to give students an equitable evaluation when looking at their SAT scores and may help balance out some of the disparities in scores due to socioeconomic factors that are outside of a student's control.

In [ ]:
! pwd
/content/drive/MyDrive
In [ ]:
warnings.filterwarnings("ignore", category=FutureWarning)

!jupyter nbconvert --to html /content/drive/MyDrive/Colab_Notebooks/Data_Science_Final.ipynb
[NbConvertApp] Converting notebook /content/drive/MyDrive/Colab_Notebooks/Data_Science_Final.ipynb to html
[NbConvertApp] Writing 2549487 bytes to /content/drive/MyDrive/Colab_Notebooks/Data_Science_Final.html