Recentely, the Supreme Court ruled on affirmative action, taking away the ability for a university to use race as a factor in admission decisions. While we hope most universities do not use this as an excuse to limit their diversity on campus, they have to think more creatively about how they decide who to admit into their school.
Standardized testing has been a long standing factor in the college admissions process, and they don't seem to be going away. Students recieve a range of scores on these exams, the SAT and ACT, that contribute to their admissions decisions. When a student submits their score, all the school recieves is a number, not what factors contribute to the score.
This project is going to be a deep dive into this, looking for different factors that contribute to an SAT score. What contributes the most? Can previous academic performace relate?
Our focus is building a tool that colleges and universities can use to help make admissions decisions regarding SAT scores. The first goal is to figure out what factors have a correlation to SAT scores, and the second is to make a system for colleges to use to help keep an equitable admissions process.
Data Sources
We used 3 different data sources in this project.The first two were found on Kaggle. The first is called US School Scores. This dataset has 99 columns of information regarding SAT scores by state and US territory from 2005-2015. Although all the information in this dataset is based on when the SAT was out of 2400 points instead of the current 1600 points, it is still relevant to our goals. The only difference between the two tests is that the written portion became optional but the data set we have does not include the writing section so the score is out of 1600 still. The columns cover a range of topics including scores from different sections of the SAT, subject specific GPA, the average scores based on family income, and many more data points. You can find the dataset used here.
The secondary data set is also from Kaggle, and contains information organized by state and year, like the first dataset. It contains information about state revenue and spending, the number of students in each grade, 4th and 8th grade standardized test scores, and these separated by factors of race and gender. The dataset is linked here.
Our last dataset is from the National Center for Education Statistics. It contains information on average SAT scores by race by year. The dataset is linked here.
Extraction and Load We downloaded the datasets directly from Kaggle to a CSV files and added them to our Google Drives and mounted it. We then imported the Python libraries we will be using in our first set of analysis, and read in the files using pd.read_csv.
# Connecting Notebook to Google Drive
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/MyDrive
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True). /content/drive/MyDrive
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.options.mode.chained_assignment = None
# Reading in the datasets
df1 = pd.read_csv('/content/drive/MyDrive/Data_science_project/school_scores.csv')
df2 = pd.read_csv('/content/drive/MyDrive/Data_science_project/states_all_extended.csv')
race_dataset = '/content/drive/MyDrive/Data_science_project/tabn226.10.xls'
df3 = pd.read_excel(race_dataset)
df1.head()
Year | State.Code | State.Name | Total.Math | Total.Test-takers | Total.Verbal | Academic Subjects.Arts/Music.Average GPA | Academic Subjects.Arts/Music.Average Years | Academic Subjects.English.Average GPA | Academic Subjects.English.Average Years | ... | Score Ranges.Between 600 to 700.Math.Total | Score Ranges.Between 600 to 700.Verbal.Females | Score Ranges.Between 600 to 700.Verbal.Males | Score Ranges.Between 600 to 700.Verbal.Total | Score Ranges.Between 700 to 800.Math.Females | Score Ranges.Between 700 to 800.Math.Males | Score Ranges.Between 700 to 800.Math.Total | Score Ranges.Between 700 to 800.Verbal.Females | Score Ranges.Between 700 to 800.Verbal.Males | Score Ranges.Between 700 to 800.Verbal.Total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2005 | AL | Alabama | 559 | 3985 | 567 | 3.92 | 2.2 | 3.53 | 3.9 | ... | 1096 | 596 | 613 | 1209 | 156 | 292 | 448 | 219 | 250 | 469 |
1 | 2005 | AK | Alaska | 519 | 3996 | 523 | 3.76 | 1.9 | 3.35 | 3.9 | ... | 787 | 423 | 375 | 798 | 47 | 116 | 163 | 109 | 115 | 224 |
2 | 2005 | AZ | Arizona | 530 | 18184 | 526 | 3.85 | 2.1 | 3.45 | 3.9 | ... | 3814 | 1831 | 1679 | 3510 | 327 | 630 | 957 | 412 | 501 | 913 |
3 | 2005 | AR | Arkansas | 552 | 1600 | 563 | 3.90 | 2.2 | 3.61 | 4.0 | ... | 443 | 242 | 226 | 468 | 49 | 83 | 132 | 77 | 93 | 170 |
4 | 2005 | CA | California | 522 | 186552 | 504 | 3.76 | 1.8 | 3.32 | 3.8 | ... | 37605 | 16078 | 14966 | 31044 | 5451 | 8792 | 14243 | 5011 | 4766 | 9777 |
5 rows × 99 columns
df2.head()
PRIMARY_KEY | STATE | YEAR | ENROLL | TOTAL_REVENUE | FEDERAL_REVENUE | STATE_REVENUE | LOCAL_REVENUE | TOTAL_EXPENDITURE | INSTRUCTION_EXPENDITURE | ... | G08_HI_A_READING | G08_HI_A_MATHEMATICS | G08_AS_A_READING | G08_AS_A_MATHEMATICS | G08_AM_A_READING | G08_AM_A_MATHEMATICS | G08_HP_A_READING | G08_HP_A_MATHEMATICS | G08_TR_A_READING | G08_TR_A_MATHEMATICS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1992_ALABAMA | ALABAMA | 1992 | NaN | 2678885.0 | 304177.0 | 1659028.0 | 715680.0 | 2653798.0 | 1481703.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 1992_ALASKA | ALASKA | 1992 | NaN | 1049591.0 | 106780.0 | 720711.0 | 222100.0 | 972488.0 | 498362.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 1992_ARIZONA | ARIZONA | 1992 | NaN | 3258079.0 | 297888.0 | 1369815.0 | 1590376.0 | 3401580.0 | 1435908.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 1992_ARKANSAS | ARKANSAS | 1992 | NaN | 1711959.0 | 178571.0 | 958785.0 | 574603.0 | 1743022.0 | 964323.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 1992_CALIFORNIA | CALIFORNIA | 1992 | NaN | 26260025.0 | 2072470.0 | 16546514.0 | 7641041.0 | 27138832.0 | 14358922.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 266 columns
df3.head()
Table 226.10. SAT mean scores of college-bound seniors, by race/ethnicity: Selected years, 1986-87 through 2014-15 | Unnamed: 1 | Unnamed: 2 | Unnamed: 3 | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Unnamed: 9 | ... | Unnamed: 11 | Unnamed: 12 | Unnamed: 13 | Unnamed: 14 | Unnamed: 15 | Unnamed: 16 | Unnamed: 17 | Unnamed: 18 | Unnamed: 19 | Unnamed: 20 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Race/ethnicity | 1986- 87 | 1990- 91 | 1996- 97 | 2000- 01 | 2003- 04 | 2004- 05 | 2005-06 | 2006-07 | 2007-08 | ... | 2009-10 | 2010-11 | 2011-12 | 2012-13 | NaN | NaN | Score change | NaN | NaN | NaN |
1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | 2010-11 | 2011-12 | NaN | 2013-14 | 2014-15 | 1990-91 to \n2004-05 | 2004-05\nto\n2014-15 | 2009-10\n to\n 2014-15 | 2013-14 to \n2014-15 |
2 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | ... | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 |
3 | SAT—Critical reading | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | All students ............... | 507 | 499 | 505 | 506 | 508 | 508 | 503 | 502 | 502 | ... | 501 | 497 | 496 | 496 | 497 | 495 | 9 | -13 | -6 | -2 |
5 rows × 21 columns
Tidy and Transform
There are no null cells in this dataset 1, however there are 1,037 with the value of 0. We have set these cells to contain NaN instead so that calculations are not impacted.
df1.isnull().sum().sum() #0
(df1==0).sum().sum()
df1.replace(0, np.nan, inplace=True)
Our datatypes are correct.
df1.dtypes
Year int64 State.Code object State.Name object Total.Math int64 Total.Test-takers int64 ... Score Ranges.Between 700 to 800.Math.Males int64 Score Ranges.Between 700 to 800.Math.Total int64 Score Ranges.Between 700 to 800.Verbal.Females int64 Score Ranges.Between 700 to 800.Verbal.Males int64 Score Ranges.Between 700 to 800.Verbal.Total int64 Length: 99, dtype: object
Now, we will create new dataframes and tidy them that will be used in our analysis.
We focused on changing the names of columns so that we can merge different dataframes, and creating our own columns to use in our analysis. In addition, we only kept columns in these new dataframes that we will be using for our analysis, and got rid of what we didn't need.
df1['State.Name'] = df1['State.Name'].str.upper()
df_income = df1[['Year', 'State.Name', 'Family Income.Less than 20k.Test-takers', 'Family Income.Less than 20k.Math','Family Income.Less than 20k.Verbal','Family Income.Between 20-40k.Test-takers', 'Family Income.Between 20-40k.Math', 'Family Income.Between 20-40k.Verbal', 'Family Income.Between 40-60k.Test-takers','Family Income.Between 40-60k.Math', 'Family Income.Between 40-60k.Verbal','Family Income.Between 60-80k.Test-takers', 'Family Income.Between 60-80k.Math','Family Income.Between 60-80k.Verbal', 'Family Income.Between 80-100k.Test-takers', 'Family Income.Between 80-100k.Math', 'Family Income.Between 80-100k.Verbal', 'Family Income.More than 100k.Test-takers', 'Family Income.More than 100k.Math', 'Family Income.More than 100k.Verbal', 'Gender.Female.Math', 'Gender.Female.Verbal', 'Gender.Male.Math', 'Gender.Male.Verbal' ]]
df_income["Total.Test-takers"] = df_income["Family Income.Less than 20k.Test-takers"] + df_income["Family Income.Between 20-40k.Test-takers"] +df_income["Family Income.Between 40-60k.Test-takers"] + df_income["Family Income.Between 60-80k.Test-takers"] + df_income["Family Income.Between 80-100k.Test-takers"] + df_income["Family Income.More than 100k.Test-takers"]
df_income["% < 20k"] = df_income["Family Income.Less than 20k.Test-takers"]/df_income["Total.Test-takers"]
df_income["% 20-40k"] = df_income["Family Income.Between 20-40k.Test-takers"]/df_income["Total.Test-takers"]
df_income["% 40-60k"] = df_income["Family Income.Between 40-60k.Test-takers"]/df_income["Total.Test-takers"]
df_income["% 60-80k"] =df_income["Family Income.Between 60-80k.Test-takers"]/df_income["Total.Test-takers"]
df_income["% 80-100k"] =df_income["Family Income.Between 80-100k.Test-takers"]/df_income["Total.Test-takers"]
df_income["% > 100k"] =df_income["Family Income.More than 100k.Test-takers"]/df_income["Total.Test-takers"]
df_income = df_income.rename(columns = { 'State.Name': 'State',
'Family Income.Less than 20k.Test-takers' : 'Family Income < 20k: Number of Test-takers',
'Family Income.Less than 20k.Math' : 'Family Income < 20k: Math Average Score',
'Family Income.Less than 20k.Verbal' : 'Family Income < 20k: Reading Average Score',
'Family Income.Between 20-40k.Test-takers' : 'Family Income 20-40k: Number of Test-takers',
'Family Income.Between 20-40k.Math' : 'Family Income 20-40k: Math Average Score',
'Family Income.Between 20-40k.Verbal' : 'Family Income 20-40k: Reading Average Score',
'Family Income.Between 40-60k.Test-takers' : 'Family Income 40-60k: Number of Test-takers',
'Family Income.Between 40-60k.Math' : 'Family Income 40-60k: Math Average Score',
'Family Income.Between 40-60k.Verbal' : 'Family Income 40-60k: Reading Average Score',
'Family Income.Between 60-80k.Test-takers' : 'Family Income 60-80k: Number of Test-takers',
'Family Income.Between 60-80k.Math' : 'Family Income 60-80k: Math Average Score',
'Family Income.Between 60-80k.Verbal' : 'Family Income 60-80k: Reading Average Score',
'Family Income.Between 80-100k.Test-takers' : 'Family Income 80-100k: Number of Test-takers',
'Family Income.Between 80-100k.Math' : 'Family Income 80-100k: Math Average Score',
'Family Income.Between 80-100k.Verbal' : 'Family Income 80-100k: Reading Average Score',
'Family Income.More than 100k.Test-takers' : 'Family Income > 100k: Number of Test-takers',
'Family Income.More than 100k.Math' : 'Family Income > 100k: Math Average Score',
'Family Income.More than 100k.Verbal' : 'Family Income > 100k: Reading Average Score'
})
df_income.head()
Year | State | Family Income < 20k: Number of Test-takers | Family Income < 20k: Math Average Score | Family Income < 20k: Reading Average Score | Family Income 20-40k: Number of Test-takers | Family Income 20-40k: Math Average Score | Family Income 20-40k: Reading Average Score | Family Income 40-60k: Number of Test-takers | Family Income 40-60k: Math Average Score | ... | Gender.Female.Verbal | Gender.Male.Math | Gender.Male.Verbal | Total.Test-takers | % < 20k | % 20-40k | % 40-60k | % 60-80k | % 80-100k | % > 100k | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2005 | ALABAMA | 175 | 462.0 | 474.0 | 324 | 513.0 | 527 | 442 | 539 | ... | 561 | 582 | 574 | 2869 | 0.060997 | 0.112931 | 0.154061 | 0.164866 | 0.165563 | 0.341582 |
1 | 2005 | ALASKA | 191 | 464.0 | 467.0 | 401 | 492.0 | 500 | 539 | 517 | ... | 521 | 535 | 526 | 2718 | 0.070272 | 0.147535 | 0.198308 | 0.221854 | 0.163355 | 0.198675 |
2 | 2005 | ARIZONA | 891 | 485.0 | 474.0 | 2121 | 498.0 | 495 | 2270 | 520 | ... | 522 | 549 | 531 | 12603 | 0.070697 | 0.168293 | 0.180116 | 0.188209 | 0.148060 | 0.244624 |
3 | 2005 | ARKANSAS | 107 | 489.0 | 486.0 | 180 | 513.0 | 526 | 245 | 543 | ... | 558 | 570 | 570 | 1220 | 0.087705 | 0.147541 | 0.200820 | 0.186066 | 0.120492 | 0.257377 |
4 | 2005 | CALIFORNIA | 19323 | 451.0 | 421.0 | 26161 | 477.0 | 458 | 18347 | 506 | ... | 499 | 543 | 510 | 123872 | 0.155992 | 0.211194 | 0.148113 | 0.144803 | 0.113989 | 0.225911 |
5 rows × 31 columns
df_gpa = df1[['Academic Subjects.Arts/Music.Average GPA', 'Academic Subjects.Arts/Music.Average Years', 'Academic Subjects.English.Average GPA',
'Academic Subjects.English.Average Years', 'Academic Subjects.Foreign Languages.Average GPA', 'Academic Subjects.Foreign Languages.Average Years',
'Academic Subjects.Mathematics.Average GPA', 'Academic Subjects.Mathematics.Average Years', 'Academic Subjects.Natural Sciences.Average GPA',
'Academic Subjects.Natural Sciences.Average Years', 'Academic Subjects.Social Sciences/History.Average GPA', 'Academic Subjects.Social Sciences/History.Average Years',
'GPA.A minus.Math', 'GPA.A minus.Test-takers', 'GPA.A minus.Verbal', 'GPA.A plus.Math', 'GPA.A plus.Test-takers', 'GPA.A plus.Verbal', 'GPA.A.Math', 'GPA.A.Test-takers',
'GPA.A.Verbal', 'GPA.B.Math', 'GPA.B.Test-takers', 'GPA.B.Verbal', 'GPA.C.Math', 'GPA.C.Test-takers', 'GPA.C.Verbal', 'GPA.D or lower.Math', 'GPA.D or lower.Test-takers',
'GPA.D or lower.Verbal', 'GPA.No response.Math', 'GPA.No response.Test-takers', 'GPA.No response.Verbal', 'Score Ranges.Between 200 to 300.Math.Total', 'Score Ranges.Between 200 to 300.Verbal.Total',
'Score Ranges.Between 300 to 400.Math.Total', 'Score Ranges.Between 300 to 400.Verbal.Total', 'Score Ranges.Between 400 to 500.Math.Total', 'Score Ranges.Between 400 to 500.Verbal.Total',
'Score Ranges.Between 500 to 600.Math.Total', 'Score Ranges.Between 500 to 600.Verbal.Total', 'Score Ranges.Between 600 to 700.Math.Total',
'Score Ranges.Between 600 to 700.Verbal.Total', 'Score Ranges.Between 700 to 800.Math.Total', 'Score Ranges.Between 700 to 800.Verbal.Total']]
df_gpa = df_gpa.rename(columns = { 'Academic Subjects.Arts/Music.Average GPA': 'Arts/Music GPA',
'Academic Subjects.Arts/Music.Average Years': 'Arts/Music Avg Yrs',
'Academic Subjects.English.Average GPA' : 'English GPA',
'Academic Subjects.English.Average Years' : 'English Avg Yrs',
'Academic Subjects.Foreign Languages.Average GPA' : 'Foreign Language GPA',
'Academic Subjects.Foreign Languages.Average Years' : 'Foreign Language Avg Yrs',
'Academic Subjects.Mathematics.Average GPA' : 'Math GPA',
'Academic Subjects.Mathematics.Average Years' : 'Math Avg Yrs',
'Academic Subjects.Natural Sciences.Average GPA' : 'Science GPA',
'Academic Subjects.Natural Sciences.Average Years' : 'Science Avg Yrs',
'Academic Subjects.Social Sciences/History.Average GPA' : 'Social Science GPA',
'Academic Subjects.Social Sciences/History.Average Years' : 'Social Science Avg Yrs'
})
df_gpa.head()
Arts/Music GPA | Arts/Music Avg Yrs | English GPA | English Avg Yrs | Foreign Language GPA | Foreign Language Avg Yrs | Math GPA | Math Avg Yrs | Science GPA | Science Avg Yrs | ... | Score Ranges.Between 300 to 400.Math.Total | Score Ranges.Between 300 to 400.Verbal.Total | Score Ranges.Between 400 to 500.Math.Total | Score Ranges.Between 400 to 500.Verbal.Total | Score Ranges.Between 500 to 600.Math.Total | Score Ranges.Between 500 to 600.Verbal.Total | Score Ranges.Between 600 to 700.Math.Total | Score Ranges.Between 600 to 700.Verbal.Total | Score Ranges.Between 700 to 800.Math.Total | Score Ranges.Between 700 to 800.Verbal.Total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3.92 | 2.2 | 3.53 | 3.9 | 3.54 | 2.6 | 3.41 | 4.0 | 3.52 | 3.9 | ... | 266.0 | 207 | 807.0 | 762.0 | 1336 | 1307 | 1096 | 1209 | 448 | 469 |
1 | 3.76 | 1.9 | 3.35 | 3.9 | 3.34 | 2.1 | 3.06 | 3.5 | 3.25 | 3.2 | ... | 386.0 | 389 | 1181.0 | 1208.0 | 1429 | 1325 | 787 | 798 | 163 | 224 |
2 | 3.85 | 2.1 | 3.45 | 3.9 | 3.41 | 2.6 | 3.25 | 3.9 | 3.43 | 3.4 | ... | 1331.0 | 1352 | 5163.0 | 5446.0 | 6728 | 6762 | 3814 | 3510 | 957 | 913 |
3 | 3.90 | 2.2 | 3.61 | 4.0 | 3.64 | 2.6 | 3.46 | 4.1 | 3.55 | 3.7 | ... | 99.0 | 88 | 347.0 | 324.0 | 560 | 538 | 443 | 468 | 132 | 170 |
4 | 3.76 | 1.8 | 3.32 | 3.8 | 3.29 | 2.8 | 3.05 | 3.7 | 3.20 | 3.2 | ... | 21754.0 | 26170 | 51702.0 | 58157.0 | 56817 | 55589 | 37605 | 31044 | 14243 | 9777 |
5 rows × 45 columns
Now we can tidy our second dataset.
Here, we are checking for null values. As you can see, there are many in this dataset, but it does not effect our analysis. One of the reasons for these empty values is due to the fact that certain characteristics, like race, was not collected during testing for every year.
df2.isna().sum().sort_values()
PRIMARY_KEY 0 STATE 0 YEAR 0 KG_A_A 83 G12_A_A 83 ... G08_AM_A_MATHEMATICS 1655 G04_HP_A_READING 1699 G04_HP_A_MATHEMATICS 1700 G08_HP_A_READING 1701 G08_HP_A_MATHEMATICS 1702 Length: 266, dtype: int64
The datatypes are correct.
df2.dtypes
PRIMARY_KEY object STATE object YEAR int64 ENROLL float64 TOTAL_REVENUE float64 ... G08_AM_A_MATHEMATICS float64 G08_HP_A_READING float64 G08_HP_A_MATHEMATICS float64 G08_TR_A_READING float64 G08_TR_A_MATHEMATICS float64 Length: 266, dtype: object
We changed this dataset to only contain data from 2005-2015, like the first dataset. We also changed how some columns were named and got rid of rows that would not merge with the other dataframes.
#pd.set_option('display.max_rows', None)
df2 = df2[(df2['YEAR'] >= 2005) & (df2['YEAR'] <= 2015)]
df2 = df2[(df2['STATE'] != 'NATIONAL') & (df2['STATE'] != 'DODEA')]
df2.index = range(len(df2.index))
df2 = df2.rename(columns = { 'STATE': 'State', #renaming columns to match previous dataset for easier future use
'YEAR': 'Year'})
df1 = df1.rename(columns = {'State.Name' : 'State'})
df2['State'] = df2['State'].replace('DISTRICT_OF_COLUMBIA', 'DISTRICT OF COLUMBIA')
df2['State'] = df2['State'].replace('NEW_HAMPSHIRE', 'NEW HAMPSHIRE')
df2['State'] = df2['State'].replace('NEW_JERSEY', 'NEW JERSEY')
df2['State'] = df2['State'].replace('NEW_YORK', 'NEW YORK')
df2['State'] = df2['State'].replace('NEW_MEXICO', 'NEW MEXICO')
df2['State'] = df2['State'].replace('NORTH_DAKOTA', 'NORTH DAKOTA')
df2['State'] = df2['State'].replace('NORTH_CAROLINA', 'NORTH CAROLINA')
df2['State'] = df2['State'].replace('RHODE_ISLAND', 'RHODE ISLAND')
df2['State'] = df2['State'].replace('SOUTH_CAROLINA', 'SOUTH CAROLINA')
df2['State'] = df2['State'].replace('SOUTH_DAKOTA', 'SOUTH DAKOTA')
df2['State'] = df2['State'].replace('WEST_VIRGINIA', 'WEST VIRGINIA')
df2.head(5)
PRIMARY_KEY | State | Year | ENROLL | TOTAL_REVENUE | FEDERAL_REVENUE | STATE_REVENUE | LOCAL_REVENUE | TOTAL_EXPENDITURE | INSTRUCTION_EXPENDITURE | ... | G08_HI_A_READING | G08_HI_A_MATHEMATICS | G08_AS_A_READING | G08_AS_A_MATHEMATICS | G08_AM_A_READING | G08_AM_A_MATHEMATICS | G08_HP_A_READING | G08_HP_A_MATHEMATICS | G08_TR_A_READING | G08_TR_A_MATHEMATICS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2005_ALABAMA | ALABAMA | 2005 | 729342.0 | 5889966.0 | 665924.0 | 3247562.0 | 1976480.0 | 5904664.0 | 3053380.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 2005_ALASKA | ALASKA | 2005 | 132568.0 | 1599739.0 | 302846.0 | 878731.0 | 418162.0 | 1751718.0 | 829666.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 2005_ARIZONA | ARIZONA | 2005 | 919875.0 | 7550109.0 | 897553.0 | 3320196.0 | 3332360.0 | 7136794.0 | 3423716.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 2005_ARKANSAS | ARKANSAS | 2005 | 461667.0 | 3978240.0 | 447012.0 | 2995788.0 | 535440.0 | 3999608.0 | 2118580.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 2005_CALIFORNIA | CALIFORNIA | 2005 | 6307022.0 | 62171797.0 | 7148863.0 | 34912166.0 | 20110768.0 | 66196072.0 | 31070450.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 266 columns
This third dataset was a lot more complicated to clean. We were able to do it by lots of renaming, transpose, and removing information we didn't need.
df3.at[0, 'Unnamed: 15'] = "2013-14"
df3.at[0, 'Unnamed: 16'] = "2014-15"
df3.at[0, 'Unnamed: 17'] = "Score change: 1990-91 to 2004-05"
df3.at[0, 'Unnamed: 18'] = "Score change: 2004-05 to 2014-15"
df3.at[0, 'Unnamed: 19'] = "Score change: 2009-10 to 2014-15"
df3.at[0, 'Unnamed: 20'] = "Score change: 2013-14 to 2014-15"
df3.columns = df3.iloc[0]
df3 = df3[2:]
df3.reset_index(drop=True, inplace=True)
df3.at[2, "Race/ethnicity"] = "SAT—Critical reading All students"
df3 = df3.drop(1)
df3.at[9, "Race/ethnicity"] = "Asian/Pacific Islander"
df3 = df3.drop(8)
df3.at[11, "Race/ethnicity"] = "American Indian/ Alaska Native"
df3 = df3.drop(10)
df3.at[15, "Race/ethnicity"] = "SAT—Mathematics All students"
df3 = df3.drop(14)
df3 = df3.drop(13)
df3.at[22, "Race/ethnicity"] = "Asian/Pacific Islander"
df3 = df3.drop(21)
df3.at[24, "Race/ethnicity"] = "American Indian/ Alaska Native"
df3 = df3.drop(23)
df3 = df3.drop(26)
df3 = df3.drop(27)
df3.at[28, "Race/ethnicity"] = "SAT—Writing All students"
df3.at[35, "Race/ethnicity"] = "Asian/Pacific Islander"
df3 = df3.drop(34)
df3.at[37, "Race/ethnicity"] = "American Indian/ Alaska Native"
df3 = df3.drop(36)
df3 = df3.drop(39)
df3 = df3.drop(40)
df3 = df3.drop(41)
df3.at[3, "Race/ethnicity"] = "White"
df3.at[4, "Race/ethnicity"] = "Black"
df3.at[5, "Race/ethnicity"] = "Mexican American"
df3.at[6, "Race/ethnicity"] = "Puerto Rican"
df3.at[7, "Race/ethnicity"] = "Other Hispanic"
df3.at[12, "Race/ethnicity"] = "Other"
df3.at[16, "Race/ethnicity"] = "White"
df3.at[17, "Race/ethnicity"] = "Black"
df3.at[18, "Race/ethnicity"] = "Mexican American"
df3.at[19, "Race/ethnicity"] = "Puerto Rican"
df3.at[20, "Race/ethnicity"] = "Other Hispanic"
df3.at[25, "Race/ethnicity"] = "Other"
df3.at[29, "Race/ethnicity"] = "White"
df3.at[30, "Race/ethnicity"] = "Black"
df3.at[31, "Race/ethnicity"] = "Mexican American"
df3.at[32, "Race/ethnicity"] = "Puerto Rican"
df3.at[33, "Race/ethnicity"] = "Other Hispanic"
df3.at[38, "Race/ethnicity"] = "Other"
df3.reset_index(drop=True, inplace=True)
df3.head()
Race/ethnicity | 1986- 87 | 1990- 91 | 1996- 97 | 2000- 01 | 2003- 04 | 2004- 05 | 2005-06 | 2006-07 | 2007-08 | ... | 2009-10 | 2010-11 | 2011-12 | 2012-13 | 2013-14 | 2014-15 | Score change: 1990-91 to 2004-05 | Score change: 2004-05 to 2014-15 | Score change: 2009-10 to 2014-15 | Score change: 2013-14 to 2014-15 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | ... | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 |
1 | SAT—Critical reading All students | 507 | 499 | 505 | 506 | 508 | 508 | 503 | 502 | 502 | ... | 501 | 497 | 496 | 496 | 497 | 495 | 9 | -13 | -6 | -2 |
2 | White | 524 | 518 | 526 | 529 | 528 | 532 | 527 | 527 | 528 | ... | 528 | 528 | 527 | 527 | 529 | 529 | 14 | -3 | 1 | 0 |
3 | Black | 428 | 427 | 434 | 433 | 430 | 433 | 434 | 433 | 430 | ... | 429 | 428 | 428 | 431 | 431 | 431 | 6 | -2 | 2 | 0 |
4 | Mexican American | 457 | 454 | 451 | 451 | 451 | 453 | 454 | 455 | 454 | ... | 454 | 451 | 448 | 449 | 450 | 448 | -1 | -5 | -6 | -2 |
5 rows × 21 columns
df3 = df3.drop(['1986- 87', '1990- 91', '1996- 97', '2000- 01', '2003- 04'], axis=1)
df3 = df3.drop(['Score change: 1990-91 to 2004-05', 'Score change: 2004-05 to 2014-15', 'Score change: 2009-10 to 2014-15', 'Score change: 2013-14 to 2014-15'], axis=1)
df3 = df3.rename(columns = {'2004- 05' : '2005',
'2005-06' : '2006',
'2006-07' : '2007',
'2007-08' : '2008',
'2008-09' : '2009',
'2009-10' : '2010',
'2010-11' : '2011',
'2011-12' : '2012',
'2012-13' : '2013',
'2013-14' : '2014',
'2014-15' : '2015'
})
We created a separate math and reading dataframe with this dataset to align with our other data and work for our model. We also converted the datatypes of the scores to ints, as they were objects before.
# Creating the reading dataframe
df_R = df3[(df3['2005'].isin([532,433,453,460,463,511,489,495]))& (df3['2006'].isin([527, 434, 454,459,458,510,487,494]))]
df_R = df_R.rename(columns = {'Race/ethnicity': 'Year'})
df_R = df_R.T
df_R.columns = df_R.iloc[0]
df_R = df_R.drop('Year', axis=0)
df_R.reset_index(inplace =True)
df_R = df_R.rename(columns= {0:'Year'})
convert_dict = {'Year': int,
'White': int,
'Black': int,
'Mexican American': int,
'Puerto Rican': int,
'Other Hispanic' : int,
'Asian/Pacific Islander': int,
'American Indian/ Alaska Native': int,
'Other': int}
df_R = df_R.astype(convert_dict)
df_R
Year | Year | White | Black | Mexican American | Puerto Rican | Other Hispanic | Asian/Pacific Islander | American Indian/ Alaska Native | Other |
---|---|---|---|---|---|---|---|---|---|
0 | 2005 | 532 | 433 | 453 | 460 | 463 | 511 | 489 | 495 |
1 | 2006 | 527 | 434 | 454 | 459 | 458 | 510 | 487 | 494 |
2 | 2007 | 527 | 433 | 455 | 459 | 459 | 514 | 487 | 497 |
3 | 2008 | 528 | 430 | 454 | 456 | 455 | 513 | 485 | 496 |
4 | 2009 | 528 | 429 | 453 | 452 | 455 | 516 | 486 | 494 |
5 | 2010 | 528 | 429 | 454 | 454 | 454 | 519 | 485 | 494 |
6 | 2011 | 528 | 428 | 451 | 452 | 451 | 517 | 484 | 493 |
7 | 2012 | 527 | 428 | 448 | 452 | 447 | 518 | 482 | 491 |
8 | 2013 | 527 | 431 | 449 | 456 | 450 | 521 | 480 | 492 |
9 | 2014 | 529 | 431 | 450 | 456 | 451 | 523 | 483 | 493 |
10 | 2015 | 529 | 431 | 448 | 456 | 449 | 525 | 481 | 490 |
# Creating the math dataframe
df_M = df3[(df3['2005'].isin([536,431,463,457,469,580,493,513]))& (df3['2006'].isin([536, 429, 465,456,463,578,494,513]))]
df_M = df_M.rename(columns = {'Race/ethnicity': 'Year'})
df_M = df_M.T
df_M.columns = df_M.iloc[0]
df_M = df_M.drop('Year', axis=0)
df_M.reset_index(inplace =True)
df_M = df_M.rename(columns= {0:'Year'})
df_M.rename = None
convert_dict = {'Year': int,
'White': int,
'Black': int,
'Mexican American': int,
'Puerto Rican': int,
'Other Hispanic' : int,
'Asian/Pacific Islander': int,
'American Indian/ Alaska Native': int,
'Other': int}
df_M = df_M.astype(convert_dict)
df_M
Year | Year | White | Black | Mexican American | Puerto Rican | Other Hispanic | Asian/Pacific Islander | American Indian/ Alaska Native | Other |
---|---|---|---|---|---|---|---|---|---|
0 | 2005 | 536 | 431 | 463 | 457 | 469 | 580 | 493 | 513 |
1 | 2006 | 536 | 429 | 465 | 456 | 463 | 578 | 494 | 513 |
2 | 2007 | 534 | 429 | 466 | 454 | 463 | 578 | 494 | 512 |
3 | 2008 | 537 | 426 | 463 | 453 | 461 | 581 | 491 | 512 |
4 | 2009 | 536 | 426 | 463 | 450 | 461 | 587 | 493 | 514 |
5 | 2010 | 536 | 428 | 467 | 452 | 462 | 591 | 492 | 514 |
6 | 2011 | 535 | 427 | 466 | 452 | 462 | 595 | 488 | 517 |
7 | 2012 | 536 | 428 | 465 | 452 | 461 | 595 | 489 | 516 |
8 | 2013 | 534 | 429 | 464 | 453 | 461 | 597 | 486 | 519 |
9 | 2014 | 534 | 429 | 461 | 450 | 459 | 598 | 484 | 520 |
10 | 2015 | 534 | 428 | 457 | 449 | 457 | 598 | 482 | 519 |
One of our original hypotheses was that family income has a positive correlation with SAT scores, meaning the higher the family income, the higher the average SAT score. We decided to average the math and reading scores, as we want to look at overall scores for this analysis.
The first table we created includes all the data about how SAT scores relate to family income. We have also added new columns to show the percentage of test-takers in each family income bracket of the total number of test takers in each row. We have also found a total number of test takers.
df_income.head()
Year | State | Family Income < 20k: Number of Test-takers | Family Income < 20k: Math Average Score | Family Income < 20k: Reading Average Score | Family Income 20-40k: Number of Test-takers | Family Income 20-40k: Math Average Score | Family Income 20-40k: Reading Average Score | Family Income 40-60k: Number of Test-takers | Family Income 40-60k: Math Average Score | ... | Gender.Female.Verbal | Gender.Male.Math | Gender.Male.Verbal | Total.Test-takers | % < 20k | % 20-40k | % 40-60k | % 60-80k | % 80-100k | % > 100k | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2005 | ALABAMA | 175 | 462.0 | 474.0 | 324 | 513.0 | 527 | 442 | 539 | ... | 561 | 582 | 574 | 2869 | 0.060997 | 0.112931 | 0.154061 | 0.164866 | 0.165563 | 0.341582 |
1 | 2005 | ALASKA | 191 | 464.0 | 467.0 | 401 | 492.0 | 500 | 539 | 517 | ... | 521 | 535 | 526 | 2718 | 0.070272 | 0.147535 | 0.198308 | 0.221854 | 0.163355 | 0.198675 |
2 | 2005 | ARIZONA | 891 | 485.0 | 474.0 | 2121 | 498.0 | 495 | 2270 | 520 | ... | 522 | 549 | 531 | 12603 | 0.070697 | 0.168293 | 0.180116 | 0.188209 | 0.148060 | 0.244624 |
3 | 2005 | ARKANSAS | 107 | 489.0 | 486.0 | 180 | 513.0 | 526 | 245 | 543 | ... | 558 | 570 | 570 | 1220 | 0.087705 | 0.147541 | 0.200820 | 0.186066 | 0.120492 | 0.257377 |
4 | 2005 | CALIFORNIA | 19323 | 451.0 | 421.0 | 26161 | 477.0 | 458 | 18347 | 506 | ... | 499 | 543 | 510 | 123872 | 0.155992 | 0.211194 | 0.148113 | 0.144803 | 0.113989 | 0.225911 |
5 rows × 31 columns
This dataframe enables us to explore the correlation between income and SAT scores across different states. We can also analyze SAT score trends over the years to determine if there has been an increase in scores over time. Once we establish the correlation between SAT scores and income by state, we can further assess the percentage of people from various income brackets residing in each state to assess the potential impact of regional differences on educational outcomes.
Combined Correlation between Scores and Household Income We first want to analyze whether family income has an effect on SAT scores. To start off our analysis, we will create a line graph comparing family income to average SAT score. We decided to take the average of math average score and reading average score and plot it against family income. We predict that there will be a positive correlation between family income and SAT scores. We predict this because those who have families with a higher income can afford more resources, such as SAT prep books, which help their children get a higher score.
import warnings
warnings.filterwarnings("ignore", category=FutureWarning) #kept getting warnings using .corr() but we can ignore them for now because it doesn't affect current version of pandas
income_columns = ['< 20k',
'20-40k',
'40-60k',
'60-80k',
'80-100k',
'> 100k']
df_avg_0k = (df_income['Family Income < 20k: Math Average Score'].mean() + df_income['Family Income < 20k: Reading Average Score'].mean()) / 2
df_avg_20k = (df_income['Family Income 20-40k: Math Average Score'].mean() + df_income['Family Income 20-40k: Reading Average Score'].mean()) / 2
df_avg_40k = (df_income['Family Income 40-60k: Math Average Score'].mean() + df_income['Family Income 40-60k: Reading Average Score'].mean()) / 2
df_avg_60k = (df_income['Family Income 60-80k: Math Average Score'].mean() + df_income['Family Income 60-80k: Reading Average Score'].mean()) / 2
df_avg_80k = (df_income['Family Income 80-100k: Math Average Score'].mean() + df_income['Family Income 80-100k: Reading Average Score'].mean()) / 2
df_avg_100k = (df_income['Family Income > 100k: Math Average Score'].mean() + df_income['Family Income > 100k: Reading Average Score'].mean()) / 2
df_avg_total = {'score': [df_avg_0k, df_avg_20k, df_avg_40k, df_avg_60k, df_avg_80k, df_avg_100k],
'income': ['< 20k', '20-40k', '40-60k','60-80k', '80-100k','> 100k'],
'index': [1, 2, 3, 4, 5, 6]}
df_avg_total = pd.DataFrame(df_avg_total)
df_avg_total.head()
plt.legend()
plt.xlabel('Income')
plt.ylabel('Average Score')
plt.title(f'Avg Score based on income')
plot_income_sat = plt.plot(df_avg_total['income'], df_avg_total['score'])
df_avg_total.corr().loc['score', 'index']
pd.set_option('display.max_colwidth', 800)
df_income.head()
WARNING:matplotlib.legend:No artists with labels found to put in legend. Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
Year | State | Family Income < 20k: Number of Test-takers | Family Income < 20k: Math Average Score | Family Income < 20k: Reading Average Score | Family Income 20-40k: Number of Test-takers | Family Income 20-40k: Math Average Score | Family Income 20-40k: Reading Average Score | Family Income 40-60k: Number of Test-takers | Family Income 40-60k: Math Average Score | ... | Gender.Female.Verbal | Gender.Male.Math | Gender.Male.Verbal | Total.Test-takers | % < 20k | % 20-40k | % 40-60k | % 60-80k | % 80-100k | % > 100k | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2005 | ALABAMA | 175 | 462.0 | 474.0 | 324 | 513.0 | 527 | 442 | 539 | ... | 561 | 582 | 574 | 2869 | 0.060997 | 0.112931 | 0.154061 | 0.164866 | 0.165563 | 0.341582 |
1 | 2005 | ALASKA | 191 | 464.0 | 467.0 | 401 | 492.0 | 500 | 539 | 517 | ... | 521 | 535 | 526 | 2718 | 0.070272 | 0.147535 | 0.198308 | 0.221854 | 0.163355 | 0.198675 |
2 | 2005 | ARIZONA | 891 | 485.0 | 474.0 | 2121 | 498.0 | 495 | 2270 | 520 | ... | 522 | 549 | 531 | 12603 | 0.070697 | 0.168293 | 0.180116 | 0.188209 | 0.148060 | 0.244624 |
3 | 2005 | ARKANSAS | 107 | 489.0 | 486.0 | 180 | 513.0 | 526 | 245 | 543 | ... | 558 | 570 | 570 | 1220 | 0.087705 | 0.147541 | 0.200820 | 0.186066 | 0.120492 | 0.257377 |
4 | 2005 | CALIFORNIA | 19323 | 451.0 | 421.0 | 26161 | 477.0 | 458 | 18347 | 506 | ... | 499 | 543 | 510 | 123872 | 0.155992 | 0.211194 | 0.148113 | 0.144803 | 0.113989 | 0.225911 |
5 rows × 31 columns
Based on the graph there is a positive correlation between income and test scores. We then found the correlation coefficient to see how strong this correlation is. The correlation coefficient is .976 which is really strong.
However, this is the correlation for all of the states combined. Later in the project, we want to use state and income as a variable for our predictive model. Therefore, we need to find the correlation coefficient for each state to make sure the model would be accurate for each state when predicting.
To do this, we created a function that could get the correlation coefficient for any state so we could easily iterate through all of the states seamlessly. Then we added it to a dictionary with state as the key and the correlation coefficient as the value. We then sorted the dictionary from smallest to largest correlation coefficient.
Correlation Between Scores and Household Income by State
def get_correlation(df_income, statename):
df_state = df_income[(df_income['State'] == statename)]
income_columns = []
df_avg_0k = (df_state['Family Income < 20k: Math Average Score'].mean() + df_state['Family Income < 20k: Reading Average Score'].mean()) / 2
df_avg_20k = (df_state['Family Income 20-40k: Math Average Score'].mean() + df_state['Family Income 20-40k: Reading Average Score'].mean()) / 2
df_avg_40k = (df_state['Family Income 40-60k: Math Average Score'].mean() + df_state['Family Income 40-60k: Reading Average Score'].mean()) / 2
df_avg_60k = (df_state['Family Income 60-80k: Math Average Score'].mean() + df_state['Family Income 60-80k: Reading Average Score'].mean()) / 2
df_avg_80k = (df_state['Family Income 80-100k: Math Average Score'].mean() + df_state['Family Income 80-100k: Reading Average Score'].mean()) / 2
df_avg_100k = (df_state['Family Income > 100k: Math Average Score'].mean() + df_state['Family Income > 100k: Reading Average Score'].mean()) / 2
df_avg = {'score': [df_avg_0k, df_avg_20k, df_avg_40k, df_avg_60k, df_avg_80k, df_avg_100k],
'income': ['< 20k', '20-40k', '40-60k','60-80k', '80-100k','> 100k'],
'index': [1, 2, 3, 4, 5, 6]}
df_avg = pd.DataFrame(df_avg)
correlation_matrix = df_avg.corr()
r_value = correlation_matrix.loc['score', 'index']
# Extract the correlation coefficient between two specific columns (e.g., Column1 and Column2)
return r_value
state_correlations = {}
state_names = df_income['State'].unique()
for i in state_names:
state_correlations[i] = get_correlation(df_income, i)
sorted_correlations = dict(sorted(state_correlations.items(), key=lambda item: item[1]))
sorted_correlations
{'WYOMING': 0.6817823026104805, 'NORTH DAKOTA': 0.7444367294875884, 'SOUTH DAKOTA': 0.7511933183748691, 'VIRGIN ISLANDS': 0.9312281762015785, 'MONTANA': 0.9397737259414765, 'NEBRASKA': 0.9447349398116189, 'MISSISSIPPI': 0.9452556846643638, 'UTAH': 0.9483990565719241, 'IOWA': 0.9540450456746358, 'KANSAS': 0.959994646351036, 'ARKANSAS': 0.9627908180238146, 'OKLAHOMA': 0.9629782756291895, 'ALASKA': 0.9634885788666858, 'MAINE': 0.9655202926950999, 'WISCONSIN': 0.967455700113321, 'FLORIDA': 0.9721455941178156, 'OHIO': 0.9725710757715881, 'ALABAMA': 0.9742000091476154, 'MISSOURI': 0.9743170332983074, 'RHODE ISLAND': 0.9752676496678933, 'COLORADO': 0.976166583527223, 'KENTUCKY': 0.9774019928780511, 'SOUTH CAROLINA': 0.9778913114637013, 'LOUISIANA': 0.9791280064127692, 'TENNESSEE': 0.9791381979672444, 'WASHINGTON': 0.97979829408224, 'MINNESOTA': 0.980880125149849, 'MICHIGAN': 0.981669010843995, 'NEVADA': 0.9824884335167628, 'NEW HAMPSHIRE': 0.9825814330307877, 'PENNSYLVANIA': 0.9826012777025478, 'NEW YORK': 0.9827062212711544, 'ILLINOIS': 0.9827348970539682, 'IDAHO': 0.9834113354285738, 'MASSACHUSETTS': 0.9840192451238341, 'CONNECTICUT': 0.9848453372345946, 'HAWAII': 0.9854316568080004, 'WEST VIRGINIA': 0.985512001183547, 'NEW JERSEY': 0.9855406973376892, 'OREGON': 0.9856439572812011, 'INDIANA': 0.985864808592127, 'DISTRICT OF COLUMBIA': 0.9861195539957384, 'TEXAS': 0.9862451287121754, 'NORTH CAROLINA': 0.9868290906988119, 'ARIZONA': 0.9872445761934121, 'VERMONT': 0.9884377674742938, 'DELAWARE': 0.9884799176055574, 'GEORGIA': 0.9889991974408391, 'MARYLAND': 0.9911364667634063, 'VIRGINIA': 0.9923846879775922, 'PUERTO RICO': 0.9928976852939522, 'CALIFORNIA': 0.9929236107247942, 'NEW MEXICO': 0.9954872028202372}
As you can see by the dictionary, most of the states have a strong correlation except for Wyoming, North Dakota, and South Dakota which all have a correlation coefficient below .9.
To understand why the states have such a weak correlation we decided to plot sat scores by year for every family income category per state as seen below. We will create a graph for Wyoming which has the weakest correlation. We will also create a graph for New Mexico because it has the strongest correlation and that way we can see what the graph was suppose to look like.
df_income
df_income_state = df_income[['Year', 'State', 'Family Income < 20k: Math Average Score','Family Income < 20k: Reading Average Score', 'Family Income 20-40k: Math Average Score', 'Family Income 20-40k: Reading Average Score', 'Family Income 40-60k: Math Average Score', 'Family Income 40-60k: Reading Average Score', 'Family Income 60-80k: Math Average Score', 'Family Income 60-80k: Reading Average Score', 'Family Income 80-100k: Math Average Score', 'Family Income 80-100k: Reading Average Score', 'Family Income > 100k: Math Average Score', 'Family Income > 100k: Reading Average Score' ]]
df_income_state
#df_alabama = [(df_income_state['State'] == 'ALABAMA')]
df_income_state_0k = df_income_state[['State','Year', 'Family Income < 20k: Math Average Score', 'Family Income < 20k: Reading Average Score']]
df_income_state_0k['Avg Score'] = (df_income_state_0k['Family Income < 20k: Math Average Score'] + df_income_state_0k['Family Income < 20k: Reading Average Score'])/2
df_income_state_20k = df_income_state[['State','Year', 'Family Income 20-40k: Math Average Score', 'Family Income 20-40k: Reading Average Score']]
df_income_state_20k['Avg Score'] = (df_income_state_20k['Family Income 20-40k: Math Average Score'] + df_income_state_20k['Family Income 20-40k: Reading Average Score'])/2
df_income_state_40k = df_income_state[['State','Year', 'Family Income 40-60k: Math Average Score', 'Family Income 40-60k: Reading Average Score']]
df_income_state_40k['Avg Score'] = (df_income_state_40k['Family Income 40-60k: Math Average Score'] + df_income_state_40k['Family Income 40-60k: Reading Average Score'])/2
df_income_state_60k = df_income_state[['State','Year', 'Family Income 60-80k: Math Average Score', 'Family Income 60-80k: Reading Average Score']]
df_income_state_60k['Avg Score'] = (df_income_state_60k['Family Income 60-80k: Math Average Score'] + df_income_state_60k['Family Income 60-80k: Reading Average Score'])/2
df_income_state_80k = df_income_state[['State','Year', 'Family Income 80-100k: Math Average Score', 'Family Income 80-100k: Reading Average Score']]
df_income_state_80k['Avg Score'] = (df_income_state_80k['Family Income 80-100k: Math Average Score'] + df_income_state_80k['Family Income 80-100k: Reading Average Score'])/2
df_income_state_100k = df_income_state[['State','Year', 'Family Income > 100k: Math Average Score', 'Family Income > 100k: Reading Average Score']]
df_income_state_100k['Avg Score'] = (df_income_state_100k['Family Income > 100k: Math Average Score'] + df_income_state_100k['Family Income > 100k: Reading Average Score'])/2
def get_state_plot(df_income_state_0k, df_income_state_20k, df_income_state_40k, df_income_state_60k, df_income_state_80k, df_income_state_100k, statename, subplot_config):
df_alabama_0k = df_income_state_0k[(df_income_state_0k['State'] == statename)]
df_alabama_20k = df_income_state_20k[df_income_state_20k['State'] == statename]
df_alabama_40k = df_income_state_40k[df_income_state_40k['State'] == statename]
df_alabama_60k = df_income_state_60k[df_income_state_60k['State'] == statename]
df_alabama_80k = df_income_state_80k[df_income_state_80k['State'] == statename]
df_alabama_100k = df_income_state_100k[df_income_state_100k['State'] == statename]
rows, cols, index = subplot_config
plt.subplot(rows, cols, index)
plot_income_0k = plt.plot(df_alabama_0k['Year'], df_alabama_0k['Avg Score'], label = '<20k')
plot_income_20k = plt.plot(df_alabama_20k['Year'], df_alabama_20k['Avg Score'], label = '20k-40k')
plot_income_40k = plt.plot(df_alabama_40k['Year'], df_alabama_40k['Avg Score'], label = '40k-60k')
plot_income_60k = plt.plot(df_alabama_60k['Year'], df_alabama_60k['Avg Score'], label = '60k-80k')
plot_income_80k = plt.plot(df_alabama_80k['Year'], df_alabama_80k['Avg Score'], label = '80k-100k')
plot_income_100k = plt.plot(df_alabama_100k['Year'], df_alabama_100k['Avg Score'], label = '>100k')
plt.legend()
plt.xlabel('Year')
plt.ylabel('Average Score')
plt.title(f'Avg. Score vs. year for Incomes in {statename}')
plt.figure(figsize=(8, 6))
get_state_plot(df_income_state_0k, df_income_state_20k, df_income_state_40k, df_income_state_60k, df_income_state_80k, df_income_state_100k, 'WYOMING', (2,2,1))
get_state_plot(df_income_state_0k, df_income_state_20k, df_income_state_40k, df_income_state_60k, df_income_state_80k, df_income_state_100k, 'NEW MEXICO', (2,2,2))
plt.tight_layout()
plt.show()
The lines in the Wyoming graph mix together and there is not a clear trend and most of the lines are intertwining with each other. Unlike the Wyoming graph, the New Mexico has the its lines clearly stacked on top of one another in order of family income.
We also wanted to see whether a state's overall wealth and state spending on education had an effect on student outcomes. Firstly, we are going to start with an analysis on state revenue and SAT scores. To do this we made a scatterplot with all of the observations with state revenue as the independent variable and total math score as the dependent variable.
df2_revenue = df2[['State', 'Year', 'TOTAL_REVENUE', 'FEDERAL_REVENUE', 'STATE_REVENUE', 'LOCAL_REVENUE', 'INSTRUCTION_EXPENDITURE', 'ENROLL']]
df1_scores = df1[['Year', 'State', 'Total.Math', 'Total.Verbal']]
df1_scores = df1_scores[(df1_scores['State'] != 'PUERTO RICO') & (df1_scores['State'] != 'VIRGIN ISLANDS')]
merged_df = pd.merge(df2_revenue, df1_scores, left_on=['State', 'Year'], right_on=['State', 'Year'])
math_rev = plt.scatter(merged_df['STATE_REVENUE'], merged_df['Total.Math'])
plt.title("State Revenue vs SAT scores")
plt.xlabel("Average State Revenue from 2005-2015")
plt.ylabel("Average Math SAT score by state from 2005-2015")
verbal_rev = plt.scatter(merged_df['STATE_REVENUE'], merged_df['Total.Verbal'])
plt.legend([math_rev, verbal_rev], ['math', 'verbal'])
<matplotlib.legend.Legend at 0x7c12fb0bfaf0>
merged_df
State | Year | TOTAL_REVENUE | FEDERAL_REVENUE | STATE_REVENUE | LOCAL_REVENUE | INSTRUCTION_EXPENDITURE | ENROLL | Total.Math | Total.Verbal | |
---|---|---|---|---|---|---|---|---|---|---|
0 | ALABAMA | 2005 | 5889966.0 | 665924.0 | 3247562.0 | 1976480.0 | 3053380.0 | 729342.0 | 559 | 567 |
1 | ALASKA | 2005 | 1599739.0 | 302846.0 | 878731.0 | 418162.0 | 829666.0 | 132568.0 | 519 | 523 |
2 | ARIZONA | 2005 | 7550109.0 | 897553.0 | 3320196.0 | 3332360.0 | 3423716.0 | 919875.0 | 530 | 526 |
3 | ARKANSAS | 2005 | 3978240.0 | 447012.0 | 2995788.0 | 535440.0 | 2118580.0 | 461667.0 | 552 | 563 |
4 | CALIFORNIA | 2005 | 62171797.0 | 7148863.0 | 34912166.0 | 20110768.0 | 31070450.0 | 6307022.0 | 522 | 504 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
554 | VIRGINIA | 2015 | 15857524.0 | 1012205.0 | 6240349.0 | 8604970.0 | 8755896.0 | 1279867.0 | 517 | 518 |
555 | WASHINGTON | 2015 | 13709442.0 | 1036422.0 | 8293812.0 | 4379208.0 | 6508964.0 | 1072359.0 | 511 | 502 |
556 | WEST VIRGINIA | 2015 | 3478401.0 | 362959.0 | 1979466.0 | 1135976.0 | 1819903.0 | 279565.0 | 500 | 509 |
557 | WISCONSIN | 2015 | 11637376.0 | 814385.0 | 5869265.0 | 4953726.0 | 5723474.0 | 861813.0 | 606 | 591 |
558 | WYOMING | 2015 | 1962874.0 | 120290.0 | 1116917.0 | 725667.0 | 895910.0 | 93867.0 | 587 | 588 |
559 rows × 10 columns
The scatter plot indicates that there is no strong or visible correlation between revenue and test scores. We were curious about why this was and conducted further research. During our search, we found this article which explains that state revenue or state spending on education does not have as significant impact on student success. You can find it here.
The study indicates that while there may not be an obvious correlation when looking at the data as a whole, low income students are more sensitive to changes in education spending than higher income students. This is probably because lower income students are more reliant on the resources the state provides them. For a future investigation, it would be interesting to look at an additional data set with per pupil spending over the years for each state and compare it with the performance of students with different income brackets to verify this correlation.
However, for now we wanted to verify the information found on the previous article and create a scatterplot of education spending per student vs. sat scores. To do this we divided instruction expenditure by number of students enrolled to get the education spending and we used the average math score as our dependent variable. When plotting, we staggered the variables a few years apart since it takes a few years for the effects of education spending to show. Although we did not do this with the state revenue scatterplot, staggering the years has little effect since unstaggering the scatterplot below produces a very similar graph.
merged_df.head()
merged_df['education_spending'] = merged_df['INSTRUCTION_EXPENDITURE']/merged_df['ENROLL']
plt.legend()
plt.xlabel('Education Spending (mil)')
plt.ylabel('Average Math Score')
plt.title(f'Education Spending per Student vs. Average Math Score')
rev_expend = plt.scatter(merged_df[merged_df['Year'] == 2007]['education_spending'], merged_df[merged_df['Year'] == 2011]['Total.Math'])
WARNING:matplotlib.legend:No artists with labels found to put in legend. Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
The NAEP or the National Assessment of Educational Progess provide national standardized testing for 4th, 8th, and 12th grade students.
The code below shows the process of creating a data frame of 8th grade NAEP exam scores and SAT scores, with a 3 year difference. We chose to look at this 3 year gap because most students take the SAT their junior year of high school, so it is the same group of students taking both exams.
We wanted to look at 8th grade NAEP scores to see if there is any way to identify early on if a child would be more successful based on previous examination results. If this is true then colleges would be able to use this data to advertise their college to students earlier than the student's sophomore year of highschool.
df2_8G = df2[['Year', 'State', 'G08_A_M_READING', 'G08_A_M_MATHEMATICS', 'G08_A_F_READING', 'G08_A_F_MATHEMATICS']]
df2_8G['G08 Average Math'] =(df2_8G['G08_A_M_MATHEMATICS'] + df2_8G['G08_A_F_MATHEMATICS']) / 2
df2_8G['G08 Average Reading'] =(df2_8G['G08_A_M_READING'] + df2_8G['G08_A_F_READING']) / 2
df2_8G = df2_8G.rename(columns = { 'G08_A_M_READING': 'G08 Average Reading Male',
'G08_A_M_MATHEMATICS': 'G08 Average Math Male',
'G08_A_F_MATHEMATICS': 'G08 Average Math Female',
'G08_A_F_READING': 'G08 Average Reading Female'})
df1_scores = df1_scores.rename(columns = {'Total.Math': 'Average Math Score',
'Total.Verbal': 'Average Verbal Score'})
# For Math Scores
df2_8G_2007 = df2_8G[(df2_8G['Year'] == 2007)]
df1_2010 = df1_scores[(df1_scores['Year']==2010)]
merged_2007_2010 = pd.merge(df2_8G_2007, df1_2010, left_on=['State'], right_on=['State'])
plot2007_2010 = plt.scatter(merged_2007_2010['G08 Average Math'], merged_2007_2010['Average Math Score'])
df2_8G_2009 = df2_8G[(df2_8G['Year'] == 2009)]
df1_2012 = df1_scores[(df1_scores['Year']==2012)]
merged_2009_2012 = pd.merge(df2_8G_2009, df1_2012, left_on=['State'], right_on=['State'])
merged_2009_2012.describe()
plot2009_2012 = plt.scatter(merged_2009_2012['G08 Average Math'], merged_2009_2012['Average Math Score'])
df2_8G_2011 = df2_8G[(df2_8G['Year'] == 2011)]
df1_2014 = df1_scores[(df1_scores['Year']==2014)]
merged_2011_2014 = pd.merge(df2_8G_2011, df1_2014, left_on=['State'], right_on=['State'])
merged_2011_2014
plot2011_2014 = plt.scatter(merged_2011_2014['G08 Average Math'], merged_2011_2014['Average Math Score'])
plt.legend([plot2007_2010, plot2009_2012, plot2011_2014], ['2007-2010', '2009-2011', '2011-2014'])
plt.title("8th Grade NAEP Math Scores vs SAT Math scores")
plt.xlabel("Average State NAEP Score")
plt.ylabel("Average State SAT Score")
Text(0, 0.5, 'Average State SAT Score')
As you can see by the graph, there is a pretty loose correlation between average state NAEP score and SAT math score. Since we do not have data of individual people and their scores, we have to make the assumption that the people taking the 8th grade NAEP is the same group of people who are taking the SAT three years later.
# For reading/ verbal scores
df2_8G_2007 = df2_8G[(df2_8G['Year'] == 2007)]
df1_2010 = df1_scores[(df1_scores['Year']==2010)]
merged_2007_2010 = pd.merge(df2_8G_2007, df1_2010, left_on=['State'], right_on=['State'])
plot2007_2010r = plt.scatter(merged_2007_2010['G08 Average Reading'], merged_2007_2010['Average Verbal Score'], c='#17becf')
df2_8G_2009 = df2_8G[(df2_8G['Year'] == 2009)]
df1_2012 = df1_scores[(df1_scores['Year']==2012)]
merged_2009_2012 = pd.merge(df2_8G_2009, df1_2012, left_on=['State'], right_on=['State'])
merged_2009_2012.describe()
plot2009_2012r = plt.scatter(merged_2009_2012['G08 Average Reading'], merged_2009_2012['Average Verbal Score'], c='#d62728')
df2_8G_2011 = df2_8G[(df2_8G['Year'] == 2011)]
df1_2014 = df1_scores[(df1_scores['Year']==2014)]
merged_2011_2014 = pd.merge(df2_8G_2011, df1_2014, left_on=['State'], right_on=['State'])
merged_2011_2014
plot2011_2014r = plt.scatter(merged_2011_2014['G08 Average Reading'], merged_2011_2014['Average Verbal Score'], c='#e377c2')
plt.legend([plot2007_2010r, plot2009_2012r, plot2011_2014r], ['2007-2010', '2009-2011', '2011-2014'])
plt.title("8th Grade NAEP Reading Scores vs SAT Verbal scores")
plt.xlabel("Average State NAEP Score")
plt.ylabel("Average State SAT Score")
Text(0, 0.5, 'Average State SAT Score')
We wanted to see if reading scores were any different and the graphs look pretty similar, however the correlation seems a little looser.
One feature that we wanted to add to the predictive model is gender. To do this we will see if there is any difference between gender and scores. We did an intitial analysis by comparing the average difference in scores between women and men for math and reading and then both of them together. We foudn that there is on average a larger disparity in test scores between the two genders when it comes to math rather than reading. In total the average difference in the total score is around 42. It is difficult to determine how significant the disparity is because SAT scores are calculated on a scale rather than each question being worth an equal amount of points.
print('avg difference in scores between women and men in math: ', abs((df_income['Gender.Female.Math']- df_income['Gender.Male.Math']).mean()))
print('avg difference in scores between women and men in verbal: ', abs((df_income['Gender.Female.Verbal']- df_income['Gender.Male.Verbal']).mean()))
print('avg difference in total scores: ', abs(((df_income['Gender.Female.Verbal']+df_income['Gender.Female.Math'])- (df_income['Gender.Male.Verbal']+df_income['Gender.Male.Math'])).mean()))
avg difference in scores between women and men in math: 35.49566724436742 avg difference in scores between women and men in verbal: 6.535528596187175 avg difference in total scores: 42.031195840554595
Another feature we want to look at for our predictive model is race and the impact it has on SAT scores. We thought this would be an important feature to look at since historically, people of color have a disadvantage due to systematic issues. To do this, we found the average scores of all the years for each race, and also averaged reading and math.
df_R_copy = df_R
df_M_copy = df_M
white = df_M_copy['White'].mean()
black = df_M_copy['Black'].mean()
mexican = df_M_copy['Mexican American'].mean()
puerto = df_M_copy['Puerto Rican'].mean()
hispanic = df_M_copy['Other Hispanic'].mean()
asian = df_M_copy['Asian/Pacific Islander'].mean()
AI_AN = df_M_copy['American Indian/ Alaska Native'].mean()
other = df_M_copy['Other'].mean()
data = {
'race/ethnicity': ['White', 'Black', 'Mexican American', 'Puerto Rican', 'Other Hispanic', 'Asian/Pacific Islander', 'American Indian/ Alaska Native', 'Other'],
'data': [white, black, mexican, puerto, hispanic, asian, AI_AN, other]
}
df_bar = pd.DataFrame(data)
df_bar.plot(kind='bar', x='race/ethnicity', y='data', legend=False)
# Add labels and title
plt.xlabel('Race/Ethnicity')
plt.ylabel('Avg. Math SAT score')
plt.title('Avg. Math SAT score for each race')
# Show the plot
plt.show()
white = df_R_copy['White'].mean()
black = df_R_copy['Black'].mean()
mexican = df_R_copy['Mexican American'].mean()
puerto = df_R_copy['Puerto Rican'].mean()
hispanic = df_R_copy['Other Hispanic'].mean()
asian = df_R_copy['Asian/Pacific Islander'].mean()
AI_AN = df_R_copy['American Indian/ Alaska Native'].mean()
other = df_R_copy['Other'].mean()
data = {
'race/ethnicity': ['White', 'Black', 'Mexican American', 'Puerto Rican', 'Other Hispanic', 'Asian/Pacific Islander', 'American Indian/ Alaska Native', 'Other'],
'data': [white, black, mexican, puerto, hispanic, asian, AI_AN, other]
}
df_bar = pd.DataFrame(data)
df_bar.plot(kind='bar', x='race/ethnicity', y='data', legend=False)
# Add labels and title
plt.xlabel('Race/Ethnicity')
plt.ylabel('Avg. Reading SAT score')
plt.title('Avg. Reading SAT score for each race')
# Show the plot
plt.show()
For math, asian/pacific islanders have the highest average score while black people have the lowest average score. For reading, white people have the highest average score while black people have the lowest average score. The disparities between scores are more apparent in math scores than reading scores which makes sense beceause math is a more technical skill and less intuitive.
We want to focus on the socioeconomic factors that go into an SAT score. The goal of our tutorial is to make a product that college admission offices can use to compare students based on more qualities than just their test scores. There are many socioeconomic factors that play into a students success in school, so comparing two students' SAT scores who have very different backgrounds may not be equitable to do one to one. With our model, however, colleges can input these different factors for each student they are assessing, and figure out how they compare to students who do have a similar background to them.
Our model will predict and output the average score of someone with the same state/US terriroty, gender, race, and family income bracket for the student they are viewing. The admissions officer can then compare their student to that average, instead of a general/ national one.
For our model we decided that we needed it to be nonlinear, and chose a Random Forest Regressor Model.
Creating our table
To create the table to feed to our model, we had to make 192 separate dataframes, each that contain the average score for someone with each specific combination of socioeconomic factors. With this, we decided to average each of the scores for each of the factors, as they all had a correlation with SAT scores. In these dataframes, we also added columns that contained what category they are representing. For example, Gender=M, Income=40k, etc.
To make this process easier, we created a main function that would create each data frame based on given inputs. This would make it easier to copy and paste a section of dataframes and then manually change some of the inputs.
def add_race(race, df_income, race_data, gender, income, subject):
#to code: gender, income subject
if gender == 'M': #male
if income == 0: #male, 0
if subject == 'M': #male, 0, math
main_data = df_income[['Year', 'State', 'Gender.Male.Math', 'Family Income < 20k: Math Average Score']]
main_data['Gender'] = 'M'
main_data['Income'] = '<20k'
main_data['Subject'] = 'Math'
main_data['Race'] = race
gender_label = 'Gender.Male.Math'
income_label = 'Family Income < 20k: Math Average Score'
else: #male, 0, reading
main_data = df_income[['Year', 'State', 'Gender.Male.Verbal', 'Family Income < 20k: Reading Average Score']]
main_data['Gender'] = 'M'
main_data['Income'] = '<20k'
main_data['Subject'] = 'Reading'
main_data['Race'] = race
gender_label = 'Gender.Male.Verbal'
income_label = 'Family Income < 20k: Reading Average Score'
if income == 20:
if subject == 'M': #male, 20, math
main_data = df_income[['Year', 'State', 'Gender.Male.Math', 'Family Income 20-40k: Math Average Score']]
main_data['Gender'] = 'M'
main_data['Income'] = '20-40k'
main_data['Subject'] = 'Math'
main_data['Race'] = race
gender_label = 'Gender.Male.Math'
income_label = 'Family Income 20-40k: Math Average Score'
else: #male, 20, reading
main_data = df_income[['Year', 'State', 'Gender.Male.Verbal', 'Family Income 20-40k: Reading Average Score']]
main_data['Gender'] = 'M'
main_data['Income'] = '20-40k'
main_data['Subject'] = 'Reading'
main_data['Race'] = race
gender_label = 'Gender.Male.Verbal'
income_label = 'Family Income 20-40k: Reading Average Score'
if income == 40:
if subject == 'M': #male, 40, math
main_data = df_income[['Year', 'State', 'Gender.Male.Math', 'Family Income 40-60k: Math Average Score']]
main_data['Gender'] = 'M'
main_data['Income'] = '40-60k'
main_data['Subject'] = 'Math'
main_data['Race'] = race
gender_label = 'Gender.Male.Math'
income_label = 'Family Income 40-60k: Math Average Score'
else: #male, 40, reading
main_data = df_income[['Year', 'State', 'Gender.Male.Verbal', 'Family Income 40-60k: Reading Average Score']]
main_data['Gender'] = 'M'
main_data['Income'] = '40-60k'
main_data['Subject'] = 'Reading'
main_data['Race'] = race
gender_label = 'Gender.Male.Verbal'
income_label = 'Family Income 40-60k: Reading Average Score'
if income == 60:
if subject == 'M': #male, 60, math
main_data = df_income[['Year', 'State', 'Gender.Male.Math', 'Family Income 60-80k: Math Average Score']]
main_data['Gender'] = 'M'
main_data['Income'] = '60-80k'
main_data['Subject'] = 'Math'
main_data['Race'] = race
gender_label = 'Gender.Male.Math'
income_label = 'Family Income 60-80k: Math Average Score'
else: #male, 60, reading
main_data = df_income[['Year', 'State', 'Gender.Male.Verbal', 'Family Income 60-80k: Reading Average Score']]
main_data['Gender'] = 'M'
main_data['Income'] = '60-80k'
main_data['Subject'] = 'Reading'
main_data['Race'] = race
gender_label = 'Gender.Male.Verbal'
income_label = 'Family Income 60-80k: Reading Average Score'
if income == 80:
if subject == 'M': #male, 80, math
main_data = df_income[['Year', 'State', 'Gender.Male.Math', 'Family Income 80-100k: Math Average Score']]
main_data['Gender'] = 'M'
main_data['Income'] = '80-100k'
main_data['Subject'] = 'Math'
main_data['Race'] = race
gender_label = 'Gender.Male.Math'
income_label = 'Family Income 80-100k: Math Average Score'
else: #male, 80, reading
main_data = df_income[['Year', 'State', 'Gender.Male.Verbal', 'Family Income 80-100k: Reading Average Score']]
main_data['Gender'] = 'M'
main_data['Income'] = '80-100k'
main_data['Subject'] = 'Reading'
main_data['Race'] = race
gender_label = 'Gender.Male.Verbal'
income_label = 'Family Income 80-100k: Reading Average Score'
if income == 100:
if subject == 'M': #male, 100, math
main_data = df_income[['Year', 'State', 'Gender.Male.Math', 'Family Income > 100k: Math Average Score']]
main_data['Gender'] = 'M'
main_data['Income'] = '>100k'
main_data['Subject'] = 'Math'
main_data['Race'] = race
gender_label = 'Gender.Male.Math'
income_label = 'Family Income > 100k: Math Average Score'
else: #male, 100, reading
main_data = df_income[['Year', 'State', 'Gender.Male.Verbal', 'Family Income > 100k: Reading Average Score']]
main_data['Gender'] = 'M'
main_data['Income'] = '>100k'
main_data['Subject'] = 'Reading'
main_data['Race'] = race
gender_label = 'Gender.Male.Verbal'
income_label = 'Family Income > 100k: Reading Average Score'
else:
if income == 0: #female, 0
if subject == 'M': #male, 0, math
main_data = df_income[['Year', 'State', 'Gender.Female.Math', 'Family Income < 20k: Math Average Score']]
main_data['Gender'] = 'F'
main_data['Income'] = '<20k'
main_data['Subject'] = 'Math'
main_data['Race'] = race
gender_label = 'Gender.Female.Math'
income_label = 'Family Income < 20k: Math Average Score'
else: #female, 0, reading
main_data = df_income[['Year', 'State', 'Gender.Female.Verbal', 'Family Income < 20k: Reading Average Score']]
main_data['Gender'] = 'F'
main_data['Income'] = '<20k'
main_data['Subject'] = 'Reading'
main_data['Race'] = race
gender_label = 'Gender.Female.Verbal'
income_label = 'Family Income < 20k: Reading Average Score'
if income == 20:
if subject == 'M': #female, 20, math
main_data = df_income[['Year', 'State', 'Gender.Female.Math', 'Family Income 20-40k: Math Average Score']]
main_data['Gender'] = 'F'
main_data['Income'] = '20-40k'
main_data['Subject'] = 'Math'
main_data['Race'] = race
gender_label = 'Gender.Female.Math'
income_label = 'Family Income 20-40k: Math Average Score'
else: #female, 20, reading
main_data = df_income[['Year', 'State', 'Gender.Female.Verbal', 'Family Income 20-40k: Reading Average Score']]
main_data['Gender'] = 'F'
main_data['Income'] = '20-40k'
main_data['Subject'] = 'Reading'
main_data['Race'] = race
gender_label = 'Gender.Female.Verbal'
income_label = 'Family Income 20-40k: Reading Average Score'
if income == 40:
if subject == 'M': #female, 40, math
main_data = df_income[['Year', 'State', 'Gender.Female.Math', 'Family Income 40-60k: Math Average Score']]
main_data['Gender'] = 'F'
main_data['Income'] = '40-60k'
main_data['Subject'] = 'Math'
main_data['Race'] = race
gender_label = 'Gender.Female.Math'
income_label = 'Family Income 40-60k: Math Average Score'
else: #female, 40, reading
main_data = df_income[['Year', 'State', 'Gender.Female.Verbal', 'Family Income 40-60k: Reading Average Score']]
main_data['Gender'] = 'F'
main_data['Income'] = '40-60k'
main_data['Subject'] = 'Reading'
main_data['Race'] = race
gender_label = 'Gender.Female.Verbal'
income_label = 'Family Income 40-60k: Reading Average Score'
if income == 60:
if subject == 'M': #female, 60, math
main_data = df_income[['Year', 'State', 'Gender.Female.Math', 'Family Income 60-80k: Math Average Score']]
main_data['Gender'] = 'F'
main_data['Income'] = '60-80k'
main_data['Subject'] = 'Math'
main_data['Race'] = race
gender_label = 'Gender.Female.Math'
income_label = 'Family Income 60-80k: Math Average Score'
else: #female, 60, reading
main_data = df_income[['Year', 'State', 'Gender.Female.Verbal', 'Family Income 60-80k: Reading Average Score']]
main_data['Gender'] = 'F'
main_data['Income'] = '60-80k'
main_data['Subject'] = 'Reading'
main_data['Race'] = race
gender_label = 'Gender.Female.Verbal'
income_label = 'Family Income 60-80k: Reading Average Score'
if income == 80:
if subject == 'M': #female, 80, math
main_data = df_income[['Year', 'State', 'Gender.Female.Math', 'Family Income 80-100k: Math Average Score']]
main_data['Gender'] = 'F'
main_data['Income'] = '80-100k'
main_data['Subject'] = 'Math'
main_data['Race'] = race
gender_label = 'Gender.Female.Math'
income_label = 'Family Income 80-100k: Math Average Score'
else: #female, 80, reading
main_data = df_income[['Year', 'State', 'Gender.Female.Verbal', 'Family Income 80-100k: Reading Average Score']]
main_data['Gender'] = 'F'
main_data['Income'] = '80-100k'
main_data['Subject'] = 'Reading'
main_data['Race'] = race
gender_label = 'Gender.Female.Verbal'
income_label = 'Family Income 80-100k: Reading Average Score'
if income == 100:
if subject == 'M': #female, 100, math
main_data = df_income[['Year', 'State', 'Gender.Female.Math', 'Family Income > 100k: Math Average Score']]
main_data['Gender'] = 'F'
main_data['Income'] = '>100k'
main_data['Subject'] = 'Math'
main_data['Race'] = race
gender_label = 'Gender.Female.Math'
income_label = 'Family Income > 100k: Math Average Score'
else: #female, 100, reading
main_data = df_income[['Year', 'State', 'Gender.Female.Verbal', 'Family Income > 100k: Reading Average Score']]
main_data['Gender'] = 'F'
main_data['Income'] = '>100k'
main_data['Subject'] = 'Reading'
main_data['Race'] = race
gender_label = 'Gender.Female.Verbal'
income_label = 'Family Income > 100k: Reading Average Score'
x = 0
for i in range(2005, 2016):
main_data.loc[main_data['Year'] == i, race] = race_data.at[x, race]
x+=1
main_data['combined score'] = (main_data[gender_label] + main_data[income_label] +main_data[race])/3
return main_data
#('race', df_income, df_M/df_R, F/M, income, M/R)
Here is an example of the combination of a white male's math score whos family makes less than $20,000 a year. The combined score represents the average of the male math average score, the less than 20k income bracket, and being white.
m_math_0k_white = add_race('White', df_income, df_M, 'M', 0, 'M')
m_math_0k_white
Year | State | Gender.Male.Math | Family Income < 20k: Math Average Score | Gender | Income | Subject | Race | White | combined score | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 2005 | ALABAMA | 582 | 462.0 | M | <20k | Math | White | 536.0 | 526.666667 |
1 | 2005 | ALASKA | 535 | 464.0 | M | <20k | Math | White | 536.0 | 511.666667 |
2 | 2005 | ARIZONA | 549 | 485.0 | M | <20k | Math | White | 536.0 | 523.333333 |
3 | 2005 | ARKANSAS | 570 | 489.0 | M | <20k | Math | White | 536.0 | 531.666667 |
4 | 2005 | CALIFORNIA | 543 | 451.0 | M | <20k | Math | White | 536.0 | 510.000000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
572 | 2015 | VIRGINIA | 532 | 441.0 | M | <20k | Math | White | 534.0 | 502.333333 |
573 | 2015 | WASHINGTON | 525 | 449.0 | M | <20k | Math | White | 534.0 | 502.666667 |
574 | 2015 | WEST VIRGINIA | 513 | 448.0 | M | <20k | Math | White | 534.0 | 498.333333 |
575 | 2015 | WISCONSIN | 627 | 573.0 | M | <20k | Math | White | 534.0 | 578.000000 |
576 | 2015 | WYOMING | 604 | NaN | M | <20k | Math | White | 534.0 | NaN |
577 rows × 10 columns
Here is the creation of the rest of the dataframes, calling the above function.
#male math
m_math_0k_white = add_race('White', df_income, df_M, 'M', 0, 'M')
m_math_0k_black = add_race('Black', df_income, df_M, 'M', 0, 'M')
m_math_0k_mexican = add_race('Mexican American', df_income, df_M, 'M', 0, 'M')
m_math_0k_puerto = add_race('Puerto Rican', df_income, df_M, 'M', 0, 'M')
m_math_0k_hispanic = add_race('Other Hispanic', df_income, df_M, 'M', 0, 'M')
m_math_0k_asian = add_race('Asian/Pacific Islander', df_income, df_M, 'M', 0, 'M')
m_math_0k_AI_AN = add_race('American Indian/ Alaska Native', df_income, df_M, 'M', 0, 'M')
m_math_0k_other = add_race('Other', df_income, df_M, 'M', 0, 'M')
m_math_20k_white = add_race('White', df_income, df_M, 'M', 20, 'M')
m_math_20k_black = add_race('Black', df_income, df_M, 'M', 20, 'M')
m_math_20k_mexican = add_race('Mexican American', df_income, df_M, 'M', 20, 'M')
m_math_20k_puerto = add_race('Puerto Rican', df_income, df_M, 'M', 20, 'M')
m_math_20k_hispanic = add_race('Other Hispanic', df_income, df_M, 'M', 20, 'M')
m_math_20k_asian = add_race('Asian/Pacific Islander', df_income, df_M, 'M', 20, 'M')
m_math_20k_AI_AN = add_race('American Indian/ Alaska Native', df_income, df_M, 'M', 20, 'M')
m_math_20k_other = add_race('Other', df_income, df_M, 'M', 20, 'M')
m_math_40k_white = add_race('White', df_income, df_M, 'M', 40, 'M')
m_math_40k_black = add_race('Black', df_income, df_M, 'M', 40, 'M')
m_math_40k_mexican = add_race('Mexican American', df_income, df_M, 'M', 40, 'M')
m_math_40k_puerto = add_race('Puerto Rican', df_income, df_M, 'M', 40, 'M')
m_math_40k_hispanic = add_race('Other Hispanic', df_income, df_M, 'M', 40, 'M')
m_math_40k_asian = add_race('Asian/Pacific Islander', df_income, df_M, 'M', 40, 'M')
m_math_40k_AI_AN = add_race('American Indian/ Alaska Native', df_income, df_M, 'M', 40, 'M')
m_math_40k_other = add_race('Other', df_income, df_M, 'M', 40, 'M')
m_math_60k_white = add_race('White', df_income, df_M, 'M', 60, 'M')
m_math_60k_black = add_race('Black', df_income, df_M, 'M', 60, 'M')
m_math_60k_mexican = add_race('Mexican American', df_income, df_M, 'M', 60, 'M')
m_math_60k_puerto = add_race('Puerto Rican', df_income, df_M, 'M', 60, 'M')
m_math_60k_hispanic = add_race('Other Hispanic', df_income, df_M, 'M', 60, 'M')
m_math_60k_asian = add_race('Asian/Pacific Islander', df_income, df_M, 'M', 60, 'M')
m_math_60k_AI_AN = add_race('American Indian/ Alaska Native', df_income, df_M, 'M', 60, 'M')
m_math_60k_other = add_race('Other', df_income, df_M, 'M', 60, 'M')
m_math_80k_white = add_race('White', df_income, df_M, 'M', 80, 'M')
m_math_80k_black = add_race('Black', df_income, df_M, 'M', 80, 'M')
m_math_80k_mexican = add_race('Mexican American', df_income, df_M, 'M', 80, 'M')
m_math_80k_puerto = add_race('Puerto Rican', df_income, df_M, 'M', 80, 'M')
m_math_80k_hispanic = add_race('Other Hispanic', df_income, df_M, 'M', 80, 'M')
m_math_80k_asian = add_race('Asian/Pacific Islander', df_income, df_M, 'M', 80, 'M')
m_math_80k_AI_AN = add_race('American Indian/ Alaska Native', df_income, df_M, 'M', 80, 'M')
m_math_80k_other = add_race('Other', df_income, df_M, 'M', 80, 'M')
m_math_100k_white = add_race('White', df_income, df_M, 'M', 100, 'M')
m_math_100k_black = add_race('Black', df_income, df_M, 'M', 100, 'M')
m_math_100k_mexican = add_race('Mexican American', df_income, df_M, 'M', 100, 'M')
m_math_100k_puerto = add_race('Puerto Rican', df_income, df_M, 'M', 100, 'M')
m_math_100k_hispanic = add_race('Other Hispanic', df_income, df_M, 'M', 100, 'M')
m_math_100k_asian = add_race('Asian/Pacific Islander', df_income, df_M, 'M', 100, 'M')
m_math_100k_AI_AN = add_race('American Indian/ Alaska Native', df_income, df_M, 'M', 100, 'M')
m_math_100k_other = add_race('Other', df_income, df_M, 'M', 100, 'M')
#female, math
f_math_0k_white = add_race('White', df_income, df_M, 'F', 0, 'M')
f_math_0k_black = add_race('Black', df_income, df_M, 'F', 0, 'M')
f_math_0k_mexican = add_race('Mexican American', df_income, df_M, 'F', 0, 'M')
f_math_0k_puerto = add_race('Puerto Rican', df_income, df_M, 'F', 0, 'M')
f_math_0k_hispanic = add_race('Other Hispanic', df_income, df_M, 'F', 0, 'M')
f_math_0k_asian = add_race('Asian/Pacific Islander', df_income, df_M, 'F', 0, 'M')
f_math_0k_AI_AN = add_race('American Indian/ Alaska Native', df_income, df_M, 'F', 0, 'M')
f_math_0k_other = add_race('Other', df_income, df_M, 'M', 0, 'M')
f_math_20k_white = add_race('White', df_income, df_M, 'F', 20, 'M')
f_math_20k_black = add_race('Black', df_income, df_M, 'F', 20, 'M')
f_math_20k_mexican = add_race('Mexican American', df_income, df_M, 'F', 20, 'M')
f_math_20k_puerto = add_race('Puerto Rican', df_income, df_M, 'F', 20, 'M')
f_math_20k_hispanic = add_race('Other Hispanic', df_income, df_M, 'F', 20, 'M')
f_math_20k_asian = add_race('Asian/Pacific Islander', df_income, df_M, 'F', 20, 'M')
f_math_20k_AI_AN = add_race('American Indian/ Alaska Native', df_income, df_M, 'F', 20, 'M')
f_math_20k_other = add_race('Other', df_income, df_M, 'F', 20, 'M')
f_math_40k_white = add_race('White', df_income, df_M, 'F', 40, 'M')
f_math_40k_black = add_race('Black', df_income, df_M, 'F', 40, 'M')
f_math_40k_mexican = add_race('Mexican American', df_income, df_M, 'F', 40, 'M')
f_math_40k_puerto = add_race('Puerto Rican', df_income, df_M, 'F', 40, 'M')
f_math_40k_hispanic = add_race('Other Hispanic', df_income, df_M, 'F', 40, 'M')
f_math_40k_asian = add_race('Asian/Pacific Islander', df_income, df_M, 'F', 40, 'M')
f_math_40k_AI_AN = add_race('American Indian/ Alaska Native', df_income, df_M, 'F', 40, 'M')
f_math_40k_other = add_race('Other', df_income, df_M, 'F', 40, 'M')
f_math_60k_white = add_race('White', df_income, df_M, 'F', 60, 'M')
f_math_60k_black = add_race('Black', df_income, df_M, 'F', 60, 'M')
f_math_60k_mexican = add_race('Mexican American', df_income, df_M, 'F', 60, 'M')
f_math_60k_puerto = add_race('Puerto Rican', df_income, df_M, 'F', 60, 'M')
f_math_60k_hispanic = add_race('Other Hispanic', df_income, df_M, 'F', 60, 'M')
f_math_60k_asian = add_race('Asian/Pacific Islander', df_income, df_M, 'F', 60, 'M')
f_math_60k_AI_AN = add_race('American Indian/ Alaska Native', df_income, df_M, 'F', 60, 'M')
f_math_60k_other = add_race('Other', df_income, df_M, 'F', 60, 'M')
f_math_80k_white = add_race('White', df_income, df_M, 'F', 80, 'M')
f_math_80k_black = add_race('Black', df_income, df_M, 'F', 80, 'M')
f_math_80k_mexican = add_race('Mexican American', df_income, df_M, 'F', 80, 'M')
f_math_80k_puerto = add_race('Puerto Rican', df_income, df_M, 'F', 80, 'M')
f_math_80k_hispanic = add_race('Other Hispanic', df_income, df_M, 'F', 80, 'M')
f_math_80k_asian = add_race('Asian/Pacific Islander', df_income, df_M, 'F', 80, 'M')
f_math_80k_AI_AN = add_race('American Indian/ Alaska Native', df_income, df_M, 'F', 80, 'M')
f_math_80k_other = add_race('Other', df_income, df_M, 'F', 80, 'M')
f_math_100k_white = add_race('White', df_income, df_M, 'F', 100, 'M')
f_math_100k_black = add_race('Black', df_income, df_M, 'F', 100, 'M')
f_math_100k_mexican = add_race('Mexican American', df_income, df_M, 'F', 100, 'M')
f_math_100k_puerto = add_race('Puerto Rican', df_income, df_M, 'F', 100, 'M')
f_math_100k_hispanic = add_race('Other Hispanic', df_income, df_M, 'F', 100, 'M')
f_math_100k_asian = add_race('Asian/Pacific Islander', df_income, df_M, 'F', 100, 'M')
f_math_100k_AI_AN = add_race('American Indian/ Alaska Native', df_income, df_M, 'F', 100, 'M')
f_math_100k_other = add_race('Other', df_income, df_M, 'F', 100, 'M')
#male reading
m_reading_0k_white = add_race('White', df_income, df_R, 'M', 0, 'R')
m_reading_0k_black = add_race('Black', df_income, df_R, 'M', 0, 'R')
m_reading_0k_mexican = add_race('Mexican American', df_income, df_R, 'M', 0, 'R')
m_reading_0k_puerto = add_race('Puerto Rican', df_income, df_R, 'M', 0, 'R')
m_reading_0k_hispanic = add_race('Other Hispanic', df_income, df_R, 'M', 0, 'R')
m_reading_0k_asian = add_race('Asian/Pacific Islander', df_income, df_R, 'M', 0, 'R')
m_reading_0k_AI_AN = add_race('American Indian/ Alaska Native', df_income, df_R, 'M', 0, 'R')
m_reading_0k_other = add_race('Other', df_income, df_R, 'M', 0, 'R')
m_reading_20k_white = add_race('White', df_income, df_R, 'M', 20, 'R')
m_reading_20k_black = add_race('Black', df_income, df_R, 'M', 20, 'R')
m_reading_20k_mexican = add_race('Mexican American', df_income, df_R, 'M', 20, 'R')
m_reading_20k_puerto = add_race('Puerto Rican', df_income, df_R, 'M', 20, 'R')
m_reading_20k_hispanic = add_race('Other Hispanic', df_income, df_R, 'M', 20, 'R')
m_reading_20k_asian = add_race('Asian/Pacific Islander', df_income, df_R, 'M', 20, 'R')
m_reading_20k_AI_AN = add_race('American Indian/ Alaska Native', df_income, df_R, 'M', 20, 'R')
m_reading_20k_other = add_race('Other', df_income, df_R, 'M', 20, 'R')
m_reading_40k_white = add_race('White', df_income, df_R, 'M', 40, 'R')
m_reading_40k_black = add_race('Black', df_income, df_R, 'M', 40, 'R')
m_reading_40k_mexican = add_race('Mexican American', df_income, df_R, 'M', 40, 'R')
m_reading_40k_puerto = add_race('Puerto Rican', df_income, df_R, 'M', 40, 'R')
m_reading_40k_hispanic = add_race('Other Hispanic', df_income, df_R, 'M', 40, 'R')
m_reading_40k_asian = add_race('Asian/Pacific Islander', df_income, df_R, 'M', 40, 'R')
m_reading_40k_AI_AN = add_race('American Indian/ Alaska Native', df_income, df_R, 'M', 40, 'R')
m_reading_40k_other = add_race('Other', df_income, df_R, 'M', 40, 'R')
m_reading_60k_white = add_race('White', df_income, df_R, 'M', 60, 'R')
m_reading_60k_black = add_race('Black', df_income, df_R, 'M', 60, 'R')
m_reading_60k_mexican = add_race('Mexican American', df_income, df_R, 'M', 60, 'R')
m_reading_60k_puerto = add_race('Puerto Rican', df_income, df_R, 'M', 60, 'R')
m_reading_60k_hispanic = add_race('Other Hispanic', df_income, df_R, 'M', 60, 'R')
m_reading_60k_asian = add_race('Asian/Pacific Islander', df_income, df_R, 'M', 60, 'R')
m_reading_60k_AI_AN = add_race('American Indian/ Alaska Native', df_income, df_R, 'M', 60, 'R')
m_reading_60k_other = add_race('Other', df_income, df_R, 'M', 60, 'R')
m_reading_80k_white = add_race('White', df_income, df_R, 'M', 80, 'R')
m_reading_80k_black = add_race('Black', df_income, df_R, 'M', 80, 'R')
m_reading_80k_mexican = add_race('Mexican American', df_income, df_R, 'M', 80, 'R')
m_reading_80k_puerto = add_race('Puerto Rican', df_income, df_R, 'M', 80, 'R')
m_reading_80k_hispanic = add_race('Other Hispanic', df_income, df_R, 'M', 80, 'R')
m_reading_80k_asian = add_race('Asian/Pacific Islander', df_income, df_R, 'M', 80, 'R')
m_reading_80k_AI_AN = add_race('American Indian/ Alaska Native', df_income, df_R, 'M', 80, 'R')
m_reading_80k_other = add_race('Other', df_income, df_R, 'M', 80, 'R')
m_reading_100k_white = add_race('White', df_income, df_R, 'M', 100, 'R')
m_reading_100k_black = add_race('Black', df_income, df_R, 'M', 100, 'R')
m_reading_100k_mexican = add_race('Mexican American', df_income, df_R, 'M', 100, 'R')
m_reading_100k_puerto = add_race('Puerto Rican', df_income, df_R, 'M', 100, 'R')
m_reading_100k_hispanic = add_race('Other Hispanic', df_income, df_R, 'M', 100, 'R')
m_reading_100k_asian = add_race('Asian/Pacific Islander', df_income, df_R, 'M', 100, 'R')
m_reading_100k_AI_AN = add_race('American Indian/ Alaska Native', df_income, df_R, 'M', 100, 'R')
m_reading_100k_other = add_race('Other', df_income, df_R, 'M', 100, 'R')
#female reading
f_reading_0k_white = add_race('White', df_income, df_R, 'F', 0, 'R')
f_reading_0k_black = add_race('Black', df_income, df_R, 'F', 0, 'R')
f_reading_0k_mexican = add_race('Mexican American', df_income, df_R, 'F', 0, 'R')
f_reading_0k_puerto = add_race('Puerto Rican', df_income, df_R, 'F', 0, 'R')
f_reading_0k_hispanic = add_race('Other Hispanic', df_income, df_R, 'F', 0, 'R')
f_reading_0k_asian = add_race('Asian/Pacific Islander', df_income, df_R, 'F', 0, 'R')
f_reading_0k_AI_AN = add_race('American Indian/ Alaska Native', df_income, df_R, 'F', 0, 'R')
f_reading_0k_other = add_race('Other', df_income, df_R, 'F', 0, 'R')
f_reading_20k_white = add_race('White', df_income, df_R, 'F', 20, 'R')
f_reading_20k_black = add_race('Black', df_income, df_R, 'F', 20, 'R')
f_reading_20k_mexican = add_race('Mexican American', df_income, df_R, 'F', 20, 'R')
f_reading_20k_puerto = add_race('Puerto Rican', df_income, df_R, 'F', 20, 'R')
f_reading_20k_hispanic = add_race('Other Hispanic', df_income, df_R, 'F', 20, 'R')
f_reading_20k_asian = add_race('Asian/Pacific Islander', df_income, df_R, 'F', 20, 'R')
f_reading_20k_AI_AN = add_race('American Indian/ Alaska Native', df_income, df_R, 'F', 20, 'R')
f_reading_20k_other = add_race('Other', df_income, df_R, 'F', 20, 'R')
f_reading_40k_white = add_race('White', df_income, df_R, 'F', 40, 'R')
f_reading_40k_black = add_race('Black', df_income, df_R, 'F', 40, 'R')
f_reading_40k_mexican = add_race('Mexican American', df_income, df_R, 'F', 40, 'R')
f_reading_40k_puerto = add_race('Puerto Rican', df_income, df_R, 'F', 40, 'R')
f_reading_40k_hispanic = add_race('Other Hispanic', df_income, df_R, 'F', 40, 'R')
f_reading_40k_asian = add_race('Asian/Pacific Islander', df_income, df_R, 'F', 40, 'R')
f_reading_40k_AI_AN = add_race('American Indian/ Alaska Native', df_income, df_R, 'M', 40, 'R')
f_reading_40k_other = add_race('Other', df_income, df_R, 'F', 40, 'R')
f_reading_60k_white = add_race('White', df_income, df_R, 'F', 60, 'R')
f_reading_60k_black = add_race('Black', df_income, df_R, 'F', 60, 'R')
f_reading_60k_mexican = add_race('Mexican American', df_income, df_R, 'F', 60, 'R')
f_reading_60k_puerto = add_race('Puerto Rican', df_income, df_R, 'F', 60, 'R')
f_reading_60k_hispanic = add_race('Other Hispanic', df_income, df_R, 'F', 60, 'R')
f_reading_60k_asian = add_race('Asian/Pacific Islander', df_income, df_R, 'F', 60, 'R')
f_reading_60k_AI_AN = add_race('American Indian/ Alaska Native', df_income, df_R, 'F', 60, 'R')
f_reading_60k_other = add_race('Other', df_income, df_R, 'F', 60, 'R')
f_reading_80k_white = add_race('White', df_income, df_R, 'F', 80, 'R')
f_reading_80k_black = add_race('Black', df_income, df_R, 'F', 80, 'R')
f_reading_80k_mexican = add_race('Mexican American', df_income, df_R, 'F', 80, 'R')
f_reading_80k_puerto = add_race('Puerto Rican', df_income, df_R, 'F', 80, 'R')
f_reading_80k_hispanic = add_race('Other Hispanic', df_income, df_R, 'F', 80, 'R')
f_reading_80k_asian = add_race('Asian/Pacific Islander', df_income, df_R, 'F', 80, 'R')
f_reading_80k_AI_AN = add_race('American Indian/ Alaska Native', df_income, df_R, 'F', 80, 'R')
f_reading_80k_other = add_race('Other', df_income, df_R, 'F', 80, 'R')
f_reading_100k_white = add_race('White', df_income, df_R, 'F', 100, 'R')
f_reading_100k_black = add_race('Black', df_income, df_R, 'F', 100, 'R')
f_reading_100k_mexican = add_race('Mexican American', df_income, df_R, 'F', 100, 'R')
f_reading_100k_puerto = add_race('Puerto Rican', df_income, df_R, 'F', 100, 'R')
f_reading_100k_hispanic = add_race('Other Hispanic', df_income, df_R, 'F', 100, 'R')
f_reading_100k_asian = add_race('Asian/Pacific Islander', df_income, df_R, 'F', 100, 'R')
f_reading_100k_AI_AN = add_race('American Indian/ Alaska Native', df_income, df_R, 'F', 100, 'R')
f_reading_100k_other = add_race('Other', df_income, df_R, 'F', 100, 'R')
Now that we have every data frame that we need, we have to prepare the data for a model. To do this we need to first concatenate every dataframe together so we have one cohesive dataframe we can put into the model.
df_model_prep = pd.concat([m_math_0k_white, m_math_0k_black, m_math_0k_mexican, m_math_0k_puerto, m_math_0k_hispanic, m_math_0k_asian, m_math_0k_AI_AN, m_math_0k_other, m_math_20k_white, m_math_20k_black, m_math_20k_mexican, m_math_20k_puerto, m_math_20k_hispanic, m_math_20k_asian, m_math_20k_AI_AN, m_math_20k_other, m_math_40k_white, m_math_40k_black, m_math_40k_mexican, m_math_40k_puerto, m_math_40k_hispanic, m_math_40k_asian, m_math_40k_AI_AN, m_math_40k_other, m_math_60k_white, m_math_60k_black, m_math_60k_mexican, m_math_60k_puerto, m_math_60k_hispanic, m_math_60k_asian, m_math_60k_AI_AN, m_math_60k_other, m_math_80k_white, m_math_80k_black, m_math_80k_mexican, m_math_80k_puerto, m_math_80k_hispanic, m_math_80k_asian, m_math_80k_AI_AN, m_math_80k_other, m_math_100k_white, m_math_100k_black, m_math_100k_mexican, m_math_100k_puerto, m_math_100k_hispanic, m_math_100k_asian, m_math_100k_AI_AN, m_math_100k_other,
f_math_0k_white, f_math_0k_black, f_math_0k_mexican, f_math_0k_puerto, f_math_0k_hispanic, f_math_0k_asian, f_math_0k_AI_AN, f_math_0k_other, f_math_20k_white, f_math_20k_black, f_math_20k_mexican, f_math_20k_puerto, f_math_20k_hispanic, f_math_20k_asian, f_math_20k_AI_AN, f_math_20k_other, f_math_40k_white, f_math_40k_black, f_math_40k_mexican, f_math_40k_puerto, f_math_40k_hispanic, f_math_40k_asian, f_math_40k_AI_AN, f_math_40k_other, f_math_60k_white, f_math_60k_black, f_math_60k_mexican, f_math_60k_puerto, f_math_60k_hispanic, f_math_60k_asian, f_math_60k_AI_AN, f_math_60k_other, f_math_80k_white, f_math_80k_black, f_math_80k_mexican, f_math_80k_puerto, f_math_80k_hispanic, f_math_80k_asian, f_math_80k_AI_AN, f_math_80k_other, f_math_100k_white, f_math_100k_black, f_math_100k_mexican, f_math_100k_puerto, f_math_100k_hispanic, f_math_100k_asian, f_math_100k_AI_AN, f_math_100k_other,
m_reading_0k_white, m_reading_0k_black, m_reading_0k_mexican, m_reading_0k_puerto, m_reading_0k_hispanic, m_reading_0k_asian, m_reading_0k_AI_AN, m_reading_0k_other, m_reading_20k_white, m_reading_20k_black, m_reading_20k_mexican, m_reading_20k_puerto, m_reading_20k_hispanic, m_reading_20k_asian, m_reading_20k_AI_AN, m_reading_20k_other, m_reading_40k_white, m_reading_40k_black, m_reading_40k_mexican, m_reading_40k_puerto, m_reading_40k_hispanic, m_reading_40k_asian, m_reading_40k_AI_AN, m_reading_40k_other, m_reading_60k_white, m_reading_60k_black, m_reading_60k_mexican, m_reading_60k_puerto, m_reading_60k_hispanic, m_reading_60k_asian, m_reading_60k_AI_AN, m_reading_60k_other, m_reading_80k_white, m_reading_80k_black, m_reading_80k_mexican, m_reading_80k_puerto, m_reading_80k_hispanic, m_reading_80k_asian, m_reading_80k_AI_AN, m_reading_80k_other, m_reading_100k_white, m_reading_100k_black, m_reading_100k_mexican, m_reading_100k_puerto, m_reading_100k_hispanic, m_reading_100k_asian, m_reading_100k_AI_AN, m_reading_100k_other,
f_reading_0k_white, f_reading_0k_black, f_reading_0k_mexican, f_reading_0k_puerto, f_reading_0k_hispanic, f_reading_0k_asian, f_reading_0k_AI_AN, f_reading_0k_other, f_reading_20k_white, f_reading_20k_black, f_reading_20k_mexican, f_reading_20k_puerto, f_reading_20k_hispanic, f_reading_20k_asian, f_reading_20k_AI_AN, f_reading_20k_other, f_reading_40k_white, f_reading_40k_black, f_reading_40k_mexican, f_reading_40k_puerto, f_reading_40k_hispanic, f_reading_40k_asian, f_reading_40k_AI_AN, f_reading_40k_other, f_reading_60k_white, f_reading_60k_black, f_reading_60k_mexican, f_reading_60k_puerto, f_reading_60k_hispanic, f_reading_60k_asian, f_reading_60k_AI_AN, f_reading_60k_other, f_reading_80k_white, f_reading_80k_black, f_reading_80k_mexican, f_reading_80k_puerto, f_reading_80k_hispanic, f_reading_80k_asian, f_reading_80k_AI_AN, f_reading_80k_other, f_reading_100k_white, f_reading_100k_black, f_reading_100k_mexican, f_reading_100k_puerto, f_reading_100k_hispanic, f_reading_100k_asian, f_reading_100k_AI_AN, f_reading_100k_other],ignore_index=True, axis=0)
df_model_prep
#df_model_math = df_model_math.drop(columns = ['Family Income < 20k: Math Average Score','Family Income 20-40k: Math Average Score', 'Family Income 40-60k: Math Average Score', 'Family Income 60-80k: Math Average Score', 'Family Income 80-100k: Math Average Score', 'Family Income > 100k: Math Average Score', 'Gender.Male.Math', 'Gender.Female.Math' ])
Year | State | Gender.Male.Math | Family Income < 20k: Math Average Score | Gender | Income | Subject | Race | White | combined score | ... | Family Income > 100k: Math Average Score | Gender.Female.Math | Gender.Male.Verbal | Family Income < 20k: Reading Average Score | Family Income 20-40k: Reading Average Score | Family Income 40-60k: Reading Average Score | Family Income 60-80k: Reading Average Score | Family Income 80-100k: Reading Average Score | Family Income > 100k: Reading Average Score | Gender.Female.Verbal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2005 | ALABAMA | 582.0 | 462.0 | M | <20k | Math | White | 536.0 | 526.666667 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 2005 | ALASKA | 535.0 | 464.0 | M | <20k | Math | White | 536.0 | 511.666667 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 2005 | ARIZONA | 549.0 | 485.0 | M | <20k | Math | White | 536.0 | 523.333333 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 2005 | ARKANSAS | 570.0 | 489.0 | M | <20k | Math | White | 536.0 | 531.666667 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 2005 | CALIFORNIA | 543.0 | 451.0 | M | <20k | Math | White | 536.0 | 510.000000 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
110779 | 2015 | VIRGINIA | NaN | NaN | F | >100k | Reading | Other | NaN | 518.666667 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 550.0 | 516.0 |
110780 | 2015 | WASHINGTON | NaN | NaN | F | >100k | Reading | Other | NaN | 514.666667 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 552.0 | 502.0 |
110781 | 2015 | WEST VIRGINIA | NaN | NaN | F | >100k | Reading | Other | NaN | 509.333333 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 532.0 | 506.0 |
110782 | 2015 | WISCONSIN | NaN | NaN | F | >100k | Reading | Other | NaN | 559.666667 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 603.0 | 586.0 |
110783 | 2015 | WYOMING | NaN | NaN | F | >100k | Reading | Other | NaN | 558.666667 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 591.0 | 595.0 |
110784 rows × 31 columns
Below is the table for our model. It contains a row for each combination of year, state, gender, income, subject, and race that can come from our original datasets. There were originally 288 null values, but because that is only 0.26% of rows, it is okay to drop.
df_model = df_model_prep[['Year', 'State', 'Gender', 'Income', 'Subject', 'Race', 'combined score']]
df_model
Year | State | Gender | Income | Subject | Race | combined score | |
---|---|---|---|---|---|---|---|
0 | 2005 | ALABAMA | M | <20k | Math | White | 526.666667 |
1 | 2005 | ALASKA | M | <20k | Math | White | 511.666667 |
2 | 2005 | ARIZONA | M | <20k | Math | White | 523.333333 |
3 | 2005 | ARKANSAS | M | <20k | Math | White | 531.666667 |
4 | 2005 | CALIFORNIA | M | <20k | Math | White | 510.000000 |
... | ... | ... | ... | ... | ... | ... | ... |
110779 | 2015 | VIRGINIA | F | >100k | Reading | Other | 518.666667 |
110780 | 2015 | WASHINGTON | F | >100k | Reading | Other | 514.666667 |
110781 | 2015 | WEST VIRGINIA | F | >100k | Reading | Other | 509.333333 |
110782 | 2015 | WISCONSIN | F | >100k | Reading | Other | 559.666667 |
110783 | 2015 | WYOMING | F | >100k | Reading | Other | 558.666667 |
110784 rows × 7 columns
df_model.isnull().sum().sum() #288 null values
df_model = df_model.dropna()
Here is the spread of scores. The lowest possible score on a single section of the SAT is 200 and the highest is 800. Here, we can see that most averages of scores lie withing the 480 to 550 range.
df_model.hist(['combined score'])
plt.xlabel('Score')
plt.ylabel('Count')
Text(0, 0.5, 'Count')
The goal of our model is to make a product that college admission offices can use to compare students stest scores based on their socioeconomic qualities. There are many socioeconomic factors in our nation that give people a disadvantage/advantage depending on their background, which is something a person cannot control. Therefore, by being able to get an estimate of what someone in a student's socioeconomic background would get, it increases equity since they are being compared on a smaller, more specific scale.
Below, we are defining our features for the model and then getting the dummy variables from our model to use in our dataframe. With the way our data is organized, since we do not have observations of individual test takers, we have to have each feature be a column in our dataframe.
features = ['State', 'Gender', 'Income', 'Subject', "Race"]
df_dummies = pd.get_dummies(df_model[features])
df_dummies.head()
State_ALABAMA | State_ALASKA | State_ARIZONA | State_ARKANSAS | State_CALIFORNIA | State_COLORADO | State_CONNECTICUT | State_DELAWARE | State_DISTRICT OF COLUMBIA | State_FLORIDA | ... | Subject_Math | Subject_Reading | Race_American Indian/ Alaska Native | Race_Asian/Pacific Islander | Race_Black | Race_Mexican American | Race_Other | Race_Other Hispanic | Race_Puerto Rican | Race_White | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
2 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
3 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
4 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
5 rows × 71 columns
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
import math
x = df_dummies
y = df_model['combined score']
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
regressor = RandomForestRegressor(n_estimators=100, max_features = 61, criterion = 'squared_error',
random_state = 0)
regressor.fit(x,y)
y_pred = regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print('RMSE:', math.sqrt(mse))
RMSE: 8.141284713443188
Our RSME error is around 8 which is relatively small considering SAT scores range from 200-800. Now we are going to use this model to predict the average math score of two different people. The first student is a white female from Alabama whos household family income is 70k. The second is a black male from California whos family income in 40k.
Unfortunately, there wasn't a more convenient way to input an instance for the model to check so we have to put in a value for every column instead of just the ones that apply to that person.
# Math, white, female 80-100k, Alabama
new_instance = pd.DataFrame({'State_ALABAMA': [1], 'State_ALASKA' : [0], 'State_ARIZONA' : [0],'State_ARKANSAS': [0], 'State_CALIFORNIA' :[0], 'State_COLORADO':[0], 'State_CONNECTICUT':[0],
'State_DELAWARE':[0], 'State_DISTRICT OF COLUMBIA': [0], 'State_FLORIDA' :[0], 'State_GEORGIA': [0], 'State_HAWAII' : [0], 'State_IDAHO':[0],'State_ILLINOIS':[0],
'State_INDIANA':[0], 'State_IOWA':[0], 'State_KANSAS':[0], 'State_KENTUCKY':[0], 'State_KENTUCKY':[0], 'State_LOUISIANA':[0], 'State_MAINE':[0],
'State_MARYLAND':[0], 'State_MASSACHUSETTS':[0], 'State_MICHIGAN':[0], 'State_MINNESOTA':[0],'State_MISSISSIPPI':[0], 'State_MISSOURI':[0], 'State_MONTANA':[0], 'State_NEBRASKA':[0],
'State_NEVADA':[0], 'State_NEW HAMPSHIRE':[0], 'State_NEW JERSEY':[0], 'State_NEW MEXICO':[0], 'State_NEW YORK':[0], 'State_NORTH CAROLINA':[0], 'State_NORTH DAKOTA':[0],
'State_OHIO':[0], 'State_OKLAHOMA':[0], 'State_OREGON':[0], 'State_PENNSYLVANIA':[0], 'State_PUERTO RICO':[0], 'State_RHODE ISLAND':[0], 'State_SOUTH CAROLINA':[0], 'State_SOUTH DAKOTA':[0], 'State_TENNESSEE':[0],
'State_TEXAS':[0], 'State_UTAH':[0], 'State_VERMONT':[0], 'State_VIRGIN ISLANDS':[0],'State_VIRGINIA':[0], 'State_WASHINGTON':[0], 'State_WEST VIRGINIA':[0], 'State_WISCONSIN':[0], 'State_WYOMING':[0],
'Gender_F': [1], 'Gender_M': [0], 'Income_20-40k': [0], 'Income_40-60k':[0],'Income_60-80k': [0],'Income_80-100k': [1],'Income_<20k':[0],'Income_>100k': [0], 'Subject_Math': [1], 'Subject_Reading': [0], 'Race_American Indian/ Alaska Native': [0],
'Race_Asian/Pacific Islander' : [0], 'Race_Black' : [0], 'Race_Mexican American' : [0], 'Race_Other' : [0],'Race_Other Hispanic' : [0], 'Race_Puerto Rican' : [0], 'Race_White' : [1] })
predicted_score = regressor.predict(new_instance)
print(f'Predicted Average Score: {predicted_score[0]}')
Predicted Average Score: 539.6554671418911
The Predicted Average Score is 539.6554671418911
# Math, black, male, 20-40k, California
new_instance = pd.DataFrame({'State_ALABAMA': [0], 'State_ALASKA' : [0], 'State_ARIZONA' : [0],'State_ARKANSAS': [0], 'State_CALIFORNIA' :[1], 'State_COLORADO':[0], 'State_CONNECTICUT':[0],
'State_DELAWARE':[0], 'State_DISTRICT OF COLUMBIA': [0], 'State_FLORIDA' :[0], 'State_GEORGIA': [0], 'State_HAWAII' : [0], 'State_IDAHO':[0],'State_ILLINOIS':[0],
'State_INDIANA':[0], 'State_IOWA':[0], 'State_KANSAS':[0], 'State_KENTUCKY':[0], 'State_KENTUCKY':[0], 'State_LOUISIANA':[0], 'State_MAINE':[0],
'State_MARYLAND':[0], 'State_MASSACHUSETTS':[0], 'State_MICHIGAN':[0], 'State_MINNESOTA':[0],'State_MISSISSIPPI':[0], 'State_MISSOURI':[0], 'State_MONTANA':[0], 'State_NEBRASKA':[0],
'State_NEVADA':[0], 'State_NEW HAMPSHIRE':[0], 'State_NEW JERSEY':[0], 'State_NEW MEXICO':[0], 'State_NEW YORK':[0], 'State_NORTH CAROLINA':[0], 'State_NORTH DAKOTA':[0],
'State_OHIO':[0], 'State_OKLAHOMA':[0], 'State_OREGON':[0], 'State_PENNSYLVANIA':[0], 'State_PUERTO RICO':[0], 'State_RHODE ISLAND':[0], 'State_SOUTH CAROLINA':[0], 'State_SOUTH DAKOTA':[0], 'State_TENNESSEE':[0],
'State_TEXAS':[0], 'State_UTAH':[0], 'State_VERMONT':[0], 'State_VIRGIN ISLANDS':[0],'State_VIRGINIA':[0], 'State_WASHINGTON':[0], 'State_WEST VIRGINIA':[0], 'State_WISCONSIN':[0], 'State_WYOMING':[0],
'Gender_F': [0], 'Gender_M': [1], 'Income_20-40k': [1], 'Income_40-60k':[0],'Income_60-80k': [0],'Income_80-100k': [0],'Income_<20k':[0],'Income_>100k': [0], 'Subject_Math': [1], 'Subject_Reading': [0], 'Race_American Indian/ Alaska Native': [0],
'Race_Asian/Pacific Islander' : [0], 'Race_Black' : [1], 'Race_Mexican American' : [0], 'Race_Other' : [0],'Race_Other Hispanic' : [0], 'Race_Puerto Rican' : [0], 'Race_White' : [0] })
predicted_score = regressor.predict(new_instance)
print(f'Predicted Average Score: {predicted_score[0]}')
Predicted Average Score: 481.79334407141005
The Predicted Average Score is 481.79334407141005
Lets say both student 1 and student 2 scored 520 points on their SAT math section. Although these are the same score, for someone with the same background as the second student, this 39 points above the average, but 19 points below the average for someone like the first.
Decision Tree
From our model, we believed that income has the largest impact on the prediced SAT score due to their very high correlation shown earlier in the tutorial.
To look at this hypothesis and see what other features were most important in the predicted average score, we have created a decision tree using Random Forest Regressor. The max depth has been set to 4 and the number of estimators to 2, so that the trees are readable. By changing the depth and number of estimators, more trees with more levels can be printed.
from sklearn import tree
regressor_tree = RandomForestRegressor(n_estimators=2, max_features = 61, criterion = 'squared_error',
random_state = 0, max_depth=4)
regressor_tree.fit(x,y)
features = x.columns.values
classes = ['0', '1', '2']
for estimator in regressor_tree.estimators_:
print(estimator)
plt.figure(figsize=(20,8))
tree.plot_tree(estimator,
feature_names=features,
class_names=classes,
fontsize=8,
filled=True,
rounded=True)
plt.show()
DecisionTreeRegressor(max_depth=4, max_features=61, random_state=209652396)
DecisionTreeRegressor(max_depth=4, max_features=61, random_state=398764591)
The trees above do not support our hypothesis exactly. As shown it seems that being from the Virgin Islands, a US territory, has the most weight when the model is predicting the average score. This makes sense however, as being for the Virgin Islands is a more unique instance.
With our data, we were able to successfully create a predictive model that predicts the average SAT score based on socioeconomic factors such as race, income, and state. However, based on the results from our exploratory analysis and lack of individual test scores, we decided not to include 8th grade scores as a feature in our predictive model. In the future however, we would be interesting in making a similar model that uses these past test scores in addition to other academic markers like GPA to predict SAT scores.
A trend we noticed with our data when doing exploratory analysis is that when we would compare math and reading separately, there was always a higher discreptancy with math. This makes sense because math is a more technical subject that is not as intuitive as reading may be.
We also saw when looking at the decision tree regressor that being from the virgin islands was on top, with being asian, being white, and having an income less than 20k following. We reasoned that being from the Virgin Islands was significant because it is a U.S. territory and not a state so the data may be drastically different. From our exploratory analysis, we saw that on average asian/pacific islanders has the highest average SAT scores, followed by white people which may explain why they were put so high up on the tree. Additionally, having an income of less than 20k puts that student below the poverty line, which has a significant impact on a student's performance because their basic needs are not met.
Overall, the model does a fairly accurate prediction of scores based on the data it's being fed. The model is structured in a way where there is a lot of potential for expansion to test other features or to incorporate a more updated dataset to improve accuracy and have scores better match up with today's trends. This model has the potential to be very beneficial to universities who wish to give students an equitable evaluation when looking at their SAT scores and may help balance out some of the disparities in scores due to socioeconomic factors that are outside of a student's control.
! pwd
/content/drive/MyDrive
warnings.filterwarnings("ignore", category=FutureWarning)
!jupyter nbconvert --to html /content/drive/MyDrive/Colab_Notebooks/Data_Science_Final.ipynb
[NbConvertApp] Converting notebook /content/drive/MyDrive/Colab_Notebooks/Data_Science_Final.ipynb to html [NbConvertApp] Writing 2549487 bytes to /content/drive/MyDrive/Colab_Notebooks/Data_Science_Final.html