Regression¶

Goal! Goal! Goal!¶

mu2.jpg

Download data

Back to spell book

1. Load Data¶

1.1 Libraries¶

In [1]:
import pandas as pd

Sometimes, we may need to use a specific encoding:

encoding = "ISO-8859-1"

encoding = "utf-8"

1.2 Data¶

In [2]:
football = pd.read_csv("football_2.csv", encoding = "ISO-8859-1")
football.head()
Out[2]:
ID Name Age Photo Nationality Flag Overall Potential Club Club Logo ... Composure Marking StandingTackle SlidingTackle GKDiving GKHandling GKKicking GKPositioning GKReflexes Release Clause
0 207439 L. Paredes 24 https://cdn.sofifa.org/players/4/19/207439.png Argentina https://cdn.sofifa.org/flags/52.png 80 85 NaN https://cdn.sofifa.org/flags/52.png ... 74.0 73.0 75.0 72.0 9.0 14.0 6.0 9.0 10.0 NaN
1 156713 A. Granqvist 33 https://cdn.sofifa.org/players/4/19/156713.png Sweden https://cdn.sofifa.org/flags/46.png 80 80 NaN https://cdn.sofifa.org/flags/46.png ... 78.0 82.0 83.0 79.0 7.0 9.0 12.0 10.0 15.0 NaN
2 229909 A. Lunev 26 https://cdn.sofifa.org/players/4/19/229909.png Russia https://cdn.sofifa.org/flags/40.png 79 81 NaN https://cdn.sofifa.org/flags/40.png ... 69.0 18.0 20.0 12.0 80.0 73.0 65.0 77.0 85.0 NaN
3 187347 I. Smolnikov 29 https://cdn.sofifa.org/players/4/19/187347.png Russia https://cdn.sofifa.org/flags/40.png 79 79 NaN https://cdn.sofifa.org/flags/40.png ... 73.0 76.0 76.0 80.0 7.0 12.0 10.0 8.0 15.0 NaN
4 153260 Hilton 40 https://cdn.sofifa.org/players/4/19/153260.png Brazil https://cdn.sofifa.org/flags/54.png 78 78 Montpellier HSC https://cdn.sofifa.org/teams/2/light/70.png ... 70.0 83.0 77.0 76.0 12.0 7.0 11.0 12.0 13.0 NaN

5 rows × 88 columns

Variable names. Hard to read without the index.

In [3]:
football.columns.values.tolist()
Out[3]:
['ID',
 'Name',
 'Age',
 'Photo',
 'Nationality',
 'Flag',
 'Overall',
 'Potential',
 'Club',
 'Club Logo',
 'Value',
 'Wage',
 'Special',
 'Preferred Foot',
 'International Reputation',
 'Weak Foot',
 'Skill Moves',
 'Work Rate',
 'Body Type',
 'Real Face',
 'Position',
 'Jersey Number',
 'Joined',
 'Loaned From',
 'Contract Valid Until',
 'Height',
 'Weight',
 'LS',
 'ST',
 'RS',
 'LW',
 'LF',
 'CF',
 'RF',
 'RW',
 'LAM',
 'CAM',
 'RAM',
 'LM',
 'LCM',
 'CM',
 'RCM',
 'RM',
 'LWB',
 'LDM',
 'CDM',
 'RDM',
 'RWB',
 'LB',
 'LCB',
 'CB',
 'RCB',
 'RB',
 'Crossing',
 'Finishing',
 'HeadingAccuracy',
 'ShortPassing',
 'Volleys',
 'Dribbling',
 'Curve',
 'FKAccuracy',
 'LongPassing',
 'BallControl',
 'Acceleration',
 'SprintSpeed',
 'Agility',
 'Reactions',
 'Balance',
 'ShotPower',
 'Jumping',
 'Stamina',
 'Strength',
 'LongShots',
 'Aggression',
 'Interceptions',
 'Positioning',
 'Vision',
 'Penalties',
 'Composure',
 'Marking',
 'StandingTackle',
 'SlidingTackle',
 'GKDiving',
 'GKHandling',
 'GKKicking',
 'GKPositioning',
 'GKReflexes',
 'Release Clause']
In [4]:
football_variables_df = pd.DataFrame(football.columns.values, columns = ["Variables"])
football_variables_df
Out[4]:
Variables
0 ID
1 Name
2 Age
3 Photo
4 Nationality
... ...
83 GKHandling
84 GKKicking
85 GKPositioning
86 GKReflexes
87 Release Clause

88 rows × 1 columns

Disply all rows

In [5]:
print(football_variables_df.to_string())
                   Variables
0                         ID
1                       Name
2                        Age
3                      Photo
4                Nationality
5                       Flag
6                    Overall
7                  Potential
8                       Club
9                  Club Logo
10                     Value
11                      Wage
12                   Special
13            Preferred Foot
14  International Reputation
15                 Weak Foot
16               Skill Moves
17                 Work Rate
18                 Body Type
19                 Real Face
20                  Position
21             Jersey Number
22                    Joined
23               Loaned From
24      Contract Valid Until
25                    Height
26                    Weight
27                        LS
28                        ST
29                        RS
30                        LW
31                        LF
32                        CF
33                        RF
34                        RW
35                       LAM
36                       CAM
37                       RAM
38                        LM
39                       LCM
40                        CM
41                       RCM
42                        RM
43                       LWB
44                       LDM
45                       CDM
46                       RDM
47                       RWB
48                        LB
49                       LCB
50                        CB
51                       RCB
52                        RB
53                  Crossing
54                 Finishing
55           HeadingAccuracy
56              ShortPassing
57                   Volleys
58                 Dribbling
59                     Curve
60                FKAccuracy
61               LongPassing
62               BallControl
63              Acceleration
64               SprintSpeed
65                   Agility
66                 Reactions
67                   Balance
68                 ShotPower
69                   Jumping
70                   Stamina
71                  Strength
72                 LongShots
73                Aggression
74             Interceptions
75               Positioning
76                    Vision
77                 Penalties
78                 Composure
79                   Marking
80            StandingTackle
81             SlidingTackle
82                  GKDiving
83                GKHandling
84                 GKKicking
85             GKPositioning
86                GKReflexes
87            Release Clause
In [6]:
print(football.dtypes.to_string())
ID                            int64
Name                         object
Age                           int64
Photo                        object
Nationality                  object
Flag                         object
Overall                       int64
Potential                     int64
Club                         object
Club Logo                    object
Value                         int64
Wage                          int64
Special                       int64
Preferred Foot               object
International Reputation    float64
Weak Foot                   float64
Skill Moves                 float64
Work Rate                    object
Body Type                    object
Real Face                    object
Position                     object
Jersey Number               float64
Joined                       object
Loaned From                  object
Contract Valid Until         object
Height                       object
Weight                       object
LS                           object
ST                           object
RS                           object
LW                           object
LF                           object
CF                           object
RF                           object
RW                           object
LAM                          object
CAM                          object
RAM                          object
LM                           object
LCM                          object
CM                           object
RCM                          object
RM                           object
LWB                          object
LDM                          object
CDM                          object
RDM                          object
RWB                          object
LB                           object
LCB                          object
CB                           object
RCB                          object
RB                           object
Crossing                    float64
Finishing                   float64
HeadingAccuracy             float64
ShortPassing                float64
Volleys                     float64
Dribbling                   float64
Curve                       float64
FKAccuracy                  float64
LongPassing                 float64
BallControl                 float64
Acceleration                float64
SprintSpeed                 float64
Agility                     float64
Reactions                   float64
Balance                     float64
ShotPower                   float64
Jumping                     float64
Stamina                     float64
Strength                    float64
LongShots                   float64
Aggression                  float64
Interceptions               float64
Positioning                 float64
Vision                      float64
Penalties                   float64
Composure                   float64
Marking                     float64
StandingTackle              float64
SlidingTackle               float64
GKDiving                    float64
GKHandling                  float64
GKKicking                   float64
GKPositioning               float64
GKReflexes                  float64
Release Clause               object

1.3 Filter required records and variables¶

Filter for strikers only.

In [7]:
football_2 = football[football["Position"] == "ST"] 
football_2.head()
Out[7]:
ID Name Age Photo Nationality Flag Overall Potential Club Club Logo ... Composure Marking StandingTackle SlidingTackle GKDiving GKHandling GKKicking GKPositioning GKReflexes Release Clause
5 187607 A. Dzyuba 29 https://cdn.sofifa.org/players/4/19/187607.png Russia https://cdn.sofifa.org/flags/40.png 78 78 NaN https://cdn.sofifa.org/flags/40.png ... 70.0 21.0 15.0 19.0 15.0 12.0 11.0 11.0 8.0 NaN
8 183389 G. Sio 29 https://cdn.sofifa.org/players/4/19/183389.png Ivory Coast https://cdn.sofifa.org/flags/108.png 77 77 NaN https://cdn.sofifa.org/flags/108.png ... 72.0 40.0 18.0 12.0 15.0 9.0 10.0 15.0 16.0 NaN
18 245683 K. Fofana 26 https://cdn.sofifa.org/players/4/19/245683.png Ivory Coast https://cdn.sofifa.org/flags/108.png 75 75 NaN https://cdn.sofifa.org/flags/108.png ... 83.0 23.0 37.0 46.0 7.0 11.0 7.0 11.0 14.0 NaN
45 190461 B. Sigur̡arson 27 https://cdn.sofifa.org/players/4/19/190461.png Iceland https://cdn.sofifa.org/flags/24.png 73 74 NaN https://cdn.sofifa.org/flags/24.png ... 76.0 31.0 39.0 24.0 9.0 12.0 10.0 15.0 16.0 NaN
65 225900 J. Sambenito 26 https://cdn.sofifa.org/players/4/19/225900.png Paraguay https://cdn.sofifa.org/flags/58.png 71 74 NaN https://cdn.sofifa.org/flags/58.png ... 74.0 15.0 16.0 16.0 15.0 16.0 15.0 7.0 7.0 NaN

5 rows × 88 columns

Filter for the required varables.

In [8]:
import numpy as np
In [9]:
football_3 = football_2.iloc[:, np.r_[2, 13, 18, 67, 68, 73, 75, 78, 11]]
football_3.head()
Out[9]:
Age Preferred Foot Body Type Balance ShotPower Aggression Positioning Composure Wage
5 29 Right Stocky 32.0 78.0 75.0 78.0 70.0 1105
8 29 Left Normal 73.0 77.0 77.0 76.0 72.0 2138
18 26 Right Normal 60.0 78.0 67.0 72.0 83.0 3875
45 27 Right Normal 76.0 68.0 73.0 73.0 76.0 3661
65 26 Right Lean 64.0 73.0 49.0 75.0 74.0 2445
In [10]:
# Or simply (if no ranges are used)

football_2.iloc[:, [2, 13, 18, 67, 68, 73, 75, 78, 11]]
Out[10]:
Age Preferred Foot Body Type Balance ShotPower Aggression Positioning Composure Wage
5 29 Right Stocky 32.0 78.0 75.0 78.0 70.0 1105
8 29 Left Normal 73.0 77.0 77.0 76.0 72.0 2138
18 26 Right Normal 60.0 78.0 67.0 72.0 83.0 3875
45 27 Right Normal 76.0 68.0 73.0 73.0 76.0 3661
65 26 Right Lean 64.0 73.0 49.0 75.0 74.0 2445
... ... ... ... ... ... ... ... ... ...
18181 19 Right Lean 64.0 67.0 38.0 61.0 52.0 3399
18184 21 Right Stocky 70.0 64.0 32.0 56.0 51.0 9389
18188 21 Right Normal 53.0 61.0 62.0 60.0 61.0 10780
18190 19 Right Normal 68.0 61.0 51.0 67.0 62.0 10121
18203 16 Right Lean 60.0 61.0 36.0 62.0 63.0 8358

2152 rows × 9 columns

2. Regression¶

2.1 Training-Validation Split¶

In [11]:
from sklearn.model_selection import train_test_split

Define predictors and target variable.

Creating dummies applies to categorical variables only.

If a prefix is desired:

X = pd.get_dummies(X, prefix_sep = 'dummy', drop_first = True)

In [12]:
X = football_3.drop(columns = ["Wage"])
# Get dummies for the caterogical variables

X = pd.get_dummies(X, drop_first = True)

y = football_3["Wage"]
In [13]:
X
Out[13]:
Age Balance ShotPower Aggression Positioning Composure Preferred Foot_Right Body Type_Lean Body Type_Normal Body Type_Stocky
5 29 32.0 78.0 75.0 78.0 70.0 1 0 0 1
8 29 73.0 77.0 77.0 76.0 72.0 0 0 1 0
18 26 60.0 78.0 67.0 72.0 83.0 1 0 1 0
45 27 76.0 68.0 73.0 73.0 76.0 1 0 1 0
65 26 64.0 73.0 49.0 75.0 74.0 1 1 0 0
... ... ... ... ... ... ... ... ... ... ...
18181 19 64.0 67.0 38.0 61.0 52.0 1 1 0 0
18184 21 70.0 64.0 32.0 56.0 51.0 1 0 0 1
18188 21 53.0 61.0 62.0 60.0 61.0 1 0 1 0
18190 19 68.0 61.0 51.0 67.0 62.0 1 0 1 0
18203 16 60.0 61.0 36.0 62.0 63.0 1 1 0 0

2152 rows × 10 columns

In [14]:
y
Out[14]:
5         1105
8         2138
18        3875
45        3661
65        2445
         ...  
18181     3399
18184     9389
18188    10780
18190    10121
18203     8358
Name: Wage, Length: 2152, dtype: int64

Split the dataset

In [15]:
train_X, valid_X, train_y, valid_y = train_test_split(X, y, test_size = 0.3, random_state = 666)

Check.

In [16]:
train_X.head()
Out[16]:
Age Balance ShotPower Aggression Positioning Composure Preferred Foot_Right Body Type_Lean Body Type_Normal Body Type_Stocky
13946 26 69.0 71.0 64.0 72.0 72.0 1 1 0 0
7711 22 85.0 69.0 63.0 47.0 56.0 1 0 0 1
8402 25 65.0 52.0 22.0 55.0 52.0 1 1 0 0
13651 26 59.0 76.0 80.0 75.0 72.0 0 0 1 0
1625 28 55.0 70.0 55.0 61.0 71.0 0 1 0 0
In [17]:
len(train_X)
Out[17]:
1506
In [18]:
train_y.head()
Out[18]:
13946    22512
7711      6760
8402      5377
13651    13711
1625     10521
Name: Wage, dtype: int64
In [19]:
len(train_y)
Out[19]:
1506
In [20]:
valid_X.head()
Out[20]:
Age Balance ShotPower Aggression Positioning Composure Preferred Foot_Right Body Type_Lean Body Type_Normal Body Type_Stocky
7882 30 59.0 67.0 39.0 63.0 52.0 1 0 1 0
14555 24 55.0 80.0 59.0 71.0 65.0 1 0 0 1
16210 32 61.0 70.0 52.0 75.0 67.0 1 0 0 1
15847 24 60.0 75.0 49.0 71.0 70.0 1 0 1 0
12382 27 67.0 76.0 55.0 78.0 77.0 1 0 1 0
In [21]:
len(valid_X)
Out[21]:
646
In [22]:
valid_y.head()
Out[22]:
7882      5628
14555    28875
16210     6941
15847    10144
12382    45877
Name: Wage, dtype: int64
In [23]:
len(valid_y)
Out[23]:
646

2.2 Training the Regression Model¶

In [24]:
import sklearn
from sklearn.linear_model import LinearRegression
In [25]:
model = LinearRegression()
In [26]:
model.fit(train_X, train_y)
Out[26]:
LinearRegression()
In [27]:
train_y_pred = model.predict(train_X)
train_y_pred
Out[27]:
array([24239.85483562,  3778.75482905, -4114.30802593, ...,
       18813.39087789, 21533.92321643, 22278.55484414])
In [28]:
train_y_pred_df = pd.DataFrame(train_y_pred, columns = ["Training_Prediction"])
train_y_pred_df
Out[28]:
Training_Prediction
0 24239.854836
1 3778.754829
2 -4114.308026
3 29762.024484
4 15704.772118
... ...
1501 25387.715396
1502 21812.772659
1503 18813.390878
1504 21533.923216
1505 22278.554844

1506 rows × 1 columns

In [29]:
print("model intercept: ", model.intercept_)
print("model coefficients: ", model.coef_)
print("Model score: ", model.score(train_X, train_y))
model intercept:  293888.9838689667
model coefficients:  [-9.16152410e+02  4.85334103e+01  4.63702248e+02  4.80950017e+01
  6.32137998e+02  3.74975144e+02 -2.20174023e+03 -3.55489317e+05
 -3.56667999e+05 -3.57613010e+05]
Model score:  0.4586287207175519

Coefficients, easier to read.

In [30]:
print(pd.DataFrame({"Predictor": train_X.columns, "Coefficient": model.coef_}))
              Predictor    Coefficient
0                   Age    -916.152410
1               Balance      48.533410
2             ShotPower     463.702248
3            Aggression      48.095002
4           Positioning     632.137998
5             Composure     374.975144
6  Preferred Foot_Right   -2201.740231
7        Body Type_Lean -355489.317404
8      Body Type_Normal -356667.999143
9      Body Type_Stocky -357613.009856

2.2.1 Model Evaluation on Training¶

Get the RMSE for training set

In [31]:
mse_train = sklearn.metrics.mean_squared_error(train_y, train_y_pred)
mse_train
Out[31]:
261404234.2377431
In [32]:
import math
In [33]:
rmse_train = math.sqrt(mse_train)
rmse_train
Out[33]:
16168.000316605116
In [34]:
train_y.describe()
Out[34]:
count      1506.000000
mean      12698.381142
std       21981.278007
min        1290.000000
25%        4692.500000
50%        6544.000000
75%       12364.250000
max      407609.000000
Name: Wage, dtype: float64

If using the dmba package:

pip install dmba

or

conda install -c conda-forge dmba

Then load the library

import dmba

from dmba import regressionSummary

In [35]:
import dmba
from dmba import regressionSummary
In [36]:
regressionSummary(train_y, train_y_pred)
Regression statistics

                      Mean Error (ME) : 0.0000
       Root Mean Squared Error (RMSE) : 16168.0003
            Mean Absolute Error (MAE) : 8475.7614
          Mean Percentage Error (MPE) : -32.8265
Mean Absolute Percentage Error (MAPE) : 111.6103

Residuals.

In [37]:
train_residuals = train_y - train_y_pred
train_residuals
Out[37]:
13946    -1727.854836
7711      2981.245171
8402      9491.308026
13651   -16051.024484
1625     -5183.772118
             ...     
12759    -3215.715396
17284   -15227.772659
1016     -7666.390878
16984    -9871.923216
16744     4022.445156
Name: Wage, Length: 1506, dtype: float64
In [38]:
type(train_residuals)
Out[38]:
pandas.core.series.Series
In [39]:
import matplotlib.pyplot as plt
plt.hist(train_residuals, bins = 30)
plt.title("Residuals for Training")
plt.show()
In [40]:
train_residuals_df = train_residuals.to_frame(name = "Wage_Residuals")
train_residuals_df
Out[40]:
Wage_Residuals
13946 -1727.854836
7711 2981.245171
8402 9491.308026
13651 -16051.024484
1625 -5183.772118
... ...
12759 -3215.715396
17284 -15227.772659
1016 -7666.390878
16984 -9871.923216
16744 4022.445156

1506 rows × 1 columns

In [41]:
import matplotlib.pyplot as plt

plt.hist(train_residuals_df["Wage_Residuals"], bins = 30)
plt.title("Residuals for Training")
plt.show()

Normality

In [42]:
import numpy as np
from scipy.stats import shapiro

shapiro(train_y)
Out[42]:
ShapiroResult(statistic=0.38412952423095703, pvalue=0.0)
In [43]:
shapiro(train_residuals)
Out[43]:
ShapiroResult(statistic=0.5784118175506592, pvalue=0.0)
In [44]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif_df = pd.DataFrame()

vif_df["features"] = train_X.columns
vif_df["VIF"] = [variance_inflation_factor(train_X.values, i) for i in range(train_X.shape[1])]

print(vif_df)
               features         VIF
0                   Age   48.741902
1               Balance   33.364691
2             ShotPower  155.270485
3            Aggression   18.482052
4           Positioning  184.558829
5             Composure  115.774481
6  Preferred Foot_Right    7.714600
7        Body Type_Lean   32.118906
8      Body Type_Normal   59.412529
9      Body Type_Stocky   11.108926

2.3 Predict Validation Set¶

In [45]:
valid_y_pred = model.predict(valid_X)
valid_y_pred
Out[45]:
array([ 2.66530045e+03,  2.39448807e+04,  1.52116766e+04,  2.42079729e+04,
        2.96013138e+04,  1.24219313e+04,  7.03030428e+03,  1.95690670e+04,
        1.65262949e+04, -1.47780320e+03,  1.30968036e+04,  1.46015036e+04,
        1.65416124e+04,  1.23451712e+04,  2.41133395e+04,  3.44640274e+04,
        1.67102936e+04,  1.29909076e+04,  3.97181605e+03,  2.20783678e+04,
        5.38322832e+03,  2.09776571e+04,  2.08805388e+04,  2.34148863e+04,
        1.06999140e+04, -5.56234183e+03,  1.17865372e+04,  2.23538133e+04,
        1.43890326e+04,  1.75954292e+04, -1.22354489e+02,  1.29628192e+04,
        8.41552263e+03,  2.27889941e+04,  2.31121184e+03,  1.94168827e+04,
       -6.96486103e+03,  1.37996432e+04,  3.27861195e+03,  2.86742008e+04,
        2.97875564e+04, -6.32773021e+03,  2.31928580e+04,  3.04579799e+03,
        1.54775276e+04,  2.42101352e+04, -5.84285114e+03, -1.29340208e+03,
        4.06247257e+03,  1.80227373e+04,  2.17889372e+04,  9.70873486e+03,
        8.02344053e+03,  1.92883880e+04,  1.77181908e+04,  2.18037493e+04,
       -4.33997863e+03,  1.56144785e+04,  9.71271318e+03,  2.05954475e+04,
        4.75178565e+03,  2.30383707e+04,  1.02677166e+04,  2.35779602e+04,
        9.87039872e+03,  1.68866474e+04, -1.31118203e+03,  8.87430877e+03,
        3.24007684e+03,  3.23350289e+04,  1.53211409e+04,  1.07576794e+03,
        1.62358353e+03,  1.73480299e+04,  1.79438503e+04,  2.49372041e+04,
        2.00529623e+04,  1.68100187e+04,  8.56136513e+03,  3.46264372e+04,
        2.54399862e+04,  4.68426335e+04,  9.91675364e+03,  1.83355436e+04,
        3.54108387e+04,  1.70138955e+04,  1.46236932e+04,  1.02365282e+04,
        1.62242848e+04,  2.37720957e+04,  3.08256560e+04,  1.59745648e+04,
        1.59032676e+04,  3.19791193e+04,  2.04379622e+04,  1.55905542e+04,
        2.37955466e+04,  1.04128105e+04,  9.19329209e+03,  4.21540077e+04,
        1.27777969e+04,  2.41864412e+04,  2.14468738e+04,  8.06144737e+03,
        2.36390151e+04,  1.44857796e+04,  2.78377031e+04,  2.65205841e+04,
        2.38842489e+03,  4.01141506e+04,  1.19248133e+04,  1.74026797e+04,
       -3.55892891e+03,  1.72208335e+04,  1.22610883e+04, -5.54908320e+03,
        9.56791282e+03,  3.64233850e+03,  3.18218976e+04,  1.47101698e+04,
        3.73624918e+03,  5.33297175e+03,  2.29637658e+04,  2.10020910e+03,
        2.39209825e+04,  4.18162608e+04,  3.26198242e+04,  3.84999758e+04,
        2.36460768e+03,  2.00953034e+03,  2.70110133e+04, -2.01391023e+03,
       -2.20258250e+03,  1.69663027e+04,  3.12074612e+04,  2.05483828e+02,
       -3.60379347e+03,  1.35194619e+03, -2.64933746e+03,  8.64132573e+03,
        1.58664227e+04,  1.67561284e+04,  1.34151485e+04,  3.70881499e+03,
        4.13986671e+03,  3.08144727e+04,  1.23279327e+04, -2.22148139e+01,
        5.57685656e+03,  2.56361861e+04,  2.59934302e+04,  1.34265919e+03,
        2.65983638e+04,  1.18370810e+04,  1.09183425e+04,  1.43909810e+04,
        3.88503740e+03, -2.33173138e+03,  2.02760295e+04,  1.87326850e+04,
        1.96038255e+04, -9.93766117e+03,  7.47558360e+03,  1.66588172e+04,
        1.16956311e+04,  3.90959788e+03,  7.51416417e+03,  2.06412445e+04,
        1.75336594e+04,  1.33414610e+04,  1.96836251e+03, -4.82880254e+03,
       -3.71099746e+02,  7.66629848e+03,  6.14881071e+03,  1.23687003e+04,
        1.21231323e+04,  1.35733402e+04,  1.83551007e+04,  1.70539379e+04,
        3.02524002e+04, -1.07159337e+04,  6.01304553e+03,  1.14180403e+04,
       -3.34690338e+03,  1.00622718e+04,  2.06918571e+04,  1.73350758e+04,
        2.29914945e+04, -5.50771205e+03,  2.85330575e+04,  1.98521492e+03,
        1.06942632e+04,  1.73230322e+04,  2.49513789e+03, -2.79106246e+03,
        4.39038348e+04,  1.43537643e+04,  1.85450045e+04, -6.16542967e+03,
        1.63629421e+04,  2.79125830e+04,  1.73431638e+04,  3.14934519e+04,
        4.66349422e+03,  1.16135024e+04,  3.35813201e+04,  4.53795643e+03,
        3.03687846e+04,  5.61531696e+03,  7.07152851e+03,  3.18497487e+04,
        1.53772460e+04,  2.67107658e+03,  2.33955389e+04,  2.40698188e+04,
       -4.75904102e+03, -3.45112153e+03,  2.41953892e+04,  1.23263230e+04,
        2.13734730e+04,  4.57129895e+03,  4.79790380e+03,  3.16235504e+04,
        2.16213142e+04,  1.75585724e+04,  5.23850275e+03,  1.36123705e+04,
        1.30277524e+04,  1.95346600e+04,  2.84475997e+03,  2.44038322e+04,
        7.41182434e+03,  1.37544881e+04, -7.02660112e+02,  1.64984689e+04,
        1.11341685e+04,  7.89961221e+03,  2.46923901e+04,  1.76240587e+04,
        8.96931690e+03,  6.48185568e+03, -1.26222354e+03, -1.69218974e+03,
        2.57214042e+03,  1.61502589e+04,  1.28138050e+04,  9.60814203e+03,
        1.63259753e+04,  6.50911760e+03, -9.95762340e+03,  3.42057372e+02,
        1.02190745e+04,  1.67276920e+04,  2.21600760e+04,  4.65490924e+03,
        1.25402256e+04,  1.61009022e+04,  1.90925692e+04,  1.98924863e+04,
        1.10061114e+04,  5.83648195e+03,  1.70692595e+04,  1.45910951e+04,
        2.66219412e+04,  8.48576063e+03,  1.69487687e+04,  2.99414092e+03,
        1.80086237e+04,  1.87362018e+04,  3.54003722e+04, -3.68164534e+02,
        2.37253161e+04,  2.03710982e+04,  1.20499329e+04,  2.47400389e+04,
       -7.52342461e+03,  2.49023341e+04,  6.50286770e+03,  4.85828603e+03,
        9.87788899e+03,  1.49300957e+04,  9.99666283e+03,  2.42095834e+04,
        2.23392492e+03,  3.87104769e+03,  1.84381152e+04,  1.99189419e+04,
        2.88502679e+04,  2.95576858e+03,  1.02739668e+04,  8.61212517e+03,
        3.79011769e+04,  1.18256155e+04,  6.26209526e+03,  2.04705359e+04,
        8.36627864e+03,  1.49795212e+03,  1.96112157e+04,  1.25041470e+04,
        3.33043822e+03,  8.02272601e+03,  6.30190884e+01,  1.38667489e+04,
       -2.66181757e+03,  2.41768675e+02,  1.19301758e+04,  1.17904093e+04,
        2.05007501e+04, -1.68721187e+04,  2.41243743e+04,  3.72096058e+03,
        6.75666851e+03,  1.58471520e+04,  1.94148247e+04,  1.48026594e+03,
        1.31713425e+04,  2.98162207e+04, -4.06148613e+03,  2.23375583e+04,
        1.19660384e+04,  2.43317954e+04,  5.31000984e+04,  1.79335761e+04,
        1.34151909e+04, -4.84794752e+03,  1.87710521e+04,  1.87750241e+04,
        2.12204070e+04,  2.20818928e+04,  1.15592485e+03, -1.04175449e+04,
       -3.17875690e+03,  2.90112257e+04,  1.69765732e+04,  2.60161382e+04,
        2.43898074e+04,  8.39128773e+02,  1.63886547e+04, -7.37612923e+03,
        5.62485527e+03,  6.74975491e+03,  1.22944936e+04,  1.33810595e+04,
        5.46762848e+03,  1.28159578e+04,  1.08639469e+04,  1.03348099e+03,
        1.63132754e+03,  1.44072810e+04,  1.84246280e+04, -2.47909605e+03,
        1.95474530e+04,  1.99982066e+04,  2.25021690e+04,  1.32872506e+04,
        3.28203363e+03,  9.06686227e+03,  2.45238371e+04,  1.33661644e+04,
       -1.14433324e+03,  1.01026574e+04,  2.07442960e+04,  1.01544218e+04,
        1.47549186e+04,  1.99151083e+04,  1.84707327e+04,  2.91420465e+03,
        1.55694684e+04,  1.98216015e+04,  7.22538464e+03,  1.59688520e+04,
        8.82264506e+03, -5.57106659e+03, -3.61325599e+03,  1.66954298e+04,
        6.04080018e+03,  5.64313195e+03,  2.34846318e+04,  4.95252927e+03,
        1.17609615e+04,  2.73460395e+02,  4.62524003e+04,  1.94219123e+04,
        2.67636958e+03,  7.99161270e+02,  3.45401861e+04,  2.59576995e+03,
        1.89228641e+04,  3.42841898e+04,  4.15955289e+03,  3.50195062e+04,
        2.12903857e+04, -5.06633274e+03,  1.72083042e+04,  1.09837661e+04,
        9.35620710e+03,  1.39696572e+04,  2.02365913e+04,  2.01273692e+04,
        4.92787054e+03,  2.75630232e+04,  1.94245386e+04,  1.78279991e+04,
        2.49990535e+04,  2.22097084e+04,  8.43216129e+03,  8.92326875e+02,
        5.49338576e+03,  1.95493159e+04,  1.06933156e+04,  1.17082833e+04,
        2.25027856e+04,  1.45716041e+04,  4.09754472e+03,  9.19128351e+03,
        7.77620947e+03,  1.90177070e+04,  1.96619616e+04,  4.66622488e+04,
        3.67881827e+03,  5.08111933e+03,  5.38303692e+03,  1.24476702e+04,
        1.84716579e+04, -1.31531118e+03,  2.03243896e+04,  1.86851668e+04,
        3.77256684e+03,  1.01734030e+04,  3.06560978e+04,  2.34528445e+04,
        1.83387120e+04,  1.40144687e+04,  1.65206404e+04,  3.96428102e+04,
        1.04960032e+04, -4.15137281e+03,  1.05653584e+04,  3.39106609e+04,
        2.97230029e+04,  1.53912380e+04,  1.11446062e+04,  2.24223425e+04,
        2.68597465e+04,  1.22405459e+04, -2.66504745e+03,  1.20825230e+04,
        2.03241862e+04, -3.92528017e+03,  5.56118121e+03, -5.96828907e+03,
        1.60675081e+04,  8.47372390e+03,  1.66574753e+04, -1.08819482e+04,
        2.20162527e+04,  1.36954201e+04,  1.52982927e+04,  2.86445902e+04,
        2.15596058e+04,  3.72854768e+03,  6.01809567e+03,  2.62442863e+04,
        1.78847640e+04,  2.60260009e+04,  2.21141255e+04, -2.91738379e+03,
        2.81612491e+04,  4.74778654e+04,  8.43042557e+03,  2.17905824e+04,
        1.78861898e+04,  2.81132855e+04,  2.24379049e+04,  4.87238954e+03,
        3.48894848e+04,  1.06339999e+04,  4.88834577e+03,  2.37027527e+04,
        2.77041377e+04,  1.67256495e+04, -1.23767615e+03,  2.45406129e+04,
       -6.73009778e+03, -3.71221165e+03,  1.87645554e+04,  3.45908317e+03,
        9.69112280e+03,  9.79096864e+03, -6.92458269e+03,  3.78754518e+03,
       -3.03440938e+03,  2.51852102e+04, -4.47402223e+03,  3.98987083e+04,
        2.19649914e+04,  1.91549960e+04,  8.60118620e+03,  2.05626620e+04,
       -3.69035608e+03,  1.48537949e+04,  4.04843946e+03,  3.16774267e+04,
        1.70576259e+04,  1.73557586e+03,  3.20480351e+04,  1.56235213e+04,
        5.63278927e+03,  2.32403127e+04,  2.46781892e+04,  7.00114435e+03,
       -3.58186265e+03,  2.21807504e+04, -6.37439516e+03,  7.92053004e+03,
        5.22713559e+03,  3.85323630e+03,  1.61069767e+04,  1.75678237e+04,
        1.81873006e+04,  1.00464058e+04,  8.84400565e+03,  1.72847394e+04,
        5.17322566e+02,  2.56655118e+04,  2.62752003e+04,  1.07591937e+04,
        1.51229379e+04,  1.93102802e+04,  9.00728050e+03,  2.97264751e+03,
       -1.65027255e+03,  1.94821918e+04, -4.41337726e+03,  1.89499469e+04,
        3.01911889e+01,  1.34545696e+04,  5.57215010e+03,  4.50681337e+02,
        3.92060784e+04,  1.32187052e+04,  1.10678007e+04,  1.06085720e+04,
       -4.72793002e+03,  2.23111111e+04,  4.89426890e+03,  1.58645569e+04,
       -4.78889524e+03,  2.36842533e+03,  3.30683841e+04,  4.89473966e+02,
        1.38929836e+04,  2.39812906e+04, -8.08396068e+03,  1.55943106e+04,
        2.26993605e+04,  3.51540327e+04,  3.59982676e+04,  2.57322782e+04,
        2.85020030e+04,  1.23762047e+04,  1.77077060e+04,  6.05218625e+02,
        2.12468521e+04,  1.98247308e+03,  9.75425736e+03,  1.91635760e+04,
        1.11030216e+04,  2.59148087e+04,  1.04661967e+04,  3.23588749e+04,
        3.13942429e+03, -4.08853810e+03,  1.52620187e+04, -1.58765960e+04,
        1.92577674e+04,  2.37955347e+04,  1.31788779e+04,  1.08961748e+04,
        1.81281065e+04,  1.83373496e+04,  9.36618933e+03,  2.37138764e+04,
        1.44469600e+04,  2.24502896e+02,  1.87964914e+04,  1.71068838e+04,
       -1.06577785e+03,  1.00160136e+04,  1.60681170e+04,  3.22133872e+04,
        2.30064316e+04,  6.75391713e+03,  1.60378128e+04,  1.32996181e+04,
       -1.49122318e+02,  8.22587441e+03,  3.26499372e+04,  1.85186487e+04,
       -4.30577382e+03,  1.16753828e+04,  5.05871265e+03,  4.04057843e+04,
        7.68904389e+03,  1.82256266e+04, -2.70297307e+03,  2.81515606e+04,
       -5.27176698e+03,  1.86762020e+04,  1.06676760e+04,  3.22475566e+04,
       -5.52105925e+03,  2.62373303e+04,  1.44986397e+04, -4.73905063e+03,
        3.65046624e+04,  4.02480572e+03,  8.28482915e+03,  1.95814608e+04,
        8.41202780e+03, -3.84857827e+03,  8.40586509e+03,  1.03176404e+04,
        2.43958443e+04,  1.38968987e+04,  3.67708230e+04,  1.72569153e+04,
        6.03696891e+03, -1.43788106e+03,  3.32488578e+04,  1.62319091e+04,
        1.61211529e+04,  4.90873986e+03,  9.34189590e+03, -5.86994130e+03,
        1.42400588e+04,  3.39109284e+02,  3.07350763e+03,  1.75517998e+04,
        1.60393277e+04,  5.81477513e+02, -1.61141500e+03,  2.06980825e+04,
        1.15261767e+04,  3.04436687e+04])
In [46]:
valid_y_pred_df = pd.DataFrame(valid_y_pred, columns = ["Validation_Prediction"])
valid_y_pred_df
Out[46]:
Validation_Prediction
0 2665.300452
1 23944.880671
2 15211.676647
3 24207.972902
4 29601.313799
... ...
641 581.477513
642 -1611.414995
643 20698.082536
644 11526.176711
645 30443.668736

646 rows × 1 columns

2.3.1 Model Evaluation on Validation¶

Get the RMSE for validation set.

In [47]:
mse_valid = sklearn.metrics.mean_squared_error(valid_y, valid_y_pred)
mse_valid
Out[47]:
380956622.57514906
In [48]:
# As before
# import math

rmse_valid = math.sqrt(mse_valid)
rmse_valid
Out[48]:
19518.110117917386
In [49]:
valid_y.describe()
Out[49]:
count       646.000000
mean      13535.160991
std       23624.770667
min        1105.000000
25%        4708.750000
50%        6750.500000
75%       12827.750000
max      301070.000000
Name: Wage, dtype: float64
In [50]:
# As before:

# If using the dmba package:

# pip install dmba


# Done earlier. Just for illustration
# import dmba
# from dmba import regressionSummary

regressionSummary(valid_y, valid_y_pred)
Regression statistics

                      Mean Error (ME) : 91.4105
       Root Mean Squared Error (RMSE) : 19518.1101
            Mean Absolute Error (MAE) : 9319.7708
          Mean Percentage Error (MPE) : -43.3987
Mean Absolute Percentage Error (MAPE) : 118.8390

Residuals.

In [51]:
valid_residuals = valid_y - valid_y_pred
valid_residuals.head()
Out[51]:
7882      2962.699548
14555     4930.119329
16210    -8270.676647
15847   -14063.972902
12382    16275.686201
Name: Wage, dtype: float64
In [52]:
import matplotlib.pyplot as plt
plt.hist(valid_residuals, bins = 30)
plt.title("Residuals for Validation")
plt.show()
In [53]:
valid_residuals_df = valid_residuals.to_frame(name = "Wage_Residuals")
valid_residuals_df
Out[53]:
Wage_Residuals
7882 2962.699548
14555 4930.119329
16210 -8270.676647
15847 -14063.972902
12382 16275.686201
... ...
8620 5455.522487
10786 6010.414995
16154 13639.917464
4990 -8092.176711
14654 -14046.668736

646 rows × 1 columns

In [54]:
import matplotlib.pyplot as plt

plt.hist(valid_residuals_df["Wage_Residuals"], bins = 30)
plt.title("Residuals for Validation")
plt.show()

2.3.2 Traditional model evaluation¶

Scikit-learn does not provide traditional regression model summaries.

Use statsmodels package if desired.

conda install -c conda-forge statsmodels

or

pip install statsmodels

In [55]:
import statsmodels.api as sm
In [56]:
model_statsmodels = sm.OLS(train_y, train_X)
In [57]:
results = model_statsmodels.fit()
print(results.summary())
                                 OLS Regression Results                                
=======================================================================================
Dep. Variable:                   Wage   R-squared (uncentered):                   0.515
Model:                            OLS   Adj. R-squared (uncentered):              0.512
Method:                 Least Squares   F-statistic:                              159.1
Date:                Sat, 18 Feb 2023   Prob (F-statistic):                   4.36e-227
Time:                        13:26:40   Log-Likelihood:                         -16865.
No. Observations:                1506   AIC:                                  3.375e+04
Df Residuals:                    1496   BIC:                                  3.380e+04
Df Model:                          10                                                  
Covariance Type:            nonrobust                                                  
========================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
Age                   -852.0938    127.623     -6.677      0.000   -1102.432    -601.756
Balance                151.9979     40.583      3.745      0.000      72.392     231.604
ShotPower              633.7931     85.723      7.393      0.000     465.642     801.944
Aggression              15.2246     36.329      0.419      0.675     -56.037      86.486
Positioning            716.7647     94.926      7.551      0.000     530.562     902.968
Composure              382.7540     81.366      4.704      0.000     223.151     542.357
Preferred Foot_Right  -493.9671   1360.831     -0.363      0.717   -3163.306    2175.372
Body Type_Lean       -8.622e+04   4538.366    -18.999      0.000   -9.51e+04   -7.73e+04
Body Type_Normal     -8.807e+04   4632.299    -19.011      0.000   -9.72e+04    -7.9e+04
Body Type_Stocky     -8.937e+04   4906.471    -18.216      0.000    -9.9e+04   -7.97e+04
==============================================================================
Omnibus:                     2072.969   Durbin-Watson:                   1.959
Prob(Omnibus):                  0.000   Jarque-Bera (JB):           660076.649
Skew:                           7.548   Prob(JB):                         0.00
Kurtosis:                     104.446   Cond. No.                     2.46e+03
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[3] The condition number is large, 2.46e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

3. New Records¶

New players

In [58]:
new_players_df = pd.read_csv("new_players.csv")
new_players_df
Out[58]:
Age Balance ShotPower Aggression Positioning Composure Preferred Foot_Right Body Type_Lean Body Type_Normal Body Type_Stocky
0 27 59 75 68 80 76 1 0 0 1
1 21 42 71 52 60 76 1 1 0 0
2 19 76 80 22 75 56 0 0 0 1
In [59]:
new_records_players_pred = model.predict(new_players_df)
new_records_players_pred
Out[59]:
array([29318.86943019, 20847.31938856, 29120.84519698])
In [60]:
# As before
# import pandas as pd

new_records_players_pred_df = pd.DataFrame(new_records_players_pred, columns = ["Prediction"])
new_records_players_pred_df

# to export
# new_records_players_pred_df.to_csv("whatever_name.csv")
Out[60]:
Prediction
0 29318.869430
1 20847.319389
2 29120.845197
In [61]:
alpha = 0.05
ci = np.quantile(train_residuals, 1 - alpha)
ci
Out[61]:
17225.89938935508
In [62]:
def generate_results_confint(preds, ci):
    df = pd.DataFrame()
    df["Prediction"] = preds
    if ci >= 0:
        df["upper"] = preds + ci
        df["lower"] = preds - ci
    else:
        df["upper"] = preds - ci
        df["lower"] = preds + ci
    return df
In [63]:
new_records_players_pred_confint_df = generate_results_confint(new_records_players_pred, ci)
new_records_players_pred_confint_df
Out[63]:
Prediction upper lower
0 29318.869430 46544.768820 12092.970041
1 20847.319389 38073.218778 3621.419999
2 29120.845197 46346.744586 11894.945808

red_devils.jpeg