Regression¶

Goal! Goal! Goal!¶

mu2.jpg

Download data

Back to spell book

1. Load Data¶

1.1 Libraries¶

In [1]:
import pandas as pd

Sometimes, we may need to use a specific encoding:

encoding = "ISO-8859-1"

encoding = "utf-8"

1.2 Data¶

In [2]:
football = pd.read_csv("football_2.csv", encoding = "ISO-8859-1")
football.head()
Out[2]:
ID Name Age Photo Nationality Flag Overall Potential Club Club Logo ... Composure Marking StandingTackle SlidingTackle GKDiving GKHandling GKKicking GKPositioning GKReflexes Release Clause
0 207439 L. Paredes 24 https://cdn.sofifa.org/players/4/19/207439.png Argentina https://cdn.sofifa.org/flags/52.png 80 85 NaN https://cdn.sofifa.org/flags/52.png ... 74.0 73.0 75.0 72.0 9.0 14.0 6.0 9.0 10.0 NaN
1 156713 A. Granqvist 33 https://cdn.sofifa.org/players/4/19/156713.png Sweden https://cdn.sofifa.org/flags/46.png 80 80 NaN https://cdn.sofifa.org/flags/46.png ... 78.0 82.0 83.0 79.0 7.0 9.0 12.0 10.0 15.0 NaN
2 229909 A. Lunev 26 https://cdn.sofifa.org/players/4/19/229909.png Russia https://cdn.sofifa.org/flags/40.png 79 81 NaN https://cdn.sofifa.org/flags/40.png ... 69.0 18.0 20.0 12.0 80.0 73.0 65.0 77.0 85.0 NaN
3 187347 I. Smolnikov 29 https://cdn.sofifa.org/players/4/19/187347.png Russia https://cdn.sofifa.org/flags/40.png 79 79 NaN https://cdn.sofifa.org/flags/40.png ... 73.0 76.0 76.0 80.0 7.0 12.0 10.0 8.0 15.0 NaN
4 153260 Hilton 40 https://cdn.sofifa.org/players/4/19/153260.png Brazil https://cdn.sofifa.org/flags/54.png 78 78 Montpellier HSC https://cdn.sofifa.org/teams/2/light/70.png ... 70.0 83.0 77.0 76.0 12.0 7.0 11.0 12.0 13.0 NaN

5 rows × 88 columns

Variable names. Hard to read without the index.

In [3]:
football.columns.values.tolist()
Out[3]:
['ID',
 'Name',
 'Age',
 'Photo',
 'Nationality',
 'Flag',
 'Overall',
 'Potential',
 'Club',
 'Club Logo',
 'Value',
 'Wage',
 'Special',
 'Preferred Foot',
 'International Reputation',
 'Weak Foot',
 'Skill Moves',
 'Work Rate',
 'Body Type',
 'Real Face',
 'Position',
 'Jersey Number',
 'Joined',
 'Loaned From',
 'Contract Valid Until',
 'Height',
 'Weight',
 'LS',
 'ST',
 'RS',
 'LW',
 'LF',
 'CF',
 'RF',
 'RW',
 'LAM',
 'CAM',
 'RAM',
 'LM',
 'LCM',
 'CM',
 'RCM',
 'RM',
 'LWB',
 'LDM',
 'CDM',
 'RDM',
 'RWB',
 'LB',
 'LCB',
 'CB',
 'RCB',
 'RB',
 'Crossing',
 'Finishing',
 'HeadingAccuracy',
 'ShortPassing',
 'Volleys',
 'Dribbling',
 'Curve',
 'FKAccuracy',
 'LongPassing',
 'BallControl',
 'Acceleration',
 'SprintSpeed',
 'Agility',
 'Reactions',
 'Balance',
 'ShotPower',
 'Jumping',
 'Stamina',
 'Strength',
 'LongShots',
 'Aggression',
 'Interceptions',
 'Positioning',
 'Vision',
 'Penalties',
 'Composure',
 'Marking',
 'StandingTackle',
 'SlidingTackle',
 'GKDiving',
 'GKHandling',
 'GKKicking',
 'GKPositioning',
 'GKReflexes',
 'Release Clause']
In [4]:
football_variables_df = pd.DataFrame(football.columns.values, columns = ["Variables"])
football_variables_df
Out[4]:
Variables
0 ID
1 Name
2 Age
3 Photo
4 Nationality
... ...
83 GKHandling
84 GKKicking
85 GKPositioning
86 GKReflexes
87 Release Clause

88 rows × 1 columns

Disply all rows

In [5]:
print(football_variables_df.to_string())
                   Variables
0                         ID
1                       Name
2                        Age
3                      Photo
4                Nationality
5                       Flag
6                    Overall
7                  Potential
8                       Club
9                  Club Logo
10                     Value
11                      Wage
12                   Special
13            Preferred Foot
14  International Reputation
15                 Weak Foot
16               Skill Moves
17                 Work Rate
18                 Body Type
19                 Real Face
20                  Position
21             Jersey Number
22                    Joined
23               Loaned From
24      Contract Valid Until
25                    Height
26                    Weight
27                        LS
28                        ST
29                        RS
30                        LW
31                        LF
32                        CF
33                        RF
34                        RW
35                       LAM
36                       CAM
37                       RAM
38                        LM
39                       LCM
40                        CM
41                       RCM
42                        RM
43                       LWB
44                       LDM
45                       CDM
46                       RDM
47                       RWB
48                        LB
49                       LCB
50                        CB
51                       RCB
52                        RB
53                  Crossing
54                 Finishing
55           HeadingAccuracy
56              ShortPassing
57                   Volleys
58                 Dribbling
59                     Curve
60                FKAccuracy
61               LongPassing
62               BallControl
63              Acceleration
64               SprintSpeed
65                   Agility
66                 Reactions
67                   Balance
68                 ShotPower
69                   Jumping
70                   Stamina
71                  Strength
72                 LongShots
73                Aggression
74             Interceptions
75               Positioning
76                    Vision
77                 Penalties
78                 Composure
79                   Marking
80            StandingTackle
81             SlidingTackle
82                  GKDiving
83                GKHandling
84                 GKKicking
85             GKPositioning
86                GKReflexes
87            Release Clause
In [6]:
print(football.dtypes.to_string())
ID                            int64
Name                         object
Age                           int64
Photo                        object
Nationality                  object
Flag                         object
Overall                       int64
Potential                     int64
Club                         object
Club Logo                    object
Value                         int64
Wage                          int64
Special                       int64
Preferred Foot               object
International Reputation    float64
Weak Foot                   float64
Skill Moves                 float64
Work Rate                    object
Body Type                    object
Real Face                    object
Position                     object
Jersey Number               float64
Joined                       object
Loaned From                  object
Contract Valid Until         object
Height                       object
Weight                       object
LS                           object
ST                           object
RS                           object
LW                           object
LF                           object
CF                           object
RF                           object
RW                           object
LAM                          object
CAM                          object
RAM                          object
LM                           object
LCM                          object
CM                           object
RCM                          object
RM                           object
LWB                          object
LDM                          object
CDM                          object
RDM                          object
RWB                          object
LB                           object
LCB                          object
CB                           object
RCB                          object
RB                           object
Crossing                    float64
Finishing                   float64
HeadingAccuracy             float64
ShortPassing                float64
Volleys                     float64
Dribbling                   float64
Curve                       float64
FKAccuracy                  float64
LongPassing                 float64
BallControl                 float64
Acceleration                float64
SprintSpeed                 float64
Agility                     float64
Reactions                   float64
Balance                     float64
ShotPower                   float64
Jumping                     float64
Stamina                     float64
Strength                    float64
LongShots                   float64
Aggression                  float64
Interceptions               float64
Positioning                 float64
Vision                      float64
Penalties                   float64
Composure                   float64
Marking                     float64
StandingTackle              float64
SlidingTackle               float64
GKDiving                    float64
GKHandling                  float64
GKKicking                   float64
GKPositioning               float64
GKReflexes                  float64
Release Clause               object

1.3 Filter required records and variables¶

Filter for strikers only.

In [7]:
football_2 = football[football["Position"] == "ST"] 
football_2.head()
Out[7]:
ID Name Age Photo Nationality Flag Overall Potential Club Club Logo ... Composure Marking StandingTackle SlidingTackle GKDiving GKHandling GKKicking GKPositioning GKReflexes Release Clause
5 187607 A. Dzyuba 29 https://cdn.sofifa.org/players/4/19/187607.png Russia https://cdn.sofifa.org/flags/40.png 78 78 NaN https://cdn.sofifa.org/flags/40.png ... 70.0 21.0 15.0 19.0 15.0 12.0 11.0 11.0 8.0 NaN
8 183389 G. Sio 29 https://cdn.sofifa.org/players/4/19/183389.png Ivory Coast https://cdn.sofifa.org/flags/108.png 77 77 NaN https://cdn.sofifa.org/flags/108.png ... 72.0 40.0 18.0 12.0 15.0 9.0 10.0 15.0 16.0 NaN
18 245683 K. Fofana 26 https://cdn.sofifa.org/players/4/19/245683.png Ivory Coast https://cdn.sofifa.org/flags/108.png 75 75 NaN https://cdn.sofifa.org/flags/108.png ... 83.0 23.0 37.0 46.0 7.0 11.0 7.0 11.0 14.0 NaN
45 190461 B. Sigur̡arson 27 https://cdn.sofifa.org/players/4/19/190461.png Iceland https://cdn.sofifa.org/flags/24.png 73 74 NaN https://cdn.sofifa.org/flags/24.png ... 76.0 31.0 39.0 24.0 9.0 12.0 10.0 15.0 16.0 NaN
65 225900 J. Sambenito 26 https://cdn.sofifa.org/players/4/19/225900.png Paraguay https://cdn.sofifa.org/flags/58.png 71 74 NaN https://cdn.sofifa.org/flags/58.png ... 74.0 15.0 16.0 16.0 15.0 16.0 15.0 7.0 7.0 NaN

5 rows × 88 columns

Filter for the required varables.

In [8]:
import numpy as np
In [9]:
football_3 = football_2.iloc[:, np.r_[2, 13, 18, 67, 68, 73, 75, 78, 11]]
football_3.head()
Out[9]:
Age Preferred Foot Body Type Balance ShotPower Aggression Positioning Composure Wage
5 29 Right Stocky 32.0 78.0 75.0 78.0 70.0 1105
8 29 Left Normal 73.0 77.0 77.0 76.0 72.0 2138
18 26 Right Normal 60.0 78.0 67.0 72.0 83.0 3875
45 27 Right Normal 76.0 68.0 73.0 73.0 76.0 3661
65 26 Right Lean 64.0 73.0 49.0 75.0 74.0 2445
In [10]:
# Or simply (if no ranges are used)

football_2.iloc[:, [2, 13, 18, 67, 68, 73, 75, 78, 11]]
Out[10]:
Age Preferred Foot Body Type Balance ShotPower Aggression Positioning Composure Wage
5 29 Right Stocky 32.0 78.0 75.0 78.0 70.0 1105
8 29 Left Normal 73.0 77.0 77.0 76.0 72.0 2138
18 26 Right Normal 60.0 78.0 67.0 72.0 83.0 3875
45 27 Right Normal 76.0 68.0 73.0 73.0 76.0 3661
65 26 Right Lean 64.0 73.0 49.0 75.0 74.0 2445
... ... ... ... ... ... ... ... ... ...
18181 19 Right Lean 64.0 67.0 38.0 61.0 52.0 3399
18184 21 Right Stocky 70.0 64.0 32.0 56.0 51.0 9389
18188 21 Right Normal 53.0 61.0 62.0 60.0 61.0 10780
18190 19 Right Normal 68.0 61.0 51.0 67.0 62.0 10121
18203 16 Right Lean 60.0 61.0 36.0 62.0 63.0 8358

2152 rows × 9 columns

2. Regression¶

2.1 Training-Validation Split¶

In [11]:
from sklearn.model_selection import train_test_split

Define predictors and target variable.

Creating dummies applies to categorical variables only.

If a prefix is desired:

X = pd.get_dummies(X, prefix_sep = 'dummy', drop_first = True)

In [12]:
X = football_3.drop(columns = ["Wage"])
# Get dummies for the caterogical variables

X = pd.get_dummies(X, drop_first = True)

y = football_3["Wage"]
In [13]:
X
Out[13]:
Age Balance ShotPower Aggression Positioning Composure Preferred Foot_Right Body Type_Lean Body Type_Normal Body Type_Stocky
5 29 32.0 78.0 75.0 78.0 70.0 1 0 0 1
8 29 73.0 77.0 77.0 76.0 72.0 0 0 1 0
18 26 60.0 78.0 67.0 72.0 83.0 1 0 1 0
45 27 76.0 68.0 73.0 73.0 76.0 1 0 1 0
65 26 64.0 73.0 49.0 75.0 74.0 1 1 0 0
... ... ... ... ... ... ... ... ... ... ...
18181 19 64.0 67.0 38.0 61.0 52.0 1 1 0 0
18184 21 70.0 64.0 32.0 56.0 51.0 1 0 0 1
18188 21 53.0 61.0 62.0 60.0 61.0 1 0 1 0
18190 19 68.0 61.0 51.0 67.0 62.0 1 0 1 0
18203 16 60.0 61.0 36.0 62.0 63.0 1 1 0 0

2152 rows × 10 columns

In [14]:
y
Out[14]:
5         1105
8         2138
18        3875
45        3661
65        2445
         ...  
18181     3399
18184     9389
18188    10780
18190    10121
18203     8358
Name: Wage, Length: 2152, dtype: int64

Split the dataset

In [15]:
train_X, valid_X, train_y, valid_y = train_test_split(X, y, test_size = 0.3, random_state = 666)

Check.

In [16]:
train_X.head()
Out[16]:
Age Balance ShotPower Aggression Positioning Composure Preferred Foot_Right Body Type_Lean Body Type_Normal Body Type_Stocky
13946 26 69.0 71.0 64.0 72.0 72.0 1 1 0 0
7711 22 85.0 69.0 63.0 47.0 56.0 1 0 0 1
8402 25 65.0 52.0 22.0 55.0 52.0 1 1 0 0
13651 26 59.0 76.0 80.0 75.0 72.0 0 0 1 0
1625 28 55.0 70.0 55.0 61.0 71.0 0 1 0 0
In [17]:
len(train_X)
Out[17]:
1506
In [18]:
train_y.head()
Out[18]:
13946    22512
7711      6760
8402      5377
13651    13711
1625     10521
Name: Wage, dtype: int64
In [19]:
len(train_y)
Out[19]:
1506
In [20]:
valid_X.head()
Out[20]:
Age Balance ShotPower Aggression Positioning Composure Preferred Foot_Right Body Type_Lean Body Type_Normal Body Type_Stocky
7882 30 59.0 67.0 39.0 63.0 52.0 1 0 1 0
14555 24 55.0 80.0 59.0 71.0 65.0 1 0 0 1
16210 32 61.0 70.0 52.0 75.0 67.0 1 0 0 1
15847 24 60.0 75.0 49.0 71.0 70.0 1 0 1 0
12382 27 67.0 76.0 55.0 78.0 77.0 1 0 1 0
In [21]:
len(valid_X)
Out[21]:
646
In [22]:
valid_y.head()
Out[22]:
7882      5628
14555    28875
16210     6941
15847    10144
12382    45877
Name: Wage, dtype: int64
In [23]:
len(valid_y)
Out[23]:
646

2.2 Training the Regression Model¶

In [24]:
import sklearn
from sklearn.linear_model import LinearRegression
In [25]:
model = LinearRegression()
In [26]:
model.fit(train_X, train_y)
Out[26]:
LinearRegression()
In [27]:
train_y_pred = model.predict(train_X)
train_y_pred
Out[27]:
array([24239.85483562,  3778.75482905, -4114.30802593, ...,
       18813.39087789, 21533.92321643, 22278.55484414])
In [28]:
train_y_pred_df = pd.DataFrame(train_y_pred, columns = ["Training_Prediction"])
train_y_pred_df
Out[28]:
Training_Prediction
0 24239.854836
1 3778.754829
2 -4114.308026
3 29762.024484
4 15704.772118
... ...
1501 25387.715396
1502 21812.772659
1503 18813.390878
1504 21533.923216
1505 22278.554844

1506 rows × 1 columns

In [29]:
print("model intercept: ", model.intercept_)
print("model coefficients: ", model.coef_)
print("Model score: ", model.score(train_X, train_y))
model intercept:  293888.9838689667
model coefficients:  [-9.16152410e+02  4.85334103e+01  4.63702248e+02  4.80950017e+01
  6.32137998e+02  3.74975144e+02 -2.20174023e+03 -3.55489317e+05
 -3.56667999e+05 -3.57613010e+05]
Model score:  0.4586287207175519

Coefficients, easier to read.

In [30]:
print(pd.DataFrame({"Predictor": train_X.columns, "Coefficient": model.coef_}))
              Predictor    Coefficient
0                   Age    -916.152410
1               Balance      48.533410
2             ShotPower     463.702248
3            Aggression      48.095002
4           Positioning     632.137998
5             Composure     374.975144
6  Preferred Foot_Right   -2201.740231
7        Body Type_Lean -355489.317404
8      Body Type_Normal -356667.999143
9      Body Type_Stocky -357613.009856

2.2.1 Model Evaluation on Training¶

Get the RMSE for training set

In [31]:
mse_train = sklearn.metrics.mean_squared_error(train_y, train_y_pred)
mse_train
Out[31]:
261404234.2377431
In [32]:
import math
In [33]:
rmse_train = math.sqrt(mse_train)
rmse_train
Out[33]:
16168.000316605116
In [34]:
train_y.describe()
Out[34]:
count      1506.000000
mean      12698.381142
std       21981.278007
min        1290.000000
25%        4692.500000
50%        6544.000000
75%       12364.250000
max      407609.000000
Name: Wage, dtype: float64

If using the dmba package:

pip install dmba

or

conda install -c conda-forge dmba

Then load the library

import dmba

from dmba import regressionSummary

In [35]:
import dmba
from dmba import regressionSummary
In [36]:
regressionSummary(train_y, train_y_pred)
Regression statistics

                      Mean Error (ME) : 0.0000
       Root Mean Squared Error (RMSE) : 16168.0003
            Mean Absolute Error (MAE) : 8475.7614
          Mean Percentage Error (MPE) : -32.8265
Mean Absolute Percentage Error (MAPE) : 111.6103

Residuals.

In [37]:
train_residuals = train_y - train_y_pred
train_residuals
Out[37]:
13946    -1727.854836
7711      2981.245171
8402      9491.308026
13651   -16051.024484
1625     -5183.772118
             ...     
12759    -3215.715396
17284   -15227.772659
1016     -7666.390878
16984    -9871.923216
16744     4022.445156
Name: Wage, Length: 1506, dtype: float64
In [38]:
type(train_residuals)
Out[38]:
pandas.core.series.Series
In [39]:
import matplotlib.pyplot as plt
plt.hist(train_residuals, bins = 30)
plt.title("Residuals for Training")
plt.show()
In [40]:
train_residuals_df = train_residuals.to_frame(name = "Wage_Residuals")
train_residuals_df
Out[40]:
Wage_Residuals
13946 -1727.854836
7711 2981.245171
8402 9491.308026
13651 -16051.024484
1625 -5183.772118
... ...
12759 -3215.715396
17284 -15227.772659
1016 -7666.390878
16984 -9871.923216
16744 4022.445156

1506 rows × 1 columns

In [41]:
import matplotlib.pyplot as plt

plt.hist(train_residuals_df["Wage_Residuals"], bins = 30)
plt.title("Residuals for Training")
plt.show()

Normality

In [42]:
import numpy as np
from scipy.stats import shapiro

shapiro(train_y)
Out[42]:
ShapiroResult(statistic=0.38412952423095703, pvalue=0.0)
In [43]:
shapiro(train_residuals)
Out[43]:
ShapiroResult(statistic=0.5784118175506592, pvalue=0.0)
In [44]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif_df = pd.DataFrame()

vif_df["features"] = train_X.columns
vif_df["VIF"] = [variance_inflation_factor(train_X.values, i) for i in range(train_X.shape[1])]

print(vif_df)
               features         VIF
0                   Age   48.741902
1               Balance   33.364691
2             ShotPower  155.270485
3            Aggression   18.482052
4           Positioning  184.558829
5             Composure  115.774481
6  Preferred Foot_Right    7.714600
7        Body Type_Lean   32.118906
8      Body Type_Normal   59.412529
9      Body Type_Stocky   11.108926

2.3 Predict Validation Set¶

In [45]:
valid_y_pred = model.predict(valid_X)
valid_y_pred
Out[45]:
array([ 2.66530045e+03,  2.39448807e+04,  1.52116766e+04,  2.42079729e+04,
        2.96013138e+04,  1.24219313e+04,  7.03030428e+03,  1.95690670e+04,
        1.65262949e+04, -1.47780320e+03,  1.30968036e+04,  1.46015036e+04,
        1.65416124e+04,  1.23451712e+04,  2.41133395e+04,  3.44640274e+04,
        1.67102936e+04,  1.29909076e+04,  3.97181605e+03,  2.20783678e+04,
        5.38322832e+03,  2.09776571e+04,  2.08805388e+04,  2.34148863e+04,
        1.06999140e+04, -5.56234183e+03,  1.17865372e+04,  2.23538133e+04,
        1.43890326e+04,  1.75954292e+04, -1.22354489e+02,  1.29628192e+04,
        8.41552263e+03,  2.27889941e+04,  2.31121184e+03,  1.94168827e+04,
       -6.96486103e+03,  1.37996432e+04,  3.27861195e+03,  2.86742008e+04,
        2.97875564e+04, -6.32773021e+03,  2.31928580e+04,  3.04579799e+03,
        1.54775276e+04,  2.42101352e+04, -5.84285114e+03, -1.29340208e+03,
        4.06247257e+03,  1.80227373e+04,  2.17889372e+04,  9.70873486e+03,
        8.02344053e+03,  1.92883880e+04,  1.77181908e+04,  2.18037493e+04,
       -4.33997863e+03,  1.56144785e+04,  9.71271318e+03,  2.05954475e+04,
        4.75178565e+03,  2.30383707e+04,  1.02677166e+04,  2.35779602e+04,
        9.87039872e+03,  1.68866474e+04, -1.31118203e+03,  8.87430877e+03,
        3.24007684e+03,  3.23350289e+04,  1.53211409e+04,  1.07576794e+03,
        1.62358353e+03,  1.73480299e+04,  1.79438503e+04,  2.49372041e+04,
        2.00529623e+04,  1.68100187e+04,  8.56136513e+03,  3.46264372e+04,
        2.54399862e+04,  4.68426335e+04,  9.91675364e+03,  1.83355436e+04,
        3.54108387e+04,  1.70138955e+04,  1.46236932e+04,  1.02365282e+04,
        1.62242848e+04,  2.37720957e+04,  3.08256560e+04,  1.59745648e+04,
        1.59032676e+04,  3.19791193e+04,  2.04379622e+04,  1.55905542e+04,
        2.37955466e+04,  1.04128105e+04,  9.19329209e+03,  4.21540077e+04,
        1.27777969e+04,  2.41864412e+04,  2.14468738e+04,  8.06144737e+03,
        2.36390151e+04,  1.44857796e+04,  2.78377031e+04,  2.65205841e+04,
        2.38842489e+03,  4.01141506e+04,  1.19248133e+04,  1.74026797e+04,
       -3.55892891e+03,  1.72208335e+04,  1.22610883e+04, -5.54908320e+03,
        9.56791282e+03,  3.64233850e+03,  3.18218976e+04,  1.47101698e+04,
        3.73624918e+03,  5.33297175e+03,  2.29637658e+04,  2.10020910e+03,
        2.39209825e+04,  4.18162608e+04,  3.26198242e+04,  3.84999758e+04,
        2.36460768e+03,  2.00953034e+03,  2.70110133e+04, -2.01391023e+03,
       -2.20258250e+03,  1.69663027e+04,  3.12074612e+04,  2.05483828e+02,
       -3.60379347e+03,  1.35194619e+03, -2.64933746e+03,  8.64132573e+03,
        1.58664227e+04,  1.67561284e+04,  1.34151485e+04,  3.70881499e+03,
        4.13986671e+03,  3.08144727e+04,  1.23279327e+04, -2.22148139e+01,
        5.57685656e+03,  2.56361861e+04,  2.59934302e+04,  1.34265919e+03,
        2.65983638e+04,  1.18370810e+04,  1.09183425e+04,  1.43909810e+04,
        3.88503740e+03, -2.33173138e+03,  2.02760295e+04,  1.87326850e+04,
        1.96038255e+04, -9.93766117e+03,  7.47558360e+03,  1.66588172e+04,
        1.16956311e+04,  3.90959788e+03,  7.51416417e+03,  2.06412445e+04,
        1.75336594e+04,  1.33414610e+04,  1.96836251e+03, -4.82880254e+03,
       -3.71099746e+02,  7.66629848e+03,  6.14881071e+03,  1.23687003e+04,
        1.21231323e+04,  1.35733402e+04,  1.83551007e+04,  1.70539379e+04,
        3.02524002e+04, -1.07159337e+04,  6.01304553e+03,  1.14180403e+04,
       -3.34690338e+03,  1.00622718e+04,  2.06918571e+04,  1.73350758e+04,
        2.29914945e+04, -5.50771205e+03,  2.85330575e+04,  1.98521492e+03,
        1.06942632e+04,  1.73230322e+04,  2.49513789e+03, -2.79106246e+03,
        4.39038348e+04,  1.43537643e+04,  1.85450045e+04, -6.16542967e+03,
        1.63629421e+04,  2.79125830e+04,  1.73431638e+04,  3.14934519e+04,
        4.66349422e+03,  1.16135024e+04,  3.35813201e+04,  4.53795643e+03,
        3.03687846e+04,  5.61531696e+03,  7.07152851e+03,  3.18497487e+04,
        1.53772460e+04,  2.67107658e+03,  2.33955389e+04,  2.40698188e+04,
       -4.75904102e+03, -3.45112153e+03,  2.41953892e+04,  1.23263230e+04,
        2.13734730e+04,  4.57129895e+03,  4.79790380e+03,  3.16235504e+04,
        2.16213142e+04,  1.75585724e+04,  5.23850275e+03,  1.36123705e+04,
        1.30277524e+04,  1.95346600e+04,  2.84475997e+03,  2.44038322e+04,
        7.41182434e+03,  1.37544881e+04, -7.02660112e+02,  1.64984689e+04,
        1.11341685e+04,  7.89961221e+03,  2.46923901e+04,  1.76240587e+04,
        8.96931690e+03,  6.48185568e+03, -1.26222354e+03, -1.69218974e+03,
        2.57214042e+03,  1.61502589e+04,  1.28138050e+04,  9.60814203e+03,
        1.63259753e+04,  6.50911760e+03, -9.95762340e+03,  3.42057372e+02,
        1.02190745e+04,  1.67276920e+04,  2.21600760e+04,  4.65490924e+03,
        1.25402256e+04,  1.61009022e+04,  1.90925692e+04,  1.98924863e+04,
        1.10061114e+04,  5.83648195e+03,  1.70692595e+04,  1.45910951e+04,
        2.66219412e+04,  8.48576063e+03,  1.69487687e+04,  2.99414092e+03,
        1.80086237e+04,  1.87362018e+04,  3.54003722e+04, -3.68164534e+02,
        2.37253161e+04,  2.03710982e+04,  1.20499329e+04,  2.47400389e+04,
       -7.52342461e+03,  2.49023341e+04,  6.50286770e+03,  4.85828603e+03,
        9.87788899e+03,  1.49300957e+04,  9.99666283e+03,  2.42095834e+04,
        2.23392492e+03,  3.87104769e+03,  1.84381152e+04,  1.99189419e+04,
        2.88502679e+04,  2.95576858e+03,  1.02739668e+04,  8.61212517e+03,
        3.79011769e+04,  1.18256155e+04,  6.26209526e+03,  2.04705359e+04,
        8.36627864e+03,  1.49795212e+03,  1.96112157e+04,  1.25041470e+04,
        3.33043822e+03,  8.02272601e+03,  6.30190884e+01,  1.38667489e+04,
       -2.66181757e+03,  2.41768675e+02,  1.19301758e+04,  1.17904093e+04,
        2.05007501e+04, -1.68721187e+04,  2.41243743e+04,  3.72096058e+03,
        6.75666851e+03,  1.58471520e+04,  1.94148247e+04,  1.48026594e+03,
        1.31713425e+04,  2.98162207e+04, -4.06148613e+03,  2.23375583e+04,
        1.19660384e+04,  2.43317954e+04,  5.31000984e+04,  1.79335761e+04,
        1.34151909e+04, -4.84794752e+03,  1.87710521e+04,  1.87750241e+04,
        2.12204070e+04,  2.20818928e+04,  1.15592485e+03, -1.04175449e+04,
       -3.17875690e+03,  2.90112257e+04,  1.69765732e+04,  2.60161382e+04,
        2.43898074e+04,  8.39128773e+02,  1.63886547e+04, -7.37612923e+03,
        5.62485527e+03,  6.74975491e+03,  1.22944936e+04,  1.33810595e+04,
        5.46762848e+03,  1.28159578e+04,  1.08639469e+04,  1.03348099e+03,
        1.63132754e+03,  1.44072810e+04,  1.84246280e+04, -2.47909605e+03,
        1.95474530e+04,  1.99982066e+04,  2.25021690e+04,  1.32872506e+04,
        3.28203363e+03,  9.06686227e+03,  2.45238371e+04,  1.33661644e+04,
       -1.14433324e+03,  1.01026574e+04,  2.07442960e+04,  1.01544218e+04,
        1.47549186e+04,  1.99151083e+04,  1.84707327e+04,  2.91420465e+03,
        1.55694684e+04,  1.98216015e+04,  7.22538464e+03,  1.59688520e+04,
        8.82264506e+03, -5.57106659e+03, -3.61325599e+03,  1.66954298e+04,
        6.04080018e+03,  5.64313195e+03,  2.34846318e+04,  4.95252927e+03,
        1.17609615e+04,  2.73460395e+02,  4.62524003e+04,  1.94219123e+04,
        2.67636958e+03,  7.99161270e+02,  3.45401861e+04,  2.59576995e+03,
        1.89228641e+04,  3.42841898e+04,  4.15955289e+03,  3.50195062e+04,
        2.12903857e+04, -5.06633274e+03,  1.72083042e+04,  1.09837661e+04,
        9.35620710e+03,  1.39696572e+04,  2.02365913e+04,  2.01273692e+04,
        4.92787054e+03,  2.75630232e+04,  1.94245386e+04,  1.78279991e+04,
        2.49990535e+04,  2.22097084e+04,  8.43216129e+03,  8.92326875e+02,
        5.49338576e+03,  1.95493159e+04,  1.06933156e+04,  1.17082833e+04,
        2.25027856e+04,  1.45716041e+04,  4.09754472e+03,  9.19128351e+03,
        7.77620947e+03,  1.90177070e+04,  1.96619616e+04,  4.66622488e+04,
        3.67881827e+03,  5.08111933e+03,  5.38303692e+03,  1.24476702e+04,
        1.84716579e+04, -1.31531118e+03,  2.03243896e+04,  1.86851668e+04,
        3.77256684e+03,  1.01734030e+04,  3.06560978e+04,  2.34528445e+04,
        1.83387120e+04,  1.40144687e+04,  1.65206404e+04,  3.96428102e+04,
        1.04960032e+04, -4.15137281e+03,  1.05653584e+04,  3.39106609e+04,
        2.97230029e+04,  1.53912380e+04,  1.11446062e+04,  2.24223425e+04,
        2.68597465e+04,  1.22405459e+04, -2.66504745e+03,  1.20825230e+04,
        2.03241862e+04, -3.92528017e+03,  5.56118121e+03, -5.96828907e+03,
        1.60675081e+04,  8.47372390e+03,  1.66574753e+04, -1.08819482e+04,
        2.20162527e+04,  1.36954201e+04,  1.52982927e+04,  2.86445902e+04,
        2.15596058e+04,  3.72854768e+03,  6.01809567e+03,  2.62442863e+04,
        1.78847640e+04,  2.60260009e+04,  2.21141255e+04, -2.91738379e+03,
        2.81612491e+04,  4.74778654e+04,  8.43042557e+03,  2.17905824e+04,
        1.78861898e+04,  2.81132855e+04,  2.24379049e+04,  4.87238954e+03,
        3.48894848e+04,  1.06339999e+04,  4.88834577e+03,  2.37027527e+04,
        2.77041377e+04,  1.67256495e+04, -1.23767615e+03,  2.45406129e+04,
       -6.73009778e+03, -3.71221165e+03,  1.87645554e+04,  3.45908317e+03,
        9.69112280e+03,  9.79096864e+03, -6.92458269e+03,  3.78754518e+03,
       -3.03440938e+03,  2.51852102e+04, -4.47402223e+03,  3.98987083e+04,
        2.19649914e+04,  1.91549960e+04,  8.60118620e+03,  2.05626620e+04,
       -3.69035608e+03,  1.48537949e+04,  4.04843946e+03,  3.16774267e+04,
        1.70576259e+04,  1.73557586e+03,  3.20480351e+04,  1.56235213e+04,
        5.63278927e+03,  2.32403127e+04,  2.46781892e+04,  7.00114435e+03,
       -3.58186265e+03,  2.21807504e+04, -6.37439516e+03,  7.92053004e+03,
        5.22713559e+03,  3.85323630e+03,  1.61069767e+04,  1.75678237e+04,
        1.81873006e+04,  1.00464058e+04,  8.84400565e+03,  1.72847394e+04,
        5.17322566e+02,  2.56655118e+04,  2.62752003e+04,  1.07591937e+04,
        1.51229379e+04,  1.93102802e+04,  9.00728050e+03,  2.97264751e+03,
       -1.65027255e+03,  1.94821918e+04, -4.41337726e+03,  1.89499469e+04,
        3.01911889e+01,  1.34545696e+04,  5.57215010e+03,  4.50681337e+02,
        3.92060784e+04,  1.32187052e+04,  1.10678007e+04,  1.06085720e+04,
       -4.72793002e+03,  2.23111111e+04,  4.89426890e+03,  1.58645569e+04,
       -4.78889524e+03,  2.36842533e+03,  3.30683841e+04,  4.89473966e+02,
        1.38929836e+04,  2.39812906e+04, -8.08396068e+03,  1.55943106e+04,
        2.26993605e+04,  3.51540327e+04,  3.59982676e+04,  2.57322782e+04,
        2.85020030e+04,  1.23762047e+04,  1.77077060e+04,  6.05218625e+02,
        2.12468521e+04,  1.98247308e+03,  9.75425736e+03,  1.91635760e+04,
        1.11030216e+04,  2.59148087e+04,  1.04661967e+04,  3.23588749e+04,
        3.13942429e+03, -4.08853810e+03,  1.52620187e+04, -1.58765960e+04,
        1.92577674e+04,  2.37955347e+04,  1.31788779e+04,  1.08961748e+04,
        1.81281065e+04,  1.83373496e+04,  9.36618933e+03,  2.37138764e+04,
        1.44469600e+04,  2.24502896e+02,  1.87964914e+04,  1.71068838e+04,
       -1.06577785e+03,  1.00160136e+04,  1.60681170e+04,  3.22133872e+04,
        2.30064316e+04,  6.75391713e+03,  1.60378128e+04,  1.32996181e+04,
       -1.49122318e+02,  8.22587441e+03,  3.26499372e+04,  1.85186487e+04,
       -4.30577382e+03,  1.16753828e+04,  5.05871265e+03,  4.04057843e+04,
        7.68904389e+03,  1.82256266e+04, -2.70297307e+03,  2.81515606e+04,
       -5.27176698e+03,  1.86762020e+04,  1.06676760e+04,  3.22475566e+04,
       -5.52105925e+03,  2.62373303e+04,  1.44986397e+04, -4.73905063e+03,
        3.65046624e+04,  4.02480572e+03,  8.28482915e+03,  1.95814608e+04,
        8.41202780e+03, -3.84857827e+03,  8.40586509e+03,  1.03176404e+04,
        2.43958443e+04,  1.38968987e+04,  3.67708230e+04,  1.72569153e+04,
        6.03696891e+03, -1.43788106e+03,  3.32488578e+04,  1.62319091e+04,
        1.61211529e+04,  4.90873986e+03,  9.34189590e+03, -5.86994130e+03,
        1.42400588e+04,  3.39109284e+02,  3.07350763e+03,  1.75517998e+04,
        1.60393277e+04,  5.81477513e+02, -1.61141500e+03,  2.06980825e+04,
        1.15261767e+04,  3.04436687e+04])
In [46]:
valid_y_pred_df = pd.DataFrame(valid_y_pred, columns = ["Validation_Prediction"])
valid_y_pred_df
Out[46]:
Validation_Prediction
0 2665.300452
1 23944.880671
2 15211.676647
3 24207.972902
4 29601.313799
... ...
641 581.477513
642 -1611.414995
643 20698.082536
644 11526.176711
645 30443.668736

646 rows × 1 columns

2.3.1 Model Evaluation on Validation¶

Get the RMSE for validation set.

In [47]:
mse_valid = sklearn.metrics.mean_squared_error(valid_y, valid_y_pred)
mse_valid
Out[47]:
380956622.57514906
In [48]:
# As before
# import math

rmse_valid = math.sqrt(mse_valid)
rmse_valid
Out[48]:
19518.110117917386
In [49]:
valid_y.describe()
Out[49]:
count       646.000000
mean      13535.160991
std       23624.770667
min        1105.000000
25%        4708.750000
50%        6750.500000
75%       12827.750000
max      301070.000000
Name: Wage, dtype: float64
In [50]:
# As before:

# If using the dmba package:

# pip install dmba


# Done earlier. Just for illustration
# import dmba
# from dmba import regressionSummary

regressionSummary(valid_y, valid_y_pred)
Regression statistics

                      Mean Error (ME) : 91.4105
       Root Mean Squared Error (RMSE) : 19518.1101
            Mean Absolute Error (MAE) : 9319.7708
          Mean Percentage Error (MPE) : -43.3987
Mean Absolute Percentage Error (MAPE) : 118.8390

Residuals.

In [51]:
valid_residuals = valid_y - valid_y_pred
valid_residuals.head()
Out[51]:
7882      2962.699548
14555     4930.119329
16210    -8270.676647
15847   -14063.972902
12382    16275.686201
Name: Wage, dtype: float64
In [52]:
import matplotlib.pyplot as plt
plt.hist(valid_residuals, bins = 30)
plt.title("Residuals for Validation")
plt.show()
In [53]:
valid_residuals_df = valid_residuals.to_frame(name = "Wage_Residuals")
valid_residuals_df
Out[53]:
Wage_Residuals
7882 2962.699548
14555 4930.119329
16210 -8270.676647
15847 -14063.972902
12382 16275.686201
... ...
8620 5455.522487
10786 6010.414995
16154 13639.917464
4990 -8092.176711
14654 -14046.668736

646 rows × 1 columns

In [54]:
import matplotlib.pyplot as plt

plt.hist(valid_residuals_df["Wage_Residuals"], bins = 30)
plt.title("Residuals for Validation")
plt.show()

2.3.2 Traditional model evaluation¶

Scikit-learn does not provide traditional regression model summaries.

Use statsmodels package if desired.

conda install -c conda-forge statsmodels

or

pip install statsmodels

In [55]:
import statsmodels.api as sm
In [56]:
model_statsmodels = sm.OLS(train_y, train_X)
In [57]:
results = model_statsmodels.fit()
print(results.summary())
                                 OLS Regression Results                                
=======================================================================================
Dep. Variable:                   Wage   R-squared (uncentered):                   0.515
Model:                            OLS   Adj. R-squared (uncentered):              0.512
Method:                 Least Squares   F-statistic:                              159.1
Date:                Sat, 18 Feb 2023   Prob (F-statistic):                   4.36e-227
Time:                        13:25:45   Log-Likelihood:                         -16865.
No. Observations:                1506   AIC:                                  3.375e+04
Df Residuals:                    1496   BIC:                                  3.380e+04
Df Model:                          10                                                  
Covariance Type:            nonrobust                                                  
========================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
Age                   -852.0938    127.623     -6.677      0.000   -1102.432    -601.756
Balance                151.9979     40.583      3.745      0.000      72.392     231.604
ShotPower              633.7931     85.723      7.393      0.000     465.642     801.944
Aggression              15.2246     36.329      0.419      0.675     -56.037      86.486
Positioning            716.7647     94.926      7.551      0.000     530.562     902.968
Composure              382.7540     81.366      4.704      0.000     223.151     542.357
Preferred Foot_Right  -493.9671   1360.831     -0.363      0.717   -3163.306    2175.372
Body Type_Lean       -8.622e+04   4538.366    -18.999      0.000   -9.51e+04   -7.73e+04
Body Type_Normal     -8.807e+04   4632.299    -19.011      0.000   -9.72e+04    -7.9e+04
Body Type_Stocky     -8.937e+04   4906.471    -18.216      0.000    -9.9e+04   -7.97e+04
==============================================================================
Omnibus:                     2072.969   Durbin-Watson:                   1.959
Prob(Omnibus):                  0.000   Jarque-Bera (JB):           660076.649
Skew:                           7.548   Prob(JB):                         0.00
Kurtosis:                     104.446   Cond. No.                     2.46e+03
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[3] The condition number is large, 2.46e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

3. New Records¶

New players

In [58]:
new_players_df = pd.read_csv("new_players.csv")
new_players_df
Out[58]:
Age Balance ShotPower Aggression Positioning Composure Preferred Foot_Right Body Type_Lean Body Type_Normal Body Type_Stocky
0 27 59 75 68 80 76 1 0 0 1
1 21 42 71 52 60 76 1 1 0 0
2 19 76 80 22 75 56 0 0 0 1
In [59]:
new_records_players_pred = model.predict(new_players_df)
new_records_players_pred
Out[59]:
array([29318.86943019, 20847.31938856, 29120.84519698])
In [60]:
# As before
# import pandas as pd

new_records_players_pred_df = pd.DataFrame(new_records_players_pred, columns = ["Prediction"])
new_records_players_pred_df

# to export
# new_records_players_pred_df.to_csv("whatever_name.csv")
Out[60]:
Prediction
0 29318.869430
1 20847.319389
2 29120.845197
In [61]:
alpha = 0.05
ci = np.quantile(train_residuals, 1 - alpha)
ci
Out[61]:
17225.89938935508
In [62]:
def generate_results_confint(preds, ci):
    df = pd.DataFrame()
    df["Prediction"] = preds
    if ci >= 0:
        df["upper"] = preds + ci
        df["lower"] = preds - ci
    else:
        df["upper"] = preds - ci
        df["lower"] = preds + ci
        
    return df
In [63]:
new_records_players_pred_confint_df = generate_results_confint(new_records_players_pred, ci)
new_records_players_pred_confint_df
Out[63]:
Prediction upper lower
0 29318.869430 46544.768820 12092.970041
1 20847.319389 38073.218778 3621.419999
2 29120.845197 46346.744586 11894.945808

red_devils.jpeg

4. Non-Linear Regression¶

4.1 Log transformation¶

In [64]:
train_X.dtypes
Out[64]:
Age                       int64
Balance                 float64
ShotPower               float64
Aggression              float64
Positioning             float64
Composure               float64
Preferred Foot_Right      uint8
Body Type_Lean            uint8
Body Type_Normal          uint8
Body Type_Stocky          uint8
dtype: object
In [65]:
train_X_variables_df = pd.DataFrame(train_X.columns.values, columns = ["Variables"])
train_X_variables_df
Out[65]:
Variables
0 Age
1 Balance
2 ShotPower
3 Aggression
4 Positioning
5 Composure
6 Preferred Foot_Right
7 Body Type_Lean
8 Body Type_Normal
9 Body Type_Stocky
In [66]:
train_X_log10 = np.log10(train_X.iloc[:,0:6])
train_X_log10.head()
Out[66]:
Age Balance ShotPower Aggression Positioning Composure
13946 1.414973 1.838849 1.851258 1.806180 1.857332 1.857332
7711 1.342423 1.929419 1.838849 1.799341 1.672098 1.748188
8402 1.397940 1.812913 1.716003 1.342423 1.740363 1.716003
13651 1.414973 1.770852 1.880814 1.903090 1.875061 1.857332
1625 1.447158 1.740363 1.845098 1.740363 1.785330 1.851258
In [67]:
train_X2 = pd.concat((train_X_log10, train_X.iloc[:,6:10]), axis = 1)
train_X2.head()
Out[67]:
Age Balance ShotPower Aggression Positioning Composure Preferred Foot_Right Body Type_Lean Body Type_Normal Body Type_Stocky
13946 1.414973 1.838849 1.851258 1.806180 1.857332 1.857332 1 1 0 0
7711 1.342423 1.929419 1.838849 1.799341 1.672098 1.748188 1 0 0 1
8402 1.397940 1.812913 1.716003 1.342423 1.740363 1.716003 1 1 0 0
13651 1.414973 1.770852 1.880814 1.903090 1.875061 1.857332 0 0 1 0
1625 1.447158 1.740363 1.845098 1.740363 1.785330 1.851258 0 1 0 0
In [68]:
valid_X_log10 = np.log10(valid_X.iloc[:,0:6])
valid_X_log10.head()
Out[68]:
Age Balance ShotPower Aggression Positioning Composure
7882 1.477121 1.770852 1.826075 1.591065 1.799341 1.716003
14555 1.380211 1.740363 1.903090 1.770852 1.851258 1.812913
16210 1.505150 1.785330 1.845098 1.716003 1.875061 1.826075
15847 1.380211 1.778151 1.875061 1.690196 1.851258 1.845098
12382 1.431364 1.826075 1.880814 1.740363 1.892095 1.886491
In [69]:
valid_X2 = pd.concat((valid_X_log10, valid_X.iloc[:,6:10]), axis = 1)
valid_X2.head()
Out[69]:
Age Balance ShotPower Aggression Positioning Composure Preferred Foot_Right Body Type_Lean Body Type_Normal Body Type_Stocky
7882 1.477121 1.770852 1.826075 1.591065 1.799341 1.716003 1 0 1 0
14555 1.380211 1.740363 1.903090 1.770852 1.851258 1.812913 1 0 0 1
16210 1.505150 1.785330 1.845098 1.716003 1.875061 1.826075 1 0 0 1
15847 1.380211 1.778151 1.875061 1.690196 1.851258 1.845098 1 0 1 0
12382 1.431364 1.826075 1.880814 1.740363 1.892095 1.886491 1 0 1 0
In [70]:
train_y2 = np.log10(train_y)
train_y2
Out[70]:
13946    4.352414
7711     3.829947
8402     3.730540
13651    4.137069
1625     4.022057
           ...   
12759    4.345805
17284    3.818556
1016     4.047158
16984    4.066773
16744    4.419972
Name: Wage, Length: 1506, dtype: float64
In [71]:
valid_y2 = np.log10(valid_y)
valid_y2
Out[71]:
7882     3.750354
14555    4.460522
16210    3.841422
15847    4.006209
12382    4.661595
           ...   
8620     3.780821
10786    3.643354
16154    4.535775
4990     3.535800
14654    4.214764
Name: Wage, Length: 646, dtype: float64

4.2 Training the Log Regression Model¶

In [72]:
import sklearn
from sklearn.linear_model import LinearRegression
In [73]:
model2 = LinearRegression()
In [74]:
model2.fit(train_X2, train_y2)
Out[74]:
LinearRegression()
In [75]:
train_y2_pred = model2.predict(train_X2)
train_y2_pred
Out[75]:
array([4.16670693, 3.70430495, 3.5309925 , ..., 4.08446948, 4.07971215,
       4.12972782])
In [76]:
train_y2_pred_df = pd.DataFrame(train_y2_pred, columns = ["Training_Prediction"])
train_y2_pred_df
Out[76]:
Training_Prediction
0 4.166707
1 3.704305
2 3.530993
3 4.228996
4 3.970243
... ...
1501 4.194870
1502 4.084830
1503 4.084469
1504 4.079712
1505 4.129728

1506 rows × 1 columns

In [77]:
print("model intercept: ", model2.intercept_)
print("model coefficients: ", model2.coef_)
print("Model score: ", model2.score(train_X2, train_y2))
model intercept:  -2.8669058840786708
model coefficients:  [-0.74070881  0.14474169  1.64201441  0.09668163  1.83662843  1.15205982
  0.00265055 -0.95251776 -0.96819688 -0.9947823 ]
Model score:  0.4981172727732758

Coefficients, easier to read.

In [78]:
print(pd.DataFrame({"Predictor": train_X2.columns, "Coefficient": model2.coef_}))
              Predictor  Coefficient
0                   Age    -0.740709
1               Balance     0.144742
2             ShotPower     1.642014
3            Aggression     0.096682
4           Positioning     1.836628
5             Composure     1.152060
6  Preferred Foot_Right     0.002651
7        Body Type_Lean    -0.952518
8      Body Type_Normal    -0.968197
9      Body Type_Stocky    -0.994782

4.2.1 Model Evaluation on Training (Log Regression)¶

Get the RMSE for training set

In [79]:
mse_train_2 = sklearn.metrics.mean_squared_error(train_y2, train_y2_pred)
mse_train_2
Out[79]:
0.06356591642873398
In [80]:
import math
In [81]:
rmse_train_2 = math.sqrt(mse_train_2)
rmse_train_2
Out[81]:
0.25212282012688575
In [82]:
train_y2.describe()
Out[82]:
count    1506.000000
mean        3.904520
std         0.356004
min         3.110590
25%         3.671404
50%         3.815843
75%         4.092168
max         5.610244
Name: Wage, dtype: float64

If using the dmba package:

pip install dmba

or

conda install -c conda-forge dmba

Then load the library

import dmba

from dmba import regressionSummary

In [83]:
import dmba
from dmba import regressionSummary
In [84]:
regressionSummary(train_y2, train_y2_pred)
Regression statistics

                      Mean Error (ME) : -0.0000
       Root Mean Squared Error (RMSE) : 0.2521
            Mean Absolute Error (MAE) : 0.1955
          Mean Percentage Error (MPE) : -0.3879
Mean Absolute Percentage Error (MAPE) : 4.9901

Normality

In [85]:
import numpy as np
from scipy.stats import shapiro

shapiro(train_y2)
Out[85]:
ShapiroResult(statistic=0.9299247860908508, pvalue=5.641337850676267e-26)
In [86]:
train_residuals_2 = train_y2 - train_y2_pred
train_residuals_2
Out[86]:
13946    0.185707
7711     0.125642
8402     0.199548
13651   -0.091927
1625     0.051814
           ...   
12759    0.150935
17284   -0.266274
1016    -0.037311
16984   -0.012939
16744    0.290244
Name: Wage, Length: 1506, dtype: float64
In [87]:
shapiro(train_residuals_2)
Out[87]:
ShapiroResult(statistic=0.9941757917404175, pvalue=1.2548777704068925e-05)
In [88]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif_df_2 = pd.DataFrame()

vif_df_2["features"] = train_X.columns
vif_df_2["VIF"] = [variance_inflation_factor(train_X2.values, i) for i in range(train_X2.shape[1])]

print(vif_df_2)
               features          VIF
0                   Age   503.047038
1               Balance   392.228433
2             ShotPower  2230.845164
3            Aggression   224.646809
4           Positioning  2688.086015
5             Composure  1657.030037
6  Preferred Foot_Right     7.725654
7        Body Type_Lean   234.209226
8      Body Type_Normal   421.031902
9      Body Type_Stocky    71.646782

4.3 Predict Validation Set (Log Regression)¶

In [89]:
valid_y2_pred = model2.predict(valid_X2)
valid_y2_pred
Out[89]:
array([3.76367355, 4.15529918, 4.02761799, 4.17061134, 4.27664223,
       3.90982955, 3.86919134, 4.05394802, 4.07575135, 3.58962247,
       3.90714578, 3.99290823, 3.95256648, 3.85413803, 4.19055898,
       4.3770946 , 4.06555802, 4.0018328 , 3.71793422, 4.15389889,
       3.75427973, 4.11372724, 4.04790165, 4.16199766, 3.81913817,
       3.45592807, 3.85618089, 4.11258275, 4.01680881, 4.0156488 ,
       3.57791354, 3.92013811, 3.93333019, 4.15243617, 3.64206255,
       4.05268594, 3.55963358, 3.93038411, 3.65549926, 4.26277204,
       4.2957239 , 3.407724  , 4.0802973 , 3.71756546, 4.01320061,
       4.15470805, 3.43053436, 3.58982582, 3.77744561, 4.05297243,
       4.1264634 , 3.86412457, 3.80646696, 4.10663088, 4.0277011 ,
       4.10682818, 3.4907613 , 3.9930389 , 3.7844055 , 4.07632993,
       3.74574599, 4.13168887, 3.93609012, 4.15447129, 3.8351393 ,
       3.98164635, 3.70036723, 3.83283241, 3.6957984 , 4.32483198,
       3.97680842, 3.64890098, 3.68376159, 4.02159597, 4.04864317,
       4.17927691, 4.07174931, 3.99436296, 3.89363006, 4.35326354,
       4.1946218 , 4.55669742, 3.83271191, 4.02022325, 4.38496503,
       3.99722499, 3.93647576, 3.89577843, 3.99995635, 4.17175375,
       4.29133548, 4.0160841 , 3.98917806, 4.31295093, 4.10110026,
       4.01077971, 4.15171613, 3.93406065, 3.92381907, 4.47762954,
       3.86056946, 4.14341326, 4.10795381, 3.78488352, 4.13184288,
       3.96830151, 4.22809601, 4.17370611, 3.70893568, 4.41445846,
       3.9860071 , 4.03624456, 3.54130701, 3.97681342, 3.93321275,
       3.40274183, 3.77377519, 3.7484716 , 4.31232063, 3.99361937,
       3.709269  , 3.76580361, 4.12616882, 3.62756573, 4.10717141,
       4.4300846 , 4.26733575, 4.42559001, 3.68110372, 3.63376597,
       4.20690577, 3.55348096, 3.58817486, 4.0028743 , 4.31354532,
       3.60417179, 3.46791211, 3.63460305, 3.52797995, 3.80240164,
       3.99804681, 4.07003803, 3.96740854, 3.67782452, 3.7088948 ,
       4.29154294, 3.91117274, 3.68003573, 3.72960344, 4.20021854,
       4.16794103, 3.68047068, 4.24551816, 3.89801283, 3.85297611,
       3.93996911, 3.79674516, 3.5260676 , 4.08968162, 4.04203417,
       4.08326129, 3.31249389, 3.78856385, 4.03608251, 3.97750166,
       3.73639092, 3.89686649, 4.08641956, 4.02202326, 3.95785434,
       3.66121658, 3.47142139, 3.68216025, 3.79270249, 3.74367342,
       3.90196004, 3.99075541, 3.9180192 , 4.06335476, 4.00007676,
       4.27009768, 3.31072324, 3.74364683, 3.93199629, 3.66915503,
       3.86109118, 4.12719138, 4.02302088, 4.12066631, 3.46212247,
       4.24516929, 3.60772261, 3.86266376, 4.0068775 , 3.64991768,
       3.57821803, 4.51164804, 3.86884302, 4.02278378, 3.40892806,
       4.00929465, 4.27441749, 4.04650643, 4.31664087, 3.79570229,
       3.87387915, 4.33204612, 3.73025151, 4.26612984, 3.739114  ,
       3.79062207, 4.35863673, 3.9667509 , 3.65394184, 4.08544516,
       4.21615412, 3.61137378, 3.47775978, 4.18257474, 3.94734872,
       4.09660859, 3.86340944, 3.76653697, 4.28916253, 4.12961482,
       4.01531056, 3.73242103, 3.92353868, 4.01528881, 4.05693391,
       3.64922279, 4.13251124, 3.78851204, 3.96812818, 3.60234792,
       4.02649726, 3.87836136, 3.78527889, 4.17130734, 4.01839531,
       3.87736521, 3.84803085, 3.56809734, 3.53356624, 3.74931027,
       3.93071361, 3.92688701, 3.82006385, 3.99352542, 3.78762902,
       3.3088708 , 3.6389334 , 3.85878766, 3.99660732, 4.15405735,
       3.77264981, 3.96080938, 4.00249334, 4.06460802, 4.07503455,
       3.92041065, 3.77078081, 4.00051175, 4.01299821, 4.22032504,
       3.83147516, 4.01829502, 3.77217185, 4.04027499, 4.04894302,
       4.37009593, 3.66198416, 4.16404301, 4.07537538, 3.81702852,
       4.18393764, 3.38560675, 4.15056587, 3.76938473, 3.66133542,
       3.83298873, 3.96379047, 3.88700295, 4.11912182, 3.76007694,
       3.6960155 , 4.1211959 , 4.04400082, 4.21272665, 3.69404658,
       3.87553024, 3.82367299, 4.44392643, 3.90035329, 3.82670037,
       4.09651587, 3.81539836, 3.62901484, 4.12364698, 3.87491952,
       3.60137225, 3.8373793 , 3.60737549, 4.02177599, 3.50538178,
       3.6237138 , 3.95906308, 3.93927344, 4.0979651 , 3.06117595,
       4.11487814, 3.71943094, 3.7607101 , 4.00325951, 4.07500449,
       3.63075039, 3.90162744, 4.24153272, 3.55020114, 4.11677891,
       3.9519506 , 4.19331594, 4.64906177, 4.08705043, 3.91322188,
       3.4643572 , 4.07389286, 4.01613118, 4.13090872, 4.12330112,
       3.67620387, 3.3069908 , 3.50331655, 4.22838468, 4.06350805,
       4.23418359, 4.16447788, 3.6789202 , 3.97233518, 3.39431118,
       3.83794578, 3.83635319, 3.94286729, 3.9122974 , 3.85503556,
       3.92286221, 3.86449427, 3.61902976, 3.64006414, 3.9710172 ,
       4.06949047, 3.51857609, 4.06396503, 4.08485153, 4.16137828,
       3.96409872, 3.72730472, 3.92930897, 4.18743676, 3.9411608 ,
       3.66906524, 3.85038249, 4.04403355, 3.91918519, 4.06720527,
       4.12111189, 4.01383475, 3.61697527, 3.97950546, 4.08334525,
       3.78934118, 3.959101  , 3.80638364, 3.45757572, 3.49770466,
       4.0626396 , 3.71218504, 3.8107466 , 4.14436282, 3.74999602,
       3.92760265, 3.58786307, 4.53483917, 4.02076808, 3.79226698,
       3.70145343, 4.35696259, 3.60565745, 4.05148097, 4.35182707,
       3.69722441, 4.35512404, 4.10090967, 3.44589242, 3.94972836,
       3.88857535, 3.93084888, 3.98319886, 4.12587376, 4.06091435,
       3.76180862, 4.22510036, 4.04137816, 4.0553892 , 4.19899674,
       4.09319953, 3.78428264, 3.64991702, 3.8155647 , 4.08733891,
       3.9064268 , 3.90981766, 4.119122  , 3.95709316, 3.80575735,
       3.83523119, 3.79857318, 4.045406  , 4.0474044 , 4.56175315,
       3.75433912, 3.82384601, 3.81895492, 3.98409788, 4.04380715,
       3.56992826, 4.08171437, 4.09281733, 3.62415787, 3.89357785,
       4.32357572, 4.16051063, 4.0979448 , 3.97747216, 3.96721426,
       4.43209841, 3.89480226, 3.49403115, 3.85021414, 4.31085473,
       4.26492488, 3.96203775, 3.96303963, 4.13080105, 4.2010063 ,
       3.9434954 , 3.55218052, 3.90594563, 4.11490189, 3.53312359,
       3.86513143, 3.46451959, 3.95465407, 3.80188125, 4.02308478,
       3.27226357, 4.15753532, 3.94023949, 4.01913372, 4.22553447,
       4.07815829, 3.74204974, 3.81970664, 4.2039179 , 4.02616824,
       4.13984176, 4.1229543 , 3.55495757, 4.21420044, 4.56258001,
       3.88078398, 4.1247032 , 4.04834385, 4.27668304, 4.12747814,
       3.76683153, 4.33579598, 3.93189263, 3.72548763, 4.18447111,
       4.23916186, 4.03404372, 3.61674952, 4.17929048, 3.41112525,
       3.49051379, 4.09164834, 3.77737006, 3.82334394, 3.8455094 ,
       3.40492441, 3.70119257, 3.53003948, 4.18349422, 3.47598818,
       4.44548399, 4.07921488, 4.03500231, 3.78440745, 4.10491587,
       3.47807912, 3.92702743, 3.612196  , 4.24948279, 4.05430833,
       3.63814312, 4.31071604, 4.02682925, 3.79727886, 4.17466371,
       4.1356379 , 3.72467379, 3.67319298, 4.14294685, 3.40578843,
       3.8218298 , 3.80856988, 3.69074869, 3.97580986, 4.05361426,
       4.06683376, 3.93669446, 3.90456332, 3.99560591, 3.66866854,
       4.13639514, 4.1891806 , 3.89513874, 4.00730782, 4.06946615,
       3.86813897, 3.75452134, 3.55861187, 4.12137531, 3.44527625,
       4.08483195, 3.5823102 , 4.00056343, 3.81471346, 3.59462776,
       4.41776481, 3.90334739, 3.94811114, 3.93948265, 3.37279971,
       4.09378912, 3.71299135, 3.94671423, 3.49139588, 3.66989535,
       4.34275382, 3.58419407, 3.98424773, 4.15901486, 3.38451984,
       3.92587981, 4.11555836, 4.35077569, 4.33447151, 4.18984644,
       4.20444188, 3.98738318, 4.02013942, 3.6510333 , 4.14290566,
       3.63184601, 3.90346708, 4.05734256, 3.89546729, 4.20709072,
       3.91355501, 4.30099848, 3.65052262, 3.45561065, 3.93542772,
       3.1519519 , 4.09416062, 4.12392262, 3.90775453, 3.85970599,
       4.02417249, 3.98409722, 3.85601473, 4.18887768, 3.9512208 ,
       3.63514546, 4.04710009, 3.94807166, 3.5565326 , 3.84958316,
       3.98819984, 4.3147352 , 4.14870834, 3.79284242, 3.9958641 ,
       3.95117854, 3.69148335, 3.82822223, 4.30014425, 4.06913287,
       3.50836167, 3.8211041 , 3.8298556 , 4.42267047, 3.84580223,
       4.06574794, 3.47712171, 4.18324774, 3.55245306, 4.05191393,
       3.91601428, 4.32217236, 3.42756651, 4.19186723, 3.97225306,
       3.55904222, 4.39020702, 3.78896984, 3.87030635, 4.06807547,
       3.82900839, 3.41678228, 3.79877734, 3.886422  , 4.17168873,
       3.89690427, 4.38305566, 4.01866806, 3.78008935, 3.62017101,
       4.3320427 , 3.96860992, 4.02460072, 3.73670569, 3.83703631,
       3.42525497, 3.93023985, 3.58773105, 3.68050587, 4.01015867,
       4.00341998, 3.66600645, 3.52794768, 4.10312619, 3.88325128,
       4.26631878])
In [90]:
valid_y2_pred_df = pd.DataFrame(valid_y2_pred, columns = ["Validation_Prediction"])
valid_y2_pred_df
Out[90]:
Validation_Prediction
0 3.763674
1 4.155299
2 4.027618
3 4.170611
4 4.276642
... ...
641 3.666006
642 3.527948
643 4.103126
644 3.883251
645 4.266319

646 rows × 1 columns

4.3.1 Model Evaluation on Validation (Log Regression)¶

Get the RMSE for validation set.

In [91]:
mse_valid_2 = sklearn.metrics.mean_squared_error(valid_y2, valid_y2_pred)
mse_valid_2
Out[91]:
0.06790509555931858
In [92]:
# As before
# import math

rmse_valid_2 = math.sqrt(mse_valid_2)
rmse_valid_2
Out[92]:
0.2605860617134358
In [93]:
valid_y2.describe()
Out[93]:
count    646.000000
mean       3.915194
std        0.373620
min        3.043362
25%        3.672905
50%        3.829336
75%        4.108147
max        5.478667
Name: Wage, dtype: float64
In [94]:
# As before:

# If using the dmba package:

# pip install dmba


# Done earlier. Just for illustration
# import dmba
# from dmba import regressionSummary

regressionSummary(valid_y2, valid_y2_pred)
Regression statistics

                      Mean Error (ME) : -0.0105
       Root Mean Squared Error (RMSE) : 0.2606
            Mean Absolute Error (MAE) : 0.2020
          Mean Percentage Error (MPE) : -0.7122
Mean Absolute Percentage Error (MAPE) : 5.1776

4.4 Predict New Records (Log Regression)¶

In [95]:
new_players_df
Out[95]:
Age Balance ShotPower Aggression Positioning Composure Preferred Foot_Right Body Type_Lean Body Type_Normal Body Type_Stocky
0 27 59 75 68 80 76 1 0 0 1
1 21 42 71 52 60 76 1 1 0 0
2 19 76 80 22 75 56 0 0 0 1
In [96]:
new_players_df_log10 = np.log10(new_players_df.iloc[:,0:6])
new_players_df_log10.head()
Out[96]:
Age Balance ShotPower Aggression Positioning Composure
0 1.431364 1.770852 1.875061 1.832509 1.903090 1.880814
1 1.322219 1.623249 1.851258 1.716003 1.778151 1.880814
2 1.278754 1.880814 1.903090 1.342423 1.875061 1.748188
In [97]:
new_players_df_2 = pd.concat((new_players_df_log10, new_players_df.iloc[:,6:10]), axis = 1)
new_players_df_2
Out[97]:
Age Balance ShotPower Aggression Positioning Composure Preferred Foot_Right Body Type_Lean Body Type_Normal Body Type_Stocky
0 1.431364 1.770852 1.875061 1.832509 1.903090 1.880814 1 0 0 1
1 1.322219 1.623249 1.851258 1.716003 1.778151 1.880814 1 1 0 0
2 1.278754 1.880814 1.903090 1.342423 1.875061 1.748188 0 0 0 1
In [98]:
new_records_players_pred_2 = model2.predict(new_players_df_2)
new_records_players_pred_2
Out[98]:
array([4.25518124, 4.07711107, 4.17585671])
In [99]:
# As before
# import pandas as pd

new_records_players_pred_df_2 = pd.DataFrame(new_records_players_pred_2, columns = ["Prediction"])
new_records_players_pred_df_2

# to export
# new_records_players_pred_df.to_csv("whatever_name.csv")
Out[99]:
Prediction
0 4.255181
1 4.077111
2 4.175857
In [100]:
alpha = 0.05
ci_2 = np.quantile(train_residuals_2, 1 - alpha)
ci_2
Out[100]:
0.43633371119162057
In [101]:
# as before

def generate_results_confint_2(preds, ci_2):
    df = pd.DataFrame()
    df["Prediction"] = preds
    if ci >= 0:
        df["upper"] = preds + ci_2
        df["lower"] = preds - ci_2
    else:
        df["upper"] = preds - ci_2
        df["lower"] = preds + ci_2
    return df
In [102]:
new_records_players_pred_2_confint_df = generate_results_confint_2(new_records_players_pred_2, ci_2)
new_records_players_pred_2_confint_df
Out[102]:
Prediction upper lower
0 4.255181 4.691515 3.818848
1 4.077111 4.513445 3.640777
2 4.175857 4.612190 3.739523
In [103]:
def exp10(x):
    return 10**x
 
# execute the function
new_records_players_pred_df_2_exp = new_records_players_pred_2_confint_df.apply(exp10)
new_records_players_pred_df_2_exp
Out[103]:
Prediction upper lower
0 17996.217938 49149.030458 6589.425204
1 11942.935088 32617.057784 4372.978686
2 14991.900984 40944.013939 5489.376187

red_devils.jpeg

5. Ridge Regression¶

5.1 Transformation¶

In [104]:
train_X3 = train_X.copy()
train_y3 = train_y.copy()
valid_X3 = valid_X.copy()
valid_y3 = valid_y.copy()
In [105]:
from sklearn.linear_model import Ridge

model_ridge = Ridge(alpha = 1.0)
model_ridge.fit(train_X3, train_y3)
Out[105]:
Ridge()

5.2 Training the Ridge Regression Model¶

In [106]:
from sklearn.linear_model import Ridge

model_ridge = Ridge(alpha = 1.0)
model_ridge.fit(train_X3, train_y3)
Out[106]:
Ridge()
In [107]:
train_y3_pred = model_ridge.predict(train_X3)
train_y3_pred
Out[107]:
array([25188.51155671,  3860.14487559, -4188.44855596, ...,
       19245.95968925, 21417.50420002, 23004.19110446])
In [108]:
train_y3_pred_df = pd.DataFrame(train_y3_pred, columns = ["Training_Prediction"])
train_y3_pred_df
Out[108]:
Training_Prediction
0 25188.511557
1 3860.144876
2 -4188.448556
3 29995.100230
4 16109.083754
... ...
1501 26829.388840
1502 22783.529768
1503 19245.959689
1504 21417.504200
1505 23004.191104

1506 rows × 1 columns

In [109]:
print("model intercept: ", model_ridge.intercept_)
print("model coefficients: ", model_ridge.coef_)
print("Model score: ", model_ridge.score(train_X3, train_y3))
model intercept:  20081.83125341736
model coefficients:  [-9.27897306e+02  5.45457726e+01  5.02883492e+02  2.41877047e+01
  6.53156518e+02  4.20617179e+02 -1.86591933e+03 -8.72301759e+04
 -8.86049391e+04 -8.90541217e+04]
Model score:  0.3608967032383109

Coefficients, easier to read.

In [110]:
print(pd.DataFrame({"Predictor": train_X3.columns, "Coefficient": model_ridge.coef_}))
              Predictor   Coefficient
0                   Age   -927.897306
1               Balance     54.545773
2             ShotPower    502.883492
3            Aggression     24.187705
4           Positioning    653.156518
5             Composure    420.617179
6  Preferred Foot_Right  -1865.919332
7        Body Type_Lean -87230.175924
8      Body Type_Normal -88604.939142
9      Body Type_Stocky -89054.121691

5.2.1 Model Evaluation on Training (Ridge Regression)¶

Get the RMSE for training set

In [111]:
mse_train_3 = sklearn.metrics.mean_squared_error(train_y3, train_y3_pred)
mse_train_3
Out[111]:
308594700.68349236
In [112]:
import math
In [113]:
rmse_train_3 = math.sqrt(mse_train_3)
rmse_train_3
Out[113]:
17566.86371221375
In [114]:
train_y3.describe()
Out[114]:
count      1506.000000
mean      12698.381142
std       21981.278007
min        1290.000000
25%        4692.500000
50%        6544.000000
75%       12364.250000
max      407609.000000
Name: Wage, dtype: float64

If using the dmba package:

pip install dmba

or

conda install -c conda-forge dmba

Then load the library

import dmba

from dmba import regressionSummary

In [115]:
import dmba
from dmba import regressionSummary
In [116]:
regressionSummary(train_y3, train_y3_pred)
Regression statistics

                      Mean Error (ME) : -0.0000
       Root Mean Squared Error (RMSE) : 17566.8637
            Mean Absolute Error (MAE) : 8938.7057
          Mean Percentage Error (MPE) : -31.3821
Mean Absolute Percentage Error (MAPE) : 117.8144

Normality

In [117]:
import numpy as np
from scipy.stats import shapiro

shapiro(train_y3)
Out[117]:
ShapiroResult(statistic=0.38412952423095703, pvalue=0.0)
In [118]:
train_residuals_3 = train_y3 - train_y3_pred
train_residuals_3
Out[118]:
13946    -2676.511557
7711      2899.855124
8402      9565.448556
13651   -16284.100230
1625     -5588.083754
             ...     
12759    -4657.388840
17284   -16198.529768
1016     -8098.959689
16984    -9755.504200
16744     3296.808896
Name: Wage, Length: 1506, dtype: float64
In [119]:
shapiro(train_residuals_3)
Out[119]:
ShapiroResult(statistic=0.5465858578681946, pvalue=0.0)
In [120]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif_df_3 = pd.DataFrame()

vif_df_3["features"] = train_X3.columns
vif_df_3["VIF"] = [variance_inflation_factor(train_X3.values, i) for i in range(train_X3.shape[1])]

print(vif_df_3)
               features         VIF
0                   Age   48.741902
1               Balance   33.364691
2             ShotPower  155.270485
3            Aggression   18.482052
4           Positioning  184.558829
5             Composure  115.774481
6  Preferred Foot_Right    7.714600
7        Body Type_Lean   32.118906
8      Body Type_Normal   59.412529
9      Body Type_Stocky   11.108926

5.3 Predict Validation Set (Ridge Regression)¶

In [121]:
valid_y3_pred = model_ridge.predict(valid_X3)
valid_y3_pred
Out[121]:
array([ 2.64972256e+03,  2.52642557e+04,  1.64240635e+04,  2.53329585e+04,
        3.10955126e+04,  1.29528286e+04,  6.70941044e+03,  2.04919113e+04,
        1.67847671e+04, -1.77608909e+03,  1.37232569e+04,  1.48205527e+04,
        1.60211714e+04,  1.10493542e+04,  2.52134500e+04,  3.59934836e+04,
        1.78365347e+04,  1.32628212e+04,  4.48094205e+03,  2.29869164e+04,
        4.96584554e+03,  2.16229676e+04,  2.09228408e+04,  2.42567025e+04,
        1.03654164e+04, -6.30245310e+03,  1.23663735e+04,  2.36282687e+04,
        1.49069805e+04,  1.72923603e+04, -1.41187798e+03,  1.21934953e+04,
        8.99116266e+03,  2.36012248e+04,  2.03620098e+03,  1.97430366e+04,
       -7.88156562e+03,  1.42296144e+04,  3.02349161e+03,  2.98615135e+04,
        3.17155746e+04, -7.45185605e+03,  2.36998503e+04,  2.10088321e+03,
        1.55835468e+04,  2.53623215e+04, -6.50981532e+03, -1.76196097e+03,
        3.62986525e+03,  1.82776299e+04,  2.19955432e+04,  9.56082725e+03,
        7.90816856e+03,  1.95834541e+04,  1.78544369e+04,  2.23809073e+04,
       -4.90366027e+03,  1.54642694e+04,  9.31753268e+03,  2.15480020e+04,
        4.57990856e+03,  2.33541602e+04,  1.11657541e+04,  2.46551380e+04,
        9.84817425e+03,  1.63822774e+04, -1.80807030e+03,  8.69464783e+03,
        2.72496530e+03,  3.43697545e+04,  1.59330718e+04,  9.17012681e+02,
        1.66581487e+03,  1.76629155e+04,  1.83800182e+04,  2.65577927e+04,
        2.01509977e+04,  1.71244262e+04,  8.74705677e+03,  3.58738523e+04,
        2.65122746e+04,  5.00604302e+04,  9.64618370e+03,  1.90684929e+04,
        3.70800559e+04,  1.66337474e+04,  1.47011831e+04,  1.01626727e+04,
        1.62589786e+04,  2.47699032e+04,  3.16840751e+04,  1.74982126e+04,
        1.63608869e+04,  3.31509430e+04,  2.07902205e+04,  1.64597306e+04,
        2.47251371e+04,  1.04628862e+04,  9.57437075e+03,  4.42431055e+04,
        1.26561720e+04,  2.55199260e+04,  2.23256682e+04,  8.14743820e+03,
        2.45764795e+04,  1.48054501e+04,  2.85598619e+04,  2.72748441e+04,
        1.65894359e+03,  4.17885584e+04,  1.22392050e+04,  1.80790225e+04,
       -4.47457671e+03,  1.78499306e+04,  1.31001658e+04, -6.64508399e+03,
        9.19121426e+03,  3.15130742e+03,  3.35840483e+04,  1.53441534e+04,
        3.51981089e+03,  5.40674384e+03,  2.42962197e+04,  1.77612088e+03,
        2.46176244e+04,  4.35989119e+04,  3.41348306e+04,  4.02437061e+04,
        1.66983814e+03,  2.16856079e+03,  2.83205082e+04, -2.41088911e+03,
       -2.36682760e+03,  1.76805901e+04,  3.30181153e+04, -3.41788605e+02,
       -4.42985144e+03,  9.78417778e+02, -3.41201914e+03,  8.68924567e+03,
        1.65537818e+04,  1.72906353e+04,  1.33560660e+04,  3.40153805e+03,
        4.03253553e+03,  3.19964770e+04,  1.20998236e+04, -8.26064462e+02,
        5.55617913e+03,  2.71488508e+04,  2.68272814e+04,  1.14432826e+03,
        2.83124624e+04,  1.15858075e+04,  1.08250398e+04,  1.51075765e+04,
        3.30687121e+03, -2.76195568e+03,  2.12094054e+04,  1.88687090e+04,
        2.04936705e+04, -1.09126099e+04,  7.30930490e+03,  1.67882716e+04,
        1.15685907e+04,  3.14702931e+03,  7.35938167e+03,  2.13262354e+04,
        1.84566066e+04,  1.42514599e+04,  1.65441942e+03, -5.65924431e+03,
       -1.00778426e+03,  7.51549323e+03,  5.67264855e+03,  1.22640084e+04,
        1.33564333e+04,  1.36739531e+04,  1.90346104e+04,  1.79189151e+04,
        3.18978213e+04, -1.20856433e+04,  5.73937028e+03,  1.16894482e+04,
       -3.87423563e+03,  1.00288060e+04,  2.20562707e+04,  1.74938502e+04,
        2.37843794e+04, -6.23145484e+03,  2.94267502e+04,  1.48092211e+03,
        9.80861441e+03,  1.81335960e+04,  1.85554394e+03, -3.94264372e+03,
        4.56839001e+04,  1.39029242e+04,  1.91594303e+04, -7.49391130e+03,
        1.66476248e+04,  2.94697109e+04,  1.76200685e+04,  3.27957846e+04,
        4.31614870e+03,  1.19215192e+04,  3.51546196e+04,  4.07900156e+03,
        3.19670958e+04,  5.19281830e+03,  7.22590148e+03,  3.36253366e+04,
        1.55881255e+04,  2.49167421e+03,  2.48504689e+04,  2.55159442e+04,
       -4.77474929e+03, -4.42846792e+03,  2.59056546e+04,  1.26388624e+04,
        2.15742548e+04,  4.08688167e+03,  5.14702418e+03,  3.31563915e+04,
        2.27265145e+04,  1.79822987e+04,  4.88434587e+03,  1.41505327e+04,
        1.41902561e+04,  2.02603488e+04,  2.05241854e+03,  2.55923095e+04,
        7.44893187e+03,  1.39030195e+04, -2.08543587e+03,  1.76049436e+04,
        1.16541566e+04,  7.37981695e+03,  2.52979520e+04,  1.82726281e+04,
        8.90597823e+03,  6.23015687e+03, -1.55508696e+03, -2.63008981e+03,
        2.13902693e+03,  1.61152051e+04,  1.22182585e+04,  9.64623645e+03,
        1.63859443e+04,  6.14276618e+03, -1.11170706e+04,  2.10258398e+02,
        9.68578502e+03,  1.69246037e+04,  2.31236071e+04,  3.64414735e+03,
        1.25187206e+04,  1.70574873e+04,  1.90115826e+04,  2.04687259e+04,
        1.10294579e+04,  6.15333543e+03,  1.71130713e+04,  1.49853247e+04,
        2.73762381e+04,  8.27010205e+03,  1.71023730e+04,  2.89285921e+03,
        1.92188863e+04,  1.92121483e+04,  3.67284627e+04, -6.07554785e+02,
        2.50715536e+04,  2.16535634e+04,  1.15711762e+04,  2.61352627e+04,
       -8.60774818e+03,  2.55153591e+04,  6.52033108e+03,  3.83335571e+03,
        1.01184551e+04,  1.49659750e+04,  1.01356377e+04,  2.50175606e+04,
        1.87688328e+03,  3.39055929e+03,  1.94403783e+04,  2.04072294e+04,
        2.95313635e+04,  1.93304405e+03,  1.03166764e+04,  8.90465883e+03,
        4.03368608e+04,  1.15774038e+04,  6.91783988e+03,  2.08803031e+04,
        7.72041122e+03,  7.47676798e+02,  2.04555233e+04,  1.13323091e+04,
        2.19823569e+03,  8.45970953e+03, -1.60356018e+02,  1.51640672e+04,
       -3.23890555e+03, -5.41129809e+02,  1.18993186e+04,  1.17328722e+04,
        2.14902835e+04, -1.84628171e+04,  2.48351686e+04,  3.78260306e+03,
        6.02224315e+03,  1.64688353e+04,  1.97621129e+04,  9.78325175e+02,
        1.26317064e+04,  3.04878019e+04, -5.12399165e+03,  2.35210556e+04,
        1.21487080e+04,  2.50761478e+04,  5.54846852e+04,  1.81841465e+04,
        1.40340286e+04, -5.67563080e+03,  1.93854989e+04,  1.91248164e+04,
        2.23340938e+04,  2.27299834e+04,  5.01194389e+02, -1.15661438e+04,
       -3.77023194e+03,  3.00930391e+04,  1.72765208e+04,  2.69024012e+04,
        2.54266342e+04,  7.74975266e+02,  1.64742338e+04, -8.23238191e+03,
        4.84693210e+03,  6.87596180e+03,  1.32836570e+04,  1.37999954e+04,
        5.47063742e+03,  1.32036417e+04,  1.09850059e+04,  7.52434673e+02,
        1.40420709e+03,  1.45256380e+04,  1.92371576e+04, -2.95672651e+03,
        1.96692843e+04,  2.09792274e+04,  2.30787686e+04,  1.31057062e+04,
        2.81278489e+03,  9.19501376e+03,  2.54078773e+04,  1.39056907e+04,
       -1.48998032e+03,  9.43590431e+03,  2.08441471e+04,  1.08298218e+04,
        1.57316997e+04,  2.06514061e+04,  1.89700762e+04,  1.97035528e+03,
        1.53088095e+04,  2.05240124e+04,  6.72708804e+03,  1.60869486e+04,
        8.65837159e+03, -7.18089848e+03, -4.26694157e+03,  1.76287093e+04,
        5.62502207e+03,  4.82475782e+03,  2.42722486e+04,  3.59591071e+03,
        1.17751784e+04, -2.93829377e+02,  4.80477917e+04,  1.93483751e+04,
        2.38544972e+03,  6.92000223e+02,  3.60260896e+04,  1.49467166e+03,
        2.00469931e+04,  3.58091380e+04,  3.22001833e+03,  3.60920606e+04,
        2.20140015e+04, -6.37074328e+03,  1.72028120e+04,  1.08869315e+04,
        9.15671520e+03,  1.41010720e+04,  2.11844206e+04,  2.12523380e+04,
        5.05986592e+03,  2.87168385e+04,  1.98271066e+04,  1.78028470e+04,
        2.55563086e+04,  2.26937924e+04,  7.65953819e+03,  1.14027031e+03,
        4.95056974e+03,  2.05747082e+04,  1.06161080e+04,  1.19372254e+04,
        2.38570744e+04,  1.47834347e+04,  3.92652328e+03,  9.07837512e+03,
        7.70710391e+03,  1.95896382e+04,  1.98514394e+04,  4.87556054e+04,
        3.65470215e+03,  5.10325452e+03,  5.10605410e+03,  1.25633366e+04,
        1.87506241e+04, -2.57565109e+03,  2.06597667e+04,  1.95280430e+04,
        2.73315802e+03,  1.03630664e+04,  3.26718614e+04,  2.44039167e+04,
        1.88165479e+04,  1.41204917e+04,  1.67221228e+04,  4.06206105e+04,
        1.07958563e+04, -4.61966557e+03,  1.04215163e+04,  3.49053711e+04,
        3.06430420e+04,  1.54659743e+04,  1.22507073e+04,  2.29838211e+04,
        2.80689352e+04,  1.24090610e+04, -3.49672644e+03,  1.25655413e+04,
        2.05060764e+04, -4.41494270e+03,  6.23406765e+03, -6.70792866e+03,
        1.56835455e+04,  8.79905672e+03,  1.70954798e+04, -1.20006299e+04,
        2.32764489e+04,  1.39722455e+04,  1.62556336e+04,  2.90585260e+04,
        2.22614474e+04,  3.44931672e+03,  6.56518945e+03,  2.80309304e+04,
        1.89065022e+04,  2.66251273e+04,  2.24008904e+04, -3.46664344e+03,
        2.98405605e+04,  4.99986035e+04,  8.12192118e+03,  2.32108496e+04,
        1.87207048e+04,  2.94864225e+04,  2.30635700e+04,  4.03739664e+03,
        3.64064534e+04,  1.05239389e+04,  4.49418032e+03,  2.46542703e+04,
        2.88372647e+04,  1.76145821e+04, -1.20502662e+03,  2.61923258e+04,
       -7.61830669e+03, -4.50469220e+03,  1.91039876e+04,  2.80859861e+03,
        9.02589549e+03,  9.54643401e+03, -7.87782051e+03,  2.77553648e+03,
       -3.96550627e+03,  2.63663828e+04, -5.16826783e+03,  4.16443749e+04,
        2.25535536e+04,  2.04084092e+04,  8.41693418e+03,  2.05492947e+04,
       -4.22402137e+03,  1.48464457e+04,  3.01046448e+03,  3.28672702e+04,
        1.73061251e+04,  4.24882811e+02,  3.43642557e+04,  1.57630928e+04,
        5.52028075e+03,  2.43579326e+04,  2.52115136e+04,  6.67538351e+03,
       -4.19654223e+03,  2.29247760e+04, -7.01731014e+03,  7.45820243e+03,
        5.35308176e+03,  2.97321929e+03,  1.56642176e+04,  1.85091797e+04,
        1.85282979e+04,  1.03320994e+04,  8.86594198e+03,  1.79533622e+04,
       -3.54957681e+01,  2.62175580e+04,  2.70939411e+04,  1.06984674e+04,
        1.61454223e+04,  1.97848288e+04,  9.28741712e+03,  2.35441972e+03,
       -2.59330284e+03,  2.06180656e+04, -5.18959023e+03,  1.93485397e+04,
       -9.71407235e+02,  1.40957274e+04,  5.84901372e+03,  1.73548488e+02,
        4.09526861e+04,  1.36427385e+04,  1.16803204e+04,  1.11871214e+04,
       -5.65502163e+03,  2.31504982e+04,  3.75366145e+03,  1.57885881e+04,
       -6.34163254e+03,  2.26687996e+03,  3.40943947e+04,  9.97980536e+00,
        1.50604487e+04,  2.44857791e+04, -9.35765870e+03,  1.52885189e+04,
        2.41150212e+04,  3.73041735e+04,  3.70898660e+04,  2.74715032e+04,
        2.87226250e+04,  1.26741468e+04,  1.79040878e+04,  4.60183007e+02,
        2.23109943e+04,  1.78538625e+03,  9.02557010e+03,  2.01956331e+04,
        1.05164678e+04,  2.69007396e+04,  1.03669063e+04,  3.37218711e+04,
        2.89384174e+03, -4.98294404e+03,  1.62006981e+04, -1.68883107e+04,
        2.00918359e+04,  2.40758274e+04,  1.38033618e+04,  1.06915559e+04,
        1.88552352e+04,  1.85701688e+04,  8.58962927e+03,  2.44692265e+04,
        1.44986629e+04, -8.81666640e+02,  1.98617818e+04,  1.72889990e+04,
       -2.38132445e+03,  1.00666818e+04,  1.61490980e+04,  3.37592177e+04,
        2.35518465e+04,  6.69464349e+03,  1.62140691e+04,  1.33486771e+04,
       -7.37100276e+02,  7.64119851e+03,  3.38947073e+04,  1.90228970e+04,
       -5.14680564e+03,  1.16282656e+04,  5.24410508e+03,  4.22266653e+04,
        7.46513445e+03,  1.92853443e+04, -2.96787795e+03,  2.83142670e+04,
       -5.67270244e+03,  1.87006172e+04,  1.06228110e+04,  3.40462412e+04,
       -6.17229472e+03,  2.76193489e+04,  1.46349742e+04, -5.40404007e+03,
        3.87923821e+04,  4.26609772e+03,  9.17012026e+03,  1.98957880e+04,
        8.87239181e+03, -4.78964782e+03,  8.28162370e+03,  1.00649189e+04,
        2.54524277e+04,  1.36966510e+04,  3.83869347e+04,  1.74591469e+04,
        5.91810624e+03, -2.20336374e+03,  3.44185013e+04,  1.68850384e+04,
        1.63163285e+04,  4.39716151e+03,  9.59128260e+03, -7.16391412e+03,
        1.45834243e+04, -9.00687687e+02,  2.64116608e+03,  1.75427015e+04,
        1.66920621e+04,  5.42307771e+02, -3.37066104e+03,  2.15759581e+04,
        1.15776922e+04,  3.14873465e+04])
In [122]:
valid_y3_pred_df = pd.DataFrame(valid_y3_pred, columns = ["Validation_Prediction"])
valid_y3_pred_df
Out[122]:
Validation_Prediction
0 2649.722558
1 25264.255715
2 16424.063481
3 25332.958518
4 31095.512607
... ...
641 542.307771
642 -3370.661042
643 21575.958067
644 11577.692233
645 31487.346512

646 rows × 1 columns

5.3.1 Model Evaluation on Validation (Ridge Regression)¶

Get the RMSE for validation set.

In [123]:
mse_valid_3 = sklearn.metrics.mean_squared_error(valid_y3, valid_y3_pred)
mse_valid_3
Out[123]:
377856956.3335032
In [124]:
# As before
# import math

rmse_valid_3 = math.sqrt(mse_valid_3)
rmse_valid_3
Out[124]:
19438.543060978187
In [125]:
valid_y3.describe()
Out[125]:
count       646.000000
mean      13535.160991
std       23624.770667
min        1105.000000
25%        4708.750000
50%        6750.500000
75%       12827.750000
max      301070.000000
Name: Wage, dtype: float64
In [126]:
# As before:

# If using the dmba package:

# pip install dmba


# Done earlier. Just for illustration
# import dmba
# from dmba import regressionSummary

regressionSummary(valid_y3, valid_y3_pred)
Regression statistics

                      Mean Error (ME) : -146.0444
       Root Mean Squared Error (RMSE) : 19438.5431
            Mean Absolute Error (MAE) : 9585.2762
          Mean Percentage Error (MPE) : -42.4877
Mean Absolute Percentage Error (MAPE) : 124.7749

5.4 Predict New Records (Ridge Regression)¶

In [127]:
new_players_df
Out[127]:
Age Balance ShotPower Aggression Positioning Composure Preferred Foot_Right Body Type_Lean Body Type_Normal Body Type_Stocky
0 27 59 75 68 80 76 1 0 0 1
1 21 42 71 52 60 76 1 1 0 0
2 19 76 80 22 75 56 0 0 0 1
In [128]:
new_records_players_pred_3 = model_ridge.predict(new_players_df)
new_records_players_pred_3
Out[128]:
array([30907.21640178, 21909.60027727, 30847.24918278])
In [129]:
# As before
# import pandas as pd

new_records_players_pred_df_3 = pd.DataFrame(new_records_players_pred_3, columns = ["Prediction"])
new_records_players_pred_df_3

# to export
# new_records_players_pred_df.to_csv("whatever_name.csv")
Out[129]:
Prediction
0 30907.216402
1 21909.600277
2 30847.249183
In [130]:
alpha = 0.05
ci_3 = np.quantile(train_residuals_3, 1 - alpha)
ci_3
Out[130]:
17045.478790700115
In [131]:
def generate_results_confint_3(preds, ci_3):
    df = pd.DataFrame()
    df["Prediction"] = preds
    if ci >= 0:
        df["upper"] = preds + ci_3
        df["lower"] = preds - ci_3
    else:
        df["upper"] = preds - ci_3
        df["lower"] = preds + ci_3
    return df
In [132]:
new_records_players_pred_confint_df_3 = generate_results_confint_3(new_records_players_pred_3, ci_3)
new_records_players_pred_confint_df_3
Out[132]:
Prediction upper lower
0 30907.216402 47952.695192 13861.737611
1 21909.600277 38955.079068 4864.121487
2 30847.249183 47892.727973 13801.770392

red_devils.jpeg

6. Lasso Regression¶

6.1 Transform¶

In [133]:
train_X4 = train_X.copy()
train_y4 = train_y.copy()
valid_X4 = valid_X.copy()
valid_y4 = valid_y.copy()
In [134]:
from sklearn import linear_model

model_lasso = linear_model.Lasso(alpha = 0.5, tol = 2, max_iter = 10)
model_lasso.fit(train_X4, train_y4)
Out[134]:
Lasso(alpha=0.5, max_iter=10, tol=2)
In [135]:
train_y4_pred = model_lasso.predict(train_X4)
train_y4_pred
Out[135]:
array([23664.75129721,  7309.43845012,  1682.14259148, ...,
       25528.99628982, 23094.83450666, 23317.79242772])
In [136]:
train_y4_pred_df = pd.DataFrame(train_y4_pred, columns = ["Training_Prediction"])
train_y4_pred_df
Out[136]:
Training_Prediction
0 23664.751297
1 7309.438450
2 1682.142591
3 25612.713391
4 22900.953559
... ...
1501 28429.116142
1502 4783.812410
1503 25528.996290
1504 23094.834507
1505 23317.792428

1506 rows × 1 columns

In [137]:
print("model intercept: ", model_lasso.intercept_)
print("model coefficients: ", model_lasso.coef_)
print("Model score: ", model_lasso.score(train_X4, train_y4))
model intercept:  -86996.1156468611
model coefficients:  [  878.5716549     73.05446438   860.5312622    -13.47353083
   257.41231334    32.58020897 -2105.20659376  3767.57721845
  -518.43939572 -1578.14026859]
Model score:  0.18302583344714407

Coefficients, easier to read.

In [138]:
print(pd.DataFrame({"Predictor": train_X4.columns, "Coefficient": model_lasso.coef_}))
              Predictor  Coefficient
0                   Age   878.571655
1               Balance    73.054464
2             ShotPower   860.531262
3            Aggression   -13.473531
4           Positioning   257.412313
5             Composure    32.580209
6  Preferred Foot_Right -2105.206594
7        Body Type_Lean  3767.577218
8      Body Type_Normal  -518.439396
9      Body Type_Stocky -1578.140269

6.2.1 Model Evaluation on Training (Lasso Regression)¶

Get the RMSE for training set

In [139]:
mse_train_4 = sklearn.metrics.mean_squared_error(train_y4, train_y4_pred)
mse_train_4
Out[139]:
394480672.6408317
In [140]:
import math
In [141]:
rmse_train_4 = math.sqrt(mse_train_4)
rmse_train_4
Out[141]:
19861.537519558544
In [142]:
train_y4.describe()
Out[142]:
count      1506.000000
mean      12698.381142
std       21981.278007
min        1290.000000
25%        4692.500000
50%        6544.000000
75%       12364.250000
max      407609.000000
Name: Wage, dtype: float64

If using the dmba package:

pip install dmba

or

conda install -c conda-forge dmba

Then load the library

import dmba

from dmba import regressionSummary

In [143]:
import dmba
from dmba import regressionSummary
In [144]:
regressionSummary(train_y4, train_y4_pred)
Regression statistics

                      Mean Error (ME) : 0.0000
       Root Mean Squared Error (RMSE) : 19861.5375
            Mean Absolute Error (MAE) : 10701.9033
          Mean Percentage Error (MPE) : -31.7885
Mean Absolute Percentage Error (MAPE) : 145.9077

Normality

In [145]:
import numpy as np
from scipy.stats import shapiro

shapiro(train_y4)
Out[145]:
ShapiroResult(statistic=0.38412952423095703, pvalue=0.0)
In [146]:
train_residuals_4 = train_y4 - train_y4_pred
train_residuals_4
Out[146]:
13946    -1152.751297
7711      -549.438450
8402      3694.857409
13651   -11901.713391
1625    -12379.953559
             ...     
12759    -6257.116142
17284     1801.187590
1016    -14381.996290
16984   -11432.834507
16744     2983.207572
Name: Wage, Length: 1506, dtype: float64
In [147]:
shapiro(train_residuals_4)
Out[147]:
ShapiroResult(statistic=0.5732489824295044, pvalue=0.0)
In [148]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif_df_4 = pd.DataFrame()

vif_df_4["features"] = train_X4.columns
vif_df_4["VIF"] = [variance_inflation_factor(train_X4.values, i) for i in range(train_X4.shape[1])]

print(vif_df_4)
               features         VIF
0                   Age   48.741902
1               Balance   33.364691
2             ShotPower  155.270485
3            Aggression   18.482052
4           Positioning  184.558829
5             Composure  115.774481
6  Preferred Foot_Right    7.714600
7        Body Type_Lean   32.118906
8      Body Type_Normal   59.412529
9      Body Type_Stocky   11.108926

6.3 Predict Validation Set (Lasso Regression)¶

In [149]:
valid_y4_pred = model_lasso.predict(valid_X4)
valid_y4_pred
Out[149]:
array([ 1.60888749e+04,  2.28658032e+04,  2.29165150e+04,  2.02857565e+04,
        2.62424904e+04,  6.67204751e+03,  1.81177179e+04,  6.41889523e+03,
        2.69140122e+04, -7.34359467e+02,  5.62294937e+03,  1.95665783e+04,
        9.67738388e+03,  9.83296597e+03,  2.64396039e+04,  4.10728682e+04,
        2.39865085e+04,  2.42018735e+04,  3.60388275e+03,  2.61979797e+04,
       -3.97585033e+02,  2.19805171e+04,  1.57147085e+04,  2.50387186e+04,
        3.52740625e+03, -1.12435316e+04,  1.47258031e+04,  1.39643707e+04,
        2.22710897e+04,  1.26309930e+04,  6.54265365e+03,  1.94652956e+04,
        2.87796041e+04,  2.27009351e+04,  8.02049536e+03,  1.78417813e+04,
        6.09700927e+03,  2.07105507e+03, -5.74940302e+03,  2.73190544e+04,
        3.13813566e+04, -1.75414133e+04,  1.87993583e+04,  9.10826540e+03,
        1.60578541e+04,  2.73955332e+04, -4.58333255e+03,  7.19872200e+03,
        8.04411300e+03,  2.02694381e+04,  2.27172953e+04,  6.53840561e+03,
        2.07694645e+03,  2.72765691e+04,  1.09033157e+04,  1.36896612e+04,
        4.27639965e+02,  1.41412401e+04,  2.05011614e+03,  1.40974560e+04,
        1.93406089e+04,  2.99152387e+04,  2.31520588e+04,  2.47631286e+04,
        3.43348748e+03,  2.36939432e+04,  1.36181766e+04,  2.76957840e+03,
       -2.03776570e+03,  3.47129067e+04,  1.89990075e+04, -1.75023659e+03,
        3.01520525e+02,  1.04627448e+04,  2.62403265e+04,  2.66132919e+04,
        1.41224816e+04,  9.59158193e+03,  1.85603403e+04,  2.66950059e+04,
        2.84043835e+04,  4.37907666e+04,  2.37083249e+03,  1.77106111e+04,
        3.50975309e+04,  1.32232907e+04,  6.92081485e+03,  1.71187991e+04,
        1.76364337e+04,  2.49008126e+04,  2.65661978e+04,  1.80388153e+04,
        1.04781396e+04,  2.81083442e+04,  2.38637678e+04,  1.87402022e+04,
        1.39745881e+04,  2.37620749e+04,  2.04038056e+04,  3.04591318e+04,
        1.59192615e+04,  1.39363677e+04,  3.32839738e+04,  5.53410122e+03,
        3.21413922e+04,  1.80982290e+04,  1.87306118e+04,  2.29560836e+04,
        6.99944165e+03,  4.06076710e+04,  2.33133879e+04,  1.35916222e+04,
        8.33711061e+03,  2.26958161e+04,  1.23702955e+04, -1.27709895e+04,
        5.75901385e+03,  4.88970659e+03,  2.78235806e+04,  1.96004174e+04,
        4.59367928e+02,  3.03766150e+03,  1.88247776e+04, -5.49801173e+03,
        2.54392756e+04,  3.50776464e+04,  2.84189046e+04,  3.34495332e+04,
       -3.74478093e+02,  7.17734196e+03,  2.04475163e+04, -3.90257396e+03,
        8.51705656e+03,  1.20419618e+04,  3.12936781e+04, -1.77652605e+03,
       -8.15480142e+03,  2.55030846e+02, -1.07168959e+04, -3.92066826e+03,
        1.78793280e+04,  2.65900919e+04,  1.79857707e+04, -1.88812413e+03,
       -9.21307823e+01,  2.66310360e+04,  1.42427135e+04,  8.90525482e+03,
        5.06681050e+02,  2.36865409e+04,  2.85556930e+04,  6.71464963e+03,
        3.38685951e+04,  5.50140701e+03,  1.44291766e+04,  2.44361591e+03,
        1.10923943e+04, -4.79461466e+03,  1.73182301e+04,  9.19798343e+03,
        1.98289424e+04, -1.67570552e+04,  4.75537493e+03,  2.17511075e+04,
        3.07270871e+04,  7.54793161e+03,  2.42095051e+04,  2.55495164e+04,
        2.34050760e+04,  1.55194262e+04, -2.44420941e+03, -1.38997798e+04,
        1.51027652e+04, -1.57547089e+03,  5.54208650e+03,  7.37557579e+02,
        2.25591142e+04,  1.15972177e+04,  2.25169608e+04,  1.48101737e+04,
        2.60619906e+04, -7.11529825e+03,  4.41655556e+02,  1.90016427e+04,
        2.06300472e+04,  4.45581728e+03,  2.72346581e+04,  2.26692972e+04,
        1.66070388e+04, -3.34656643e+03,  2.37028391e+04, -6.99913025e+03,
        6.39434938e+03,  4.35908400e+03, -2.06729861e+03, -5.58996134e+02,
        3.86144201e+04,  2.78935579e+03,  9.37102135e+03, -6.97122364e+03,
        1.37569537e+04,  3.73297127e+04,  2.42269396e+04,  3.06369603e+04,
        1.84434748e+04,  1.17156680e+04,  3.35720256e+04, -1.91862641e+03,
        2.16452938e+04, -4.88260091e+03,  4.62975862e+03,  4.09837158e+04,
        1.79503901e+04,  3.64415514e+02,  1.66633135e+04,  3.32423824e+04,
        1.13084337e+04, -1.24435223e+04,  2.96966921e+04,  1.85483604e+04,
        2.14978785e+04,  2.60371649e+04,  1.67770907e+04,  2.17863454e+04,
        2.80850844e+04,  2.05488151e+04,  2.84596132e+03,  1.56514499e+04,
        2.50457505e+04,  1.53038368e+04, -2.10420470e+03,  1.04771486e+04,
        1.08202903e+04,  1.64081414e+04,  5.38762075e+02,  1.94600342e+04,
        1.33156332e+04, -6.62094416e+03,  1.69535139e+04,  1.18997505e+04,
        1.65717592e+04,  1.82224667e+04, -7.01880524e+03, -8.02125460e+03,
        1.12112209e+04,  1.01939541e+04,  1.53144871e+04, -3.68156802e+03,
        6.95021475e+03,  1.07493781e+04, -1.55063158e+04,  7.23664071e+03,
        4.68077893e+03,  1.14209803e+04,  2.44929255e+04,  1.98843194e+04,
        1.74738019e+04,  1.25251769e+04,  1.79250166e+04,  1.97538995e+04,
        1.87939890e+04,  9.63343492e+03,  1.09965060e+04,  2.35214579e+04,
        2.41829656e+04,  2.43581552e+03,  1.64577593e+04,  1.27118398e+04,
        1.63961883e+04,  1.32278591e+04,  2.88379301e+04,  8.91057395e+03,
        2.35511830e+04,  1.76595572e+04,  6.99332839e+02,  2.18978660e+04,
       -1.42876612e+04,  1.58988435e+04,  4.42680677e+03,  3.91184834e+03,
        1.22576103e+04,  7.34186242e+03,  1.09420656e+04,  1.76738855e+04,
        1.33831337e+04, -6.37160171e+03,  3.69193289e+04,  2.15356035e+04,
        2.54768158e+04,  4.52041538e+03,  8.74892408e+03,  9.59495112e+03,
        4.62705394e+04,  1.37422401e+04,  1.28107742e+04,  2.11039601e+04,
        3.17852399e+03, -7.60302984e+03,  2.68399444e+04, -7.46115997e+02,
       -8.61116274e+03,  8.46258726e+03,  3.18954628e+03,  2.44497263e+04,
       -9.00470385e+03,  2.75658141e+03,  1.88641332e+04,  1.48955244e+04,
        1.62578564e+04, -1.98085164e+04,  2.51306804e+04, -3.47888573e+02,
       -3.83135545e+03,  1.40263929e+04,  1.52814340e+04, -6.69429127e+03,
        1.29184076e+04,  3.20941758e+04,  4.06613399e+01,  1.25995833e+04,
        1.91431660e+04,  2.63589862e+04,  3.81945715e+04,  2.34658540e+04,
        5.49099238e+03, -1.40751599e+04,  1.85026469e+04,  1.97590221e+04,
        2.31917144e+04,  1.58724974e+04, -7.87105665e+02, -1.45681988e+04,
       -6.66176802e+03,  3.21785679e+04,  2.45184042e+04,  3.24873271e+04,
        2.56065571e+04,  6.87115371e+03,  5.40968391e+03, -9.39165633e+03,
        1.41752250e+04,  1.72407072e+04,  1.89420324e+04,  5.72073480e+03,
        2.14517630e+04,  7.61933028e+03,  4.86244567e+03,  6.05864416e+03,
        6.23672050e+03,  1.35262705e+04,  2.10523162e+04, -2.44342898e+03,
        3.05601701e+04,  1.53326377e+04,  2.74806751e+04,  1.44554675e+04,
        3.73210805e+03,  2.26703966e+04,  2.43727356e+04,  1.48885122e+04,
        1.10605836e+04,  2.51943385e+03,  1.98587479e+04,  1.67562321e+04,
        3.32498668e+04,  2.52278618e+04,  6.87409114e+03,  6.15607398e+03,
        1.86627647e+04,  1.84280274e+04,  2.28444296e+03,  1.58605452e+04,
        6.46186369e+02, -8.97251493e+03, -3.72358569e+03,  2.27231350e+04,
        4.88534470e+03,  1.11998861e+04,  1.82165976e+04,  1.32147369e+04,
        1.36947600e+04, -5.80173138e+03,  2.69399308e+04,  1.91517860e+04,
        1.93484166e+04,  1.54226866e+04,  3.60286200e+04,  4.47914169e+02,
        1.37536816e+04,  3.46808670e+04, -6.93837922e+03,  2.30809413e+04,
        2.01956160e+04, -1.63418067e+04,  1.01373639e+04,  6.34659526e+03,
        2.18460500e+04,  2.07755790e+04,  2.92644227e+04,  1.63887211e+04,
        4.28153755e+03,  2.67536540e+04,  9.02755879e+03,  2.38623679e+04,
        2.64816130e+04,  2.39299875e+04,  2.00292678e+03,  2.06038174e+03,
        1.12373667e+04,  2.36371608e+04,  1.32021048e+04,  1.51747075e+04,
        1.99552447e+04,  6.79833775e+03,  2.18167688e+04,  9.01418537e+03,
       -9.99352663e+02,  1.67641319e+04,  1.16662632e+04,  4.24323517e+04,
        1.12985080e+04,  1.60572403e+04,  2.44525141e+04,  2.27511222e+04,
        2.08737879e+04, -5.82400428e+03,  2.82174294e+04,  3.11551420e+04,
       -4.69698911e+03,  1.19888693e+04,  3.73228426e+04,  2.13593987e+04,
        2.55976493e+04,  2.24854620e+04,  5.41969148e+03,  2.81381425e+04,
        1.15023507e+04, -3.86262187e+03, -1.19446767e+03,  3.45223441e+04,
        2.58357274e+04,  9.23895335e+03,  2.11859025e+04,  2.86370512e+04,
        1.97167704e+04,  1.40977758e+04,  8.05747461e+03,  1.57975645e+04,
        2.52698900e+04, -2.99220337e+03,  2.27021369e+04, -7.34827734e+03,
        9.42477370e+03,  7.44978350e+03,  1.58720826e+04, -1.95381434e+04,
        2.68251362e+04,  9.47770452e+03,  2.32776944e+04,  3.04295550e+04,
        8.78853438e+03,  7.18708126e+03,  1.62088849e+04,  2.37940551e+04,
        8.85774288e+03,  1.25139539e+04,  2.21013450e+04, -1.78983235e+03,
        2.39546348e+04,  3.58780279e+04,  1.61462498e+04,  1.79346638e+04,
        1.52933406e+04,  3.30664727e+04,  1.53642916e+04,  6.37359615e+03,
        2.86878854e+04,  2.37880945e+04, -4.94550903e+03,  3.00543561e+04,
        2.90182003e+04,  2.52515789e+04,  4.78106890e+03,  3.20769640e+04,
       -7.30460403e+03, -1.25622268e+04,  2.29174987e+04,  1.39012875e+04,
       -3.06930042e+03,  1.11624990e+04, -1.48666371e+04,  3.06589123e+03,
        6.72357884e+02,  2.47464704e+04, -6.77626185e+03,  3.14378530e+04,
        2.07278930e+04,  2.72607806e+04,  1.04486112e+04,  2.56922422e+04,
       -1.02766329e+04,  2.35137439e+04,  9.62746024e+02,  2.79983466e+04,
        2.21216593e+04,  9.20928565e+03,  3.03490668e+04,  1.99912055e+04,
        1.20438670e+04,  3.00985832e+04,  2.28445373e+04,  2.53193368e+03,
        1.28598758e+04,  2.14288904e+04, -1.23016386e+04,  1.16536355e+04,
        1.51943815e+04, -6.95738781e+03,  1.94049429e+04,  1.82144135e+04,
        2.10293876e+04,  2.42393706e+04,  1.88700518e+04,  3.12614580e+03,
        6.68480328e+03,  2.11872995e+04,  1.88734202e+04,  8.00493699e+03,
        1.93347662e+04,  1.96450720e+04,  1.34064991e+04,  1.64068990e+04,
       -6.94743474e+03,  2.56211052e+04, -8.41223183e+03,  2.58562216e+04,
       -6.85420116e+03,  2.64068791e+04,  1.73265562e+04, -4.01980007e+03,
        3.76516850e+04,  1.56277151e+04,  2.01996049e+04,  2.65081750e+04,
       -1.05868509e+04,  2.59287853e+04, -4.96806088e+03,  1.32000523e+04,
       -7.10351398e+03,  2.88477413e+03,  4.24803109e+04, -6.84345017e+03,
        1.98806772e+04,  2.45388299e+04, -8.16594535e+03,  8.36178352e+03,
        1.78555867e+04,  2.28804969e+04,  2.42634875e+04,  2.74984963e+04,
        2.46513003e+04,  2.17666350e+04,  1.44914746e+04,  3.12145127e+03,
        3.04987095e+04, -2.02211535e+03,  1.98677231e+04,  1.96423504e+04,
        8.90696985e+03,  3.07026025e+04,  1.48978513e+04,  2.84487445e+04,
       -4.97523053e+03, -8.66521484e+02,  1.29763444e+04, -7.99536222e+03,
        2.75359156e+04,  2.86476834e+04,  7.53286174e+03,  6.11551173e+03,
        1.73268105e+04,  1.96363681e+04,  1.78118381e+04,  2.83903425e+04,
        4.65014871e+03,  1.02605521e+03,  2.17578611e+04,  5.42991734e+03,
       -1.24515281e+04,  1.86704096e+04,  1.35216557e+04,  3.17605101e+04,
        1.94115740e+04,  9.94811501e+03,  1.81044663e+04,  1.13553424e+04,
        1.05495804e+04,  8.14530718e+03,  2.54853869e+04,  3.39594266e+04,
        3.59410210e+03,  6.61478785e+03,  1.69307324e+04,  4.05745556e+04,
        1.83727980e+04,  2.44659292e+04, -1.38422062e+04,  2.39812839e+04,
        6.30206872e+03,  1.70565923e+04,  1.52304237e+04,  3.31359512e+04,
       -1.06537963e+04,  2.39127251e+04,  1.32528237e+04,  6.41677453e+03,
        2.82429089e+04,  1.84240516e+04,  2.39529817e+04,  1.49468820e+04,
        1.41015024e+04, -3.23017246e+03, -5.59277267e+03,  1.63212296e+04,
        2.42808217e+04,  1.24670307e+04,  3.15458260e+04,  2.17338233e+04,
        8.28983452e+03,  1.41617551e+04,  2.69616326e+04,  1.10601573e+04,
        1.65777186e+04,  5.50576865e+03,  5.76146552e+03, -9.76252428e+03,
        1.17118783e+04,  1.21399049e+03, -1.46113543e+03,  4.59854240e+03,
        1.31693546e+04,  4.99593646e+03, -1.02712169e+04,  1.74376599e+04,
        1.21050714e+04,  1.22957991e+04])
In [150]:
valid_y4_pred_df = pd.DataFrame(valid_y4_pred, columns = ["Validation_Prediction"])
valid_y4_pred_df
Out[150]:
Validation_Prediction
0 16088.874881
1 22865.803237
2 22916.515027
3 20285.756474
4 26242.490423
... ...
641 4995.936456
642 -10271.216895
643 17437.659872
644 12105.071401
645 12295.799079

646 rows × 1 columns

6.3.1 Model Evaluation on Validation (Lasso Regression)¶

Get the RMSE for validation set.

In [151]:
mse_valid_4 = sklearn.metrics.mean_squared_error(valid_y4, valid_y4_pred)
mse_valid_4
Out[151]:
445185876.78789973
In [152]:
# As before
# import math

rmse_valid_4 = math.sqrt(mse_valid_4)
rmse_valid_4
Out[152]:
21099.428352159204
In [153]:
valid_y4.describe()
Out[153]:
count       646.000000
mean      13535.160991
std       23624.770667
min        1105.000000
25%        4708.750000
50%        6750.500000
75%       12827.750000
max      301070.000000
Name: Wage, dtype: float64
In [154]:
# As before:

# If using the dmba package:

# pip install dmba


# Done earlier. Just for illustration
# import dmba
# from dmba import regressionSummary

regressionSummary(valid_y4, valid_y4_pred)
Regression statistics

                      Mean Error (ME) : -213.0518
       Root Mean Squared Error (RMSE) : 21099.4284
            Mean Absolute Error (MAE) : 11072.2334
          Mean Percentage Error (MPE) : -48.3380
Mean Absolute Percentage Error (MAPE) : 152.1841

6.4 Predict New Records (Lasso Regression)¶

In [155]:
new_players_df
Out[155]:
Age Balance ShotPower Aggression Positioning Composure Preferred Foot_Right Body Type_Lean Body Type_Normal Body Type_Stocky
0 27 59 75 68 80 76 1 0 0 1
1 21 42 71 52 60 76 1 1 0 0
2 19 76 80 22 75 56 0 0 0 1
In [156]:
new_records_players_pred_4 = model_lasso.predict(new_players_df)
new_records_players_pred_4
Out[156]:
array([24044.91108935, 14502.47793021, 23347.2433213 ])
In [157]:
# As before
# import pandas as pd

new_records_players_pred_df_4 = pd.DataFrame(new_records_players_pred_4, columns = ["Prediction"])
new_records_players_pred_df_4

# to export
# new_records_players_pred_df.to_csv("whatever_name.csv")
Out[157]:
Prediction
0 24044.911089
1 14502.477930
2 23347.243321
In [158]:
alpha = 0.05
ci_4 = np.quantile(train_residuals_4, 1 - alpha)
ci_4
Out[158]:
19771.787699952376
In [159]:
def generate_results_confint_4(preds, ci_4):
    df = pd.DataFrame()
    df["Prediction"] = preds
    if ci >= 0:
        df["upper"] = preds + ci_4
        df["lower"] = preds - ci_4
    else:
        df["upper"] = preds - ci_4
        df["lower"] = preds + ci_4
    return df
In [160]:
new_records_players_pred_confint_df_4 = generate_results_confint_4(new_records_players_pred_4, ci_4)
new_records_players_pred_confint_df_4
Out[160]:
Prediction upper lower
0 24044.911089 43816.698789 4273.123389
1 14502.477930 34274.265630 -5269.309770
2 23347.243321 43119.031021 3575.455621

red_devils.jpeg

7. Elastic Net¶

7.1 Transformation¶

In [161]:
from sklearn.linear_model import ElasticNet
from sklearn.datasets import make_regression
In [162]:
train_X5 = train_X.copy()
train_y5 = train_y.copy()
valid_X5 = valid_X.copy()
valid_y5 = valid_y.copy()
In [163]:
train_X5, train_y5 = make_regression(n_features = 10, random_state = 666)
model_elastic = ElasticNet(random_state = 666)
model_elastic.fit(train_X5, train_y5)
Out[163]:
ElasticNet(random_state=666)
In [164]:
type(train_X5)
Out[164]:
numpy.ndarray
In [165]:
type(train_y5)
Out[165]:
numpy.ndarray
In [166]:
train_y5_pred = model_elastic.predict(train_X5)
train_y5_pred
Out[166]:
array([-239.89702287,   30.20471618,  146.41102948,   62.35336983,
        177.08753921, -209.59344487, -159.0879793 , -242.38244857,
         54.87544413,  -26.34258741,  231.11945687,  309.32913489,
          5.68659987, -102.16230273,    1.22670383,  -59.4133525 ,
        -85.15133799,  -98.99056433,  -87.74329659,   44.51773042,
        159.3243627 ,  117.5443711 , -164.62941892,  125.34439908,
          8.90255763, -114.7913873 ,  291.75693957, -202.28642913,
        171.71712251,   -3.92022359, -143.08624391,  -89.08740894,
       -106.63397392, -194.63807216, -244.85791217,  122.01268078,
         57.49586543,  175.80148181, -154.49993083,  -41.72967953,
        239.68753594,    8.39230607,   67.89458582,  203.52636571,
        -15.51052836,   97.53529516,   56.9366598 , -368.15724976,
        -26.46591475, -116.38192221, -129.1303425 ,  277.99750842,
       -217.03245809,  110.77806504,  -98.47472604,  -37.8287712 ,
        211.12887163,  -41.54899521, -114.55327138,   33.64785209,
         -3.58899371,   34.09284041,   60.20734999, -252.26679574,
        154.37071339,  -62.95584088,  127.81457702, -220.9094447 ,
       -287.99704049,   42.56795837, -160.11760685,   42.74309451,
         42.01724616,   44.95060293, -171.09356266, -143.00605263,
       -128.79839338, -158.04057611,   23.3411471 , -208.12593485,
        -60.49186376,  -97.30339631,  254.14383238,  -93.8974994 ,
        301.25260918,    9.89642346,  -62.83757572, -130.10642815,
         89.3330917 ,   19.50624336,   -2.98545981,  -72.02696057,
        -39.62413746,  -93.51527208,  141.07669442,   91.93363588,
         51.80867706,  -34.02927417,  -75.6741745 ,   95.76213508])
In [167]:
train_y5_pred_df = pd.DataFrame(train_y5_pred, columns = ["Training_Prediction"])
train_y5_pred_df
Out[167]:
Training_Prediction
0 -239.897023
1 30.204716
2 146.411029
3 62.353370
4 177.087539
... ...
95 91.933636
96 51.808677
97 -34.029274
98 -75.674174
99 95.762135

100 rows × 1 columns

In [168]:
print("model intercept: ", model_elastic.intercept_)
print("model coefficients: ", model_elastic.coef_)
print("Model score: ", model_elastic.score(train_X5, train_y5))
model intercept:  -6.133468616184231
model coefficients:  [24.3065105   1.72597924 32.79853103 11.29839994 57.71459241 60.51954808
 50.97607045 46.07791909 55.01341304 60.31821326]
Model score:  0.8825434233397067

7.2.1 Model Evaluation on Training (Elastic Regression)¶

Get the RMSE for training set

In [169]:
mse_train_5 = sklearn.metrics.mean_squared_error(train_y5, train_y5_pred)
mse_train_5
Out[169]:
5476.117116636182
In [170]:
import math
In [171]:
rmse_train_5 = math.sqrt(mse_train_5)
rmse_train_5
Out[171]:
74.00079132439181
In [172]:
np.std(train_y5)
Out[172]:
215.9223978114257

If using the dmba package:

pip install dmba

or

conda install -c conda-forge dmba

Then load the library

import dmba

from dmba import regressionSummary

In [173]:
import dmba
from dmba import regressionSummary
In [174]:
regressionSummary(train_y5, train_y5_pred)
Regression statistics

                      Mean Error (ME) : 0.0000
       Root Mean Squared Error (RMSE) : 74.0008
            Mean Absolute Error (MAE) : 59.4893
          Mean Percentage Error (MPE) : -12.6902
Mean Absolute Percentage Error (MAPE) : 82.0362

Normality

In [175]:
import numpy as np
from scipy.stats import shapiro

shapiro(train_y5)
Out[175]:
ShapiroResult(statistic=0.9933440089225769, pvalue=0.907919704914093)
In [176]:
train_residuals_5 = train_y5 - train_y5_pred
train_residuals_5
Out[176]:
array([-115.87355838,   20.55225498,  105.29541848,   70.33789465,
         85.10534445, -119.32838286,  -90.17104944,  -94.61626251,
         49.49931428,  -13.57318615,  134.61766684,  121.55259874,
          6.61723475,  -38.86744747,   -8.05409341,   15.02212719,
        -43.36458565,   -2.26900164,  -44.19868389,   31.22462816,
         83.89791053,   62.4307288 ,  -81.24968022,   65.54643487,
          7.05336542,  -61.84098657,  168.23587577,  -74.16622347,
        116.40139242,   22.32224043,  -57.88069139,  -14.05995462,
        -10.96375877,  -95.15789269, -151.64687109,   53.23292199,
         35.31233236,   90.27628713,  -94.06304233,   25.63635158,
        112.03495283,    3.70977227,   33.8162091 ,   88.91654268,
        -31.254995  ,   39.51105913,   29.7090374 , -167.38446484,
         25.89299222,  -48.92758299,  -84.58876275,  148.84402504,
       -114.49031964,   76.21986696,  -62.0456898 ,  -20.23421487,
         89.04950568,   -1.57350087,  -56.13256276,    1.78988291,
        -26.7977081 ,   44.23729077,    0.6689099 , -131.99698805,
         77.07876543,  -25.64571667,   48.77518018,  -94.01834119,
       -157.90006355,   23.29664651,  -77.19246555,   55.45884587,
          5.61274092,   33.90949488,  -80.64159432,  -67.74133301,
        -52.99278782,  -43.69628765,   12.47426646, -108.35823508,
        -17.70919494,  -57.74881675,  135.72993831,  -51.58580114,
        185.07822914,    8.27717559,  -31.0038929 ,  -66.45850356,
         27.76413165,   36.8337016 ,   -6.4878831 ,  -19.69525644,
         -9.7356234 ,  -32.31802903,   81.7506021 ,   51.2448407 ,
         36.8203488 ,  -16.76202468,   14.50928694,   75.27942723])
In [177]:
shapiro(train_residuals_5)
Out[177]:
ShapiroResult(statistic=0.9953042268753052, pvalue=0.9822211861610413)

VIF may not apply

7.3 Predict Validation Set (Elastic Regression)¶

In [178]:
valid_y5_pred = model_elastic.predict(valid_X5)
valid_y5_pred
C:\Users\byeo\Anaconda3\lib\site-packages\sklearn\base.py:443: UserWarning: X has feature names, but ElasticNet was fitted without feature names
  warnings.warn(
Out[178]:
array([10352.05910353, 12105.44069022, 12255.07200558, 12134.3868721 ,
       13147.61617235, 10243.06386701, 11313.32224233, 11032.09564747,
       12791.92844967,  8908.74659791, 10387.15719227, 11464.07362857,
       10906.60311722, 10608.97750665, 12831.67428293, 14471.71294576,
       12320.35842574, 12052.28358989,  9515.58499339, 12535.33345669,
       10151.37624276, 12170.51937152, 11334.90749268, 12147.7654329 ,
        9794.85597288,  8467.82894156, 10686.93646591, 12472.92965584,
       11986.96867404, 11006.64046664,  9411.96610806, 11355.46901679,
       12196.38602963, 12709.13983713,  8970.73975657, 11031.00275534,
       10464.23712765, 10808.36750752,  8957.50677167, 13141.91778899,
       13576.62603385,  8444.99670939, 11176.32407544,  9902.04169514,
       12083.37100625, 12948.20034059,  8122.52822429,  9119.78689569,
       10840.40275917, 11571.32309035, 12138.15395455, 10521.20361808,
        9795.8311031 , 12483.43309484, 11418.35024528, 11595.3768299 ,
        8463.55143622, 11243.80020398,  9690.64626627, 11292.38087084,
       10254.85268662, 12809.3242474 , 11854.75627534, 12182.38536325,
        9764.61970382, 11595.34411481, 10976.90641622, 10068.74833515,
        9401.38304107, 13545.93970676, 10717.11283639,  9233.12389415,
       10440.76178304, 11229.62273501, 12312.2758096 , 12193.72842343,
       11397.17414624, 10716.94611653, 11713.08899622, 13630.14063311,
       13003.82337793, 15368.58879639,  9761.59161119, 12149.79714464,
       14078.54221453, 11085.17724553, 10721.85414481, 11104.99277646,
       12101.6610874 , 12476.05627164, 13347.31622935, 11723.9698162 ,
       10614.75795541, 13080.32887597, 11901.4709434 , 11840.69140093,
       11466.01742385, 12048.83126325, 11536.50027683, 14044.74049154,
        9899.55146143, 11766.23338592, 12373.55297498,  9485.01408948,
       12152.60672717, 11011.72078307, 12035.81933079, 12523.09261077,
       10044.86099962, 13679.61907932, 12119.19426514, 11555.1113625 ,
        9195.81690137, 11059.13270076, 11291.64505669,  8129.2754332 ,
        9453.67669073, 10438.16486942, 13496.484549  , 11713.15135056,
        9565.65875894, 10127.63828516, 11211.25847167,  8730.46393144,
       11464.48238118, 13992.42592753, 12680.9219006 , 13920.38074777,
        9548.5274636 ,  8889.91531046, 11755.34749597,  9125.06800726,
        9121.30409183, 10794.18810957, 13345.27496844,  8959.37264987,
        8194.45885751,  9090.65510799,  8559.99299338,  9895.62335948,
       11191.77979577, 12314.54463745, 11448.29771524,  9160.53378102,
        9156.10331445, 12909.91829704, 10859.31073101, 10070.28886388,
        9204.80176842, 12178.93635149, 12830.08269192,  9876.61311764,
       13882.44657459, 10469.0559072 , 10375.89184879, 10857.74933212,
       11568.49556747,  8363.05241204, 11656.17277285, 11071.47688265,
       11524.56418679,  7851.60146357,  9697.27927879, 12111.67380638,
       12002.93630653, 10001.97324197, 11498.83815181, 12221.96067723,
       10924.12307689, 11378.00042588,  9283.17597152,  8606.10583163,
        9978.55170431,  9736.45350326,  9331.35800275, 10215.8221707 ,
       12561.04023405, 11107.05957564, 11581.62588877, 10522.64186496,
       12798.60222989,  8118.80733405,  9598.93135701, 11417.7759794 ,
       10414.19744897, 10118.95520032, 12950.08051071, 12077.26462182,
       11444.33982293,  8538.05715103, 13014.41184658,  9024.42345157,
       10577.58323199, 10824.73004598,  8959.15622165,  9534.11628987,
       14733.93676839,  9631.46855787, 10754.48023068,  8142.09307727,
       11148.02085278, 13941.64646087, 11565.43884066, 13632.35722122,
       10554.12980061, 10130.09110264, 12728.87927866, 10110.23841177,
       12104.07799694,  9800.3472128 ,  9850.34126804, 14501.1381244 ,
       11385.27675579,  8917.82199634, 11768.23753761, 13311.39581984,
        9936.86455837,  9051.59327363, 12573.05898577, 11060.07161974,
       11961.10514464, 11593.11161466,  9832.93259552, 12492.82402189,
       12478.95319134, 11784.66170994,  9689.1675903 , 10556.50773932,
       12768.75650148, 11101.65276867,  8914.20026887, 11882.03457769,
        9807.98637367, 11884.85828607,  9353.19048249, 11777.48423306,
       10074.44938971, 10161.97810891, 12287.35960953, 10605.23222895,
       10756.1904017 , 11152.64050382,  8884.06056659,  8762.31273865,
       10444.39992637, 10519.90297765, 10906.78681814,  9985.65571416,
       11319.41523747, 10209.02622837,  7658.66885234,  9269.95118605,
       10448.99818541, 10603.75092298, 13021.66467132, 10954.2872683 ,
       12007.53026353, 11106.66541944, 12126.05183304, 11685.5686955 ,
       10998.1895253 ,  9623.88504223, 10508.47416779, 11800.44219586,
       12881.95959803, 10789.33550312, 11631.22698844, 10747.60354427,
       11463.68774696, 11661.68780507, 13924.20457518,  9646.73076185,
       13060.49409883, 11382.35314513,  9819.18444803, 12535.12146765,
        8264.88828232, 12032.72088507,  9717.07893981,  9153.073164  ,
       10643.48540969, 11153.4817259 , 10863.82417413, 11734.51629157,
       10654.21132183,  9483.72026011, 12745.64816369, 11552.51173636,
       12392.31644634,  9622.34634275, 10327.2845096 ,  9968.3212924 ,
       14767.08367069, 10554.27373735, 10679.54042274, 12297.78539478,
       10002.89591453,  8926.61180675, 12718.1118005 , 10785.41060887,
        8502.21359245, 10366.58046846,  8884.93706053, 12147.61982551,
        8540.44564116,  9568.2296874 , 11746.79516293, 11430.09696906,
       12074.74703264,  6932.90638676, 11701.53820379,  9766.42011926,
        9550.75765528, 11329.59590192, 12041.51349855,  8855.99875912,
       10934.37560517, 12682.62925959,  9339.65202606, 11163.52806206,
       11360.37063843, 13075.13477091, 15304.40279194, 12964.4931213 ,
       10251.0757177 ,  8353.81395642, 11892.10702604, 12101.18330051,
       12283.95447085, 12177.44514571, 10274.42117972,  7787.09898337,
        8360.20426381, 13258.87581233, 12259.53250973, 13261.76167787,
       13009.05294426,  9592.56570422, 10496.55630757,  8346.13613016,
       11649.80655946, 10817.12275202, 11465.9587795 , 10268.55390299,
       11161.97842404, 10441.96006572,  9975.27498703,  8804.15908119,
        8864.31669839, 11066.63683321, 12374.1176849 ,  8449.18106432,
       12293.82925208, 11690.10907326, 12722.85412665, 11682.91566374,
       10062.56357869, 11832.5551472 , 13531.39222571, 11224.24260704,
        9967.5888109 , 10225.76056374, 12254.50939371, 11769.97854703,
       13159.58425392, 12658.84164177, 10618.01908829,  9012.0754734 ,
       11188.27492229, 11814.17928184, 10130.3801664 , 10889.78883617,
        9807.04788674,  8530.3089394 ,  8433.50363483, 12328.56321367,
        9267.9742972 , 11014.41671047, 12236.77969102, 10760.96746641,
       11495.89420143,  8713.55480494, 14191.9125366 , 11861.89064184,
       10929.37940198,  9852.93814648, 13532.24549513,  8975.74784107,
       11950.50064859, 13922.89647561,  9206.669179  , 13480.56012939,
       11824.1702701 ,  8428.99332366, 10280.62786165, 10761.27083749,
       11888.92838727, 11404.13078614, 12723.24363395, 10964.34822915,
        9976.53263002, 12442.64820484, 11015.48251927, 11818.12250712,
       13363.67455221, 12625.19385989,  9711.13491179,  9743.68379298,
       11080.78110568, 11987.06609374, 11063.27187731, 10935.10091144,
       11435.63005305, 10717.67824628, 10668.92756333,  9858.02067596,
        9759.03261078, 11732.14060299, 10895.69565095, 15186.2229698 ,
       10398.41151905, 10895.39018223, 11743.04593773, 12011.69427789,
       11413.31377267,  8920.81218442, 12100.64853017, 12142.685605  ,
        8669.94306172, 10863.2632574 , 13842.42029289, 12295.14895607,
       12776.83092699, 11786.50872889, 11064.71718986, 13794.76181894,
       10717.36420635,  8572.96565498,  9687.30808491, 13931.27732276,
       12540.69681497, 11384.08531118, 11878.75612326, 11876.75124707,
       11945.06229399, 11271.79918663,  8897.90743476, 10371.35173223,
       12415.0924782 ,  8863.66537364, 11616.70456178,  8607.00880655,
       10581.63405677,  9719.38670152, 11385.44425347,  7473.04793819,
       13022.48744777, 11396.35349915, 11882.66746664, 13018.94540275,
       10950.96681061, 10055.39106399, 10465.06848606, 13096.17154433,
       10664.92517367, 11846.69946251, 11920.63555508,  8848.83944817,
       11676.45409336, 14680.38429109, 11238.51192127, 12484.13530186,
       11623.46332236, 13694.25618208, 11908.87620961, 10457.15701494,
       12367.81699053, 11605.92559126,  9909.44232993, 12650.86097834,
       12503.35314765, 11754.6511254 ,  9521.76644665, 12951.99522976,
        8300.87555145,  8129.5321238 , 12482.49265728, 10729.17565091,
        9704.17957748, 10108.01493561,  8040.68551843,  9459.58537563,
        8875.6548931 , 12545.83598965,  8120.71763799, 13381.77370333,
       11457.21936677, 12149.08982721, 10114.53871164, 12604.93300858,
        8408.49356248, 10492.56132372,  8800.45969315, 12572.50219925,
       12052.52606993,  9558.83283931, 13302.16550826, 12140.71994083,
       10561.94991113, 13252.63075741, 12361.08207867,  9293.17586843,
       10773.29517317, 12575.22191801,  8150.31728555, 10158.31730872,
       10498.21211411,  9294.11718509, 11512.22492837, 12086.95191814,
       11905.58951226, 12056.21979514, 11783.55043798, 10381.87249236,
        9614.51471077, 11982.33598525, 11563.37704148, 10979.23136041,
       11899.11945894, 12037.11036324, 11125.77367072, 10586.4840021 ,
        8706.95283703, 13711.34439512,  8070.54339278, 12049.8503929 ,
        9027.25424165, 12085.50782813, 10705.38046147,  8717.6062918 ,
       14370.0903199 , 11037.30558497, 11487.37476506, 11322.71420903,
        8033.28239732, 12002.20514018,  9264.05699996, 11466.35549296,
        8942.4513159 ,  9025.74462322, 13902.44326097,  8893.78707802,
       11567.77257344, 12551.09105077,  8402.89606438, 10930.78799594,
       11648.91542521, 12917.56854583, 12713.15335777, 11814.65224847,
       12958.12124841, 12158.03067397, 11179.37678291,  9278.12924163,
       12640.02543852,  8546.49534972, 11348.10962545, 11072.99129439,
       11023.97942909, 12601.76271511, 11570.97552015, 13563.39314908,
        8976.9451948 ,  8887.12914342, 11164.12997477,  7349.59801436,
       12297.6377403 , 12003.38087139, 11077.40788284, 10127.2837807 ,
       10759.46387563, 10649.04178383, 10796.07837745, 13048.04448806,
       10464.6499887 ,  9455.36973093, 11106.23404978, 10223.40089743,
        8676.97838846, 10877.06833973, 10786.72331017, 13232.87131753,
       12560.65966486, 10129.57959853, 11762.86563682, 11213.62427294,
       10723.83478156, 10323.19110693, 12887.51244029, 12165.12147262,
        9067.67034924,  9815.91069961, 10922.34608037, 14307.36191062,
       10816.18660139, 12202.81147321,  8499.88212335, 12350.43882535,
        9230.93735655, 11370.61611376, 11200.67622136, 13770.79491507,
        8330.06371747, 11742.99846727, 11046.85775595,  9297.32581794,
       12799.89408687, 10574.64858958, 11423.39836545, 11315.90253714,
       10307.65262234,  7893.2361795 ,  9870.0870919 , 10938.20075926,
       12948.14320036, 10814.20903472, 14047.97609045, 11302.41790933,
        9946.63323734,  9500.22293708, 14010.0340521 , 10282.6559869 ,
       11970.80623827,  9874.12347864,  9971.93112213,  8493.84065595,
       10388.7205472 ,  8966.89519204,  9195.00539556, 10897.77393112,
       11058.80252548,  9508.46221552,  8710.25832147, 12042.05594272,
       10221.49974833, 12455.93087572])
In [179]:
valid_y5_pred_df = pd.DataFrame(valid_y5_pred, columns = ["Validation_Prediction"])
valid_y5_pred_df
Out[179]:
Validation_Prediction
0 10352.059104
1 12105.440690
2 12255.072006
3 12134.386872
4 13147.616172
... ...
641 9508.462216
642 8710.258321
643 12042.055943
644 10221.499748
645 12455.930876

646 rows × 1 columns

7.3.1 Model Evaluation on Validation (elastic Regression)¶

Get the RMSE for validation set.

In [180]:
mse_valid_5 = sklearn.metrics.mean_squared_error(valid_y5, valid_y5_pred)
mse_valid_5
Out[180]:
529003242.8305753
In [181]:
# As before
# import math

rmse_valid_5 = math.sqrt(mse_valid_5)
rmse_valid_5
Out[181]:
23000.070496208817
In [182]:
np.std(valid_y5)
Out[182]:
23606.478158841604
In [183]:
# As before:

# If using the dmba package:

# pip install dmba


# Done earlier. Just for illustration
# import dmba
# from dmba import regressionSummary

regressionSummary(valid_y5, valid_y5_pred)
Regression statistics

                      Mean Error (ME) : 2498.2800
       Root Mean Squared Error (RMSE) : 23000.0705
            Mean Absolute Error (MAE) : 9245.0033
          Mean Percentage Error (MPE) : -69.6001
Mean Absolute Percentage Error (MAPE) : 93.7624

7.4 Predict New Records (Elastic Regression)¶

In [184]:
new_players_df
Out[184]:
Age Balance ShotPower Aggression Positioning Composure Preferred Foot_Right Body Type_Lean Body Type_Normal Body Type_Stocky
0 27 59 75 68 80 76 1 0 0 1
1 21 42 71 52 60 76 1 1 0 0
2 19 76 80 22 75 56 0 0 0 1
In [185]:
#new_players_df = make_regression(n_features = 1, random_state = 666)
In [186]:
new_records_players_pred_5 = model_elastic.predict(new_players_df)
new_records_players_pred_5
C:\Users\byeo\Anaconda3\lib\site-packages\sklearn\base.py:443: UserWarning: X has feature names, but ElasticNet was fitted without feature names
  warnings.warn(
Out[186]:
array([13308.1034435 , 11652.42206796, 11237.31927071])
In [187]:
new_records_players_pred_5_df = pd.DataFrame(new_records_players_pred_5, columns = ["Prediction"])
new_records_players_pred_5_df
Out[187]:
Prediction
0 13308.103443
1 11652.422068
2 11237.319271
In [188]:
alpha = 0.05
ci_5 = np.quantile(train_residuals_5, 1 - alpha)
ci_5
Out[188]:
122.20585214240019
In [189]:
def generate_results_confint_5(preds, ci_5):
    df = pd.DataFrame()
    df["Prediction"] = preds
    if ci >= 0:
        df["upper"] = preds + ci_5
        df["lower"] = preds - ci_5
    else:
        df["upper"] = preds - ci_5
        df["lower"] = preds + ci_5
    return df
In [190]:
new_records_players_pred_confint_df_5 = generate_results_confint_5(new_records_players_pred_5, ci_5)
new_records_players_pred_confint_df_5
Out[190]:
Prediction upper lower
0 13308.103443 13430.309296 13185.897591
1 11652.422068 11774.627920 11530.216216
2 11237.319271 11359.525123 11115.113419

red_devils.jpeg