- A+

I'd like to choose the best algorithm for future. I found some solutions, but I didn't understand which R-Squared value is correct.

For this, I divided my data into two as test and training, and I printed two different R squared values below.

`import statsmodels.api as sm from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score lineer = LinearRegression() lineer.fit(x_train,y_train) lineerPredict = lineer.predict(x_test) scoreLineer = r2_score(y_test, lineerPredict) # First R-Squared model = sm.OLS(lineerPredict, y_test) print(model.fit().summary()) # Second R-Squared `

First R-Squared result is -4.28.

Second R-Squared result is 0.84

But I didn't understand which value is correct.

Arguably, the real challenge in such cases is to be sure that you compare apples to apples. And in your case, it seems that you don't. Our best friend is always the relevant documentation, combined with simple experinets. So...

Although scikit-learn's `LinearRegression()`

(i.e. your 1st R-squared) is fitted by default with `fit_intercept=True`

(docs), this is **not** the case with statsmodels' `OLS`

(your 2nd R-squared); quoting from the docs:

An intercept is not included by default and should be added by the user. See

`statsmodels.tools.add_constant`

.

Keeping this important detail in mind, let's run some simple experiments with dummy data:

`import numpy as np import statsmodels.api as sm from sklearn.metrics import r2_score from sklearn.linear_model import LinearRegression # dummy data: y = np.array([1,3,4,5,2,3,4]) X = np.array(range(1,8)).reshape(-1,1) # reshape to column # scikit-learn: lr = LinearRegression() lr.fit(X,y) # LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, # normalize=False) lr.score(X,y) # 0.16118421052631582 y_pred=lr.predict(X) r2_score(y, y_pred) # 0.16118421052631582 # statsmodels # first artificially add intercept to X, as advised in the docs: X_ = sm.add_constant(X) model = sm.OLS(y,X_) # X_ here results = model.fit() results.rsquared # 0.16118421052631593 `

For all practical purposes, these two values of R-squared produced by scikit-learn and statsmodels are **identical**.

Let's go a step further, and try a scikit-learn model without intercept, but where we use the artificially "intercepted" data `X_`

we have already built for use with statsmodels:

`lr2 = LinearRegression(fit_intercept=False) lr2.fit(X_,y) # X_ here # LinearRegression(copy_X=True, fit_intercept=False, n_jobs=None, # normalize=False) lr2.score(X_, y) # 0.16118421052631593 y_pred2 = lr2.predict(X_) r2_score(y, y_pred2) # 0.16118421052631593 `

Again, the R-squared is **identical** with the previous values.

So, what happens when we "accidentally" forget to account for the fact that statsmodels `OLS`

is fitted without an intercept? Let's see:

`model3 = sm.OLS(y,X) # X here, i.e. no intercept results3 = model2.fit() results3.rsquared # 0.8058035714285714 `

Well, an R-squared of 0.80 is indeed very far from the one of 0.16 returned by a model *with* an intercept, and arguably this is exactly what has happened in your case.

So far so good, and I could easily finish the answer here; but there is indeed a point where this harmonious world breaks down: let's see what happens when we fit both models without intercept and with the initial data `X`

where we have not artificially added any interception. We have already fitted the `OLS`

model above, and got an R-squared of 0.80; what about a similar model from scikit-learn?

`# scikit-learn lr3 = LinearRegression(fit_intercept=False) lr3.fit(X,y) # X here lr3.score(X,y) # -0.4309210526315792 y_pred3 = lr3.predict(X) r2_score(y, y_pred3) # -0.4309210526315792 `

Ooops...! What the heck??

It seems that scikit-earn, when computes the `r2_score`

, always *assumes* an intercept, either explicitly in the model (`fit_intercept=True`

) or implicitly in the data (the way we have produced `X_`

from `X`

above, using statsmodels' `add_constant`

); digging a little online reveals a Github thread (closed without a remedy) where it is confirmed that the situation is indeed like that.

Let me clarify that the discrepancy I have described above has **nothing** to do with your issue: in your case, the real issue is that you are actually comparing apples (a model with intercept) with oranges (a model without intercept).

So, why scikit-learn not only fails in such an (admittedly *edge*) case, but even when the fact emerges in a Github issue it is actually treated with *indifference*? (Notice also that the scikit-learn core developer who replies in the above thread casually admits that "*I'm not super familiar with stats*"...).

The answer goes a little beyond coding issues, such as the ones SO is mainly about, but it may be worth elaborating a little here.

Arguably, the reason is that the whole R-squared concept comes in fact directly from the world of statistics, where the emphasis is on *interpretative* models, and it has little use in machine learning contexts, where the emphasis is clearly on *predictive* models; at least AFAIK, and beyond some very introductory courses, I have never (I mean *never*...) seen a predictive modeling problem where the R-squared is used for any kind of performance assessment; neither it's an accident that popular *machine learning* introductions, such as Andrew Ng's Machine Learning at Coursera, do not even bother to mention it. And, as noted in the Github thread above (emphasis added):

In particular when using a

testset, it's a bit unclear to me what the R^2 means.

with which I certainly concur.

As for the edge case discussed above (to include or not an intercept term?), I suspect it would sound really irrelevant to modern deep learning practitioners, where the equivalent of an intercept (bias parameters) is always included by default in neural network models...

See the accepted (and highly upvoted) answer in the Cross Validated question Difference between statsmodel OLS and scikit linear regression for a more detailed discussion along these last lines...