Using Standardization in sklearn pipeline
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}
I am using Standardscaler to normalize my dataset, that is I turn each feature into a z-score, by subtracting the mean and dividing by the Std.
I would like to use Standardscaler within sklearn's pipeline and I am wondering how exactly the transformation is applied to X_test. That is, in the code below, when I run pipeline.predict(X_test)
, it is my understanding that the StandardScaler
and SVC()
is run on X_test, but what exactly does Standardscaler
use as the mean and the StD? The ones from the X_Train
or does it compute those only for X_test
? What if, for instance X_test
consists only of 2 variables, the normalization would look a lot different than if I had normalized X_train
and X_test
altogether, right?
steps = [('scaler', StandardScaler()),
('model',SVC())]
pipeline = Pipeline(steps)
pipeline.fit(X_train,y_train)
y_pred = pipeline.predict(X_test)
scikit-learn normalization pipeline
add a comment |
I am using Standardscaler to normalize my dataset, that is I turn each feature into a z-score, by subtracting the mean and dividing by the Std.
I would like to use Standardscaler within sklearn's pipeline and I am wondering how exactly the transformation is applied to X_test. That is, in the code below, when I run pipeline.predict(X_test)
, it is my understanding that the StandardScaler
and SVC()
is run on X_test, but what exactly does Standardscaler
use as the mean and the StD? The ones from the X_Train
or does it compute those only for X_test
? What if, for instance X_test
consists only of 2 variables, the normalization would look a lot different than if I had normalized X_train
and X_test
altogether, right?
steps = [('scaler', StandardScaler()),
('model',SVC())]
pipeline = Pipeline(steps)
pipeline.fit(X_train,y_train)
y_pred = pipeline.predict(X_test)
scikit-learn normalization pipeline
add a comment |
I am using Standardscaler to normalize my dataset, that is I turn each feature into a z-score, by subtracting the mean and dividing by the Std.
I would like to use Standardscaler within sklearn's pipeline and I am wondering how exactly the transformation is applied to X_test. That is, in the code below, when I run pipeline.predict(X_test)
, it is my understanding that the StandardScaler
and SVC()
is run on X_test, but what exactly does Standardscaler
use as the mean and the StD? The ones from the X_Train
or does it compute those only for X_test
? What if, for instance X_test
consists only of 2 variables, the normalization would look a lot different than if I had normalized X_train
and X_test
altogether, right?
steps = [('scaler', StandardScaler()),
('model',SVC())]
pipeline = Pipeline(steps)
pipeline.fit(X_train,y_train)
y_pred = pipeline.predict(X_test)
scikit-learn normalization pipeline
I am using Standardscaler to normalize my dataset, that is I turn each feature into a z-score, by subtracting the mean and dividing by the Std.
I would like to use Standardscaler within sklearn's pipeline and I am wondering how exactly the transformation is applied to X_test. That is, in the code below, when I run pipeline.predict(X_test)
, it is my understanding that the StandardScaler
and SVC()
is run on X_test, but what exactly does Standardscaler
use as the mean and the StD? The ones from the X_Train
or does it compute those only for X_test
? What if, for instance X_test
consists only of 2 variables, the normalization would look a lot different than if I had normalized X_train
and X_test
altogether, right?
steps = [('scaler', StandardScaler()),
('model',SVC())]
pipeline = Pipeline(steps)
pipeline.fit(X_train,y_train)
y_pred = pipeline.predict(X_test)
scikit-learn normalization pipeline
scikit-learn normalization pipeline
edited Jan 4 at 14:19
desertnaut
20.8k84579
20.8k84579
asked Jan 4 at 7:52
TartagliaTartaglia
1029
1029
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
Sklearn's pipeline
will apply transformer.fit_transform()
when pipeline.fit()
is called and transformer.transform()
when pipeline.predict()
is called. So for your case, StandardScaler
will be fitted to X_train
and then the mean and stdev from X_train
will be used to scale X_test
.
The transform of X_train
would indeed look different to that of X_train
and X_test
. The extent of the difference would depend on the extent of the difference in the distributions between X_train
and X_test
combined. However, if randomly partitioned from the same original dataset, and of a reasonable size, the distributions of X_train
and X_test
will probably be similar.
Regardless, it is important to treat X_test
as though it is out of sample, in order for it to be a (hopefully) reliable metric for unseen data. Since you don't know the distribution of unseen data, you should pretend you don't know the distribution of X_test
, including the mean and stdev.
1
Very happy to hear that, that makes perfect sense. Thank you so much for the explanation Chris!!
– Tartaglia
Jan 4 at 19:55
@Tartaglia glad to be able to help.
– Chris
Jan 4 at 20:20
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54034991%2fusing-standardization-in-sklearn-pipeline%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Sklearn's pipeline
will apply transformer.fit_transform()
when pipeline.fit()
is called and transformer.transform()
when pipeline.predict()
is called. So for your case, StandardScaler
will be fitted to X_train
and then the mean and stdev from X_train
will be used to scale X_test
.
The transform of X_train
would indeed look different to that of X_train
and X_test
. The extent of the difference would depend on the extent of the difference in the distributions between X_train
and X_test
combined. However, if randomly partitioned from the same original dataset, and of a reasonable size, the distributions of X_train
and X_test
will probably be similar.
Regardless, it is important to treat X_test
as though it is out of sample, in order for it to be a (hopefully) reliable metric for unseen data. Since you don't know the distribution of unseen data, you should pretend you don't know the distribution of X_test
, including the mean and stdev.
1
Very happy to hear that, that makes perfect sense. Thank you so much for the explanation Chris!!
– Tartaglia
Jan 4 at 19:55
@Tartaglia glad to be able to help.
– Chris
Jan 4 at 20:20
add a comment |
Sklearn's pipeline
will apply transformer.fit_transform()
when pipeline.fit()
is called and transformer.transform()
when pipeline.predict()
is called. So for your case, StandardScaler
will be fitted to X_train
and then the mean and stdev from X_train
will be used to scale X_test
.
The transform of X_train
would indeed look different to that of X_train
and X_test
. The extent of the difference would depend on the extent of the difference in the distributions between X_train
and X_test
combined. However, if randomly partitioned from the same original dataset, and of a reasonable size, the distributions of X_train
and X_test
will probably be similar.
Regardless, it is important to treat X_test
as though it is out of sample, in order for it to be a (hopefully) reliable metric for unseen data. Since you don't know the distribution of unseen data, you should pretend you don't know the distribution of X_test
, including the mean and stdev.
1
Very happy to hear that, that makes perfect sense. Thank you so much for the explanation Chris!!
– Tartaglia
Jan 4 at 19:55
@Tartaglia glad to be able to help.
– Chris
Jan 4 at 20:20
add a comment |
Sklearn's pipeline
will apply transformer.fit_transform()
when pipeline.fit()
is called and transformer.transform()
when pipeline.predict()
is called. So for your case, StandardScaler
will be fitted to X_train
and then the mean and stdev from X_train
will be used to scale X_test
.
The transform of X_train
would indeed look different to that of X_train
and X_test
. The extent of the difference would depend on the extent of the difference in the distributions between X_train
and X_test
combined. However, if randomly partitioned from the same original dataset, and of a reasonable size, the distributions of X_train
and X_test
will probably be similar.
Regardless, it is important to treat X_test
as though it is out of sample, in order for it to be a (hopefully) reliable metric for unseen data. Since you don't know the distribution of unseen data, you should pretend you don't know the distribution of X_test
, including the mean and stdev.
Sklearn's pipeline
will apply transformer.fit_transform()
when pipeline.fit()
is called and transformer.transform()
when pipeline.predict()
is called. So for your case, StandardScaler
will be fitted to X_train
and then the mean and stdev from X_train
will be used to scale X_test
.
The transform of X_train
would indeed look different to that of X_train
and X_test
. The extent of the difference would depend on the extent of the difference in the distributions between X_train
and X_test
combined. However, if randomly partitioned from the same original dataset, and of a reasonable size, the distributions of X_train
and X_test
will probably be similar.
Regardless, it is important to treat X_test
as though it is out of sample, in order for it to be a (hopefully) reliable metric for unseen data. Since you don't know the distribution of unseen data, you should pretend you don't know the distribution of X_test
, including the mean and stdev.
edited Jan 4 at 17:05
answered Jan 4 at 16:59
ChrisChris
544414
544414
1
Very happy to hear that, that makes perfect sense. Thank you so much for the explanation Chris!!
– Tartaglia
Jan 4 at 19:55
@Tartaglia glad to be able to help.
– Chris
Jan 4 at 20:20
add a comment |
1
Very happy to hear that, that makes perfect sense. Thank you so much for the explanation Chris!!
– Tartaglia
Jan 4 at 19:55
@Tartaglia glad to be able to help.
– Chris
Jan 4 at 20:20
1
1
Very happy to hear that, that makes perfect sense. Thank you so much for the explanation Chris!!
– Tartaglia
Jan 4 at 19:55
Very happy to hear that, that makes perfect sense. Thank you so much for the explanation Chris!!
– Tartaglia
Jan 4 at 19:55
@Tartaglia glad to be able to help.
– Chris
Jan 4 at 20:20
@Tartaglia glad to be able to help.
– Chris
Jan 4 at 20:20
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54034991%2fusing-standardization-in-sklearn-pipeline%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown