Expand a dataset is efficient to improve performance of Machine Learning algorithme?
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}
I am currently working on a Machine Learning project. And I am a newbie.
My dataset size is around 30k and I am not sure it’s enough to perform well.
These 30k rows are collected by a particular type of product but in my database I have several products.
And my question is if I collect all the products and include it in my dataset will my model be more accurate? Or It will just add a wasted time to the process?
For example I want to predict if an email sent by this type of person is a spam is it efficient to collect all emails in my inbox from all types of person? Or I just have to collect emails by this type of person.
Thanks a lot for answers !
database machine-learning performance-testing data-science database-performance
add a comment |
I am currently working on a Machine Learning project. And I am a newbie.
My dataset size is around 30k and I am not sure it’s enough to perform well.
These 30k rows are collected by a particular type of product but in my database I have several products.
And my question is if I collect all the products and include it in my dataset will my model be more accurate? Or It will just add a wasted time to the process?
For example I want to predict if an email sent by this type of person is a spam is it efficient to collect all emails in my inbox from all types of person? Or I just have to collect emails by this type of person.
Thanks a lot for answers !
database machine-learning performance-testing data-science database-performance
Post your code, your current algorithm to get help in the right direction.
– Ali
Jan 4 at 0:52
I haven’t decided any algorithm right now to use, because I will test all classifiers in Scikit Learn and choose the most efficient. For information my datasets are load in a CSV from Oracle. I have around 200 columns mixed with numbers and categoricals values. My target is like Spam/NotSpam. Do you think I have to try both kind of datasets? The small one and a larger one?
– Katy
Jan 4 at 7:19
add a comment |
I am currently working on a Machine Learning project. And I am a newbie.
My dataset size is around 30k and I am not sure it’s enough to perform well.
These 30k rows are collected by a particular type of product but in my database I have several products.
And my question is if I collect all the products and include it in my dataset will my model be more accurate? Or It will just add a wasted time to the process?
For example I want to predict if an email sent by this type of person is a spam is it efficient to collect all emails in my inbox from all types of person? Or I just have to collect emails by this type of person.
Thanks a lot for answers !
database machine-learning performance-testing data-science database-performance
I am currently working on a Machine Learning project. And I am a newbie.
My dataset size is around 30k and I am not sure it’s enough to perform well.
These 30k rows are collected by a particular type of product but in my database I have several products.
And my question is if I collect all the products and include it in my dataset will my model be more accurate? Or It will just add a wasted time to the process?
For example I want to predict if an email sent by this type of person is a spam is it efficient to collect all emails in my inbox from all types of person? Or I just have to collect emails by this type of person.
Thanks a lot for answers !
database machine-learning performance-testing data-science database-performance
database machine-learning performance-testing data-science database-performance
asked Jan 3 at 20:15
KatyKaty
467
467
Post your code, your current algorithm to get help in the right direction.
– Ali
Jan 4 at 0:52
I haven’t decided any algorithm right now to use, because I will test all classifiers in Scikit Learn and choose the most efficient. For information my datasets are load in a CSV from Oracle. I have around 200 columns mixed with numbers and categoricals values. My target is like Spam/NotSpam. Do you think I have to try both kind of datasets? The small one and a larger one?
– Katy
Jan 4 at 7:19
add a comment |
Post your code, your current algorithm to get help in the right direction.
– Ali
Jan 4 at 0:52
I haven’t decided any algorithm right now to use, because I will test all classifiers in Scikit Learn and choose the most efficient. For information my datasets are load in a CSV from Oracle. I have around 200 columns mixed with numbers and categoricals values. My target is like Spam/NotSpam. Do you think I have to try both kind of datasets? The small one and a larger one?
– Katy
Jan 4 at 7:19
Post your code, your current algorithm to get help in the right direction.
– Ali
Jan 4 at 0:52
Post your code, your current algorithm to get help in the right direction.
– Ali
Jan 4 at 0:52
I haven’t decided any algorithm right now to use, because I will test all classifiers in Scikit Learn and choose the most efficient. For information my datasets are load in a CSV from Oracle. I have around 200 columns mixed with numbers and categoricals values. My target is like Spam/NotSpam. Do you think I have to try both kind of datasets? The small one and a larger one?
– Katy
Jan 4 at 7:19
I haven’t decided any algorithm right now to use, because I will test all classifiers in Scikit Learn and choose the most efficient. For information my datasets are load in a CSV from Oracle. I have around 200 columns mixed with numbers and categoricals values. My target is like Spam/NotSpam. Do you think I have to try both kind of datasets? The small one and a larger one?
– Katy
Jan 4 at 7:19
add a comment |
1 Answer
1
active
oldest
votes
It really depends on the dataset, the model, the actual problem/question and the desired accuracy/error. Taking your spam email example, naive Bayes is a very common method for this type of problem and requires a relatively small amount for data to obtain 'reasonable' accuracies (reasonable being customer/stakeholder defined).
30,000 samples is a pretty decent sized dataset for shallow learning classifiers, (although of course that does depend on the quality of the data, e.g. number of missing values, errors, outliers etc.) I wouldn't worry about adding additional products until you've seen how the classifier performs with the data you already have.
So I would start with the single product and try models which work well with smaller amounts of data, such as Naive Bayes (NB) and Support Vector Classifiers (SVCs) and see whether the resulting accuracy is adequate for your application. If it isn't you have two options: more data and other modelling approaches. For more data you could try adding other products incrementally and assessing the resulting accuracy. You could try using a clustering model, e.g K-means to select (samples from) the products, if it isn't apparent which products would likely be most useful to add. You could also try simulating more data for the product of interest, perhaps using other products as the basis for the simulation. The main thing is to make sure you have a baseline accuracy from the single product to assess whether the additional data is helping. For other modelling approaches, you could try ensembling – just weighted averaging your SVC and NB models would be a good place to start – or different algorithms entirely.
Remember, for smaller datasets, the risk of over-fitting and susceptibility to outliers are increased, so careful feature selection/engineering and good discipline with dev/test and validation set are all the more important.
1
Thank you Chris for your very clear answer. I'll follow your advice and start with the smallest dataset and either enlarge it with the other products or change the modeling approach. This really confirms that size is not a good measure.
– Katy
Jan 4 at 8:59
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54029178%2fexpand-a-dataset-is-efficient-to-improve-performance-of-machine-learning-algorit%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
It really depends on the dataset, the model, the actual problem/question and the desired accuracy/error. Taking your spam email example, naive Bayes is a very common method for this type of problem and requires a relatively small amount for data to obtain 'reasonable' accuracies (reasonable being customer/stakeholder defined).
30,000 samples is a pretty decent sized dataset for shallow learning classifiers, (although of course that does depend on the quality of the data, e.g. number of missing values, errors, outliers etc.) I wouldn't worry about adding additional products until you've seen how the classifier performs with the data you already have.
So I would start with the single product and try models which work well with smaller amounts of data, such as Naive Bayes (NB) and Support Vector Classifiers (SVCs) and see whether the resulting accuracy is adequate for your application. If it isn't you have two options: more data and other modelling approaches. For more data you could try adding other products incrementally and assessing the resulting accuracy. You could try using a clustering model, e.g K-means to select (samples from) the products, if it isn't apparent which products would likely be most useful to add. You could also try simulating more data for the product of interest, perhaps using other products as the basis for the simulation. The main thing is to make sure you have a baseline accuracy from the single product to assess whether the additional data is helping. For other modelling approaches, you could try ensembling – just weighted averaging your SVC and NB models would be a good place to start – or different algorithms entirely.
Remember, for smaller datasets, the risk of over-fitting and susceptibility to outliers are increased, so careful feature selection/engineering and good discipline with dev/test and validation set are all the more important.
1
Thank you Chris for your very clear answer. I'll follow your advice and start with the smallest dataset and either enlarge it with the other products or change the modeling approach. This really confirms that size is not a good measure.
– Katy
Jan 4 at 8:59
add a comment |
It really depends on the dataset, the model, the actual problem/question and the desired accuracy/error. Taking your spam email example, naive Bayes is a very common method for this type of problem and requires a relatively small amount for data to obtain 'reasonable' accuracies (reasonable being customer/stakeholder defined).
30,000 samples is a pretty decent sized dataset for shallow learning classifiers, (although of course that does depend on the quality of the data, e.g. number of missing values, errors, outliers etc.) I wouldn't worry about adding additional products until you've seen how the classifier performs with the data you already have.
So I would start with the single product and try models which work well with smaller amounts of data, such as Naive Bayes (NB) and Support Vector Classifiers (SVCs) and see whether the resulting accuracy is adequate for your application. If it isn't you have two options: more data and other modelling approaches. For more data you could try adding other products incrementally and assessing the resulting accuracy. You could try using a clustering model, e.g K-means to select (samples from) the products, if it isn't apparent which products would likely be most useful to add. You could also try simulating more data for the product of interest, perhaps using other products as the basis for the simulation. The main thing is to make sure you have a baseline accuracy from the single product to assess whether the additional data is helping. For other modelling approaches, you could try ensembling – just weighted averaging your SVC and NB models would be a good place to start – or different algorithms entirely.
Remember, for smaller datasets, the risk of over-fitting and susceptibility to outliers are increased, so careful feature selection/engineering and good discipline with dev/test and validation set are all the more important.
1
Thank you Chris for your very clear answer. I'll follow your advice and start with the smallest dataset and either enlarge it with the other products or change the modeling approach. This really confirms that size is not a good measure.
– Katy
Jan 4 at 8:59
add a comment |
It really depends on the dataset, the model, the actual problem/question and the desired accuracy/error. Taking your spam email example, naive Bayes is a very common method for this type of problem and requires a relatively small amount for data to obtain 'reasonable' accuracies (reasonable being customer/stakeholder defined).
30,000 samples is a pretty decent sized dataset for shallow learning classifiers, (although of course that does depend on the quality of the data, e.g. number of missing values, errors, outliers etc.) I wouldn't worry about adding additional products until you've seen how the classifier performs with the data you already have.
So I would start with the single product and try models which work well with smaller amounts of data, such as Naive Bayes (NB) and Support Vector Classifiers (SVCs) and see whether the resulting accuracy is adequate for your application. If it isn't you have two options: more data and other modelling approaches. For more data you could try adding other products incrementally and assessing the resulting accuracy. You could try using a clustering model, e.g K-means to select (samples from) the products, if it isn't apparent which products would likely be most useful to add. You could also try simulating more data for the product of interest, perhaps using other products as the basis for the simulation. The main thing is to make sure you have a baseline accuracy from the single product to assess whether the additional data is helping. For other modelling approaches, you could try ensembling – just weighted averaging your SVC and NB models would be a good place to start – or different algorithms entirely.
Remember, for smaller datasets, the risk of over-fitting and susceptibility to outliers are increased, so careful feature selection/engineering and good discipline with dev/test and validation set are all the more important.
It really depends on the dataset, the model, the actual problem/question and the desired accuracy/error. Taking your spam email example, naive Bayes is a very common method for this type of problem and requires a relatively small amount for data to obtain 'reasonable' accuracies (reasonable being customer/stakeholder defined).
30,000 samples is a pretty decent sized dataset for shallow learning classifiers, (although of course that does depend on the quality of the data, e.g. number of missing values, errors, outliers etc.) I wouldn't worry about adding additional products until you've seen how the classifier performs with the data you already have.
So I would start with the single product and try models which work well with smaller amounts of data, such as Naive Bayes (NB) and Support Vector Classifiers (SVCs) and see whether the resulting accuracy is adequate for your application. If it isn't you have two options: more data and other modelling approaches. For more data you could try adding other products incrementally and assessing the resulting accuracy. You could try using a clustering model, e.g K-means to select (samples from) the products, if it isn't apparent which products would likely be most useful to add. You could also try simulating more data for the product of interest, perhaps using other products as the basis for the simulation. The main thing is to make sure you have a baseline accuracy from the single product to assess whether the additional data is helping. For other modelling approaches, you could try ensembling – just weighted averaging your SVC and NB models would be a good place to start – or different algorithms entirely.
Remember, for smaller datasets, the risk of over-fitting and susceptibility to outliers are increased, so careful feature selection/engineering and good discipline with dev/test and validation set are all the more important.
edited Jan 4 at 8:30
answered Jan 4 at 8:23
ChrisChris
544414
544414
1
Thank you Chris for your very clear answer. I'll follow your advice and start with the smallest dataset and either enlarge it with the other products or change the modeling approach. This really confirms that size is not a good measure.
– Katy
Jan 4 at 8:59
add a comment |
1
Thank you Chris for your very clear answer. I'll follow your advice and start with the smallest dataset and either enlarge it with the other products or change the modeling approach. This really confirms that size is not a good measure.
– Katy
Jan 4 at 8:59
1
1
Thank you Chris for your very clear answer. I'll follow your advice and start with the smallest dataset and either enlarge it with the other products or change the modeling approach. This really confirms that size is not a good measure.
– Katy
Jan 4 at 8:59
Thank you Chris for your very clear answer. I'll follow your advice and start with the smallest dataset and either enlarge it with the other products or change the modeling approach. This really confirms that size is not a good measure.
– Katy
Jan 4 at 8:59
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54029178%2fexpand-a-dataset-is-efficient-to-improve-performance-of-machine-learning-algorit%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Post your code, your current algorithm to get help in the right direction.
– Ali
Jan 4 at 0:52
I haven’t decided any algorithm right now to use, because I will test all classifiers in Scikit Learn and choose the most efficient. For information my datasets are load in a CSV from Oracle. I have around 200 columns mixed with numbers and categoricals values. My target is like Spam/NotSpam. Do you think I have to try both kind of datasets? The small one and a larger one?
– Katy
Jan 4 at 7:19