Expand a dataset is efficient to improve performance of Machine Learning algorithme?





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}







0















I am currently working on a Machine Learning project. And I am a newbie.



My dataset size is around 30k and I am not sure it’s enough to perform well.



These 30k rows are collected by a particular type of product but in my database I have several products.



And my question is if I collect all the products and include it in my dataset will my model be more accurate? Or It will just add a wasted time to the process?



For example I want to predict if an email sent by this type of person is a spam is it efficient to collect all emails in my inbox from all types of person? Or I just have to collect emails by this type of person.



Thanks a lot for answers !










share|improve this question























  • Post your code, your current algorithm to get help in the right direction.

    – Ali
    Jan 4 at 0:52











  • I haven’t decided any algorithm right now to use, because I will test all classifiers in Scikit Learn and choose the most efficient. For information my datasets are load in a CSV from Oracle. I have around 200 columns mixed with numbers and categoricals values. My target is like Spam/NotSpam. Do you think I have to try both kind of datasets? The small one and a larger one?

    – Katy
    Jan 4 at 7:19




















0















I am currently working on a Machine Learning project. And I am a newbie.



My dataset size is around 30k and I am not sure it’s enough to perform well.



These 30k rows are collected by a particular type of product but in my database I have several products.



And my question is if I collect all the products and include it in my dataset will my model be more accurate? Or It will just add a wasted time to the process?



For example I want to predict if an email sent by this type of person is a spam is it efficient to collect all emails in my inbox from all types of person? Or I just have to collect emails by this type of person.



Thanks a lot for answers !










share|improve this question























  • Post your code, your current algorithm to get help in the right direction.

    – Ali
    Jan 4 at 0:52











  • I haven’t decided any algorithm right now to use, because I will test all classifiers in Scikit Learn and choose the most efficient. For information my datasets are load in a CSV from Oracle. I have around 200 columns mixed with numbers and categoricals values. My target is like Spam/NotSpam. Do you think I have to try both kind of datasets? The small one and a larger one?

    – Katy
    Jan 4 at 7:19
















0












0








0








I am currently working on a Machine Learning project. And I am a newbie.



My dataset size is around 30k and I am not sure it’s enough to perform well.



These 30k rows are collected by a particular type of product but in my database I have several products.



And my question is if I collect all the products and include it in my dataset will my model be more accurate? Or It will just add a wasted time to the process?



For example I want to predict if an email sent by this type of person is a spam is it efficient to collect all emails in my inbox from all types of person? Or I just have to collect emails by this type of person.



Thanks a lot for answers !










share|improve this question














I am currently working on a Machine Learning project. And I am a newbie.



My dataset size is around 30k and I am not sure it’s enough to perform well.



These 30k rows are collected by a particular type of product but in my database I have several products.



And my question is if I collect all the products and include it in my dataset will my model be more accurate? Or It will just add a wasted time to the process?



For example I want to predict if an email sent by this type of person is a spam is it efficient to collect all emails in my inbox from all types of person? Or I just have to collect emails by this type of person.



Thanks a lot for answers !







database machine-learning performance-testing data-science database-performance






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Jan 3 at 20:15









KatyKaty

467




467













  • Post your code, your current algorithm to get help in the right direction.

    – Ali
    Jan 4 at 0:52











  • I haven’t decided any algorithm right now to use, because I will test all classifiers in Scikit Learn and choose the most efficient. For information my datasets are load in a CSV from Oracle. I have around 200 columns mixed with numbers and categoricals values. My target is like Spam/NotSpam. Do you think I have to try both kind of datasets? The small one and a larger one?

    – Katy
    Jan 4 at 7:19





















  • Post your code, your current algorithm to get help in the right direction.

    – Ali
    Jan 4 at 0:52











  • I haven’t decided any algorithm right now to use, because I will test all classifiers in Scikit Learn and choose the most efficient. For information my datasets are load in a CSV from Oracle. I have around 200 columns mixed with numbers and categoricals values. My target is like Spam/NotSpam. Do you think I have to try both kind of datasets? The small one and a larger one?

    – Katy
    Jan 4 at 7:19



















Post your code, your current algorithm to get help in the right direction.

– Ali
Jan 4 at 0:52





Post your code, your current algorithm to get help in the right direction.

– Ali
Jan 4 at 0:52













I haven’t decided any algorithm right now to use, because I will test all classifiers in Scikit Learn and choose the most efficient. For information my datasets are load in a CSV from Oracle. I have around 200 columns mixed with numbers and categoricals values. My target is like Spam/NotSpam. Do you think I have to try both kind of datasets? The small one and a larger one?

– Katy
Jan 4 at 7:19







I haven’t decided any algorithm right now to use, because I will test all classifiers in Scikit Learn and choose the most efficient. For information my datasets are load in a CSV from Oracle. I have around 200 columns mixed with numbers and categoricals values. My target is like Spam/NotSpam. Do you think I have to try both kind of datasets? The small one and a larger one?

– Katy
Jan 4 at 7:19














1 Answer
1






active

oldest

votes


















1














It really depends on the dataset, the model, the actual problem/question and the desired accuracy/error. Taking your spam email example, naive Bayes is a very common method for this type of problem and requires a relatively small amount for data to obtain 'reasonable' accuracies (reasonable being customer/stakeholder defined).



30,000 samples is a pretty decent sized dataset for shallow learning classifiers, (although of course that does depend on the quality of the data, e.g. number of missing values, errors, outliers etc.) I wouldn't worry about adding additional products until you've seen how the classifier performs with the data you already have.



So I would start with the single product and try models which work well with smaller amounts of data, such as Naive Bayes (NB) and Support Vector Classifiers (SVCs) and see whether the resulting accuracy is adequate for your application. If it isn't you have two options: more data and other modelling approaches. For more data you could try adding other products incrementally and assessing the resulting accuracy. You could try using a clustering model, e.g K-means to select (samples from) the products, if it isn't apparent which products would likely be most useful to add. You could also try simulating more data for the product of interest, perhaps using other products as the basis for the simulation. The main thing is to make sure you have a baseline accuracy from the single product to assess whether the additional data is helping. For other modelling approaches, you could try ensembling – just weighted averaging your SVC and NB models would be a good place to start – or different algorithms entirely.



Remember, for smaller datasets, the risk of over-fitting and susceptibility to outliers are increased, so careful feature selection/engineering and good discipline with dev/test and validation set are all the more important.






share|improve this answer





















  • 1





    Thank you Chris for your very clear answer. I'll follow your advice and start with the smallest dataset and either enlarge it with the other products or change the modeling approach. This really confirms that size is not a good measure.

    – Katy
    Jan 4 at 8:59












Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54029178%2fexpand-a-dataset-is-efficient-to-improve-performance-of-machine-learning-algorit%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









1














It really depends on the dataset, the model, the actual problem/question and the desired accuracy/error. Taking your spam email example, naive Bayes is a very common method for this type of problem and requires a relatively small amount for data to obtain 'reasonable' accuracies (reasonable being customer/stakeholder defined).



30,000 samples is a pretty decent sized dataset for shallow learning classifiers, (although of course that does depend on the quality of the data, e.g. number of missing values, errors, outliers etc.) I wouldn't worry about adding additional products until you've seen how the classifier performs with the data you already have.



So I would start with the single product and try models which work well with smaller amounts of data, such as Naive Bayes (NB) and Support Vector Classifiers (SVCs) and see whether the resulting accuracy is adequate for your application. If it isn't you have two options: more data and other modelling approaches. For more data you could try adding other products incrementally and assessing the resulting accuracy. You could try using a clustering model, e.g K-means to select (samples from) the products, if it isn't apparent which products would likely be most useful to add. You could also try simulating more data for the product of interest, perhaps using other products as the basis for the simulation. The main thing is to make sure you have a baseline accuracy from the single product to assess whether the additional data is helping. For other modelling approaches, you could try ensembling – just weighted averaging your SVC and NB models would be a good place to start – or different algorithms entirely.



Remember, for smaller datasets, the risk of over-fitting and susceptibility to outliers are increased, so careful feature selection/engineering and good discipline with dev/test and validation set are all the more important.






share|improve this answer





















  • 1





    Thank you Chris for your very clear answer. I'll follow your advice and start with the smallest dataset and either enlarge it with the other products or change the modeling approach. This really confirms that size is not a good measure.

    – Katy
    Jan 4 at 8:59
















1














It really depends on the dataset, the model, the actual problem/question and the desired accuracy/error. Taking your spam email example, naive Bayes is a very common method for this type of problem and requires a relatively small amount for data to obtain 'reasonable' accuracies (reasonable being customer/stakeholder defined).



30,000 samples is a pretty decent sized dataset for shallow learning classifiers, (although of course that does depend on the quality of the data, e.g. number of missing values, errors, outliers etc.) I wouldn't worry about adding additional products until you've seen how the classifier performs with the data you already have.



So I would start with the single product and try models which work well with smaller amounts of data, such as Naive Bayes (NB) and Support Vector Classifiers (SVCs) and see whether the resulting accuracy is adequate for your application. If it isn't you have two options: more data and other modelling approaches. For more data you could try adding other products incrementally and assessing the resulting accuracy. You could try using a clustering model, e.g K-means to select (samples from) the products, if it isn't apparent which products would likely be most useful to add. You could also try simulating more data for the product of interest, perhaps using other products as the basis for the simulation. The main thing is to make sure you have a baseline accuracy from the single product to assess whether the additional data is helping. For other modelling approaches, you could try ensembling – just weighted averaging your SVC and NB models would be a good place to start – or different algorithms entirely.



Remember, for smaller datasets, the risk of over-fitting and susceptibility to outliers are increased, so careful feature selection/engineering and good discipline with dev/test and validation set are all the more important.






share|improve this answer





















  • 1





    Thank you Chris for your very clear answer. I'll follow your advice and start with the smallest dataset and either enlarge it with the other products or change the modeling approach. This really confirms that size is not a good measure.

    – Katy
    Jan 4 at 8:59














1












1








1







It really depends on the dataset, the model, the actual problem/question and the desired accuracy/error. Taking your spam email example, naive Bayes is a very common method for this type of problem and requires a relatively small amount for data to obtain 'reasonable' accuracies (reasonable being customer/stakeholder defined).



30,000 samples is a pretty decent sized dataset for shallow learning classifiers, (although of course that does depend on the quality of the data, e.g. number of missing values, errors, outliers etc.) I wouldn't worry about adding additional products until you've seen how the classifier performs with the data you already have.



So I would start with the single product and try models which work well with smaller amounts of data, such as Naive Bayes (NB) and Support Vector Classifiers (SVCs) and see whether the resulting accuracy is adequate for your application. If it isn't you have two options: more data and other modelling approaches. For more data you could try adding other products incrementally and assessing the resulting accuracy. You could try using a clustering model, e.g K-means to select (samples from) the products, if it isn't apparent which products would likely be most useful to add. You could also try simulating more data for the product of interest, perhaps using other products as the basis for the simulation. The main thing is to make sure you have a baseline accuracy from the single product to assess whether the additional data is helping. For other modelling approaches, you could try ensembling – just weighted averaging your SVC and NB models would be a good place to start – or different algorithms entirely.



Remember, for smaller datasets, the risk of over-fitting and susceptibility to outliers are increased, so careful feature selection/engineering and good discipline with dev/test and validation set are all the more important.






share|improve this answer















It really depends on the dataset, the model, the actual problem/question and the desired accuracy/error. Taking your spam email example, naive Bayes is a very common method for this type of problem and requires a relatively small amount for data to obtain 'reasonable' accuracies (reasonable being customer/stakeholder defined).



30,000 samples is a pretty decent sized dataset for shallow learning classifiers, (although of course that does depend on the quality of the data, e.g. number of missing values, errors, outliers etc.) I wouldn't worry about adding additional products until you've seen how the classifier performs with the data you already have.



So I would start with the single product and try models which work well with smaller amounts of data, such as Naive Bayes (NB) and Support Vector Classifiers (SVCs) and see whether the resulting accuracy is adequate for your application. If it isn't you have two options: more data and other modelling approaches. For more data you could try adding other products incrementally and assessing the resulting accuracy. You could try using a clustering model, e.g K-means to select (samples from) the products, if it isn't apparent which products would likely be most useful to add. You could also try simulating more data for the product of interest, perhaps using other products as the basis for the simulation. The main thing is to make sure you have a baseline accuracy from the single product to assess whether the additional data is helping. For other modelling approaches, you could try ensembling – just weighted averaging your SVC and NB models would be a good place to start – or different algorithms entirely.



Remember, for smaller datasets, the risk of over-fitting and susceptibility to outliers are increased, so careful feature selection/engineering and good discipline with dev/test and validation set are all the more important.







share|improve this answer














share|improve this answer



share|improve this answer








edited Jan 4 at 8:30

























answered Jan 4 at 8:23









ChrisChris

544414




544414








  • 1





    Thank you Chris for your very clear answer. I'll follow your advice and start with the smallest dataset and either enlarge it with the other products or change the modeling approach. This really confirms that size is not a good measure.

    – Katy
    Jan 4 at 8:59














  • 1





    Thank you Chris for your very clear answer. I'll follow your advice and start with the smallest dataset and either enlarge it with the other products or change the modeling approach. This really confirms that size is not a good measure.

    – Katy
    Jan 4 at 8:59








1




1





Thank you Chris for your very clear answer. I'll follow your advice and start with the smallest dataset and either enlarge it with the other products or change the modeling approach. This really confirms that size is not a good measure.

– Katy
Jan 4 at 8:59





Thank you Chris for your very clear answer. I'll follow your advice and start with the smallest dataset and either enlarge it with the other products or change the modeling approach. This really confirms that size is not a good measure.

– Katy
Jan 4 at 8:59




















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54029178%2fexpand-a-dataset-is-efficient-to-improve-performance-of-machine-learning-algorit%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Monofisismo

Angular Downloading a file using contenturl with Basic Authentication

Olmecas