Expand a dataset is efficient to improve performance of Machine Learning algorithme?

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}

I am currently working on a Machine Learning project. And I am a newbie.

My dataset size is around 30k and I am not sure it’s enough to perform well.

These 30k rows are collected by a particular type of product but in my database I have several products.

And my question is if I collect all the products and include it in my dataset will my model be more accurate? Or It will just add a wasted time to the process?

For example I want to predict if an email sent by this type of person is a spam is it efficient to collect all emails in my inbox from all types of person? Or I just have to collect emails by this type of person.

Thanks a lot for answers !

asked Jan 3 at 20:15

Katy

467

Post your code, your current algorithm to get help in the right direction.

– Ali
Jan 4 at 0:52

I haven’t decided any algorithm right now to use, because I will test all classifiers in Scikit Learn and choose the most efficient. For information my datasets are load in a CSV from Oracle. I have around 200 columns mixed with numbers and categoricals values. My target is like Spam/NotSpam. Do you think I have to try both kind of datasets? The small one and a larger one?

– Katy
Jan 4 at 7:19

add a comment |

I am currently working on a Machine Learning project. And I am a newbie.

My dataset size is around 30k and I am not sure it’s enough to perform well.

These 30k rows are collected by a particular type of product but in my database I have several products.

And my question is if I collect all the products and include it in my dataset will my model be more accurate? Or It will just add a wasted time to the process?

Thanks a lot for answers !

asked Jan 3 at 20:15

Katy

467

Post your code, your current algorithm to get help in the right direction.

– Ali
Jan 4 at 0:52

I haven’t decided any algorithm right now to use, because I will test all classifiers in Scikit Learn and choose the most efficient. For information my datasets are load in a CSV from Oracle. I have around 200 columns mixed with numbers and categoricals values. My target is like Spam/NotSpam. Do you think I have to try both kind of datasets? The small one and a larger one?

– Katy
Jan 4 at 7:19

add a comment |

I am currently working on a Machine Learning project. And I am a newbie.

My dataset size is around 30k and I am not sure it’s enough to perform well.

These 30k rows are collected by a particular type of product but in my database I have several products.

And my question is if I collect all the products and include it in my dataset will my model be more accurate? Or It will just add a wasted time to the process?

Thanks a lot for answers !

asked Jan 3 at 20:15

Katy

467

I am currently working on a Machine Learning project. And I am a newbie.

My dataset size is around 30k and I am not sure it’s enough to perform well.

These 30k rows are collected by a particular type of product but in my database I have several products.

And my question is if I collect all the products and include it in my dataset will my model be more accurate? Or It will just add a wasted time to the process?

Thanks a lot for answers !

database machine-learning performance-testing data-science database-performance

asked Jan 3 at 20:15

Katy

467

asked Jan 3 at 20:15

Katy

467

asked Jan 3 at 20:15

Katy

467

asked Jan 3 at 20:15

Katy

467

asked Jan 3 at 20:15

Katy

467

Post your code, your current algorithm to get help in the right direction.

– Ali
Jan 4 at 0:52

I haven’t decided any algorithm right now to use, because I will test all classifiers in Scikit Learn and choose the most efficient. For information my datasets are load in a CSV from Oracle. I have around 200 columns mixed with numbers and categoricals values. My target is like Spam/NotSpam. Do you think I have to try both kind of datasets? The small one and a larger one?

– Katy
Jan 4 at 7:19

add a comment |

Post your code, your current algorithm to get help in the right direction.

– Ali
Jan 4 at 0:52

I haven’t decided any algorithm right now to use, because I will test all classifiers in Scikit Learn and choose the most efficient. For information my datasets are load in a CSV from Oracle. I have around 200 columns mixed with numbers and categoricals values. My target is like Spam/NotSpam. Do you think I have to try both kind of datasets? The small one and a larger one?

– Katy
Jan 4 at 7:19

Post your code, your current algorithm to get help in the right direction.

– Ali
Jan 4 at 0:52

I haven’t decided any algorithm right now to use, because I will test all classifiers in Scikit Learn and choose the most efficient. For information my datasets are load in a CSV from Oracle. I have around 200 columns mixed with numbers and categoricals values. My target is like Spam/NotSpam. Do you think I have to try both kind of datasets? The small one and a larger one?

– Katy
Jan 4 at 7:19

add a comment |

1 Answer
1

active

oldest

votes

It really depends on the dataset, the model, the actual problem/question and the desired accuracy/error. Taking your spam email example, naive Bayes is a very common method for this type of problem and requires a relatively small amount for data to obtain 'reasonable' accuracies (reasonable being customer/stakeholder defined).

30,000 samples is a pretty decent sized dataset for shallow learning classifiers, (although of course that does depend on the quality of the data, e.g. number of missing values, errors, outliers etc.) I wouldn't worry about adding additional products until you've seen how the classifier performs with the data you already have.

So I would start with the single product and try models which work well with smaller amounts of data, such as Naive Bayes (NB) and Support Vector Classifiers (SVCs) and see whether the resulting accuracy is adequate for your application. If it isn't you have two options: more data and other modelling approaches. For more data you could try adding other products incrementally and assessing the resulting accuracy. You could try using a clustering model, e.g K-means to select (samples from) the products, if it isn't apparent which products would likely be most useful to add. You could also try simulating more data for the product of interest, perhaps using other products as the basis for the simulation. The main thing is to make sure you have a baseline accuracy from the single product to assess whether the additional data is helping. For other modelling approaches, you could try ensembling – just weighted averaging your SVC and NB models would be a good place to start – or different algorithms entirely.

Remember, for smaller datasets, the risk of over-fitting and susceptibility to outliers are increased, so careful feature selection/engineering and good discipline with dev/test and validation set are all the more important.

edited Jan 4 at 8:30

answered Jan 4 at 8:23

Chris

544414

1

Thank you Chris for your very clear answer. I'll follow your advice and start with the smallest dataset and either enlarge it with the other products or change the modeling approach. This really confirms that size is not a good measure.

– Katy
Jan 4 at 8:59

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54029178%2fexpand-a-dataset-is-efficient-to-improve-performance-of-machine-learning-algorit%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

edited Jan 4 at 8:30

answered Jan 4 at 8:23

Chris

544414

1

Thank you Chris for your very clear answer. I'll follow your advice and start with the smallest dataset and either enlarge it with the other products or change the modeling approach. This really confirms that size is not a good measure.

– Katy
Jan 4 at 8:59

add a comment |

edited Jan 4 at 8:30

answered Jan 4 at 8:23

Chris

544414

1

Thank you Chris for your very clear answer. I'll follow your advice and start with the smallest dataset and either enlarge it with the other products or change the modeling approach. This really confirms that size is not a good measure.

– Katy
Jan 4 at 8:59

add a comment |

edited Jan 4 at 8:30

answered Jan 4 at 8:23

Chris

544414

edited Jan 4 at 8:30

answered Jan 4 at 8:23

Chris

544414

edited Jan 4 at 8:30

answered Jan 4 at 8:23

Chris

544414

answered Jan 4 at 8:23

Chris

544414

answered Jan 4 at 8:23

Chris

544414

1

Thank you Chris for your very clear answer. I'll follow your advice and start with the smallest dataset and either enlarge it with the other products or change the modeling approach. This really confirms that size is not a good measure.

– Katy
Jan 4 at 8:59

add a comment |

1

Thank you Chris for your very clear answer. I'll follow your advice and start with the smallest dataset and either enlarge it with the other products or change the modeling approach. This really confirms that size is not a good measure.

– Katy
Jan 4 at 8:59

Thank you Chris for your very clear answer. I'll follow your advice and start with the smallest dataset and either enlarge it with the other products or change the modeling approach. This really confirms that size is not a good measure.

– Katy
Jan 4 at 8:59

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

TiDq8lRc,zzF8jve1C lrE7,ETy97obBTN rY5cUSxYMmhIPK5CjG0pgMRY5w 3D,9RWpgvpuacu,h,DNZ90UqHsx 0rUK

搜尋此網誌

Bdtjtk