(Scraping) Handling products with varying identifiers/numbers

Multi tool use
Multi tool use












0















I'm scraping a set of products from various retail sites, but unfortunately ~10% of the products have mix matched product model numbers. This is troubling because ideally I need an automated way of comparing product information/pricing between these sites.



The information I am able to gain:




  • Product Brand

  • Product Name <-- These have minor variations at times also

  • Product Model Number <-- This is what I use as the "unique"

  • Misc. Product Information (Descriptions, ratings, etc)


One retail site may list a product as: ABC123-123 while another may do ABC123123 (without dash). This is the bulk of the issues, however sometimes it's more varying such as ABC123 123 B vs Brand ABC123123 (combines the entire string and tags on the literal brand in the front rather than the initial).



My proposal is to regex all incoming variables by removing spaces and dashes.



var cleanedProduct = product.replace(/-|s/g,"");


This should reduce my "chances" at mix matches as much as possible. For the remaining I have two ideas:





  • Semi-automated: Fuzzy search new products with product names, if a match exists store backup model numbers. Then I can review when this exception is found for approval.


  • Manual: Check new products and add them to the exception list so it triggers matching up to the correct/baseline model number


During production it's expected to be ~5000 products, only 80-120 new products annually.



Just curious if someone came into the same issue, ideally using UPC codes... but I am not given that information.










share|improve this question























  • You solution looks good enough. Are there other cases where '-' and spaces are not involved and something else is there?

    – Aditya Gupta
    Jan 2 at 8:23













  • The second example, such as ABC123 123 B vs Brand ABC123123. But these make up a VERY small amount of products.

    – Miles Collier
    Jan 2 at 8:40











  • Oh. So, removing 'Brand' like others is not appropriate or its just an example to make your point.

    – Aditya Gupta
    Jan 2 at 9:16
















0















I'm scraping a set of products from various retail sites, but unfortunately ~10% of the products have mix matched product model numbers. This is troubling because ideally I need an automated way of comparing product information/pricing between these sites.



The information I am able to gain:




  • Product Brand

  • Product Name <-- These have minor variations at times also

  • Product Model Number <-- This is what I use as the "unique"

  • Misc. Product Information (Descriptions, ratings, etc)


One retail site may list a product as: ABC123-123 while another may do ABC123123 (without dash). This is the bulk of the issues, however sometimes it's more varying such as ABC123 123 B vs Brand ABC123123 (combines the entire string and tags on the literal brand in the front rather than the initial).



My proposal is to regex all incoming variables by removing spaces and dashes.



var cleanedProduct = product.replace(/-|s/g,"");


This should reduce my "chances" at mix matches as much as possible. For the remaining I have two ideas:





  • Semi-automated: Fuzzy search new products with product names, if a match exists store backup model numbers. Then I can review when this exception is found for approval.


  • Manual: Check new products and add them to the exception list so it triggers matching up to the correct/baseline model number


During production it's expected to be ~5000 products, only 80-120 new products annually.



Just curious if someone came into the same issue, ideally using UPC codes... but I am not given that information.










share|improve this question























  • You solution looks good enough. Are there other cases where '-' and spaces are not involved and something else is there?

    – Aditya Gupta
    Jan 2 at 8:23













  • The second example, such as ABC123 123 B vs Brand ABC123123. But these make up a VERY small amount of products.

    – Miles Collier
    Jan 2 at 8:40











  • Oh. So, removing 'Brand' like others is not appropriate or its just an example to make your point.

    – Aditya Gupta
    Jan 2 at 9:16














0












0








0








I'm scraping a set of products from various retail sites, but unfortunately ~10% of the products have mix matched product model numbers. This is troubling because ideally I need an automated way of comparing product information/pricing between these sites.



The information I am able to gain:




  • Product Brand

  • Product Name <-- These have minor variations at times also

  • Product Model Number <-- This is what I use as the "unique"

  • Misc. Product Information (Descriptions, ratings, etc)


One retail site may list a product as: ABC123-123 while another may do ABC123123 (without dash). This is the bulk of the issues, however sometimes it's more varying such as ABC123 123 B vs Brand ABC123123 (combines the entire string and tags on the literal brand in the front rather than the initial).



My proposal is to regex all incoming variables by removing spaces and dashes.



var cleanedProduct = product.replace(/-|s/g,"");


This should reduce my "chances" at mix matches as much as possible. For the remaining I have two ideas:





  • Semi-automated: Fuzzy search new products with product names, if a match exists store backup model numbers. Then I can review when this exception is found for approval.


  • Manual: Check new products and add them to the exception list so it triggers matching up to the correct/baseline model number


During production it's expected to be ~5000 products, only 80-120 new products annually.



Just curious if someone came into the same issue, ideally using UPC codes... but I am not given that information.










share|improve this question














I'm scraping a set of products from various retail sites, but unfortunately ~10% of the products have mix matched product model numbers. This is troubling because ideally I need an automated way of comparing product information/pricing between these sites.



The information I am able to gain:




  • Product Brand

  • Product Name <-- These have minor variations at times also

  • Product Model Number <-- This is what I use as the "unique"

  • Misc. Product Information (Descriptions, ratings, etc)


One retail site may list a product as: ABC123-123 while another may do ABC123123 (without dash). This is the bulk of the issues, however sometimes it's more varying such as ABC123 123 B vs Brand ABC123123 (combines the entire string and tags on the literal brand in the front rather than the initial).



My proposal is to regex all incoming variables by removing spaces and dashes.



var cleanedProduct = product.replace(/-|s/g,"");


This should reduce my "chances" at mix matches as much as possible. For the remaining I have two ideas:





  • Semi-automated: Fuzzy search new products with product names, if a match exists store backup model numbers. Then I can review when this exception is found for approval.


  • Manual: Check new products and add them to the exception list so it triggers matching up to the correct/baseline model number


During production it's expected to be ~5000 products, only 80-120 new products annually.



Just curious if someone came into the same issue, ideally using UPC codes... but I am not given that information.







javascript web-scraping






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Jan 2 at 4:21









Miles CollierMiles Collier

10210




10210













  • You solution looks good enough. Are there other cases where '-' and spaces are not involved and something else is there?

    – Aditya Gupta
    Jan 2 at 8:23













  • The second example, such as ABC123 123 B vs Brand ABC123123. But these make up a VERY small amount of products.

    – Miles Collier
    Jan 2 at 8:40











  • Oh. So, removing 'Brand' like others is not appropriate or its just an example to make your point.

    – Aditya Gupta
    Jan 2 at 9:16



















  • You solution looks good enough. Are there other cases where '-' and spaces are not involved and something else is there?

    – Aditya Gupta
    Jan 2 at 8:23













  • The second example, such as ABC123 123 B vs Brand ABC123123. But these make up a VERY small amount of products.

    – Miles Collier
    Jan 2 at 8:40











  • Oh. So, removing 'Brand' like others is not appropriate or its just an example to make your point.

    – Aditya Gupta
    Jan 2 at 9:16

















You solution looks good enough. Are there other cases where '-' and spaces are not involved and something else is there?

– Aditya Gupta
Jan 2 at 8:23







You solution looks good enough. Are there other cases where '-' and spaces are not involved and something else is there?

– Aditya Gupta
Jan 2 at 8:23















The second example, such as ABC123 123 B vs Brand ABC123123. But these make up a VERY small amount of products.

– Miles Collier
Jan 2 at 8:40





The second example, such as ABC123 123 B vs Brand ABC123123. But these make up a VERY small amount of products.

– Miles Collier
Jan 2 at 8:40













Oh. So, removing 'Brand' like others is not appropriate or its just an example to make your point.

– Aditya Gupta
Jan 2 at 9:16





Oh. So, removing 'Brand' like others is not appropriate or its just an example to make your point.

– Aditya Gupta
Jan 2 at 9:16












0






active

oldest

votes











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54001155%2fscraping-handling-products-with-varying-identifiers-numbers%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























0






active

oldest

votes








0






active

oldest

votes









active

oldest

votes






active

oldest

votes
















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54001155%2fscraping-handling-products-with-varying-identifiers-numbers%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Q AoIs8vlkjaf,zXlhL8pyP,LY 9Kym,r0 vDppfx Tp7zSKGilcXuCI
dHPKensNNFFYH 88,v5,fb2JM tRIzZJ9kHYlBuhqT51IlB44yIA

Popular posts from this blog

Monofisismo

Angular Downloading a file using contenturl with Basic Authentication

Olmecas