(Scraping) Handling products with varying identifiers/numbers

Multi tool use
I'm scraping a set of products from various retail sites, but unfortunately ~10% of the products have mix matched product model numbers. This is troubling because ideally I need an automated way of comparing product information/pricing between these sites.
The information I am able to gain:
- Product Brand
- Product Name <-- These have minor variations at times also
- Product Model Number <-- This is what I use as the "unique"
- Misc. Product Information (Descriptions, ratings, etc)
One retail site may list a product as: ABC123-123 while another may do ABC123123 (without dash). This is the bulk of the issues, however sometimes it's more varying such as ABC123 123 B vs Brand ABC123123 (combines the entire string and tags on the literal brand in the front rather than the initial).
My proposal is to regex all incoming variables by removing spaces and dashes.
var cleanedProduct = product.replace(/-|s/g,"");
This should reduce my "chances" at mix matches as much as possible. For the remaining I have two ideas:
Semi-automated: Fuzzy search new products with product names, if a match exists store backup model numbers. Then I can review when this exception is found for approval.
Manual: Check new products and add them to the exception list so it triggers matching up to the correct/baseline model number
During production it's expected to be ~5000 products, only 80-120 new products annually.
Just curious if someone came into the same issue, ideally using UPC codes... but I am not given that information.
javascript web-scraping
add a comment |
I'm scraping a set of products from various retail sites, but unfortunately ~10% of the products have mix matched product model numbers. This is troubling because ideally I need an automated way of comparing product information/pricing between these sites.
The information I am able to gain:
- Product Brand
- Product Name <-- These have minor variations at times also
- Product Model Number <-- This is what I use as the "unique"
- Misc. Product Information (Descriptions, ratings, etc)
One retail site may list a product as: ABC123-123 while another may do ABC123123 (without dash). This is the bulk of the issues, however sometimes it's more varying such as ABC123 123 B vs Brand ABC123123 (combines the entire string and tags on the literal brand in the front rather than the initial).
My proposal is to regex all incoming variables by removing spaces and dashes.
var cleanedProduct = product.replace(/-|s/g,"");
This should reduce my "chances" at mix matches as much as possible. For the remaining I have two ideas:
Semi-automated: Fuzzy search new products with product names, if a match exists store backup model numbers. Then I can review when this exception is found for approval.
Manual: Check new products and add them to the exception list so it triggers matching up to the correct/baseline model number
During production it's expected to be ~5000 products, only 80-120 new products annually.
Just curious if someone came into the same issue, ideally using UPC codes... but I am not given that information.
javascript web-scraping
You solution looks good enough. Are there other cases where '-' and spaces are not involved and something else is there?
– Aditya Gupta
Jan 2 at 8:23
The second example, such as ABC123 123 B vs Brand ABC123123. But these make up a VERY small amount of products.
– Miles Collier
Jan 2 at 8:40
Oh. So, removing 'Brand' like others is not appropriate or its just an example to make your point.
– Aditya Gupta
Jan 2 at 9:16
add a comment |
I'm scraping a set of products from various retail sites, but unfortunately ~10% of the products have mix matched product model numbers. This is troubling because ideally I need an automated way of comparing product information/pricing between these sites.
The information I am able to gain:
- Product Brand
- Product Name <-- These have minor variations at times also
- Product Model Number <-- This is what I use as the "unique"
- Misc. Product Information (Descriptions, ratings, etc)
One retail site may list a product as: ABC123-123 while another may do ABC123123 (without dash). This is the bulk of the issues, however sometimes it's more varying such as ABC123 123 B vs Brand ABC123123 (combines the entire string and tags on the literal brand in the front rather than the initial).
My proposal is to regex all incoming variables by removing spaces and dashes.
var cleanedProduct = product.replace(/-|s/g,"");
This should reduce my "chances" at mix matches as much as possible. For the remaining I have two ideas:
Semi-automated: Fuzzy search new products with product names, if a match exists store backup model numbers. Then I can review when this exception is found for approval.
Manual: Check new products and add them to the exception list so it triggers matching up to the correct/baseline model number
During production it's expected to be ~5000 products, only 80-120 new products annually.
Just curious if someone came into the same issue, ideally using UPC codes... but I am not given that information.
javascript web-scraping
I'm scraping a set of products from various retail sites, but unfortunately ~10% of the products have mix matched product model numbers. This is troubling because ideally I need an automated way of comparing product information/pricing between these sites.
The information I am able to gain:
- Product Brand
- Product Name <-- These have minor variations at times also
- Product Model Number <-- This is what I use as the "unique"
- Misc. Product Information (Descriptions, ratings, etc)
One retail site may list a product as: ABC123-123 while another may do ABC123123 (without dash). This is the bulk of the issues, however sometimes it's more varying such as ABC123 123 B vs Brand ABC123123 (combines the entire string and tags on the literal brand in the front rather than the initial).
My proposal is to regex all incoming variables by removing spaces and dashes.
var cleanedProduct = product.replace(/-|s/g,"");
This should reduce my "chances" at mix matches as much as possible. For the remaining I have two ideas:
Semi-automated: Fuzzy search new products with product names, if a match exists store backup model numbers. Then I can review when this exception is found for approval.
Manual: Check new products and add them to the exception list so it triggers matching up to the correct/baseline model number
During production it's expected to be ~5000 products, only 80-120 new products annually.
Just curious if someone came into the same issue, ideally using UPC codes... but I am not given that information.
javascript web-scraping
javascript web-scraping
asked Jan 2 at 4:21
Miles CollierMiles Collier
10210
10210
You solution looks good enough. Are there other cases where '-' and spaces are not involved and something else is there?
– Aditya Gupta
Jan 2 at 8:23
The second example, such as ABC123 123 B vs Brand ABC123123. But these make up a VERY small amount of products.
– Miles Collier
Jan 2 at 8:40
Oh. So, removing 'Brand' like others is not appropriate or its just an example to make your point.
– Aditya Gupta
Jan 2 at 9:16
add a comment |
You solution looks good enough. Are there other cases where '-' and spaces are not involved and something else is there?
– Aditya Gupta
Jan 2 at 8:23
The second example, such as ABC123 123 B vs Brand ABC123123. But these make up a VERY small amount of products.
– Miles Collier
Jan 2 at 8:40
Oh. So, removing 'Brand' like others is not appropriate or its just an example to make your point.
– Aditya Gupta
Jan 2 at 9:16
You solution looks good enough. Are there other cases where '-' and spaces are not involved and something else is there?
– Aditya Gupta
Jan 2 at 8:23
You solution looks good enough. Are there other cases where '-' and spaces are not involved and something else is there?
– Aditya Gupta
Jan 2 at 8:23
The second example, such as ABC123 123 B vs Brand ABC123123. But these make up a VERY small amount of products.
– Miles Collier
Jan 2 at 8:40
The second example, such as ABC123 123 B vs Brand ABC123123. But these make up a VERY small amount of products.
– Miles Collier
Jan 2 at 8:40
Oh. So, removing 'Brand' like others is not appropriate or its just an example to make your point.
– Aditya Gupta
Jan 2 at 9:16
Oh. So, removing 'Brand' like others is not appropriate or its just an example to make your point.
– Aditya Gupta
Jan 2 at 9:16
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54001155%2fscraping-handling-products-with-varying-identifiers-numbers%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54001155%2fscraping-handling-products-with-varying-identifiers-numbers%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Q AoIs8vlkjaf,zXlhL8pyP,LY 9Kym,r0 vDppfx Tp7zSKGilcXuCI
You solution looks good enough. Are there other cases where '-' and spaces are not involved and something else is there?
– Aditya Gupta
Jan 2 at 8:23
The second example, such as ABC123 123 B vs Brand ABC123123. But these make up a VERY small amount of products.
– Miles Collier
Jan 2 at 8:40
Oh. So, removing 'Brand' like others is not appropriate or its just an example to make your point.
– Aditya Gupta
Jan 2 at 9:16