Need help understanding unexpected behavior using LINQ Join with HashSet
I encountered some odd behavior using C# HastSet with LINQ's Join method that I don't understand. I've simplified what I am doing to help focus on the behavior I am seeing.
I have the following:
private HashSet<MyClass> _mySet; // module level
IEnumerable<ISearchKey> searchKeys; // parameter.
// Partial key searches are allowed.
private IEqualityComparer<ICoreKey> _coreKeyComparer; // Module level.
// Compares instances of MyClass and ISearchKey to determine
// if they match.
Given that
- There is a 1-to-many relationship between searchKeys and _mySet.
- MyClass implements interface IPartialKey and ICoreKey.
- ISearchKey inherits from IPartialKey and ICoreKey.
- MyClass and ISearchKey instance both override the GetHashCode method.
- MyClass's hash code value is based on its full key value, which
includes its ICoreKey and IPartialKey values plus other fields. - The full keys used by MyClass are not unique. Two different MyClass instances can have the same hash code.
- ISearchKey's hash code value is based only on its ICoreKey and
IPartialKey values. i.e. The ISearchKey hash code might differ from
the hash code for a matching MyClass instance. (Side note: in the case where I first
encountered the problem, the ISearchKey's IPartialKey values match
the MyClass full key, so the GetHashCode methods would return the
same value for both ISearchKey and MyClass. I included the
additional complexity to better illustrate the underlying logic on
what I am doing.) - The _coreKeyComparer.GetHashCode method returns
the same value for matching instances of ISearchKey and MyClass using
only their ICoreKey values. - The _coreKeyComparer.Equals method cast
the parameters to MyClass and ISearchKey respectively and returns
true if their IPartialKey values match. (Side note:
_coreKeyComparer has been HEAVILY tested and works correctly.)
I expected a join between the two collections should result in something like:
{searchKey_a, myClass_a1},
{searchKey_a, myClass_a2},
{searchKey_a, myClass_a3},
{searchKey_b, myClass_b1},
{searchKey_b, myClass_b2},
{searchKey_c, myClass_c1},
{searchKey_c, myClass_c2},
{searchKey_c, myClass_c3},
{searchKey_c, myClass_c4},
etc....
i.e The same ISearchKey instance would occur multiple times, once for each matching MyClass instance it is joined to.
But when I do a join from searchKeys to _mySet:
var matchedPairs = searchKeys
.Join(
_mySet,
searchKey => searchKey,
myClass => myClass,
(searchKey, myClass) => new {searchKey, myClass},
_coreKeyComparer)
.ToList();
I only get one MyClass instance per searchKeyClass instance. i.e. The matchedPairs collection looks like:
{searchKey_a, myClass_a1},
{searchKey_b, myClass_b1},
{searchKey_c, myClass_c1},
etc....
However if I reverse the join, go from _mySet to searchKeys:
var matchedPairs = _mySet
.Join(
searchKeys,
myClass => myClass,
searchKey => searchKey,
(myClass, searchKey) => new {searchKey, myClass},
_coreKeyComparer)
.ToList();
I get the correct matchedPairs collection. All the matching records from _mySet are returned along with the searchKey which they matched against.
I checked the documentation and examined multiple examples and don't see any reason why the searchKeys-to-_mySet Join gives the wrong answer, while the _mySet-to-searchKeys gives the correct/different answer.
(Side note: I also tried GroupJoin from searchKeys to _myset and go similiar results. i.e. Each searchKeyClass instance found at most one result from _mySet.)
Either I don't understand how the Join method is supposed to work, or Join works differently with HashSet than it does with List or other type of collection.
If the former, I need clarification so I don't make mistakes using Join in the future.
If the latter, then is this differing behavior a .Net bug, or is this the correct behavior with HashSet?
Assuming the behavior is correct, I would greatly appreciate someone explaining the underlying logic behind this (unexpected) Join/HashSet behavior.
Just to be clear, I've already fixed my code so it return the correct results, I just want to understand why I got incorrect results initially.
c# linq join inner-join hashset
|
show 9 more comments
I encountered some odd behavior using C# HastSet with LINQ's Join method that I don't understand. I've simplified what I am doing to help focus on the behavior I am seeing.
I have the following:
private HashSet<MyClass> _mySet; // module level
IEnumerable<ISearchKey> searchKeys; // parameter.
// Partial key searches are allowed.
private IEqualityComparer<ICoreKey> _coreKeyComparer; // Module level.
// Compares instances of MyClass and ISearchKey to determine
// if they match.
Given that
- There is a 1-to-many relationship between searchKeys and _mySet.
- MyClass implements interface IPartialKey and ICoreKey.
- ISearchKey inherits from IPartialKey and ICoreKey.
- MyClass and ISearchKey instance both override the GetHashCode method.
- MyClass's hash code value is based on its full key value, which
includes its ICoreKey and IPartialKey values plus other fields. - The full keys used by MyClass are not unique. Two different MyClass instances can have the same hash code.
- ISearchKey's hash code value is based only on its ICoreKey and
IPartialKey values. i.e. The ISearchKey hash code might differ from
the hash code for a matching MyClass instance. (Side note: in the case where I first
encountered the problem, the ISearchKey's IPartialKey values match
the MyClass full key, so the GetHashCode methods would return the
same value for both ISearchKey and MyClass. I included the
additional complexity to better illustrate the underlying logic on
what I am doing.) - The _coreKeyComparer.GetHashCode method returns
the same value for matching instances of ISearchKey and MyClass using
only their ICoreKey values. - The _coreKeyComparer.Equals method cast
the parameters to MyClass and ISearchKey respectively and returns
true if their IPartialKey values match. (Side note:
_coreKeyComparer has been HEAVILY tested and works correctly.)
I expected a join between the two collections should result in something like:
{searchKey_a, myClass_a1},
{searchKey_a, myClass_a2},
{searchKey_a, myClass_a3},
{searchKey_b, myClass_b1},
{searchKey_b, myClass_b2},
{searchKey_c, myClass_c1},
{searchKey_c, myClass_c2},
{searchKey_c, myClass_c3},
{searchKey_c, myClass_c4},
etc....
i.e The same ISearchKey instance would occur multiple times, once for each matching MyClass instance it is joined to.
But when I do a join from searchKeys to _mySet:
var matchedPairs = searchKeys
.Join(
_mySet,
searchKey => searchKey,
myClass => myClass,
(searchKey, myClass) => new {searchKey, myClass},
_coreKeyComparer)
.ToList();
I only get one MyClass instance per searchKeyClass instance. i.e. The matchedPairs collection looks like:
{searchKey_a, myClass_a1},
{searchKey_b, myClass_b1},
{searchKey_c, myClass_c1},
etc....
However if I reverse the join, go from _mySet to searchKeys:
var matchedPairs = _mySet
.Join(
searchKeys,
myClass => myClass,
searchKey => searchKey,
(myClass, searchKey) => new {searchKey, myClass},
_coreKeyComparer)
.ToList();
I get the correct matchedPairs collection. All the matching records from _mySet are returned along with the searchKey which they matched against.
I checked the documentation and examined multiple examples and don't see any reason why the searchKeys-to-_mySet Join gives the wrong answer, while the _mySet-to-searchKeys gives the correct/different answer.
(Side note: I also tried GroupJoin from searchKeys to _myset and go similiar results. i.e. Each searchKeyClass instance found at most one result from _mySet.)
Either I don't understand how the Join method is supposed to work, or Join works differently with HashSet than it does with List or other type of collection.
If the former, I need clarification so I don't make mistakes using Join in the future.
If the latter, then is this differing behavior a .Net bug, or is this the correct behavior with HashSet?
Assuming the behavior is correct, I would greatly appreciate someone explaining the underlying logic behind this (unexpected) Join/HashSet behavior.
Just to be clear, I've already fixed my code so it return the correct results, I just want to understand why I got incorrect results initially.
c# linq join inner-join hashset
4
So far you've stated "my comparison code is correct", "either I don't understand how join works, or Join works differently with hash sets, or .NET has a bug". Well, you understand how joins work, Join does not work differently with hash sets, and .NET does not have a bug, so what you have here is a collection of statements which contradict each other, and one of them has to be wrong. My money is on "my comparison code is correct" being the false statement.
– Eric Lippert
Jan 3 at 19:15
1
Well then, my advice continues to be: create the minimal code that reproduces the defect. You should be able to get this down to a few dozen lines of extremely clear code that we all can run on our machines. When you do so, either you'll find your mistake, or you will have a program that someone else can find the mistake, or, you'll have a clear repro for your bug report to Microsoft.
– Eric Lippert
Jan 3 at 19:35
2
Also, use science. You have a hypothesis: joins are wrong with hash sets. OK, so start trying stuff. Make a copy of the hash set into a list and run the test again. Did you get a different result? That's evidence for your hypothesis. But if you got the same result then your hypothesis is false and should be abandoned.
– Eric Lippert
Jan 3 at 19:36
4
It's not the expected behaviour. That would be seriously broken. It is not observed to happen in correct code. That sort of thing is observed to happen in code where the comparison logic violates the rules of equality. Again: make a small reproducer that clearly demonstrates the behaviour, and you'll find your bug.
– Eric Lippert
Jan 3 at 20:07
1
Another way you might try to find your bug is to attack the equality problem directly. For example: make your comparison mechanism accumulate a list of every input it receives. When that list gets to be, say, 1000 items, then go into a verification mode. For every item in the list, verify that A = A is true. For every pair of items in the list, verify that A = B is the same as B = A. For all of those where it is true, check whether A = C matches B = C for every C. Worst case, that's only a billion checks. Then reset and accumulate another 1000.
– Eric Lippert
Jan 3 at 23:14
|
show 9 more comments
I encountered some odd behavior using C# HastSet with LINQ's Join method that I don't understand. I've simplified what I am doing to help focus on the behavior I am seeing.
I have the following:
private HashSet<MyClass> _mySet; // module level
IEnumerable<ISearchKey> searchKeys; // parameter.
// Partial key searches are allowed.
private IEqualityComparer<ICoreKey> _coreKeyComparer; // Module level.
// Compares instances of MyClass and ISearchKey to determine
// if they match.
Given that
- There is a 1-to-many relationship between searchKeys and _mySet.
- MyClass implements interface IPartialKey and ICoreKey.
- ISearchKey inherits from IPartialKey and ICoreKey.
- MyClass and ISearchKey instance both override the GetHashCode method.
- MyClass's hash code value is based on its full key value, which
includes its ICoreKey and IPartialKey values plus other fields. - The full keys used by MyClass are not unique. Two different MyClass instances can have the same hash code.
- ISearchKey's hash code value is based only on its ICoreKey and
IPartialKey values. i.e. The ISearchKey hash code might differ from
the hash code for a matching MyClass instance. (Side note: in the case where I first
encountered the problem, the ISearchKey's IPartialKey values match
the MyClass full key, so the GetHashCode methods would return the
same value for both ISearchKey and MyClass. I included the
additional complexity to better illustrate the underlying logic on
what I am doing.) - The _coreKeyComparer.GetHashCode method returns
the same value for matching instances of ISearchKey and MyClass using
only their ICoreKey values. - The _coreKeyComparer.Equals method cast
the parameters to MyClass and ISearchKey respectively and returns
true if their IPartialKey values match. (Side note:
_coreKeyComparer has been HEAVILY tested and works correctly.)
I expected a join between the two collections should result in something like:
{searchKey_a, myClass_a1},
{searchKey_a, myClass_a2},
{searchKey_a, myClass_a3},
{searchKey_b, myClass_b1},
{searchKey_b, myClass_b2},
{searchKey_c, myClass_c1},
{searchKey_c, myClass_c2},
{searchKey_c, myClass_c3},
{searchKey_c, myClass_c4},
etc....
i.e The same ISearchKey instance would occur multiple times, once for each matching MyClass instance it is joined to.
But when I do a join from searchKeys to _mySet:
var matchedPairs = searchKeys
.Join(
_mySet,
searchKey => searchKey,
myClass => myClass,
(searchKey, myClass) => new {searchKey, myClass},
_coreKeyComparer)
.ToList();
I only get one MyClass instance per searchKeyClass instance. i.e. The matchedPairs collection looks like:
{searchKey_a, myClass_a1},
{searchKey_b, myClass_b1},
{searchKey_c, myClass_c1},
etc....
However if I reverse the join, go from _mySet to searchKeys:
var matchedPairs = _mySet
.Join(
searchKeys,
myClass => myClass,
searchKey => searchKey,
(myClass, searchKey) => new {searchKey, myClass},
_coreKeyComparer)
.ToList();
I get the correct matchedPairs collection. All the matching records from _mySet are returned along with the searchKey which they matched against.
I checked the documentation and examined multiple examples and don't see any reason why the searchKeys-to-_mySet Join gives the wrong answer, while the _mySet-to-searchKeys gives the correct/different answer.
(Side note: I also tried GroupJoin from searchKeys to _myset and go similiar results. i.e. Each searchKeyClass instance found at most one result from _mySet.)
Either I don't understand how the Join method is supposed to work, or Join works differently with HashSet than it does with List or other type of collection.
If the former, I need clarification so I don't make mistakes using Join in the future.
If the latter, then is this differing behavior a .Net bug, or is this the correct behavior with HashSet?
Assuming the behavior is correct, I would greatly appreciate someone explaining the underlying logic behind this (unexpected) Join/HashSet behavior.
Just to be clear, I've already fixed my code so it return the correct results, I just want to understand why I got incorrect results initially.
c# linq join inner-join hashset
I encountered some odd behavior using C# HastSet with LINQ's Join method that I don't understand. I've simplified what I am doing to help focus on the behavior I am seeing.
I have the following:
private HashSet<MyClass> _mySet; // module level
IEnumerable<ISearchKey> searchKeys; // parameter.
// Partial key searches are allowed.
private IEqualityComparer<ICoreKey> _coreKeyComparer; // Module level.
// Compares instances of MyClass and ISearchKey to determine
// if they match.
Given that
- There is a 1-to-many relationship between searchKeys and _mySet.
- MyClass implements interface IPartialKey and ICoreKey.
- ISearchKey inherits from IPartialKey and ICoreKey.
- MyClass and ISearchKey instance both override the GetHashCode method.
- MyClass's hash code value is based on its full key value, which
includes its ICoreKey and IPartialKey values plus other fields. - The full keys used by MyClass are not unique. Two different MyClass instances can have the same hash code.
- ISearchKey's hash code value is based only on its ICoreKey and
IPartialKey values. i.e. The ISearchKey hash code might differ from
the hash code for a matching MyClass instance. (Side note: in the case where I first
encountered the problem, the ISearchKey's IPartialKey values match
the MyClass full key, so the GetHashCode methods would return the
same value for both ISearchKey and MyClass. I included the
additional complexity to better illustrate the underlying logic on
what I am doing.) - The _coreKeyComparer.GetHashCode method returns
the same value for matching instances of ISearchKey and MyClass using
only their ICoreKey values. - The _coreKeyComparer.Equals method cast
the parameters to MyClass and ISearchKey respectively and returns
true if their IPartialKey values match. (Side note:
_coreKeyComparer has been HEAVILY tested and works correctly.)
I expected a join between the two collections should result in something like:
{searchKey_a, myClass_a1},
{searchKey_a, myClass_a2},
{searchKey_a, myClass_a3},
{searchKey_b, myClass_b1},
{searchKey_b, myClass_b2},
{searchKey_c, myClass_c1},
{searchKey_c, myClass_c2},
{searchKey_c, myClass_c3},
{searchKey_c, myClass_c4},
etc....
i.e The same ISearchKey instance would occur multiple times, once for each matching MyClass instance it is joined to.
But when I do a join from searchKeys to _mySet:
var matchedPairs = searchKeys
.Join(
_mySet,
searchKey => searchKey,
myClass => myClass,
(searchKey, myClass) => new {searchKey, myClass},
_coreKeyComparer)
.ToList();
I only get one MyClass instance per searchKeyClass instance. i.e. The matchedPairs collection looks like:
{searchKey_a, myClass_a1},
{searchKey_b, myClass_b1},
{searchKey_c, myClass_c1},
etc....
However if I reverse the join, go from _mySet to searchKeys:
var matchedPairs = _mySet
.Join(
searchKeys,
myClass => myClass,
searchKey => searchKey,
(myClass, searchKey) => new {searchKey, myClass},
_coreKeyComparer)
.ToList();
I get the correct matchedPairs collection. All the matching records from _mySet are returned along with the searchKey which they matched against.
I checked the documentation and examined multiple examples and don't see any reason why the searchKeys-to-_mySet Join gives the wrong answer, while the _mySet-to-searchKeys gives the correct/different answer.
(Side note: I also tried GroupJoin from searchKeys to _myset and go similiar results. i.e. Each searchKeyClass instance found at most one result from _mySet.)
Either I don't understand how the Join method is supposed to work, or Join works differently with HashSet than it does with List or other type of collection.
If the former, I need clarification so I don't make mistakes using Join in the future.
If the latter, then is this differing behavior a .Net bug, or is this the correct behavior with HashSet?
Assuming the behavior is correct, I would greatly appreciate someone explaining the underlying logic behind this (unexpected) Join/HashSet behavior.
Just to be clear, I've already fixed my code so it return the correct results, I just want to understand why I got incorrect results initially.
c# linq join inner-join hashset
c# linq join inner-join hashset
edited Jan 3 at 18:18
Eric Lippert
547k14610691953
547k14610691953
asked Jan 3 at 15:47
RB DavidsonRB Davidson
435515
435515
4
So far you've stated "my comparison code is correct", "either I don't understand how join works, or Join works differently with hash sets, or .NET has a bug". Well, you understand how joins work, Join does not work differently with hash sets, and .NET does not have a bug, so what you have here is a collection of statements which contradict each other, and one of them has to be wrong. My money is on "my comparison code is correct" being the false statement.
– Eric Lippert
Jan 3 at 19:15
1
Well then, my advice continues to be: create the minimal code that reproduces the defect. You should be able to get this down to a few dozen lines of extremely clear code that we all can run on our machines. When you do so, either you'll find your mistake, or you will have a program that someone else can find the mistake, or, you'll have a clear repro for your bug report to Microsoft.
– Eric Lippert
Jan 3 at 19:35
2
Also, use science. You have a hypothesis: joins are wrong with hash sets. OK, so start trying stuff. Make a copy of the hash set into a list and run the test again. Did you get a different result? That's evidence for your hypothesis. But if you got the same result then your hypothesis is false and should be abandoned.
– Eric Lippert
Jan 3 at 19:36
4
It's not the expected behaviour. That would be seriously broken. It is not observed to happen in correct code. That sort of thing is observed to happen in code where the comparison logic violates the rules of equality. Again: make a small reproducer that clearly demonstrates the behaviour, and you'll find your bug.
– Eric Lippert
Jan 3 at 20:07
1
Another way you might try to find your bug is to attack the equality problem directly. For example: make your comparison mechanism accumulate a list of every input it receives. When that list gets to be, say, 1000 items, then go into a verification mode. For every item in the list, verify that A = A is true. For every pair of items in the list, verify that A = B is the same as B = A. For all of those where it is true, check whether A = C matches B = C for every C. Worst case, that's only a billion checks. Then reset and accumulate another 1000.
– Eric Lippert
Jan 3 at 23:14
|
show 9 more comments
4
So far you've stated "my comparison code is correct", "either I don't understand how join works, or Join works differently with hash sets, or .NET has a bug". Well, you understand how joins work, Join does not work differently with hash sets, and .NET does not have a bug, so what you have here is a collection of statements which contradict each other, and one of them has to be wrong. My money is on "my comparison code is correct" being the false statement.
– Eric Lippert
Jan 3 at 19:15
1
Well then, my advice continues to be: create the minimal code that reproduces the defect. You should be able to get this down to a few dozen lines of extremely clear code that we all can run on our machines. When you do so, either you'll find your mistake, or you will have a program that someone else can find the mistake, or, you'll have a clear repro for your bug report to Microsoft.
– Eric Lippert
Jan 3 at 19:35
2
Also, use science. You have a hypothesis: joins are wrong with hash sets. OK, so start trying stuff. Make a copy of the hash set into a list and run the test again. Did you get a different result? That's evidence for your hypothesis. But if you got the same result then your hypothesis is false and should be abandoned.
– Eric Lippert
Jan 3 at 19:36
4
It's not the expected behaviour. That would be seriously broken. It is not observed to happen in correct code. That sort of thing is observed to happen in code where the comparison logic violates the rules of equality. Again: make a small reproducer that clearly demonstrates the behaviour, and you'll find your bug.
– Eric Lippert
Jan 3 at 20:07
1
Another way you might try to find your bug is to attack the equality problem directly. For example: make your comparison mechanism accumulate a list of every input it receives. When that list gets to be, say, 1000 items, then go into a verification mode. For every item in the list, verify that A = A is true. For every pair of items in the list, verify that A = B is the same as B = A. For all of those where it is true, check whether A = C matches B = C for every C. Worst case, that's only a billion checks. Then reset and accumulate another 1000.
– Eric Lippert
Jan 3 at 23:14
4
4
So far you've stated "my comparison code is correct", "either I don't understand how join works, or Join works differently with hash sets, or .NET has a bug". Well, you understand how joins work, Join does not work differently with hash sets, and .NET does not have a bug, so what you have here is a collection of statements which contradict each other, and one of them has to be wrong. My money is on "my comparison code is correct" being the false statement.
– Eric Lippert
Jan 3 at 19:15
So far you've stated "my comparison code is correct", "either I don't understand how join works, or Join works differently with hash sets, or .NET has a bug". Well, you understand how joins work, Join does not work differently with hash sets, and .NET does not have a bug, so what you have here is a collection of statements which contradict each other, and one of them has to be wrong. My money is on "my comparison code is correct" being the false statement.
– Eric Lippert
Jan 3 at 19:15
1
1
Well then, my advice continues to be: create the minimal code that reproduces the defect. You should be able to get this down to a few dozen lines of extremely clear code that we all can run on our machines. When you do so, either you'll find your mistake, or you will have a program that someone else can find the mistake, or, you'll have a clear repro for your bug report to Microsoft.
– Eric Lippert
Jan 3 at 19:35
Well then, my advice continues to be: create the minimal code that reproduces the defect. You should be able to get this down to a few dozen lines of extremely clear code that we all can run on our machines. When you do so, either you'll find your mistake, or you will have a program that someone else can find the mistake, or, you'll have a clear repro for your bug report to Microsoft.
– Eric Lippert
Jan 3 at 19:35
2
2
Also, use science. You have a hypothesis: joins are wrong with hash sets. OK, so start trying stuff. Make a copy of the hash set into a list and run the test again. Did you get a different result? That's evidence for your hypothesis. But if you got the same result then your hypothesis is false and should be abandoned.
– Eric Lippert
Jan 3 at 19:36
Also, use science. You have a hypothesis: joins are wrong with hash sets. OK, so start trying stuff. Make a copy of the hash set into a list and run the test again. Did you get a different result? That's evidence for your hypothesis. But if you got the same result then your hypothesis is false and should be abandoned.
– Eric Lippert
Jan 3 at 19:36
4
4
It's not the expected behaviour. That would be seriously broken. It is not observed to happen in correct code. That sort of thing is observed to happen in code where the comparison logic violates the rules of equality. Again: make a small reproducer that clearly demonstrates the behaviour, and you'll find your bug.
– Eric Lippert
Jan 3 at 20:07
It's not the expected behaviour. That would be seriously broken. It is not observed to happen in correct code. That sort of thing is observed to happen in code where the comparison logic violates the rules of equality. Again: make a small reproducer that clearly demonstrates the behaviour, and you'll find your bug.
– Eric Lippert
Jan 3 at 20:07
1
1
Another way you might try to find your bug is to attack the equality problem directly. For example: make your comparison mechanism accumulate a list of every input it receives. When that list gets to be, say, 1000 items, then go into a verification mode. For every item in the list, verify that A = A is true. For every pair of items in the list, verify that A = B is the same as B = A. For all of those where it is true, check whether A = C matches B = C for every C. Worst case, that's only a billion checks. Then reset and accumulate another 1000.
– Eric Lippert
Jan 3 at 23:14
Another way you might try to find your bug is to attack the equality problem directly. For example: make your comparison mechanism accumulate a list of every input it receives. When that list gets to be, say, 1000 items, then go into a verification mode. For every item in the list, verify that A = A is true. For every pair of items in the list, verify that A = B is the same as B = A. For all of those where it is true, check whether A = C matches B = C for every C. Worst case, that's only a billion checks. Then reset and accumulate another 1000.
– Eric Lippert
Jan 3 at 23:14
|
show 9 more comments
1 Answer
1
active
oldest
votes
Your bug is almost certainly somewhere in the vast amount of code you did not show in the question. My advice is that you simplify your program down to the simplest possible program that produces the bug. In doing so, either you will find your bug, or you will produce a program that is so simple that you can post all of it in your question and then we can analyze it.
Assuming the behavior is correct, I would greatly appreciate someone explaining the underlying logic behind this (unexpected) Join/HashSet behavior.
Since I do not know what the unexpected behaviour is, I cannot say why it happens. I can however say precisely what Join
does, and perhaps that will help.
Join
takes the following:
- An "outer" collection -- the receiver of the
Join
. - An "inner" collection -- the first argument of the extension method
- Two key extractors, that extract a key from the outer and inner collections
- A projection, that takes a member of the inner and outer collections whose keys match, and produces the result for that match
- A comparison operation that compares two keys for equality.
Here's how Join
works. (This is logically what happens; the actual implementation details are somewhat optimized.)
First, we iterate over the "inner" collection, exactly once.
For each element of the inner collection, we extract its key, and we form a multi-dictionary that maps from the key to the set of all elements in the inner collection where the key selector produced that key. Keys are compared for equality using the supplied comparison.
Thus, we now have a lookup from TKey
to IEnumerable<TInner>
.
Second, we iterate over the "outer" collection, exactly once.
For each element of the outer collection, we extract its key, and do a lookup in the multi-dictionary for that key, again, using the supplied key comparison.
We then do a nested loop on each matching element of the inner collection, call the projection on the outer/inner pair, and yield the result.
That is, Join
behaves like this pseudocode implementation:
static IEnumerable<TResult> Join<TOuter, TInner, TKey, TResult>
(IEnumerable<TOuter> outer,
IEnumerable<TInner> inner,
Func<TOuter, TKey> outerKeySelector,
Func<TInner, TKey> innerKeySelector,
Func<TOuter, TInner, TResult> resultSelector,
IEqualityComparer<TKey> comparer)
{
var lookup = new SomeMultiDictionary<TKey, TInner>(comparer);
foreach(TInner innerItem in inner)
{
TKey innerKey = innerKeySelector(innerItem);
lookup.Add(innerItem, innerKey);
}
foreach (TOuter outerItem in outer)
{
TKey outerKey = outerKeySelector(outerItem);
foreach(TInner innerItem in lookup[outerKey])
{
TResult result = resultSelector(outerItem, innerItem);
yield return result;
}
}
}
Some suggestions:
- Replace all your
GetHashCode
implementations so that they return0
, and run all your tests. They should pass! It is always legal to return zero fromGetHashCode
. Doing so will almost certainly wreck your performance, but it must not wreck your correctness. If you are in a situation where you require a particular non-zero value ofGetHashCode
, then you have a bug. - Test your key comparison to ensure that it is a valid comparison. It must obey the three rules of equality: (1) reflexivity: a thing always equals itself, (2) symmetry: the equality of
A
andB
must be the same asB
andA
, (3) transitivity: ifA
equalsB
andB
equalsC
thenA
must equalC
. If these rules are not met thenJoin
can behave weirdly.
Replace your
Join
with aSelectMany
and aWhere
. That is:
from o in outer
join i in inner on getOuterKey(o) equals getInnerKey(i)
select getResult(o, i)
can be rewritten as
from o in outer
from i in inner
where keyEquality(getOuterKey(o), getInnerKey(i))
select getResult(o, i)
That query is slower than the join version, but it is logically exactly the same. Again, run your tests. Do you get the same result? If not, you have a bug somewhere in your logic.
Again, I cannot emphasize strongly enough that your attitude that "Join is probably broken when given a hash table" is what is holding you back from finding your bug. Join isn't broken. That code hasn't changed in a decade, it is very simple, and it was right when we wrote it the first time. Much more likely is that your complicated and arcane key comparison logic is broken somewhere.
1
Thanks for this amazing clear answer Eric -- it shows how my answer was wrong since I assumed you did not iterate over the inner selector. I do have one question -- HashSet (at docs.microsoft.com/en-us/dotnet/api/…) says HashSet cannot contain duplicate elements. Are duplicate elements determined by equity operator? Could that have an effect? Or an effect in thelookup[outerKey]
in your code?
– Hogan
Jan 3 at 19:02
1
@EricLippert So the multi-dictionary maps from key to unique values? TheIEnumerable<TInner>
won't have duplicates? I don't see whereGrouping.Add
does that?
– NetMage
Jan 3 at 19:37
2
@EricLippert, I now realize I didn't read your initial answer closely enough. The bug was in my IEqualityComparer, which I wrongly insisted was correct. The need for partial matches caused transitivity to fail, which meant the initial the loop over the inner hashset returned false when it should have returned true. Rule 1: Never swear the problem isn't in a particular chunk of code, because it will always be in that code. I couldn't see a way to fix the comparer to handle partial matches against a set of searchkeys, so I refactored code to avoid join. Thanks for your help.
– RB Davidson
Jan 7 at 19:06
1
@EricLippert Regarding my claim that "Either I don't understand how the Join method is supposed to work or...." Well, I didn't understand how the Join method worked. I assumed Join only compared items in the outer collection against those in the inner. I didn't know it looped over the inner collection first, comparing those elements against each other. When I confirmed the behavior matched your explanation, I realized there would be cases where ISearchKey would equal two different MyClass instances which would not equal each other. Thanks for enlightening me on the Join method's logic.
– RB Davidson
Jan 7 at 19:21
1
@RBDavidson: Further reading: If the subject of "ways people implement comparisons wrong" interests you, see ericlippert.com/2011/01/20/bad-comparisons-part-one. If the subject "ways people implement GetHashCode wrong" interests you, see blogs.msdn.microsoft.com/ericlippert/2011/02/28/…
– Eric Lippert
Jan 7 at 19:33
|
show 7 more comments
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54025578%2fneed-help-understanding-unexpected-behavior-using-linq-join-with-hashsett%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Your bug is almost certainly somewhere in the vast amount of code you did not show in the question. My advice is that you simplify your program down to the simplest possible program that produces the bug. In doing so, either you will find your bug, or you will produce a program that is so simple that you can post all of it in your question and then we can analyze it.
Assuming the behavior is correct, I would greatly appreciate someone explaining the underlying logic behind this (unexpected) Join/HashSet behavior.
Since I do not know what the unexpected behaviour is, I cannot say why it happens. I can however say precisely what Join
does, and perhaps that will help.
Join
takes the following:
- An "outer" collection -- the receiver of the
Join
. - An "inner" collection -- the first argument of the extension method
- Two key extractors, that extract a key from the outer and inner collections
- A projection, that takes a member of the inner and outer collections whose keys match, and produces the result for that match
- A comparison operation that compares two keys for equality.
Here's how Join
works. (This is logically what happens; the actual implementation details are somewhat optimized.)
First, we iterate over the "inner" collection, exactly once.
For each element of the inner collection, we extract its key, and we form a multi-dictionary that maps from the key to the set of all elements in the inner collection where the key selector produced that key. Keys are compared for equality using the supplied comparison.
Thus, we now have a lookup from TKey
to IEnumerable<TInner>
.
Second, we iterate over the "outer" collection, exactly once.
For each element of the outer collection, we extract its key, and do a lookup in the multi-dictionary for that key, again, using the supplied key comparison.
We then do a nested loop on each matching element of the inner collection, call the projection on the outer/inner pair, and yield the result.
That is, Join
behaves like this pseudocode implementation:
static IEnumerable<TResult> Join<TOuter, TInner, TKey, TResult>
(IEnumerable<TOuter> outer,
IEnumerable<TInner> inner,
Func<TOuter, TKey> outerKeySelector,
Func<TInner, TKey> innerKeySelector,
Func<TOuter, TInner, TResult> resultSelector,
IEqualityComparer<TKey> comparer)
{
var lookup = new SomeMultiDictionary<TKey, TInner>(comparer);
foreach(TInner innerItem in inner)
{
TKey innerKey = innerKeySelector(innerItem);
lookup.Add(innerItem, innerKey);
}
foreach (TOuter outerItem in outer)
{
TKey outerKey = outerKeySelector(outerItem);
foreach(TInner innerItem in lookup[outerKey])
{
TResult result = resultSelector(outerItem, innerItem);
yield return result;
}
}
}
Some suggestions:
- Replace all your
GetHashCode
implementations so that they return0
, and run all your tests. They should pass! It is always legal to return zero fromGetHashCode
. Doing so will almost certainly wreck your performance, but it must not wreck your correctness. If you are in a situation where you require a particular non-zero value ofGetHashCode
, then you have a bug. - Test your key comparison to ensure that it is a valid comparison. It must obey the three rules of equality: (1) reflexivity: a thing always equals itself, (2) symmetry: the equality of
A
andB
must be the same asB
andA
, (3) transitivity: ifA
equalsB
andB
equalsC
thenA
must equalC
. If these rules are not met thenJoin
can behave weirdly.
Replace your
Join
with aSelectMany
and aWhere
. That is:
from o in outer
join i in inner on getOuterKey(o) equals getInnerKey(i)
select getResult(o, i)
can be rewritten as
from o in outer
from i in inner
where keyEquality(getOuterKey(o), getInnerKey(i))
select getResult(o, i)
That query is slower than the join version, but it is logically exactly the same. Again, run your tests. Do you get the same result? If not, you have a bug somewhere in your logic.
Again, I cannot emphasize strongly enough that your attitude that "Join is probably broken when given a hash table" is what is holding you back from finding your bug. Join isn't broken. That code hasn't changed in a decade, it is very simple, and it was right when we wrote it the first time. Much more likely is that your complicated and arcane key comparison logic is broken somewhere.
1
Thanks for this amazing clear answer Eric -- it shows how my answer was wrong since I assumed you did not iterate over the inner selector. I do have one question -- HashSet (at docs.microsoft.com/en-us/dotnet/api/…) says HashSet cannot contain duplicate elements. Are duplicate elements determined by equity operator? Could that have an effect? Or an effect in thelookup[outerKey]
in your code?
– Hogan
Jan 3 at 19:02
1
@EricLippert So the multi-dictionary maps from key to unique values? TheIEnumerable<TInner>
won't have duplicates? I don't see whereGrouping.Add
does that?
– NetMage
Jan 3 at 19:37
2
@EricLippert, I now realize I didn't read your initial answer closely enough. The bug was in my IEqualityComparer, which I wrongly insisted was correct. The need for partial matches caused transitivity to fail, which meant the initial the loop over the inner hashset returned false when it should have returned true. Rule 1: Never swear the problem isn't in a particular chunk of code, because it will always be in that code. I couldn't see a way to fix the comparer to handle partial matches against a set of searchkeys, so I refactored code to avoid join. Thanks for your help.
– RB Davidson
Jan 7 at 19:06
1
@EricLippert Regarding my claim that "Either I don't understand how the Join method is supposed to work or...." Well, I didn't understand how the Join method worked. I assumed Join only compared items in the outer collection against those in the inner. I didn't know it looped over the inner collection first, comparing those elements against each other. When I confirmed the behavior matched your explanation, I realized there would be cases where ISearchKey would equal two different MyClass instances which would not equal each other. Thanks for enlightening me on the Join method's logic.
– RB Davidson
Jan 7 at 19:21
1
@RBDavidson: Further reading: If the subject of "ways people implement comparisons wrong" interests you, see ericlippert.com/2011/01/20/bad-comparisons-part-one. If the subject "ways people implement GetHashCode wrong" interests you, see blogs.msdn.microsoft.com/ericlippert/2011/02/28/…
– Eric Lippert
Jan 7 at 19:33
|
show 7 more comments
Your bug is almost certainly somewhere in the vast amount of code you did not show in the question. My advice is that you simplify your program down to the simplest possible program that produces the bug. In doing so, either you will find your bug, or you will produce a program that is so simple that you can post all of it in your question and then we can analyze it.
Assuming the behavior is correct, I would greatly appreciate someone explaining the underlying logic behind this (unexpected) Join/HashSet behavior.
Since I do not know what the unexpected behaviour is, I cannot say why it happens. I can however say precisely what Join
does, and perhaps that will help.
Join
takes the following:
- An "outer" collection -- the receiver of the
Join
. - An "inner" collection -- the first argument of the extension method
- Two key extractors, that extract a key from the outer and inner collections
- A projection, that takes a member of the inner and outer collections whose keys match, and produces the result for that match
- A comparison operation that compares two keys for equality.
Here's how Join
works. (This is logically what happens; the actual implementation details are somewhat optimized.)
First, we iterate over the "inner" collection, exactly once.
For each element of the inner collection, we extract its key, and we form a multi-dictionary that maps from the key to the set of all elements in the inner collection where the key selector produced that key. Keys are compared for equality using the supplied comparison.
Thus, we now have a lookup from TKey
to IEnumerable<TInner>
.
Second, we iterate over the "outer" collection, exactly once.
For each element of the outer collection, we extract its key, and do a lookup in the multi-dictionary for that key, again, using the supplied key comparison.
We then do a nested loop on each matching element of the inner collection, call the projection on the outer/inner pair, and yield the result.
That is, Join
behaves like this pseudocode implementation:
static IEnumerable<TResult> Join<TOuter, TInner, TKey, TResult>
(IEnumerable<TOuter> outer,
IEnumerable<TInner> inner,
Func<TOuter, TKey> outerKeySelector,
Func<TInner, TKey> innerKeySelector,
Func<TOuter, TInner, TResult> resultSelector,
IEqualityComparer<TKey> comparer)
{
var lookup = new SomeMultiDictionary<TKey, TInner>(comparer);
foreach(TInner innerItem in inner)
{
TKey innerKey = innerKeySelector(innerItem);
lookup.Add(innerItem, innerKey);
}
foreach (TOuter outerItem in outer)
{
TKey outerKey = outerKeySelector(outerItem);
foreach(TInner innerItem in lookup[outerKey])
{
TResult result = resultSelector(outerItem, innerItem);
yield return result;
}
}
}
Some suggestions:
- Replace all your
GetHashCode
implementations so that they return0
, and run all your tests. They should pass! It is always legal to return zero fromGetHashCode
. Doing so will almost certainly wreck your performance, but it must not wreck your correctness. If you are in a situation where you require a particular non-zero value ofGetHashCode
, then you have a bug. - Test your key comparison to ensure that it is a valid comparison. It must obey the three rules of equality: (1) reflexivity: a thing always equals itself, (2) symmetry: the equality of
A
andB
must be the same asB
andA
, (3) transitivity: ifA
equalsB
andB
equalsC
thenA
must equalC
. If these rules are not met thenJoin
can behave weirdly.
Replace your
Join
with aSelectMany
and aWhere
. That is:
from o in outer
join i in inner on getOuterKey(o) equals getInnerKey(i)
select getResult(o, i)
can be rewritten as
from o in outer
from i in inner
where keyEquality(getOuterKey(o), getInnerKey(i))
select getResult(o, i)
That query is slower than the join version, but it is logically exactly the same. Again, run your tests. Do you get the same result? If not, you have a bug somewhere in your logic.
Again, I cannot emphasize strongly enough that your attitude that "Join is probably broken when given a hash table" is what is holding you back from finding your bug. Join isn't broken. That code hasn't changed in a decade, it is very simple, and it was right when we wrote it the first time. Much more likely is that your complicated and arcane key comparison logic is broken somewhere.
1
Thanks for this amazing clear answer Eric -- it shows how my answer was wrong since I assumed you did not iterate over the inner selector. I do have one question -- HashSet (at docs.microsoft.com/en-us/dotnet/api/…) says HashSet cannot contain duplicate elements. Are duplicate elements determined by equity operator? Could that have an effect? Or an effect in thelookup[outerKey]
in your code?
– Hogan
Jan 3 at 19:02
1
@EricLippert So the multi-dictionary maps from key to unique values? TheIEnumerable<TInner>
won't have duplicates? I don't see whereGrouping.Add
does that?
– NetMage
Jan 3 at 19:37
2
@EricLippert, I now realize I didn't read your initial answer closely enough. The bug was in my IEqualityComparer, which I wrongly insisted was correct. The need for partial matches caused transitivity to fail, which meant the initial the loop over the inner hashset returned false when it should have returned true. Rule 1: Never swear the problem isn't in a particular chunk of code, because it will always be in that code. I couldn't see a way to fix the comparer to handle partial matches against a set of searchkeys, so I refactored code to avoid join. Thanks for your help.
– RB Davidson
Jan 7 at 19:06
1
@EricLippert Regarding my claim that "Either I don't understand how the Join method is supposed to work or...." Well, I didn't understand how the Join method worked. I assumed Join only compared items in the outer collection against those in the inner. I didn't know it looped over the inner collection first, comparing those elements against each other. When I confirmed the behavior matched your explanation, I realized there would be cases where ISearchKey would equal two different MyClass instances which would not equal each other. Thanks for enlightening me on the Join method's logic.
– RB Davidson
Jan 7 at 19:21
1
@RBDavidson: Further reading: If the subject of "ways people implement comparisons wrong" interests you, see ericlippert.com/2011/01/20/bad-comparisons-part-one. If the subject "ways people implement GetHashCode wrong" interests you, see blogs.msdn.microsoft.com/ericlippert/2011/02/28/…
– Eric Lippert
Jan 7 at 19:33
|
show 7 more comments
Your bug is almost certainly somewhere in the vast amount of code you did not show in the question. My advice is that you simplify your program down to the simplest possible program that produces the bug. In doing so, either you will find your bug, or you will produce a program that is so simple that you can post all of it in your question and then we can analyze it.
Assuming the behavior is correct, I would greatly appreciate someone explaining the underlying logic behind this (unexpected) Join/HashSet behavior.
Since I do not know what the unexpected behaviour is, I cannot say why it happens. I can however say precisely what Join
does, and perhaps that will help.
Join
takes the following:
- An "outer" collection -- the receiver of the
Join
. - An "inner" collection -- the first argument of the extension method
- Two key extractors, that extract a key from the outer and inner collections
- A projection, that takes a member of the inner and outer collections whose keys match, and produces the result for that match
- A comparison operation that compares two keys for equality.
Here's how Join
works. (This is logically what happens; the actual implementation details are somewhat optimized.)
First, we iterate over the "inner" collection, exactly once.
For each element of the inner collection, we extract its key, and we form a multi-dictionary that maps from the key to the set of all elements in the inner collection where the key selector produced that key. Keys are compared for equality using the supplied comparison.
Thus, we now have a lookup from TKey
to IEnumerable<TInner>
.
Second, we iterate over the "outer" collection, exactly once.
For each element of the outer collection, we extract its key, and do a lookup in the multi-dictionary for that key, again, using the supplied key comparison.
We then do a nested loop on each matching element of the inner collection, call the projection on the outer/inner pair, and yield the result.
That is, Join
behaves like this pseudocode implementation:
static IEnumerable<TResult> Join<TOuter, TInner, TKey, TResult>
(IEnumerable<TOuter> outer,
IEnumerable<TInner> inner,
Func<TOuter, TKey> outerKeySelector,
Func<TInner, TKey> innerKeySelector,
Func<TOuter, TInner, TResult> resultSelector,
IEqualityComparer<TKey> comparer)
{
var lookup = new SomeMultiDictionary<TKey, TInner>(comparer);
foreach(TInner innerItem in inner)
{
TKey innerKey = innerKeySelector(innerItem);
lookup.Add(innerItem, innerKey);
}
foreach (TOuter outerItem in outer)
{
TKey outerKey = outerKeySelector(outerItem);
foreach(TInner innerItem in lookup[outerKey])
{
TResult result = resultSelector(outerItem, innerItem);
yield return result;
}
}
}
Some suggestions:
- Replace all your
GetHashCode
implementations so that they return0
, and run all your tests. They should pass! It is always legal to return zero fromGetHashCode
. Doing so will almost certainly wreck your performance, but it must not wreck your correctness. If you are in a situation where you require a particular non-zero value ofGetHashCode
, then you have a bug. - Test your key comparison to ensure that it is a valid comparison. It must obey the three rules of equality: (1) reflexivity: a thing always equals itself, (2) symmetry: the equality of
A
andB
must be the same asB
andA
, (3) transitivity: ifA
equalsB
andB
equalsC
thenA
must equalC
. If these rules are not met thenJoin
can behave weirdly.
Replace your
Join
with aSelectMany
and aWhere
. That is:
from o in outer
join i in inner on getOuterKey(o) equals getInnerKey(i)
select getResult(o, i)
can be rewritten as
from o in outer
from i in inner
where keyEquality(getOuterKey(o), getInnerKey(i))
select getResult(o, i)
That query is slower than the join version, but it is logically exactly the same. Again, run your tests. Do you get the same result? If not, you have a bug somewhere in your logic.
Again, I cannot emphasize strongly enough that your attitude that "Join is probably broken when given a hash table" is what is holding you back from finding your bug. Join isn't broken. That code hasn't changed in a decade, it is very simple, and it was right when we wrote it the first time. Much more likely is that your complicated and arcane key comparison logic is broken somewhere.
Your bug is almost certainly somewhere in the vast amount of code you did not show in the question. My advice is that you simplify your program down to the simplest possible program that produces the bug. In doing so, either you will find your bug, or you will produce a program that is so simple that you can post all of it in your question and then we can analyze it.
Assuming the behavior is correct, I would greatly appreciate someone explaining the underlying logic behind this (unexpected) Join/HashSet behavior.
Since I do not know what the unexpected behaviour is, I cannot say why it happens. I can however say precisely what Join
does, and perhaps that will help.
Join
takes the following:
- An "outer" collection -- the receiver of the
Join
. - An "inner" collection -- the first argument of the extension method
- Two key extractors, that extract a key from the outer and inner collections
- A projection, that takes a member of the inner and outer collections whose keys match, and produces the result for that match
- A comparison operation that compares two keys for equality.
Here's how Join
works. (This is logically what happens; the actual implementation details are somewhat optimized.)
First, we iterate over the "inner" collection, exactly once.
For each element of the inner collection, we extract its key, and we form a multi-dictionary that maps from the key to the set of all elements in the inner collection where the key selector produced that key. Keys are compared for equality using the supplied comparison.
Thus, we now have a lookup from TKey
to IEnumerable<TInner>
.
Second, we iterate over the "outer" collection, exactly once.
For each element of the outer collection, we extract its key, and do a lookup in the multi-dictionary for that key, again, using the supplied key comparison.
We then do a nested loop on each matching element of the inner collection, call the projection on the outer/inner pair, and yield the result.
That is, Join
behaves like this pseudocode implementation:
static IEnumerable<TResult> Join<TOuter, TInner, TKey, TResult>
(IEnumerable<TOuter> outer,
IEnumerable<TInner> inner,
Func<TOuter, TKey> outerKeySelector,
Func<TInner, TKey> innerKeySelector,
Func<TOuter, TInner, TResult> resultSelector,
IEqualityComparer<TKey> comparer)
{
var lookup = new SomeMultiDictionary<TKey, TInner>(comparer);
foreach(TInner innerItem in inner)
{
TKey innerKey = innerKeySelector(innerItem);
lookup.Add(innerItem, innerKey);
}
foreach (TOuter outerItem in outer)
{
TKey outerKey = outerKeySelector(outerItem);
foreach(TInner innerItem in lookup[outerKey])
{
TResult result = resultSelector(outerItem, innerItem);
yield return result;
}
}
}
Some suggestions:
- Replace all your
GetHashCode
implementations so that they return0
, and run all your tests. They should pass! It is always legal to return zero fromGetHashCode
. Doing so will almost certainly wreck your performance, but it must not wreck your correctness. If you are in a situation where you require a particular non-zero value ofGetHashCode
, then you have a bug. - Test your key comparison to ensure that it is a valid comparison. It must obey the three rules of equality: (1) reflexivity: a thing always equals itself, (2) symmetry: the equality of
A
andB
must be the same asB
andA
, (3) transitivity: ifA
equalsB
andB
equalsC
thenA
must equalC
. If these rules are not met thenJoin
can behave weirdly.
Replace your
Join
with aSelectMany
and aWhere
. That is:
from o in outer
join i in inner on getOuterKey(o) equals getInnerKey(i)
select getResult(o, i)
can be rewritten as
from o in outer
from i in inner
where keyEquality(getOuterKey(o), getInnerKey(i))
select getResult(o, i)
That query is slower than the join version, but it is logically exactly the same. Again, run your tests. Do you get the same result? If not, you have a bug somewhere in your logic.
Again, I cannot emphasize strongly enough that your attitude that "Join is probably broken when given a hash table" is what is holding you back from finding your bug. Join isn't broken. That code hasn't changed in a decade, it is very simple, and it was right when we wrote it the first time. Much more likely is that your complicated and arcane key comparison logic is broken somewhere.
edited Jan 3 at 18:56
answered Jan 3 at 18:49
Eric LippertEric Lippert
547k14610691953
547k14610691953
1
Thanks for this amazing clear answer Eric -- it shows how my answer was wrong since I assumed you did not iterate over the inner selector. I do have one question -- HashSet (at docs.microsoft.com/en-us/dotnet/api/…) says HashSet cannot contain duplicate elements. Are duplicate elements determined by equity operator? Could that have an effect? Or an effect in thelookup[outerKey]
in your code?
– Hogan
Jan 3 at 19:02
1
@EricLippert So the multi-dictionary maps from key to unique values? TheIEnumerable<TInner>
won't have duplicates? I don't see whereGrouping.Add
does that?
– NetMage
Jan 3 at 19:37
2
@EricLippert, I now realize I didn't read your initial answer closely enough. The bug was in my IEqualityComparer, which I wrongly insisted was correct. The need for partial matches caused transitivity to fail, which meant the initial the loop over the inner hashset returned false when it should have returned true. Rule 1: Never swear the problem isn't in a particular chunk of code, because it will always be in that code. I couldn't see a way to fix the comparer to handle partial matches against a set of searchkeys, so I refactored code to avoid join. Thanks for your help.
– RB Davidson
Jan 7 at 19:06
1
@EricLippert Regarding my claim that "Either I don't understand how the Join method is supposed to work or...." Well, I didn't understand how the Join method worked. I assumed Join only compared items in the outer collection against those in the inner. I didn't know it looped over the inner collection first, comparing those elements against each other. When I confirmed the behavior matched your explanation, I realized there would be cases where ISearchKey would equal two different MyClass instances which would not equal each other. Thanks for enlightening me on the Join method's logic.
– RB Davidson
Jan 7 at 19:21
1
@RBDavidson: Further reading: If the subject of "ways people implement comparisons wrong" interests you, see ericlippert.com/2011/01/20/bad-comparisons-part-one. If the subject "ways people implement GetHashCode wrong" interests you, see blogs.msdn.microsoft.com/ericlippert/2011/02/28/…
– Eric Lippert
Jan 7 at 19:33
|
show 7 more comments
1
Thanks for this amazing clear answer Eric -- it shows how my answer was wrong since I assumed you did not iterate over the inner selector. I do have one question -- HashSet (at docs.microsoft.com/en-us/dotnet/api/…) says HashSet cannot contain duplicate elements. Are duplicate elements determined by equity operator? Could that have an effect? Or an effect in thelookup[outerKey]
in your code?
– Hogan
Jan 3 at 19:02
1
@EricLippert So the multi-dictionary maps from key to unique values? TheIEnumerable<TInner>
won't have duplicates? I don't see whereGrouping.Add
does that?
– NetMage
Jan 3 at 19:37
2
@EricLippert, I now realize I didn't read your initial answer closely enough. The bug was in my IEqualityComparer, which I wrongly insisted was correct. The need for partial matches caused transitivity to fail, which meant the initial the loop over the inner hashset returned false when it should have returned true. Rule 1: Never swear the problem isn't in a particular chunk of code, because it will always be in that code. I couldn't see a way to fix the comparer to handle partial matches against a set of searchkeys, so I refactored code to avoid join. Thanks for your help.
– RB Davidson
Jan 7 at 19:06
1
@EricLippert Regarding my claim that "Either I don't understand how the Join method is supposed to work or...." Well, I didn't understand how the Join method worked. I assumed Join only compared items in the outer collection against those in the inner. I didn't know it looped over the inner collection first, comparing those elements against each other. When I confirmed the behavior matched your explanation, I realized there would be cases where ISearchKey would equal two different MyClass instances which would not equal each other. Thanks for enlightening me on the Join method's logic.
– RB Davidson
Jan 7 at 19:21
1
@RBDavidson: Further reading: If the subject of "ways people implement comparisons wrong" interests you, see ericlippert.com/2011/01/20/bad-comparisons-part-one. If the subject "ways people implement GetHashCode wrong" interests you, see blogs.msdn.microsoft.com/ericlippert/2011/02/28/…
– Eric Lippert
Jan 7 at 19:33
1
1
Thanks for this amazing clear answer Eric -- it shows how my answer was wrong since I assumed you did not iterate over the inner selector. I do have one question -- HashSet (at docs.microsoft.com/en-us/dotnet/api/…) says HashSet cannot contain duplicate elements. Are duplicate elements determined by equity operator? Could that have an effect? Or an effect in the
lookup[outerKey]
in your code?– Hogan
Jan 3 at 19:02
Thanks for this amazing clear answer Eric -- it shows how my answer was wrong since I assumed you did not iterate over the inner selector. I do have one question -- HashSet (at docs.microsoft.com/en-us/dotnet/api/…) says HashSet cannot contain duplicate elements. Are duplicate elements determined by equity operator? Could that have an effect? Or an effect in the
lookup[outerKey]
in your code?– Hogan
Jan 3 at 19:02
1
1
@EricLippert So the multi-dictionary maps from key to unique values? The
IEnumerable<TInner>
won't have duplicates? I don't see where Grouping.Add
does that?– NetMage
Jan 3 at 19:37
@EricLippert So the multi-dictionary maps from key to unique values? The
IEnumerable<TInner>
won't have duplicates? I don't see where Grouping.Add
does that?– NetMage
Jan 3 at 19:37
2
2
@EricLippert, I now realize I didn't read your initial answer closely enough. The bug was in my IEqualityComparer, which I wrongly insisted was correct. The need for partial matches caused transitivity to fail, which meant the initial the loop over the inner hashset returned false when it should have returned true. Rule 1: Never swear the problem isn't in a particular chunk of code, because it will always be in that code. I couldn't see a way to fix the comparer to handle partial matches against a set of searchkeys, so I refactored code to avoid join. Thanks for your help.
– RB Davidson
Jan 7 at 19:06
@EricLippert, I now realize I didn't read your initial answer closely enough. The bug was in my IEqualityComparer, which I wrongly insisted was correct. The need for partial matches caused transitivity to fail, which meant the initial the loop over the inner hashset returned false when it should have returned true. Rule 1: Never swear the problem isn't in a particular chunk of code, because it will always be in that code. I couldn't see a way to fix the comparer to handle partial matches against a set of searchkeys, so I refactored code to avoid join. Thanks for your help.
– RB Davidson
Jan 7 at 19:06
1
1
@EricLippert Regarding my claim that "Either I don't understand how the Join method is supposed to work or...." Well, I didn't understand how the Join method worked. I assumed Join only compared items in the outer collection against those in the inner. I didn't know it looped over the inner collection first, comparing those elements against each other. When I confirmed the behavior matched your explanation, I realized there would be cases where ISearchKey would equal two different MyClass instances which would not equal each other. Thanks for enlightening me on the Join method's logic.
– RB Davidson
Jan 7 at 19:21
@EricLippert Regarding my claim that "Either I don't understand how the Join method is supposed to work or...." Well, I didn't understand how the Join method worked. I assumed Join only compared items in the outer collection against those in the inner. I didn't know it looped over the inner collection first, comparing those elements against each other. When I confirmed the behavior matched your explanation, I realized there would be cases where ISearchKey would equal two different MyClass instances which would not equal each other. Thanks for enlightening me on the Join method's logic.
– RB Davidson
Jan 7 at 19:21
1
1
@RBDavidson: Further reading: If the subject of "ways people implement comparisons wrong" interests you, see ericlippert.com/2011/01/20/bad-comparisons-part-one. If the subject "ways people implement GetHashCode wrong" interests you, see blogs.msdn.microsoft.com/ericlippert/2011/02/28/…
– Eric Lippert
Jan 7 at 19:33
@RBDavidson: Further reading: If the subject of "ways people implement comparisons wrong" interests you, see ericlippert.com/2011/01/20/bad-comparisons-part-one. If the subject "ways people implement GetHashCode wrong" interests you, see blogs.msdn.microsoft.com/ericlippert/2011/02/28/…
– Eric Lippert
Jan 7 at 19:33
|
show 7 more comments
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54025578%2fneed-help-understanding-unexpected-behavior-using-linq-join-with-hashsett%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
4
So far you've stated "my comparison code is correct", "either I don't understand how join works, or Join works differently with hash sets, or .NET has a bug". Well, you understand how joins work, Join does not work differently with hash sets, and .NET does not have a bug, so what you have here is a collection of statements which contradict each other, and one of them has to be wrong. My money is on "my comparison code is correct" being the false statement.
– Eric Lippert
Jan 3 at 19:15
1
Well then, my advice continues to be: create the minimal code that reproduces the defect. You should be able to get this down to a few dozen lines of extremely clear code that we all can run on our machines. When you do so, either you'll find your mistake, or you will have a program that someone else can find the mistake, or, you'll have a clear repro for your bug report to Microsoft.
– Eric Lippert
Jan 3 at 19:35
2
Also, use science. You have a hypothesis: joins are wrong with hash sets. OK, so start trying stuff. Make a copy of the hash set into a list and run the test again. Did you get a different result? That's evidence for your hypothesis. But if you got the same result then your hypothesis is false and should be abandoned.
– Eric Lippert
Jan 3 at 19:36
4
It's not the expected behaviour. That would be seriously broken. It is not observed to happen in correct code. That sort of thing is observed to happen in code where the comparison logic violates the rules of equality. Again: make a small reproducer that clearly demonstrates the behaviour, and you'll find your bug.
– Eric Lippert
Jan 3 at 20:07
1
Another way you might try to find your bug is to attack the equality problem directly. For example: make your comparison mechanism accumulate a list of every input it receives. When that list gets to be, say, 1000 items, then go into a verification mode. For every item in the list, verify that A = A is true. For every pair of items in the list, verify that A = B is the same as B = A. For all of those where it is true, check whether A = C matches B = C for every C. Worst case, that's only a billion checks. Then reset and accumulate another 1000.
– Eric Lippert
Jan 3 at 23:14