Doing a “fuzzy” and non-fuzzy, many to 1 merge with data.table
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}
Lets assume I have two databases dfA
and dfB
. One has individual observations and one has country level data (which is applicable to multiple observations which are from the same year and country) For each of these databases I have created a key called matchcode. This matchcode is a combination of a country code and a year.
dfA <- read.table(
text = "A B C D E F G iso year matchcode
1 0 1 1 1 0 1 0 NLD 2010 NLD2010
2 1 0 0 0 1 0 1 NLD 2014 NLD2014
3 0 0 0 1 1 0 0 AUS 2010 AUS2010
4 1 0 1 0 0 1 0 AUS 2006 AUS2006
5 0 1 0 1 0 1 1 USA 2008 USA2008
6 0 0 1 0 0 0 1 USA 2010 USA2010
7 0 1 0 1 0 0 0 USA 2012 USA2012
8 1 0 1 0 0 1 0 BLG 2008 BLG2008
9 0 1 0 1 1 0 1 BEL 2008 BEL2008
10 1 0 1 0 0 1 0 BEL 2010 BEL2010
11 0 1 1 1 0 1 0 NLD 2010 NLD2010
12 1 0 0 0 1 0 1 NLD 2014 NLD2014
13 0 0 0 1 1 0 0 AUS 2010 AUS2010
14 1 0 1 0 0 1 0 AUS 2006 AUS2006
15 0 1 0 1 0 1 1 USA 2008 USA2008
16 0 0 1 0 0 0 1 USA 2010 USA2010
17 0 1 0 1 0 0 0 USA 2012 USA2012
18 1 0 1 0 0 1 0 BLG 2008 BLG2008
19 0 1 0 1 1 0 1 BEL 2008 BEL2008
20 1 0 1 0 0 1 0 BEL 2010 BEL2010",
header = TRUE
)
dfB <- read.table(
text = "A B C D H I J iso year matchcode
1 0 1 1 1 0 1 0 NLD 2009 NLD2009
2 1 0 0 0 1 0 1 NLD 2014 NLD2014
3 0 0 0 1 1 0 0 AUS 2011 AUS2011
4 1 0 1 0 0 1 0 AUS 2007 AUS2007
5 0 1 0 1 0 1 1 USA 2007 USA2007
6 0 0 1 0 0 0 1 USA 2011 USA2010
7 0 1 0 1 0 0 0 USA 2013 USA2013
8 1 0 1 0 0 1 0 BLG 2007 BLG2007
9 0 1 0 1 1 0 1 BEL 2009 BEL2009
10 1 0 1 0 0 1 0 BEL 2012 BEL2012",
header = TRUE
)
library(data.table)
setDT(dfA)
setDT(dfB)
Mostly when I merge these datasets I simply do:
dfA<- merge(dfA, dfB, by= "matchcode", all.x = TRUE, allow.cartesian=FALSE)
The problem is that sometimes the years do not completely match. So I tried:
dfA <- dfA[dfB, on = .(iso, year), roll = "nearest", nomatch = 0]
But this reduces the amount of observations to 11.
# A tibble: 11 x 18
A B C D E F G iso year matchcode K L M N O P Q i.matchcode
<int> <int> <int> <int> <int> <int> <int> <fct> <int> <fct> <int> <int> <int> <int> <int> <int> <int> <fct>
1 0 1 1 1 0 1 0 NLD 2009 NLD2010 0 1 1 1 0 1 0 NLD2009
2 1 0 0 0 1 0 1 NLD 2014 NLD2014 1 0 0 0 1 0 1 NLD2014
3 1 0 0 0 1 0 1 NLD 2014 NLD2014 1 0 0 0 1 0 1 NLD2014
4 0 0 0 1 1 0 0 AUS 2011 AUS2010 0 0 0 1 1 0 0 AUS2011
5 1 0 1 0 0 1 0 AUS 2007 AUS2006 1 0 1 0 0 1 0 AUS2007
6 0 1 0 1 0 1 1 USA 2007 USA2008 0 1 0 1 0 1 1 USA2007
7 0 0 1 0 0 0 1 USA 2011 USA2010 0 0 1 0 0 0 1 USA2010
8 0 1 0 1 0 0 0 USA 2013 USA2012 0 1 0 1 0 0 0 USA2013
9 1 0 1 0 0 1 0 BLG 2007 BLG2008 1 0 1 0 0 1 0 BLG2007
10 0 1 0 1 1 0 1 BEL 2009 BEL2008 0 1 0 1 1 0 1 BEL2009
11 1 0 1 0 0 1 0 BEL 2012 BEL2010 1 0 1 0 0 1 0 BEL2012
The preferred output would be as follows:
# A B C D E F G iso year matchcodeA H I J matchcodeB
# 1: 1 0 0 0 1 0 1 NLD 2014 NLD2014 1 0 1 NLD2014
# 2: 0 0 0 1 1 0 0 AUS 2011 AUS2010 1 0 0 AUS2011
# 3: 1 0 1 0 0 1 0 AUS 2007 AUS2006 0 1 0 AUS2007
# 4: 0 0 1 0 0 0 1 USA 2011 USA2010 0 0 1 USA2010
# 5: 0 1 0 1 0 0 0 USA 2013 USA2012 0 0 0 USA2013
# 6: 0 1 0 1 1 0 1 BEL 2009 BEL2008 1 0 1 BEL2009
# 7: 0 1 1 1 0 1 0 NLD 2009 NLD2010 0 1 0 NLD2009
# 8: 0 1 0 1 0 1 1 USA 2007 USA2008 0 1 1 USA2007
# 9: 0 1 0 1 0 0 0 USA 2011 USA2012 0 0 1 USA2010
#10: 1 0 1 0 0 1 0 BEL 2009 BEL2010 1 0 1 BEL2009
#11: 1 0 0 0 1 0 1 NLD 2014 NLD2014 1 0 1 NLD2014
#12: 0 0 0 1 1 0 0 AUS 2011 AUS2010 1 0 0 AUS2011
#13: 1 0 1 0 0 1 0 AUS 2007 AUS2006 0 1 0 AUS2007
#14: 0 0 1 0 0 0 1 USA 2011 USA2010 0 0 1 USA2010
#15: 0 1 0 1 0 0 0 USA 2013 USA2012 0 0 0 USA2013
#16: 0 1 0 1 1 0 1 BEL 2009 BEL2008 1 0 1 BEL2009
#17: 0 1 1 1 0 1 0 NLD 2009 NLD2010 0 1 0 NLD2009
#18: 0 1 0 1 0 1 1 USA 2007 USA2008 0 1 1 USA2007
#19: 0 1 0 1 0 0 0 USA 2011 USA2012 0 0 1 USA2010
#20: 1 0 1 0 0 1 0 BEL 2009 BEL2010 1 0 1 BEL2009
Additional Sources:
1. The previous try
2. The try before that
r merge data.table fuzzyjoin
add a comment |
Lets assume I have two databases dfA
and dfB
. One has individual observations and one has country level data (which is applicable to multiple observations which are from the same year and country) For each of these databases I have created a key called matchcode. This matchcode is a combination of a country code and a year.
dfA <- read.table(
text = "A B C D E F G iso year matchcode
1 0 1 1 1 0 1 0 NLD 2010 NLD2010
2 1 0 0 0 1 0 1 NLD 2014 NLD2014
3 0 0 0 1 1 0 0 AUS 2010 AUS2010
4 1 0 1 0 0 1 0 AUS 2006 AUS2006
5 0 1 0 1 0 1 1 USA 2008 USA2008
6 0 0 1 0 0 0 1 USA 2010 USA2010
7 0 1 0 1 0 0 0 USA 2012 USA2012
8 1 0 1 0 0 1 0 BLG 2008 BLG2008
9 0 1 0 1 1 0 1 BEL 2008 BEL2008
10 1 0 1 0 0 1 0 BEL 2010 BEL2010
11 0 1 1 1 0 1 0 NLD 2010 NLD2010
12 1 0 0 0 1 0 1 NLD 2014 NLD2014
13 0 0 0 1 1 0 0 AUS 2010 AUS2010
14 1 0 1 0 0 1 0 AUS 2006 AUS2006
15 0 1 0 1 0 1 1 USA 2008 USA2008
16 0 0 1 0 0 0 1 USA 2010 USA2010
17 0 1 0 1 0 0 0 USA 2012 USA2012
18 1 0 1 0 0 1 0 BLG 2008 BLG2008
19 0 1 0 1 1 0 1 BEL 2008 BEL2008
20 1 0 1 0 0 1 0 BEL 2010 BEL2010",
header = TRUE
)
dfB <- read.table(
text = "A B C D H I J iso year matchcode
1 0 1 1 1 0 1 0 NLD 2009 NLD2009
2 1 0 0 0 1 0 1 NLD 2014 NLD2014
3 0 0 0 1 1 0 0 AUS 2011 AUS2011
4 1 0 1 0 0 1 0 AUS 2007 AUS2007
5 0 1 0 1 0 1 1 USA 2007 USA2007
6 0 0 1 0 0 0 1 USA 2011 USA2010
7 0 1 0 1 0 0 0 USA 2013 USA2013
8 1 0 1 0 0 1 0 BLG 2007 BLG2007
9 0 1 0 1 1 0 1 BEL 2009 BEL2009
10 1 0 1 0 0 1 0 BEL 2012 BEL2012",
header = TRUE
)
library(data.table)
setDT(dfA)
setDT(dfB)
Mostly when I merge these datasets I simply do:
dfA<- merge(dfA, dfB, by= "matchcode", all.x = TRUE, allow.cartesian=FALSE)
The problem is that sometimes the years do not completely match. So I tried:
dfA <- dfA[dfB, on = .(iso, year), roll = "nearest", nomatch = 0]
But this reduces the amount of observations to 11.
# A tibble: 11 x 18
A B C D E F G iso year matchcode K L M N O P Q i.matchcode
<int> <int> <int> <int> <int> <int> <int> <fct> <int> <fct> <int> <int> <int> <int> <int> <int> <int> <fct>
1 0 1 1 1 0 1 0 NLD 2009 NLD2010 0 1 1 1 0 1 0 NLD2009
2 1 0 0 0 1 0 1 NLD 2014 NLD2014 1 0 0 0 1 0 1 NLD2014
3 1 0 0 0 1 0 1 NLD 2014 NLD2014 1 0 0 0 1 0 1 NLD2014
4 0 0 0 1 1 0 0 AUS 2011 AUS2010 0 0 0 1 1 0 0 AUS2011
5 1 0 1 0 0 1 0 AUS 2007 AUS2006 1 0 1 0 0 1 0 AUS2007
6 0 1 0 1 0 1 1 USA 2007 USA2008 0 1 0 1 0 1 1 USA2007
7 0 0 1 0 0 0 1 USA 2011 USA2010 0 0 1 0 0 0 1 USA2010
8 0 1 0 1 0 0 0 USA 2013 USA2012 0 1 0 1 0 0 0 USA2013
9 1 0 1 0 0 1 0 BLG 2007 BLG2008 1 0 1 0 0 1 0 BLG2007
10 0 1 0 1 1 0 1 BEL 2009 BEL2008 0 1 0 1 1 0 1 BEL2009
11 1 0 1 0 0 1 0 BEL 2012 BEL2010 1 0 1 0 0 1 0 BEL2012
The preferred output would be as follows:
# A B C D E F G iso year matchcodeA H I J matchcodeB
# 1: 1 0 0 0 1 0 1 NLD 2014 NLD2014 1 0 1 NLD2014
# 2: 0 0 0 1 1 0 0 AUS 2011 AUS2010 1 0 0 AUS2011
# 3: 1 0 1 0 0 1 0 AUS 2007 AUS2006 0 1 0 AUS2007
# 4: 0 0 1 0 0 0 1 USA 2011 USA2010 0 0 1 USA2010
# 5: 0 1 0 1 0 0 0 USA 2013 USA2012 0 0 0 USA2013
# 6: 0 1 0 1 1 0 1 BEL 2009 BEL2008 1 0 1 BEL2009
# 7: 0 1 1 1 0 1 0 NLD 2009 NLD2010 0 1 0 NLD2009
# 8: 0 1 0 1 0 1 1 USA 2007 USA2008 0 1 1 USA2007
# 9: 0 1 0 1 0 0 0 USA 2011 USA2012 0 0 1 USA2010
#10: 1 0 1 0 0 1 0 BEL 2009 BEL2010 1 0 1 BEL2009
#11: 1 0 0 0 1 0 1 NLD 2014 NLD2014 1 0 1 NLD2014
#12: 0 0 0 1 1 0 0 AUS 2011 AUS2010 1 0 0 AUS2011
#13: 1 0 1 0 0 1 0 AUS 2007 AUS2006 0 1 0 AUS2007
#14: 0 0 1 0 0 0 1 USA 2011 USA2010 0 0 1 USA2010
#15: 0 1 0 1 0 0 0 USA 2013 USA2012 0 0 0 USA2013
#16: 0 1 0 1 1 0 1 BEL 2009 BEL2008 1 0 1 BEL2009
#17: 0 1 1 1 0 1 0 NLD 2009 NLD2010 0 1 0 NLD2009
#18: 0 1 0 1 0 1 1 USA 2007 USA2008 0 1 1 USA2007
#19: 0 1 0 1 0 0 0 USA 2011 USA2012 0 0 1 USA2010
#20: 1 0 1 0 0 1 0 BEL 2009 BEL2010 1 0 1 BEL2009
Additional Sources:
1. The previous try
2. The try before that
r merge data.table fuzzyjoin
It seems that you want to match each row from dfA in dfB. Does this give you the desired output:dfB[dfA, on = .(iso, year), roll = "nearest", nomatch = 0]
?
– mt1022
Jan 4 at 11:57
Thanks for taking a look mt1022! It does for the example, but I am still losing about 14.000 observations in the actual dataset regrettably..
– Tom
Jan 4 at 12:08
Which is already way less than before by the way!
– Tom
Jan 4 at 12:12
I guess some code in dfA are not present in B. You can setnomatch = NA
in join and examine what happens to those rows that get NA values.
– mt1022
Jan 4 at 12:12
add a comment |
Lets assume I have two databases dfA
and dfB
. One has individual observations and one has country level data (which is applicable to multiple observations which are from the same year and country) For each of these databases I have created a key called matchcode. This matchcode is a combination of a country code and a year.
dfA <- read.table(
text = "A B C D E F G iso year matchcode
1 0 1 1 1 0 1 0 NLD 2010 NLD2010
2 1 0 0 0 1 0 1 NLD 2014 NLD2014
3 0 0 0 1 1 0 0 AUS 2010 AUS2010
4 1 0 1 0 0 1 0 AUS 2006 AUS2006
5 0 1 0 1 0 1 1 USA 2008 USA2008
6 0 0 1 0 0 0 1 USA 2010 USA2010
7 0 1 0 1 0 0 0 USA 2012 USA2012
8 1 0 1 0 0 1 0 BLG 2008 BLG2008
9 0 1 0 1 1 0 1 BEL 2008 BEL2008
10 1 0 1 0 0 1 0 BEL 2010 BEL2010
11 0 1 1 1 0 1 0 NLD 2010 NLD2010
12 1 0 0 0 1 0 1 NLD 2014 NLD2014
13 0 0 0 1 1 0 0 AUS 2010 AUS2010
14 1 0 1 0 0 1 0 AUS 2006 AUS2006
15 0 1 0 1 0 1 1 USA 2008 USA2008
16 0 0 1 0 0 0 1 USA 2010 USA2010
17 0 1 0 1 0 0 0 USA 2012 USA2012
18 1 0 1 0 0 1 0 BLG 2008 BLG2008
19 0 1 0 1 1 0 1 BEL 2008 BEL2008
20 1 0 1 0 0 1 0 BEL 2010 BEL2010",
header = TRUE
)
dfB <- read.table(
text = "A B C D H I J iso year matchcode
1 0 1 1 1 0 1 0 NLD 2009 NLD2009
2 1 0 0 0 1 0 1 NLD 2014 NLD2014
3 0 0 0 1 1 0 0 AUS 2011 AUS2011
4 1 0 1 0 0 1 0 AUS 2007 AUS2007
5 0 1 0 1 0 1 1 USA 2007 USA2007
6 0 0 1 0 0 0 1 USA 2011 USA2010
7 0 1 0 1 0 0 0 USA 2013 USA2013
8 1 0 1 0 0 1 0 BLG 2007 BLG2007
9 0 1 0 1 1 0 1 BEL 2009 BEL2009
10 1 0 1 0 0 1 0 BEL 2012 BEL2012",
header = TRUE
)
library(data.table)
setDT(dfA)
setDT(dfB)
Mostly when I merge these datasets I simply do:
dfA<- merge(dfA, dfB, by= "matchcode", all.x = TRUE, allow.cartesian=FALSE)
The problem is that sometimes the years do not completely match. So I tried:
dfA <- dfA[dfB, on = .(iso, year), roll = "nearest", nomatch = 0]
But this reduces the amount of observations to 11.
# A tibble: 11 x 18
A B C D E F G iso year matchcode K L M N O P Q i.matchcode
<int> <int> <int> <int> <int> <int> <int> <fct> <int> <fct> <int> <int> <int> <int> <int> <int> <int> <fct>
1 0 1 1 1 0 1 0 NLD 2009 NLD2010 0 1 1 1 0 1 0 NLD2009
2 1 0 0 0 1 0 1 NLD 2014 NLD2014 1 0 0 0 1 0 1 NLD2014
3 1 0 0 0 1 0 1 NLD 2014 NLD2014 1 0 0 0 1 0 1 NLD2014
4 0 0 0 1 1 0 0 AUS 2011 AUS2010 0 0 0 1 1 0 0 AUS2011
5 1 0 1 0 0 1 0 AUS 2007 AUS2006 1 0 1 0 0 1 0 AUS2007
6 0 1 0 1 0 1 1 USA 2007 USA2008 0 1 0 1 0 1 1 USA2007
7 0 0 1 0 0 0 1 USA 2011 USA2010 0 0 1 0 0 0 1 USA2010
8 0 1 0 1 0 0 0 USA 2013 USA2012 0 1 0 1 0 0 0 USA2013
9 1 0 1 0 0 1 0 BLG 2007 BLG2008 1 0 1 0 0 1 0 BLG2007
10 0 1 0 1 1 0 1 BEL 2009 BEL2008 0 1 0 1 1 0 1 BEL2009
11 1 0 1 0 0 1 0 BEL 2012 BEL2010 1 0 1 0 0 1 0 BEL2012
The preferred output would be as follows:
# A B C D E F G iso year matchcodeA H I J matchcodeB
# 1: 1 0 0 0 1 0 1 NLD 2014 NLD2014 1 0 1 NLD2014
# 2: 0 0 0 1 1 0 0 AUS 2011 AUS2010 1 0 0 AUS2011
# 3: 1 0 1 0 0 1 0 AUS 2007 AUS2006 0 1 0 AUS2007
# 4: 0 0 1 0 0 0 1 USA 2011 USA2010 0 0 1 USA2010
# 5: 0 1 0 1 0 0 0 USA 2013 USA2012 0 0 0 USA2013
# 6: 0 1 0 1 1 0 1 BEL 2009 BEL2008 1 0 1 BEL2009
# 7: 0 1 1 1 0 1 0 NLD 2009 NLD2010 0 1 0 NLD2009
# 8: 0 1 0 1 0 1 1 USA 2007 USA2008 0 1 1 USA2007
# 9: 0 1 0 1 0 0 0 USA 2011 USA2012 0 0 1 USA2010
#10: 1 0 1 0 0 1 0 BEL 2009 BEL2010 1 0 1 BEL2009
#11: 1 0 0 0 1 0 1 NLD 2014 NLD2014 1 0 1 NLD2014
#12: 0 0 0 1 1 0 0 AUS 2011 AUS2010 1 0 0 AUS2011
#13: 1 0 1 0 0 1 0 AUS 2007 AUS2006 0 1 0 AUS2007
#14: 0 0 1 0 0 0 1 USA 2011 USA2010 0 0 1 USA2010
#15: 0 1 0 1 0 0 0 USA 2013 USA2012 0 0 0 USA2013
#16: 0 1 0 1 1 0 1 BEL 2009 BEL2008 1 0 1 BEL2009
#17: 0 1 1 1 0 1 0 NLD 2009 NLD2010 0 1 0 NLD2009
#18: 0 1 0 1 0 1 1 USA 2007 USA2008 0 1 1 USA2007
#19: 0 1 0 1 0 0 0 USA 2011 USA2012 0 0 1 USA2010
#20: 1 0 1 0 0 1 0 BEL 2009 BEL2010 1 0 1 BEL2009
Additional Sources:
1. The previous try
2. The try before that
r merge data.table fuzzyjoin
Lets assume I have two databases dfA
and dfB
. One has individual observations and one has country level data (which is applicable to multiple observations which are from the same year and country) For each of these databases I have created a key called matchcode. This matchcode is a combination of a country code and a year.
dfA <- read.table(
text = "A B C D E F G iso year matchcode
1 0 1 1 1 0 1 0 NLD 2010 NLD2010
2 1 0 0 0 1 0 1 NLD 2014 NLD2014
3 0 0 0 1 1 0 0 AUS 2010 AUS2010
4 1 0 1 0 0 1 0 AUS 2006 AUS2006
5 0 1 0 1 0 1 1 USA 2008 USA2008
6 0 0 1 0 0 0 1 USA 2010 USA2010
7 0 1 0 1 0 0 0 USA 2012 USA2012
8 1 0 1 0 0 1 0 BLG 2008 BLG2008
9 0 1 0 1 1 0 1 BEL 2008 BEL2008
10 1 0 1 0 0 1 0 BEL 2010 BEL2010
11 0 1 1 1 0 1 0 NLD 2010 NLD2010
12 1 0 0 0 1 0 1 NLD 2014 NLD2014
13 0 0 0 1 1 0 0 AUS 2010 AUS2010
14 1 0 1 0 0 1 0 AUS 2006 AUS2006
15 0 1 0 1 0 1 1 USA 2008 USA2008
16 0 0 1 0 0 0 1 USA 2010 USA2010
17 0 1 0 1 0 0 0 USA 2012 USA2012
18 1 0 1 0 0 1 0 BLG 2008 BLG2008
19 0 1 0 1 1 0 1 BEL 2008 BEL2008
20 1 0 1 0 0 1 0 BEL 2010 BEL2010",
header = TRUE
)
dfB <- read.table(
text = "A B C D H I J iso year matchcode
1 0 1 1 1 0 1 0 NLD 2009 NLD2009
2 1 0 0 0 1 0 1 NLD 2014 NLD2014
3 0 0 0 1 1 0 0 AUS 2011 AUS2011
4 1 0 1 0 0 1 0 AUS 2007 AUS2007
5 0 1 0 1 0 1 1 USA 2007 USA2007
6 0 0 1 0 0 0 1 USA 2011 USA2010
7 0 1 0 1 0 0 0 USA 2013 USA2013
8 1 0 1 0 0 1 0 BLG 2007 BLG2007
9 0 1 0 1 1 0 1 BEL 2009 BEL2009
10 1 0 1 0 0 1 0 BEL 2012 BEL2012",
header = TRUE
)
library(data.table)
setDT(dfA)
setDT(dfB)
Mostly when I merge these datasets I simply do:
dfA<- merge(dfA, dfB, by= "matchcode", all.x = TRUE, allow.cartesian=FALSE)
The problem is that sometimes the years do not completely match. So I tried:
dfA <- dfA[dfB, on = .(iso, year), roll = "nearest", nomatch = 0]
But this reduces the amount of observations to 11.
# A tibble: 11 x 18
A B C D E F G iso year matchcode K L M N O P Q i.matchcode
<int> <int> <int> <int> <int> <int> <int> <fct> <int> <fct> <int> <int> <int> <int> <int> <int> <int> <fct>
1 0 1 1 1 0 1 0 NLD 2009 NLD2010 0 1 1 1 0 1 0 NLD2009
2 1 0 0 0 1 0 1 NLD 2014 NLD2014 1 0 0 0 1 0 1 NLD2014
3 1 0 0 0 1 0 1 NLD 2014 NLD2014 1 0 0 0 1 0 1 NLD2014
4 0 0 0 1 1 0 0 AUS 2011 AUS2010 0 0 0 1 1 0 0 AUS2011
5 1 0 1 0 0 1 0 AUS 2007 AUS2006 1 0 1 0 0 1 0 AUS2007
6 0 1 0 1 0 1 1 USA 2007 USA2008 0 1 0 1 0 1 1 USA2007
7 0 0 1 0 0 0 1 USA 2011 USA2010 0 0 1 0 0 0 1 USA2010
8 0 1 0 1 0 0 0 USA 2013 USA2012 0 1 0 1 0 0 0 USA2013
9 1 0 1 0 0 1 0 BLG 2007 BLG2008 1 0 1 0 0 1 0 BLG2007
10 0 1 0 1 1 0 1 BEL 2009 BEL2008 0 1 0 1 1 0 1 BEL2009
11 1 0 1 0 0 1 0 BEL 2012 BEL2010 1 0 1 0 0 1 0 BEL2012
The preferred output would be as follows:
# A B C D E F G iso year matchcodeA H I J matchcodeB
# 1: 1 0 0 0 1 0 1 NLD 2014 NLD2014 1 0 1 NLD2014
# 2: 0 0 0 1 1 0 0 AUS 2011 AUS2010 1 0 0 AUS2011
# 3: 1 0 1 0 0 1 0 AUS 2007 AUS2006 0 1 0 AUS2007
# 4: 0 0 1 0 0 0 1 USA 2011 USA2010 0 0 1 USA2010
# 5: 0 1 0 1 0 0 0 USA 2013 USA2012 0 0 0 USA2013
# 6: 0 1 0 1 1 0 1 BEL 2009 BEL2008 1 0 1 BEL2009
# 7: 0 1 1 1 0 1 0 NLD 2009 NLD2010 0 1 0 NLD2009
# 8: 0 1 0 1 0 1 1 USA 2007 USA2008 0 1 1 USA2007
# 9: 0 1 0 1 0 0 0 USA 2011 USA2012 0 0 1 USA2010
#10: 1 0 1 0 0 1 0 BEL 2009 BEL2010 1 0 1 BEL2009
#11: 1 0 0 0 1 0 1 NLD 2014 NLD2014 1 0 1 NLD2014
#12: 0 0 0 1 1 0 0 AUS 2011 AUS2010 1 0 0 AUS2011
#13: 1 0 1 0 0 1 0 AUS 2007 AUS2006 0 1 0 AUS2007
#14: 0 0 1 0 0 0 1 USA 2011 USA2010 0 0 1 USA2010
#15: 0 1 0 1 0 0 0 USA 2013 USA2012 0 0 0 USA2013
#16: 0 1 0 1 1 0 1 BEL 2009 BEL2008 1 0 1 BEL2009
#17: 0 1 1 1 0 1 0 NLD 2009 NLD2010 0 1 0 NLD2009
#18: 0 1 0 1 0 1 1 USA 2007 USA2008 0 1 1 USA2007
#19: 0 1 0 1 0 0 0 USA 2011 USA2012 0 0 1 USA2010
#20: 1 0 1 0 0 1 0 BEL 2009 BEL2010 1 0 1 BEL2009
Additional Sources:
1. The previous try
2. The try before that
r merge data.table fuzzyjoin
r merge data.table fuzzyjoin
asked Jan 4 at 11:42
TomTom
355213
355213
It seems that you want to match each row from dfA in dfB. Does this give you the desired output:dfB[dfA, on = .(iso, year), roll = "nearest", nomatch = 0]
?
– mt1022
Jan 4 at 11:57
Thanks for taking a look mt1022! It does for the example, but I am still losing about 14.000 observations in the actual dataset regrettably..
– Tom
Jan 4 at 12:08
Which is already way less than before by the way!
– Tom
Jan 4 at 12:12
I guess some code in dfA are not present in B. You can setnomatch = NA
in join and examine what happens to those rows that get NA values.
– mt1022
Jan 4 at 12:12
add a comment |
It seems that you want to match each row from dfA in dfB. Does this give you the desired output:dfB[dfA, on = .(iso, year), roll = "nearest", nomatch = 0]
?
– mt1022
Jan 4 at 11:57
Thanks for taking a look mt1022! It does for the example, but I am still losing about 14.000 observations in the actual dataset regrettably..
– Tom
Jan 4 at 12:08
Which is already way less than before by the way!
– Tom
Jan 4 at 12:12
I guess some code in dfA are not present in B. You can setnomatch = NA
in join and examine what happens to those rows that get NA values.
– mt1022
Jan 4 at 12:12
It seems that you want to match each row from dfA in dfB. Does this give you the desired output:
dfB[dfA, on = .(iso, year), roll = "nearest", nomatch = 0]
?– mt1022
Jan 4 at 11:57
It seems that you want to match each row from dfA in dfB. Does this give you the desired output:
dfB[dfA, on = .(iso, year), roll = "nearest", nomatch = 0]
?– mt1022
Jan 4 at 11:57
Thanks for taking a look mt1022! It does for the example, but I am still losing about 14.000 observations in the actual dataset regrettably..
– Tom
Jan 4 at 12:08
Thanks for taking a look mt1022! It does for the example, but I am still losing about 14.000 observations in the actual dataset regrettably..
– Tom
Jan 4 at 12:08
Which is already way less than before by the way!
– Tom
Jan 4 at 12:12
Which is already way less than before by the way!
– Tom
Jan 4 at 12:12
I guess some code in dfA are not present in B. You can set
nomatch = NA
in join and examine what happens to those rows that get NA values.– mt1022
Jan 4 at 12:12
I guess some code in dfA are not present in B. You can set
nomatch = NA
in join and examine what happens to those rows that get NA values.– mt1022
Jan 4 at 12:12
add a comment |
1 Answer
1
active
oldest
votes
Hers is my (default) approach for a join like this, using data.table
code
library( data.table )
#change the name of the matchcode-column
setnames(dfA, c("matchcode", "iso", "year"), c("matchcodeA", "isoA", "yearA"))
setnames(dfB, c("matchcode", "iso", "year"), c("matchcodeB", "isoB", "yearB"))
#store column-order for in the end
namesA <- as.character( names( dfA ) )
namesB <- as.character( setdiff( names(dfB), names(dfA) ) )
colorder <- c(namesA, namesB)
#create columns to join on
dfA[, `:=`(iso.join = isoA, year.join = yearA)]
dfB[, `:=`(iso.join = isoB, year.join = yearB)]
#perform left join
result <- dfB[dfA, on = c("iso.join", "year.join"),roll = "nearest" ]
#drop columns that are not needed
result[, grep("^i\.", names(result)) := NULL ]
result[, grep("join$", names(result)) := NULL ]
#set column order
setcolorder(result, colorder)
result
# A B C D E F G isoA yearA matchcodeA H I J isoB yearB matchcodeB
# 1: 0 1 1 1 0 1 0 NLD 2010 NLD2010 0 1 0 NLD 2009 NLD2009
# 2: 1 0 0 0 1 0 1 NLD 2014 NLD2014 1 0 1 NLD 2014 NLD2014
# 3: 0 0 0 1 1 0 0 AUS 2010 AUS2010 1 0 0 AUS 2011 AUS2011
# 4: 1 0 1 0 0 1 0 AUS 2006 AUS2006 0 1 0 AUS 2007 AUS2007
# 5: 0 1 0 1 0 1 1 USA 2008 USA2008 0 1 1 USA 2007 USA2007
# 6: 0 0 1 0 0 0 1 USA 2010 USA2010 0 0 1 USA 2011 USA2010
# 7: 0 0 1 0 0 0 0 USA 2012 USA2012 0 0 1 USA 2011 USA2010
# 8: 1 0 1 0 0 1 0 BLG 2008 BLG2008 0 1 0 BLG 2007 BLG2007
# 9: 0 1 0 1 1 0 1 BEL 2008 BEL2008 1 0 1 BEL 2009 BEL2009
# 10: 0 1 0 1 0 1 0 BEL 2010 BEL2010 1 0 1 BEL 2009 BEL2009
# 11: 0 1 1 1 0 1 0 NLD 2010 NLD2010 0 1 0 NLD 2009 NLD2009
# 12: 1 0 0 0 1 0 1 NLD 2014 NLD2014 1 0 1 NLD 2014 NLD2014
# 13: 0 0 0 1 1 0 0 AUS 2010 AUS2010 1 0 0 AUS 2011 AUS2011
# 14: 1 0 1 0 0 1 0 AUS 2006 AUS2006 0 1 0 AUS 2007 AUS2007
# 15: 0 1 0 1 0 1 1 USA 2008 USA2008 0 1 1 USA 2007 USA2007
# 16: 0 0 1 0 0 0 1 USA 2010 USA2010 0 0 1 USA 2011 USA2010
# 17: 0 0 1 0 0 0 0 USA 2012 USA2012 0 0 1 USA 2011 USA2010
# 18: 1 0 1 0 0 1 0 BLG 2008 BLG2008 0 1 0 BLG 2007 BLG2007
# 19: 0 1 0 1 1 0 1 BEL 2008 BEL2008 1 0 1 BEL 2009 BEL2009
# 20: 0 1 0 1 0 1 0 BEL 2010 BEL2010 1 0 1 BEL 2009 BEL2009
sample data
dfA <- fread(
"A B C D E F G iso year matchcode
0 1 1 1 0 1 0 NLD 2010 NLD2010
1 0 0 0 1 0 1 NLD 2014 NLD2014
0 0 0 1 1 0 0 AUS 2010 AUS2010
1 0 1 0 0 1 0 AUS 2006 AUS2006
0 1 0 1 0 1 1 USA 2008 USA2008
0 0 1 0 0 0 1 USA 2010 USA2010
0 1 0 1 0 0 0 USA 2012 USA2012
1 0 1 0 0 1 0 BLG 2008 BLG2008
0 1 0 1 1 0 1 BEL 2008 BEL2008
1 0 1 0 0 1 0 BEL 2010 BEL2010
0 1 1 1 0 1 0 NLD 2010 NLD2010
1 0 0 0 1 0 1 NLD 2014 NLD2014
0 0 0 1 1 0 0 AUS 2010 AUS2010
1 0 1 0 0 1 0 AUS 2006 AUS2006
0 1 0 1 0 1 1 USA 2008 USA2008
0 0 1 0 0 0 1 USA 2010 USA2010
0 1 0 1 0 0 0 USA 2012 USA2012
1 0 1 0 0 1 0 BLG 2008 BLG2008
0 1 0 1 1 0 1 BEL 2008 BEL2008
1 0 1 0 0 1 0 BEL 2010 BEL2010",
header = TRUE
)
dfB <- fread(
"A B C D H I J iso year matchcode
0 1 1 1 0 1 0 NLD 2009 NLD2009
1 0 0 0 1 0 1 NLD 2014 NLD2014
0 0 0 1 1 0 0 AUS 2011 AUS2011
1 0 1 0 0 1 0 AUS 2007 AUS2007
0 1 0 1 0 1 1 USA 2007 USA2007
0 0 1 0 0 0 1 USA 2011 USA2010
0 1 0 1 0 0 0 USA 2013 USA2013
1 0 1 0 0 1 0 BLG 2007 BLG2007
0 1 0 1 1 0 1 BEL 2009 BEL2009
1 0 1 0 0 1 0 BEL 2012 BEL2012",
header = TRUE
)
I removed a previous column because I thought something was wrong. On a more thorough inspection it did however work just fine. Thank you so much for your help. A very small question. I noticed that apart frommatchcodeB
, alsoyear
andiso
a are empty (this is my confusion arose). Is it possible to a adapt the code to leaveiso
andyear
fromdfA
intact?
– Tom
Jan 5 at 2:32
1
@Tom Sure just give them unique names. See updated answer
– Wimpel
Jan 5 at 6:53
In the hope you are still here: I have been using your answer to merge many databases. For example A with number 1-8 and then B with number 1 to 8. The first merge goes well but after I am getting problems with duplicates in thecolorder
. I'm breaking my head on solving it. Would you have any idea why it happens and how to prevent it? Should I also remove thei.
and.join
from the other databases?
– Tom
Jan 5 at 7:13
Nevermind, I fixed it by dropping all variables which are not needed from eachdfB
as well before doing the second merging. That appears to work.
– Tom
Jan 5 at 7:29
@Tom glad you got it to work!
– Wimpel
Jan 5 at 8:40
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54038311%2fdoing-a-fuzzy-and-non-fuzzy-many-to-1-merge-with-data-table%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Hers is my (default) approach for a join like this, using data.table
code
library( data.table )
#change the name of the matchcode-column
setnames(dfA, c("matchcode", "iso", "year"), c("matchcodeA", "isoA", "yearA"))
setnames(dfB, c("matchcode", "iso", "year"), c("matchcodeB", "isoB", "yearB"))
#store column-order for in the end
namesA <- as.character( names( dfA ) )
namesB <- as.character( setdiff( names(dfB), names(dfA) ) )
colorder <- c(namesA, namesB)
#create columns to join on
dfA[, `:=`(iso.join = isoA, year.join = yearA)]
dfB[, `:=`(iso.join = isoB, year.join = yearB)]
#perform left join
result <- dfB[dfA, on = c("iso.join", "year.join"),roll = "nearest" ]
#drop columns that are not needed
result[, grep("^i\.", names(result)) := NULL ]
result[, grep("join$", names(result)) := NULL ]
#set column order
setcolorder(result, colorder)
result
# A B C D E F G isoA yearA matchcodeA H I J isoB yearB matchcodeB
# 1: 0 1 1 1 0 1 0 NLD 2010 NLD2010 0 1 0 NLD 2009 NLD2009
# 2: 1 0 0 0 1 0 1 NLD 2014 NLD2014 1 0 1 NLD 2014 NLD2014
# 3: 0 0 0 1 1 0 0 AUS 2010 AUS2010 1 0 0 AUS 2011 AUS2011
# 4: 1 0 1 0 0 1 0 AUS 2006 AUS2006 0 1 0 AUS 2007 AUS2007
# 5: 0 1 0 1 0 1 1 USA 2008 USA2008 0 1 1 USA 2007 USA2007
# 6: 0 0 1 0 0 0 1 USA 2010 USA2010 0 0 1 USA 2011 USA2010
# 7: 0 0 1 0 0 0 0 USA 2012 USA2012 0 0 1 USA 2011 USA2010
# 8: 1 0 1 0 0 1 0 BLG 2008 BLG2008 0 1 0 BLG 2007 BLG2007
# 9: 0 1 0 1 1 0 1 BEL 2008 BEL2008 1 0 1 BEL 2009 BEL2009
# 10: 0 1 0 1 0 1 0 BEL 2010 BEL2010 1 0 1 BEL 2009 BEL2009
# 11: 0 1 1 1 0 1 0 NLD 2010 NLD2010 0 1 0 NLD 2009 NLD2009
# 12: 1 0 0 0 1 0 1 NLD 2014 NLD2014 1 0 1 NLD 2014 NLD2014
# 13: 0 0 0 1 1 0 0 AUS 2010 AUS2010 1 0 0 AUS 2011 AUS2011
# 14: 1 0 1 0 0 1 0 AUS 2006 AUS2006 0 1 0 AUS 2007 AUS2007
# 15: 0 1 0 1 0 1 1 USA 2008 USA2008 0 1 1 USA 2007 USA2007
# 16: 0 0 1 0 0 0 1 USA 2010 USA2010 0 0 1 USA 2011 USA2010
# 17: 0 0 1 0 0 0 0 USA 2012 USA2012 0 0 1 USA 2011 USA2010
# 18: 1 0 1 0 0 1 0 BLG 2008 BLG2008 0 1 0 BLG 2007 BLG2007
# 19: 0 1 0 1 1 0 1 BEL 2008 BEL2008 1 0 1 BEL 2009 BEL2009
# 20: 0 1 0 1 0 1 0 BEL 2010 BEL2010 1 0 1 BEL 2009 BEL2009
sample data
dfA <- fread(
"A B C D E F G iso year matchcode
0 1 1 1 0 1 0 NLD 2010 NLD2010
1 0 0 0 1 0 1 NLD 2014 NLD2014
0 0 0 1 1 0 0 AUS 2010 AUS2010
1 0 1 0 0 1 0 AUS 2006 AUS2006
0 1 0 1 0 1 1 USA 2008 USA2008
0 0 1 0 0 0 1 USA 2010 USA2010
0 1 0 1 0 0 0 USA 2012 USA2012
1 0 1 0 0 1 0 BLG 2008 BLG2008
0 1 0 1 1 0 1 BEL 2008 BEL2008
1 0 1 0 0 1 0 BEL 2010 BEL2010
0 1 1 1 0 1 0 NLD 2010 NLD2010
1 0 0 0 1 0 1 NLD 2014 NLD2014
0 0 0 1 1 0 0 AUS 2010 AUS2010
1 0 1 0 0 1 0 AUS 2006 AUS2006
0 1 0 1 0 1 1 USA 2008 USA2008
0 0 1 0 0 0 1 USA 2010 USA2010
0 1 0 1 0 0 0 USA 2012 USA2012
1 0 1 0 0 1 0 BLG 2008 BLG2008
0 1 0 1 1 0 1 BEL 2008 BEL2008
1 0 1 0 0 1 0 BEL 2010 BEL2010",
header = TRUE
)
dfB <- fread(
"A B C D H I J iso year matchcode
0 1 1 1 0 1 0 NLD 2009 NLD2009
1 0 0 0 1 0 1 NLD 2014 NLD2014
0 0 0 1 1 0 0 AUS 2011 AUS2011
1 0 1 0 0 1 0 AUS 2007 AUS2007
0 1 0 1 0 1 1 USA 2007 USA2007
0 0 1 0 0 0 1 USA 2011 USA2010
0 1 0 1 0 0 0 USA 2013 USA2013
1 0 1 0 0 1 0 BLG 2007 BLG2007
0 1 0 1 1 0 1 BEL 2009 BEL2009
1 0 1 0 0 1 0 BEL 2012 BEL2012",
header = TRUE
)
I removed a previous column because I thought something was wrong. On a more thorough inspection it did however work just fine. Thank you so much for your help. A very small question. I noticed that apart frommatchcodeB
, alsoyear
andiso
a are empty (this is my confusion arose). Is it possible to a adapt the code to leaveiso
andyear
fromdfA
intact?
– Tom
Jan 5 at 2:32
1
@Tom Sure just give them unique names. See updated answer
– Wimpel
Jan 5 at 6:53
In the hope you are still here: I have been using your answer to merge many databases. For example A with number 1-8 and then B with number 1 to 8. The first merge goes well but after I am getting problems with duplicates in thecolorder
. I'm breaking my head on solving it. Would you have any idea why it happens and how to prevent it? Should I also remove thei.
and.join
from the other databases?
– Tom
Jan 5 at 7:13
Nevermind, I fixed it by dropping all variables which are not needed from eachdfB
as well before doing the second merging. That appears to work.
– Tom
Jan 5 at 7:29
@Tom glad you got it to work!
– Wimpel
Jan 5 at 8:40
add a comment |
Hers is my (default) approach for a join like this, using data.table
code
library( data.table )
#change the name of the matchcode-column
setnames(dfA, c("matchcode", "iso", "year"), c("matchcodeA", "isoA", "yearA"))
setnames(dfB, c("matchcode", "iso", "year"), c("matchcodeB", "isoB", "yearB"))
#store column-order for in the end
namesA <- as.character( names( dfA ) )
namesB <- as.character( setdiff( names(dfB), names(dfA) ) )
colorder <- c(namesA, namesB)
#create columns to join on
dfA[, `:=`(iso.join = isoA, year.join = yearA)]
dfB[, `:=`(iso.join = isoB, year.join = yearB)]
#perform left join
result <- dfB[dfA, on = c("iso.join", "year.join"),roll = "nearest" ]
#drop columns that are not needed
result[, grep("^i\.", names(result)) := NULL ]
result[, grep("join$", names(result)) := NULL ]
#set column order
setcolorder(result, colorder)
result
# A B C D E F G isoA yearA matchcodeA H I J isoB yearB matchcodeB
# 1: 0 1 1 1 0 1 0 NLD 2010 NLD2010 0 1 0 NLD 2009 NLD2009
# 2: 1 0 0 0 1 0 1 NLD 2014 NLD2014 1 0 1 NLD 2014 NLD2014
# 3: 0 0 0 1 1 0 0 AUS 2010 AUS2010 1 0 0 AUS 2011 AUS2011
# 4: 1 0 1 0 0 1 0 AUS 2006 AUS2006 0 1 0 AUS 2007 AUS2007
# 5: 0 1 0 1 0 1 1 USA 2008 USA2008 0 1 1 USA 2007 USA2007
# 6: 0 0 1 0 0 0 1 USA 2010 USA2010 0 0 1 USA 2011 USA2010
# 7: 0 0 1 0 0 0 0 USA 2012 USA2012 0 0 1 USA 2011 USA2010
# 8: 1 0 1 0 0 1 0 BLG 2008 BLG2008 0 1 0 BLG 2007 BLG2007
# 9: 0 1 0 1 1 0 1 BEL 2008 BEL2008 1 0 1 BEL 2009 BEL2009
# 10: 0 1 0 1 0 1 0 BEL 2010 BEL2010 1 0 1 BEL 2009 BEL2009
# 11: 0 1 1 1 0 1 0 NLD 2010 NLD2010 0 1 0 NLD 2009 NLD2009
# 12: 1 0 0 0 1 0 1 NLD 2014 NLD2014 1 0 1 NLD 2014 NLD2014
# 13: 0 0 0 1 1 0 0 AUS 2010 AUS2010 1 0 0 AUS 2011 AUS2011
# 14: 1 0 1 0 0 1 0 AUS 2006 AUS2006 0 1 0 AUS 2007 AUS2007
# 15: 0 1 0 1 0 1 1 USA 2008 USA2008 0 1 1 USA 2007 USA2007
# 16: 0 0 1 0 0 0 1 USA 2010 USA2010 0 0 1 USA 2011 USA2010
# 17: 0 0 1 0 0 0 0 USA 2012 USA2012 0 0 1 USA 2011 USA2010
# 18: 1 0 1 0 0 1 0 BLG 2008 BLG2008 0 1 0 BLG 2007 BLG2007
# 19: 0 1 0 1 1 0 1 BEL 2008 BEL2008 1 0 1 BEL 2009 BEL2009
# 20: 0 1 0 1 0 1 0 BEL 2010 BEL2010 1 0 1 BEL 2009 BEL2009
sample data
dfA <- fread(
"A B C D E F G iso year matchcode
0 1 1 1 0 1 0 NLD 2010 NLD2010
1 0 0 0 1 0 1 NLD 2014 NLD2014
0 0 0 1 1 0 0 AUS 2010 AUS2010
1 0 1 0 0 1 0 AUS 2006 AUS2006
0 1 0 1 0 1 1 USA 2008 USA2008
0 0 1 0 0 0 1 USA 2010 USA2010
0 1 0 1 0 0 0 USA 2012 USA2012
1 0 1 0 0 1 0 BLG 2008 BLG2008
0 1 0 1 1 0 1 BEL 2008 BEL2008
1 0 1 0 0 1 0 BEL 2010 BEL2010
0 1 1 1 0 1 0 NLD 2010 NLD2010
1 0 0 0 1 0 1 NLD 2014 NLD2014
0 0 0 1 1 0 0 AUS 2010 AUS2010
1 0 1 0 0 1 0 AUS 2006 AUS2006
0 1 0 1 0 1 1 USA 2008 USA2008
0 0 1 0 0 0 1 USA 2010 USA2010
0 1 0 1 0 0 0 USA 2012 USA2012
1 0 1 0 0 1 0 BLG 2008 BLG2008
0 1 0 1 1 0 1 BEL 2008 BEL2008
1 0 1 0 0 1 0 BEL 2010 BEL2010",
header = TRUE
)
dfB <- fread(
"A B C D H I J iso year matchcode
0 1 1 1 0 1 0 NLD 2009 NLD2009
1 0 0 0 1 0 1 NLD 2014 NLD2014
0 0 0 1 1 0 0 AUS 2011 AUS2011
1 0 1 0 0 1 0 AUS 2007 AUS2007
0 1 0 1 0 1 1 USA 2007 USA2007
0 0 1 0 0 0 1 USA 2011 USA2010
0 1 0 1 0 0 0 USA 2013 USA2013
1 0 1 0 0 1 0 BLG 2007 BLG2007
0 1 0 1 1 0 1 BEL 2009 BEL2009
1 0 1 0 0 1 0 BEL 2012 BEL2012",
header = TRUE
)
I removed a previous column because I thought something was wrong. On a more thorough inspection it did however work just fine. Thank you so much for your help. A very small question. I noticed that apart frommatchcodeB
, alsoyear
andiso
a are empty (this is my confusion arose). Is it possible to a adapt the code to leaveiso
andyear
fromdfA
intact?
– Tom
Jan 5 at 2:32
1
@Tom Sure just give them unique names. See updated answer
– Wimpel
Jan 5 at 6:53
In the hope you are still here: I have been using your answer to merge many databases. For example A with number 1-8 and then B with number 1 to 8. The first merge goes well but after I am getting problems with duplicates in thecolorder
. I'm breaking my head on solving it. Would you have any idea why it happens and how to prevent it? Should I also remove thei.
and.join
from the other databases?
– Tom
Jan 5 at 7:13
Nevermind, I fixed it by dropping all variables which are not needed from eachdfB
as well before doing the second merging. That appears to work.
– Tom
Jan 5 at 7:29
@Tom glad you got it to work!
– Wimpel
Jan 5 at 8:40
add a comment |
Hers is my (default) approach for a join like this, using data.table
code
library( data.table )
#change the name of the matchcode-column
setnames(dfA, c("matchcode", "iso", "year"), c("matchcodeA", "isoA", "yearA"))
setnames(dfB, c("matchcode", "iso", "year"), c("matchcodeB", "isoB", "yearB"))
#store column-order for in the end
namesA <- as.character( names( dfA ) )
namesB <- as.character( setdiff( names(dfB), names(dfA) ) )
colorder <- c(namesA, namesB)
#create columns to join on
dfA[, `:=`(iso.join = isoA, year.join = yearA)]
dfB[, `:=`(iso.join = isoB, year.join = yearB)]
#perform left join
result <- dfB[dfA, on = c("iso.join", "year.join"),roll = "nearest" ]
#drop columns that are not needed
result[, grep("^i\.", names(result)) := NULL ]
result[, grep("join$", names(result)) := NULL ]
#set column order
setcolorder(result, colorder)
result
# A B C D E F G isoA yearA matchcodeA H I J isoB yearB matchcodeB
# 1: 0 1 1 1 0 1 0 NLD 2010 NLD2010 0 1 0 NLD 2009 NLD2009
# 2: 1 0 0 0 1 0 1 NLD 2014 NLD2014 1 0 1 NLD 2014 NLD2014
# 3: 0 0 0 1 1 0 0 AUS 2010 AUS2010 1 0 0 AUS 2011 AUS2011
# 4: 1 0 1 0 0 1 0 AUS 2006 AUS2006 0 1 0 AUS 2007 AUS2007
# 5: 0 1 0 1 0 1 1 USA 2008 USA2008 0 1 1 USA 2007 USA2007
# 6: 0 0 1 0 0 0 1 USA 2010 USA2010 0 0 1 USA 2011 USA2010
# 7: 0 0 1 0 0 0 0 USA 2012 USA2012 0 0 1 USA 2011 USA2010
# 8: 1 0 1 0 0 1 0 BLG 2008 BLG2008 0 1 0 BLG 2007 BLG2007
# 9: 0 1 0 1 1 0 1 BEL 2008 BEL2008 1 0 1 BEL 2009 BEL2009
# 10: 0 1 0 1 0 1 0 BEL 2010 BEL2010 1 0 1 BEL 2009 BEL2009
# 11: 0 1 1 1 0 1 0 NLD 2010 NLD2010 0 1 0 NLD 2009 NLD2009
# 12: 1 0 0 0 1 0 1 NLD 2014 NLD2014 1 0 1 NLD 2014 NLD2014
# 13: 0 0 0 1 1 0 0 AUS 2010 AUS2010 1 0 0 AUS 2011 AUS2011
# 14: 1 0 1 0 0 1 0 AUS 2006 AUS2006 0 1 0 AUS 2007 AUS2007
# 15: 0 1 0 1 0 1 1 USA 2008 USA2008 0 1 1 USA 2007 USA2007
# 16: 0 0 1 0 0 0 1 USA 2010 USA2010 0 0 1 USA 2011 USA2010
# 17: 0 0 1 0 0 0 0 USA 2012 USA2012 0 0 1 USA 2011 USA2010
# 18: 1 0 1 0 0 1 0 BLG 2008 BLG2008 0 1 0 BLG 2007 BLG2007
# 19: 0 1 0 1 1 0 1 BEL 2008 BEL2008 1 0 1 BEL 2009 BEL2009
# 20: 0 1 0 1 0 1 0 BEL 2010 BEL2010 1 0 1 BEL 2009 BEL2009
sample data
dfA <- fread(
"A B C D E F G iso year matchcode
0 1 1 1 0 1 0 NLD 2010 NLD2010
1 0 0 0 1 0 1 NLD 2014 NLD2014
0 0 0 1 1 0 0 AUS 2010 AUS2010
1 0 1 0 0 1 0 AUS 2006 AUS2006
0 1 0 1 0 1 1 USA 2008 USA2008
0 0 1 0 0 0 1 USA 2010 USA2010
0 1 0 1 0 0 0 USA 2012 USA2012
1 0 1 0 0 1 0 BLG 2008 BLG2008
0 1 0 1 1 0 1 BEL 2008 BEL2008
1 0 1 0 0 1 0 BEL 2010 BEL2010
0 1 1 1 0 1 0 NLD 2010 NLD2010
1 0 0 0 1 0 1 NLD 2014 NLD2014
0 0 0 1 1 0 0 AUS 2010 AUS2010
1 0 1 0 0 1 0 AUS 2006 AUS2006
0 1 0 1 0 1 1 USA 2008 USA2008
0 0 1 0 0 0 1 USA 2010 USA2010
0 1 0 1 0 0 0 USA 2012 USA2012
1 0 1 0 0 1 0 BLG 2008 BLG2008
0 1 0 1 1 0 1 BEL 2008 BEL2008
1 0 1 0 0 1 0 BEL 2010 BEL2010",
header = TRUE
)
dfB <- fread(
"A B C D H I J iso year matchcode
0 1 1 1 0 1 0 NLD 2009 NLD2009
1 0 0 0 1 0 1 NLD 2014 NLD2014
0 0 0 1 1 0 0 AUS 2011 AUS2011
1 0 1 0 0 1 0 AUS 2007 AUS2007
0 1 0 1 0 1 1 USA 2007 USA2007
0 0 1 0 0 0 1 USA 2011 USA2010
0 1 0 1 0 0 0 USA 2013 USA2013
1 0 1 0 0 1 0 BLG 2007 BLG2007
0 1 0 1 1 0 1 BEL 2009 BEL2009
1 0 1 0 0 1 0 BEL 2012 BEL2012",
header = TRUE
)
Hers is my (default) approach for a join like this, using data.table
code
library( data.table )
#change the name of the matchcode-column
setnames(dfA, c("matchcode", "iso", "year"), c("matchcodeA", "isoA", "yearA"))
setnames(dfB, c("matchcode", "iso", "year"), c("matchcodeB", "isoB", "yearB"))
#store column-order for in the end
namesA <- as.character( names( dfA ) )
namesB <- as.character( setdiff( names(dfB), names(dfA) ) )
colorder <- c(namesA, namesB)
#create columns to join on
dfA[, `:=`(iso.join = isoA, year.join = yearA)]
dfB[, `:=`(iso.join = isoB, year.join = yearB)]
#perform left join
result <- dfB[dfA, on = c("iso.join", "year.join"),roll = "nearest" ]
#drop columns that are not needed
result[, grep("^i\.", names(result)) := NULL ]
result[, grep("join$", names(result)) := NULL ]
#set column order
setcolorder(result, colorder)
result
# A B C D E F G isoA yearA matchcodeA H I J isoB yearB matchcodeB
# 1: 0 1 1 1 0 1 0 NLD 2010 NLD2010 0 1 0 NLD 2009 NLD2009
# 2: 1 0 0 0 1 0 1 NLD 2014 NLD2014 1 0 1 NLD 2014 NLD2014
# 3: 0 0 0 1 1 0 0 AUS 2010 AUS2010 1 0 0 AUS 2011 AUS2011
# 4: 1 0 1 0 0 1 0 AUS 2006 AUS2006 0 1 0 AUS 2007 AUS2007
# 5: 0 1 0 1 0 1 1 USA 2008 USA2008 0 1 1 USA 2007 USA2007
# 6: 0 0 1 0 0 0 1 USA 2010 USA2010 0 0 1 USA 2011 USA2010
# 7: 0 0 1 0 0 0 0 USA 2012 USA2012 0 0 1 USA 2011 USA2010
# 8: 1 0 1 0 0 1 0 BLG 2008 BLG2008 0 1 0 BLG 2007 BLG2007
# 9: 0 1 0 1 1 0 1 BEL 2008 BEL2008 1 0 1 BEL 2009 BEL2009
# 10: 0 1 0 1 0 1 0 BEL 2010 BEL2010 1 0 1 BEL 2009 BEL2009
# 11: 0 1 1 1 0 1 0 NLD 2010 NLD2010 0 1 0 NLD 2009 NLD2009
# 12: 1 0 0 0 1 0 1 NLD 2014 NLD2014 1 0 1 NLD 2014 NLD2014
# 13: 0 0 0 1 1 0 0 AUS 2010 AUS2010 1 0 0 AUS 2011 AUS2011
# 14: 1 0 1 0 0 1 0 AUS 2006 AUS2006 0 1 0 AUS 2007 AUS2007
# 15: 0 1 0 1 0 1 1 USA 2008 USA2008 0 1 1 USA 2007 USA2007
# 16: 0 0 1 0 0 0 1 USA 2010 USA2010 0 0 1 USA 2011 USA2010
# 17: 0 0 1 0 0 0 0 USA 2012 USA2012 0 0 1 USA 2011 USA2010
# 18: 1 0 1 0 0 1 0 BLG 2008 BLG2008 0 1 0 BLG 2007 BLG2007
# 19: 0 1 0 1 1 0 1 BEL 2008 BEL2008 1 0 1 BEL 2009 BEL2009
# 20: 0 1 0 1 0 1 0 BEL 2010 BEL2010 1 0 1 BEL 2009 BEL2009
sample data
dfA <- fread(
"A B C D E F G iso year matchcode
0 1 1 1 0 1 0 NLD 2010 NLD2010
1 0 0 0 1 0 1 NLD 2014 NLD2014
0 0 0 1 1 0 0 AUS 2010 AUS2010
1 0 1 0 0 1 0 AUS 2006 AUS2006
0 1 0 1 0 1 1 USA 2008 USA2008
0 0 1 0 0 0 1 USA 2010 USA2010
0 1 0 1 0 0 0 USA 2012 USA2012
1 0 1 0 0 1 0 BLG 2008 BLG2008
0 1 0 1 1 0 1 BEL 2008 BEL2008
1 0 1 0 0 1 0 BEL 2010 BEL2010
0 1 1 1 0 1 0 NLD 2010 NLD2010
1 0 0 0 1 0 1 NLD 2014 NLD2014
0 0 0 1 1 0 0 AUS 2010 AUS2010
1 0 1 0 0 1 0 AUS 2006 AUS2006
0 1 0 1 0 1 1 USA 2008 USA2008
0 0 1 0 0 0 1 USA 2010 USA2010
0 1 0 1 0 0 0 USA 2012 USA2012
1 0 1 0 0 1 0 BLG 2008 BLG2008
0 1 0 1 1 0 1 BEL 2008 BEL2008
1 0 1 0 0 1 0 BEL 2010 BEL2010",
header = TRUE
)
dfB <- fread(
"A B C D H I J iso year matchcode
0 1 1 1 0 1 0 NLD 2009 NLD2009
1 0 0 0 1 0 1 NLD 2014 NLD2014
0 0 0 1 1 0 0 AUS 2011 AUS2011
1 0 1 0 0 1 0 AUS 2007 AUS2007
0 1 0 1 0 1 1 USA 2007 USA2007
0 0 1 0 0 0 1 USA 2011 USA2010
0 1 0 1 0 0 0 USA 2013 USA2013
1 0 1 0 0 1 0 BLG 2007 BLG2007
0 1 0 1 1 0 1 BEL 2009 BEL2009
1 0 1 0 0 1 0 BEL 2012 BEL2012",
header = TRUE
)
edited Jan 5 at 6:53
answered Jan 4 at 12:40
WimpelWimpel
6,362323
6,362323
I removed a previous column because I thought something was wrong. On a more thorough inspection it did however work just fine. Thank you so much for your help. A very small question. I noticed that apart frommatchcodeB
, alsoyear
andiso
a are empty (this is my confusion arose). Is it possible to a adapt the code to leaveiso
andyear
fromdfA
intact?
– Tom
Jan 5 at 2:32
1
@Tom Sure just give them unique names. See updated answer
– Wimpel
Jan 5 at 6:53
In the hope you are still here: I have been using your answer to merge many databases. For example A with number 1-8 and then B with number 1 to 8. The first merge goes well but after I am getting problems with duplicates in thecolorder
. I'm breaking my head on solving it. Would you have any idea why it happens and how to prevent it? Should I also remove thei.
and.join
from the other databases?
– Tom
Jan 5 at 7:13
Nevermind, I fixed it by dropping all variables which are not needed from eachdfB
as well before doing the second merging. That appears to work.
– Tom
Jan 5 at 7:29
@Tom glad you got it to work!
– Wimpel
Jan 5 at 8:40
add a comment |
I removed a previous column because I thought something was wrong. On a more thorough inspection it did however work just fine. Thank you so much for your help. A very small question. I noticed that apart frommatchcodeB
, alsoyear
andiso
a are empty (this is my confusion arose). Is it possible to a adapt the code to leaveiso
andyear
fromdfA
intact?
– Tom
Jan 5 at 2:32
1
@Tom Sure just give them unique names. See updated answer
– Wimpel
Jan 5 at 6:53
In the hope you are still here: I have been using your answer to merge many databases. For example A with number 1-8 and then B with number 1 to 8. The first merge goes well but after I am getting problems with duplicates in thecolorder
. I'm breaking my head on solving it. Would you have any idea why it happens and how to prevent it? Should I also remove thei.
and.join
from the other databases?
– Tom
Jan 5 at 7:13
Nevermind, I fixed it by dropping all variables which are not needed from eachdfB
as well before doing the second merging. That appears to work.
– Tom
Jan 5 at 7:29
@Tom glad you got it to work!
– Wimpel
Jan 5 at 8:40
I removed a previous column because I thought something was wrong. On a more thorough inspection it did however work just fine. Thank you so much for your help. A very small question. I noticed that apart from
matchcodeB
, also year
and iso
a are empty (this is my confusion arose). Is it possible to a adapt the code to leave iso
and year
from dfA
intact?– Tom
Jan 5 at 2:32
I removed a previous column because I thought something was wrong. On a more thorough inspection it did however work just fine. Thank you so much for your help. A very small question. I noticed that apart from
matchcodeB
, also year
and iso
a are empty (this is my confusion arose). Is it possible to a adapt the code to leave iso
and year
from dfA
intact?– Tom
Jan 5 at 2:32
1
1
@Tom Sure just give them unique names. See updated answer
– Wimpel
Jan 5 at 6:53
@Tom Sure just give them unique names. See updated answer
– Wimpel
Jan 5 at 6:53
In the hope you are still here: I have been using your answer to merge many databases. For example A with number 1-8 and then B with number 1 to 8. The first merge goes well but after I am getting problems with duplicates in the
colorder
. I'm breaking my head on solving it. Would you have any idea why it happens and how to prevent it? Should I also remove the i.
and .join
from the other databases?– Tom
Jan 5 at 7:13
In the hope you are still here: I have been using your answer to merge many databases. For example A with number 1-8 and then B with number 1 to 8. The first merge goes well but after I am getting problems with duplicates in the
colorder
. I'm breaking my head on solving it. Would you have any idea why it happens and how to prevent it? Should I also remove the i.
and .join
from the other databases?– Tom
Jan 5 at 7:13
Nevermind, I fixed it by dropping all variables which are not needed from each
dfB
as well before doing the second merging. That appears to work.– Tom
Jan 5 at 7:29
Nevermind, I fixed it by dropping all variables which are not needed from each
dfB
as well before doing the second merging. That appears to work.– Tom
Jan 5 at 7:29
@Tom glad you got it to work!
– Wimpel
Jan 5 at 8:40
@Tom glad you got it to work!
– Wimpel
Jan 5 at 8:40
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54038311%2fdoing-a-fuzzy-and-non-fuzzy-many-to-1-merge-with-data-table%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
It seems that you want to match each row from dfA in dfB. Does this give you the desired output:
dfB[dfA, on = .(iso, year), roll = "nearest", nomatch = 0]
?– mt1022
Jan 4 at 11:57
Thanks for taking a look mt1022! It does for the example, but I am still losing about 14.000 observations in the actual dataset regrettably..
– Tom
Jan 4 at 12:08
Which is already way less than before by the way!
– Tom
Jan 4 at 12:12
I guess some code in dfA are not present in B. You can set
nomatch = NA
in join and examine what happens to those rows that get NA values.– mt1022
Jan 4 at 12:12