Doing a “fuzzy” and non-fuzzy, many to 1 merge with data.table





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}







2















Lets assume I have two databases dfA and dfB. One has individual observations and one has country level data (which is applicable to multiple observations which are from the same year and country) For each of these databases I have created a key called matchcode. This matchcode is a combination of a country code and a year.



   dfA <- read.table(
text = "A B C D E F G iso year matchcode
1 0 1 1 1 0 1 0 NLD 2010 NLD2010
2 1 0 0 0 1 0 1 NLD 2014 NLD2014
3 0 0 0 1 1 0 0 AUS 2010 AUS2010
4 1 0 1 0 0 1 0 AUS 2006 AUS2006
5 0 1 0 1 0 1 1 USA 2008 USA2008
6 0 0 1 0 0 0 1 USA 2010 USA2010
7 0 1 0 1 0 0 0 USA 2012 USA2012
8 1 0 1 0 0 1 0 BLG 2008 BLG2008
9 0 1 0 1 1 0 1 BEL 2008 BEL2008
10 1 0 1 0 0 1 0 BEL 2010 BEL2010
11 0 1 1 1 0 1 0 NLD 2010 NLD2010
12 1 0 0 0 1 0 1 NLD 2014 NLD2014
13 0 0 0 1 1 0 0 AUS 2010 AUS2010
14 1 0 1 0 0 1 0 AUS 2006 AUS2006
15 0 1 0 1 0 1 1 USA 2008 USA2008
16 0 0 1 0 0 0 1 USA 2010 USA2010
17 0 1 0 1 0 0 0 USA 2012 USA2012
18 1 0 1 0 0 1 0 BLG 2008 BLG2008
19 0 1 0 1 1 0 1 BEL 2008 BEL2008
20 1 0 1 0 0 1 0 BEL 2010 BEL2010",
header = TRUE
)

dfB <- read.table(
text = "A B C D H I J iso year matchcode
1 0 1 1 1 0 1 0 NLD 2009 NLD2009
2 1 0 0 0 1 0 1 NLD 2014 NLD2014
3 0 0 0 1 1 0 0 AUS 2011 AUS2011
4 1 0 1 0 0 1 0 AUS 2007 AUS2007
5 0 1 0 1 0 1 1 USA 2007 USA2007
6 0 0 1 0 0 0 1 USA 2011 USA2010
7 0 1 0 1 0 0 0 USA 2013 USA2013
8 1 0 1 0 0 1 0 BLG 2007 BLG2007
9 0 1 0 1 1 0 1 BEL 2009 BEL2009
10 1 0 1 0 0 1 0 BEL 2012 BEL2012",
header = TRUE
)

library(data.table)
setDT(dfA)
setDT(dfB)


Mostly when I merge these datasets I simply do:



dfA<- merge(dfA, dfB, by= "matchcode", all.x = TRUE, allow.cartesian=FALSE)


The problem is that sometimes the years do not completely match. So I tried:



dfA <- dfA[dfB, on = .(iso, year), roll = "nearest", nomatch = 0]


But this reduces the amount of observations to 11.



# A tibble: 11 x 18
A B C D E F G iso year matchcode K L M N O P Q i.matchcode
<int> <int> <int> <int> <int> <int> <int> <fct> <int> <fct> <int> <int> <int> <int> <int> <int> <int> <fct>
1 0 1 1 1 0 1 0 NLD 2009 NLD2010 0 1 1 1 0 1 0 NLD2009
2 1 0 0 0 1 0 1 NLD 2014 NLD2014 1 0 0 0 1 0 1 NLD2014
3 1 0 0 0 1 0 1 NLD 2014 NLD2014 1 0 0 0 1 0 1 NLD2014
4 0 0 0 1 1 0 0 AUS 2011 AUS2010 0 0 0 1 1 0 0 AUS2011
5 1 0 1 0 0 1 0 AUS 2007 AUS2006 1 0 1 0 0 1 0 AUS2007
6 0 1 0 1 0 1 1 USA 2007 USA2008 0 1 0 1 0 1 1 USA2007
7 0 0 1 0 0 0 1 USA 2011 USA2010 0 0 1 0 0 0 1 USA2010
8 0 1 0 1 0 0 0 USA 2013 USA2012 0 1 0 1 0 0 0 USA2013
9 1 0 1 0 0 1 0 BLG 2007 BLG2008 1 0 1 0 0 1 0 BLG2007
10 0 1 0 1 1 0 1 BEL 2009 BEL2008 0 1 0 1 1 0 1 BEL2009
11 1 0 1 0 0 1 0 BEL 2012 BEL2010 1 0 1 0 0 1 0 BEL2012


The preferred output would be as follows:



#    A B C D E F G iso year matchcodeA H I J matchcodeB
# 1: 1 0 0 0 1 0 1 NLD 2014 NLD2014 1 0 1 NLD2014
# 2: 0 0 0 1 1 0 0 AUS 2011 AUS2010 1 0 0 AUS2011
# 3: 1 0 1 0 0 1 0 AUS 2007 AUS2006 0 1 0 AUS2007
# 4: 0 0 1 0 0 0 1 USA 2011 USA2010 0 0 1 USA2010
# 5: 0 1 0 1 0 0 0 USA 2013 USA2012 0 0 0 USA2013
# 6: 0 1 0 1 1 0 1 BEL 2009 BEL2008 1 0 1 BEL2009
# 7: 0 1 1 1 0 1 0 NLD 2009 NLD2010 0 1 0 NLD2009
# 8: 0 1 0 1 0 1 1 USA 2007 USA2008 0 1 1 USA2007
# 9: 0 1 0 1 0 0 0 USA 2011 USA2012 0 0 1 USA2010
#10: 1 0 1 0 0 1 0 BEL 2009 BEL2010 1 0 1 BEL2009
#11: 1 0 0 0 1 0 1 NLD 2014 NLD2014 1 0 1 NLD2014
#12: 0 0 0 1 1 0 0 AUS 2011 AUS2010 1 0 0 AUS2011
#13: 1 0 1 0 0 1 0 AUS 2007 AUS2006 0 1 0 AUS2007
#14: 0 0 1 0 0 0 1 USA 2011 USA2010 0 0 1 USA2010
#15: 0 1 0 1 0 0 0 USA 2013 USA2012 0 0 0 USA2013
#16: 0 1 0 1 1 0 1 BEL 2009 BEL2008 1 0 1 BEL2009
#17: 0 1 1 1 0 1 0 NLD 2009 NLD2010 0 1 0 NLD2009
#18: 0 1 0 1 0 1 1 USA 2007 USA2008 0 1 1 USA2007
#19: 0 1 0 1 0 0 0 USA 2011 USA2012 0 0 1 USA2010
#20: 1 0 1 0 0 1 0 BEL 2009 BEL2010 1 0 1 BEL2009


Additional Sources:



1. The previous try



2. The try before that










share|improve this question























  • It seems that you want to match each row from dfA in dfB. Does this give you the desired output: dfB[dfA, on = .(iso, year), roll = "nearest", nomatch = 0]?

    – mt1022
    Jan 4 at 11:57













  • Thanks for taking a look mt1022! It does for the example, but I am still losing about 14.000 observations in the actual dataset regrettably..

    – Tom
    Jan 4 at 12:08











  • Which is already way less than before by the way!

    – Tom
    Jan 4 at 12:12











  • I guess some code in dfA are not present in B. You can set nomatch = NA in join and examine what happens to those rows that get NA values.

    – mt1022
    Jan 4 at 12:12


















2















Lets assume I have two databases dfA and dfB. One has individual observations and one has country level data (which is applicable to multiple observations which are from the same year and country) For each of these databases I have created a key called matchcode. This matchcode is a combination of a country code and a year.



   dfA <- read.table(
text = "A B C D E F G iso year matchcode
1 0 1 1 1 0 1 0 NLD 2010 NLD2010
2 1 0 0 0 1 0 1 NLD 2014 NLD2014
3 0 0 0 1 1 0 0 AUS 2010 AUS2010
4 1 0 1 0 0 1 0 AUS 2006 AUS2006
5 0 1 0 1 0 1 1 USA 2008 USA2008
6 0 0 1 0 0 0 1 USA 2010 USA2010
7 0 1 0 1 0 0 0 USA 2012 USA2012
8 1 0 1 0 0 1 0 BLG 2008 BLG2008
9 0 1 0 1 1 0 1 BEL 2008 BEL2008
10 1 0 1 0 0 1 0 BEL 2010 BEL2010
11 0 1 1 1 0 1 0 NLD 2010 NLD2010
12 1 0 0 0 1 0 1 NLD 2014 NLD2014
13 0 0 0 1 1 0 0 AUS 2010 AUS2010
14 1 0 1 0 0 1 0 AUS 2006 AUS2006
15 0 1 0 1 0 1 1 USA 2008 USA2008
16 0 0 1 0 0 0 1 USA 2010 USA2010
17 0 1 0 1 0 0 0 USA 2012 USA2012
18 1 0 1 0 0 1 0 BLG 2008 BLG2008
19 0 1 0 1 1 0 1 BEL 2008 BEL2008
20 1 0 1 0 0 1 0 BEL 2010 BEL2010",
header = TRUE
)

dfB <- read.table(
text = "A B C D H I J iso year matchcode
1 0 1 1 1 0 1 0 NLD 2009 NLD2009
2 1 0 0 0 1 0 1 NLD 2014 NLD2014
3 0 0 0 1 1 0 0 AUS 2011 AUS2011
4 1 0 1 0 0 1 0 AUS 2007 AUS2007
5 0 1 0 1 0 1 1 USA 2007 USA2007
6 0 0 1 0 0 0 1 USA 2011 USA2010
7 0 1 0 1 0 0 0 USA 2013 USA2013
8 1 0 1 0 0 1 0 BLG 2007 BLG2007
9 0 1 0 1 1 0 1 BEL 2009 BEL2009
10 1 0 1 0 0 1 0 BEL 2012 BEL2012",
header = TRUE
)

library(data.table)
setDT(dfA)
setDT(dfB)


Mostly when I merge these datasets I simply do:



dfA<- merge(dfA, dfB, by= "matchcode", all.x = TRUE, allow.cartesian=FALSE)


The problem is that sometimes the years do not completely match. So I tried:



dfA <- dfA[dfB, on = .(iso, year), roll = "nearest", nomatch = 0]


But this reduces the amount of observations to 11.



# A tibble: 11 x 18
A B C D E F G iso year matchcode K L M N O P Q i.matchcode
<int> <int> <int> <int> <int> <int> <int> <fct> <int> <fct> <int> <int> <int> <int> <int> <int> <int> <fct>
1 0 1 1 1 0 1 0 NLD 2009 NLD2010 0 1 1 1 0 1 0 NLD2009
2 1 0 0 0 1 0 1 NLD 2014 NLD2014 1 0 0 0 1 0 1 NLD2014
3 1 0 0 0 1 0 1 NLD 2014 NLD2014 1 0 0 0 1 0 1 NLD2014
4 0 0 0 1 1 0 0 AUS 2011 AUS2010 0 0 0 1 1 0 0 AUS2011
5 1 0 1 0 0 1 0 AUS 2007 AUS2006 1 0 1 0 0 1 0 AUS2007
6 0 1 0 1 0 1 1 USA 2007 USA2008 0 1 0 1 0 1 1 USA2007
7 0 0 1 0 0 0 1 USA 2011 USA2010 0 0 1 0 0 0 1 USA2010
8 0 1 0 1 0 0 0 USA 2013 USA2012 0 1 0 1 0 0 0 USA2013
9 1 0 1 0 0 1 0 BLG 2007 BLG2008 1 0 1 0 0 1 0 BLG2007
10 0 1 0 1 1 0 1 BEL 2009 BEL2008 0 1 0 1 1 0 1 BEL2009
11 1 0 1 0 0 1 0 BEL 2012 BEL2010 1 0 1 0 0 1 0 BEL2012


The preferred output would be as follows:



#    A B C D E F G iso year matchcodeA H I J matchcodeB
# 1: 1 0 0 0 1 0 1 NLD 2014 NLD2014 1 0 1 NLD2014
# 2: 0 0 0 1 1 0 0 AUS 2011 AUS2010 1 0 0 AUS2011
# 3: 1 0 1 0 0 1 0 AUS 2007 AUS2006 0 1 0 AUS2007
# 4: 0 0 1 0 0 0 1 USA 2011 USA2010 0 0 1 USA2010
# 5: 0 1 0 1 0 0 0 USA 2013 USA2012 0 0 0 USA2013
# 6: 0 1 0 1 1 0 1 BEL 2009 BEL2008 1 0 1 BEL2009
# 7: 0 1 1 1 0 1 0 NLD 2009 NLD2010 0 1 0 NLD2009
# 8: 0 1 0 1 0 1 1 USA 2007 USA2008 0 1 1 USA2007
# 9: 0 1 0 1 0 0 0 USA 2011 USA2012 0 0 1 USA2010
#10: 1 0 1 0 0 1 0 BEL 2009 BEL2010 1 0 1 BEL2009
#11: 1 0 0 0 1 0 1 NLD 2014 NLD2014 1 0 1 NLD2014
#12: 0 0 0 1 1 0 0 AUS 2011 AUS2010 1 0 0 AUS2011
#13: 1 0 1 0 0 1 0 AUS 2007 AUS2006 0 1 0 AUS2007
#14: 0 0 1 0 0 0 1 USA 2011 USA2010 0 0 1 USA2010
#15: 0 1 0 1 0 0 0 USA 2013 USA2012 0 0 0 USA2013
#16: 0 1 0 1 1 0 1 BEL 2009 BEL2008 1 0 1 BEL2009
#17: 0 1 1 1 0 1 0 NLD 2009 NLD2010 0 1 0 NLD2009
#18: 0 1 0 1 0 1 1 USA 2007 USA2008 0 1 1 USA2007
#19: 0 1 0 1 0 0 0 USA 2011 USA2012 0 0 1 USA2010
#20: 1 0 1 0 0 1 0 BEL 2009 BEL2010 1 0 1 BEL2009


Additional Sources:



1. The previous try



2. The try before that










share|improve this question























  • It seems that you want to match each row from dfA in dfB. Does this give you the desired output: dfB[dfA, on = .(iso, year), roll = "nearest", nomatch = 0]?

    – mt1022
    Jan 4 at 11:57













  • Thanks for taking a look mt1022! It does for the example, but I am still losing about 14.000 observations in the actual dataset regrettably..

    – Tom
    Jan 4 at 12:08











  • Which is already way less than before by the way!

    – Tom
    Jan 4 at 12:12











  • I guess some code in dfA are not present in B. You can set nomatch = NA in join and examine what happens to those rows that get NA values.

    – mt1022
    Jan 4 at 12:12














2












2








2








Lets assume I have two databases dfA and dfB. One has individual observations and one has country level data (which is applicable to multiple observations which are from the same year and country) For each of these databases I have created a key called matchcode. This matchcode is a combination of a country code and a year.



   dfA <- read.table(
text = "A B C D E F G iso year matchcode
1 0 1 1 1 0 1 0 NLD 2010 NLD2010
2 1 0 0 0 1 0 1 NLD 2014 NLD2014
3 0 0 0 1 1 0 0 AUS 2010 AUS2010
4 1 0 1 0 0 1 0 AUS 2006 AUS2006
5 0 1 0 1 0 1 1 USA 2008 USA2008
6 0 0 1 0 0 0 1 USA 2010 USA2010
7 0 1 0 1 0 0 0 USA 2012 USA2012
8 1 0 1 0 0 1 0 BLG 2008 BLG2008
9 0 1 0 1 1 0 1 BEL 2008 BEL2008
10 1 0 1 0 0 1 0 BEL 2010 BEL2010
11 0 1 1 1 0 1 0 NLD 2010 NLD2010
12 1 0 0 0 1 0 1 NLD 2014 NLD2014
13 0 0 0 1 1 0 0 AUS 2010 AUS2010
14 1 0 1 0 0 1 0 AUS 2006 AUS2006
15 0 1 0 1 0 1 1 USA 2008 USA2008
16 0 0 1 0 0 0 1 USA 2010 USA2010
17 0 1 0 1 0 0 0 USA 2012 USA2012
18 1 0 1 0 0 1 0 BLG 2008 BLG2008
19 0 1 0 1 1 0 1 BEL 2008 BEL2008
20 1 0 1 0 0 1 0 BEL 2010 BEL2010",
header = TRUE
)

dfB <- read.table(
text = "A B C D H I J iso year matchcode
1 0 1 1 1 0 1 0 NLD 2009 NLD2009
2 1 0 0 0 1 0 1 NLD 2014 NLD2014
3 0 0 0 1 1 0 0 AUS 2011 AUS2011
4 1 0 1 0 0 1 0 AUS 2007 AUS2007
5 0 1 0 1 0 1 1 USA 2007 USA2007
6 0 0 1 0 0 0 1 USA 2011 USA2010
7 0 1 0 1 0 0 0 USA 2013 USA2013
8 1 0 1 0 0 1 0 BLG 2007 BLG2007
9 0 1 0 1 1 0 1 BEL 2009 BEL2009
10 1 0 1 0 0 1 0 BEL 2012 BEL2012",
header = TRUE
)

library(data.table)
setDT(dfA)
setDT(dfB)


Mostly when I merge these datasets I simply do:



dfA<- merge(dfA, dfB, by= "matchcode", all.x = TRUE, allow.cartesian=FALSE)


The problem is that sometimes the years do not completely match. So I tried:



dfA <- dfA[dfB, on = .(iso, year), roll = "nearest", nomatch = 0]


But this reduces the amount of observations to 11.



# A tibble: 11 x 18
A B C D E F G iso year matchcode K L M N O P Q i.matchcode
<int> <int> <int> <int> <int> <int> <int> <fct> <int> <fct> <int> <int> <int> <int> <int> <int> <int> <fct>
1 0 1 1 1 0 1 0 NLD 2009 NLD2010 0 1 1 1 0 1 0 NLD2009
2 1 0 0 0 1 0 1 NLD 2014 NLD2014 1 0 0 0 1 0 1 NLD2014
3 1 0 0 0 1 0 1 NLD 2014 NLD2014 1 0 0 0 1 0 1 NLD2014
4 0 0 0 1 1 0 0 AUS 2011 AUS2010 0 0 0 1 1 0 0 AUS2011
5 1 0 1 0 0 1 0 AUS 2007 AUS2006 1 0 1 0 0 1 0 AUS2007
6 0 1 0 1 0 1 1 USA 2007 USA2008 0 1 0 1 0 1 1 USA2007
7 0 0 1 0 0 0 1 USA 2011 USA2010 0 0 1 0 0 0 1 USA2010
8 0 1 0 1 0 0 0 USA 2013 USA2012 0 1 0 1 0 0 0 USA2013
9 1 0 1 0 0 1 0 BLG 2007 BLG2008 1 0 1 0 0 1 0 BLG2007
10 0 1 0 1 1 0 1 BEL 2009 BEL2008 0 1 0 1 1 0 1 BEL2009
11 1 0 1 0 0 1 0 BEL 2012 BEL2010 1 0 1 0 0 1 0 BEL2012


The preferred output would be as follows:



#    A B C D E F G iso year matchcodeA H I J matchcodeB
# 1: 1 0 0 0 1 0 1 NLD 2014 NLD2014 1 0 1 NLD2014
# 2: 0 0 0 1 1 0 0 AUS 2011 AUS2010 1 0 0 AUS2011
# 3: 1 0 1 0 0 1 0 AUS 2007 AUS2006 0 1 0 AUS2007
# 4: 0 0 1 0 0 0 1 USA 2011 USA2010 0 0 1 USA2010
# 5: 0 1 0 1 0 0 0 USA 2013 USA2012 0 0 0 USA2013
# 6: 0 1 0 1 1 0 1 BEL 2009 BEL2008 1 0 1 BEL2009
# 7: 0 1 1 1 0 1 0 NLD 2009 NLD2010 0 1 0 NLD2009
# 8: 0 1 0 1 0 1 1 USA 2007 USA2008 0 1 1 USA2007
# 9: 0 1 0 1 0 0 0 USA 2011 USA2012 0 0 1 USA2010
#10: 1 0 1 0 0 1 0 BEL 2009 BEL2010 1 0 1 BEL2009
#11: 1 0 0 0 1 0 1 NLD 2014 NLD2014 1 0 1 NLD2014
#12: 0 0 0 1 1 0 0 AUS 2011 AUS2010 1 0 0 AUS2011
#13: 1 0 1 0 0 1 0 AUS 2007 AUS2006 0 1 0 AUS2007
#14: 0 0 1 0 0 0 1 USA 2011 USA2010 0 0 1 USA2010
#15: 0 1 0 1 0 0 0 USA 2013 USA2012 0 0 0 USA2013
#16: 0 1 0 1 1 0 1 BEL 2009 BEL2008 1 0 1 BEL2009
#17: 0 1 1 1 0 1 0 NLD 2009 NLD2010 0 1 0 NLD2009
#18: 0 1 0 1 0 1 1 USA 2007 USA2008 0 1 1 USA2007
#19: 0 1 0 1 0 0 0 USA 2011 USA2012 0 0 1 USA2010
#20: 1 0 1 0 0 1 0 BEL 2009 BEL2010 1 0 1 BEL2009


Additional Sources:



1. The previous try



2. The try before that










share|improve this question














Lets assume I have two databases dfA and dfB. One has individual observations and one has country level data (which is applicable to multiple observations which are from the same year and country) For each of these databases I have created a key called matchcode. This matchcode is a combination of a country code and a year.



   dfA <- read.table(
text = "A B C D E F G iso year matchcode
1 0 1 1 1 0 1 0 NLD 2010 NLD2010
2 1 0 0 0 1 0 1 NLD 2014 NLD2014
3 0 0 0 1 1 0 0 AUS 2010 AUS2010
4 1 0 1 0 0 1 0 AUS 2006 AUS2006
5 0 1 0 1 0 1 1 USA 2008 USA2008
6 0 0 1 0 0 0 1 USA 2010 USA2010
7 0 1 0 1 0 0 0 USA 2012 USA2012
8 1 0 1 0 0 1 0 BLG 2008 BLG2008
9 0 1 0 1 1 0 1 BEL 2008 BEL2008
10 1 0 1 0 0 1 0 BEL 2010 BEL2010
11 0 1 1 1 0 1 0 NLD 2010 NLD2010
12 1 0 0 0 1 0 1 NLD 2014 NLD2014
13 0 0 0 1 1 0 0 AUS 2010 AUS2010
14 1 0 1 0 0 1 0 AUS 2006 AUS2006
15 0 1 0 1 0 1 1 USA 2008 USA2008
16 0 0 1 0 0 0 1 USA 2010 USA2010
17 0 1 0 1 0 0 0 USA 2012 USA2012
18 1 0 1 0 0 1 0 BLG 2008 BLG2008
19 0 1 0 1 1 0 1 BEL 2008 BEL2008
20 1 0 1 0 0 1 0 BEL 2010 BEL2010",
header = TRUE
)

dfB <- read.table(
text = "A B C D H I J iso year matchcode
1 0 1 1 1 0 1 0 NLD 2009 NLD2009
2 1 0 0 0 1 0 1 NLD 2014 NLD2014
3 0 0 0 1 1 0 0 AUS 2011 AUS2011
4 1 0 1 0 0 1 0 AUS 2007 AUS2007
5 0 1 0 1 0 1 1 USA 2007 USA2007
6 0 0 1 0 0 0 1 USA 2011 USA2010
7 0 1 0 1 0 0 0 USA 2013 USA2013
8 1 0 1 0 0 1 0 BLG 2007 BLG2007
9 0 1 0 1 1 0 1 BEL 2009 BEL2009
10 1 0 1 0 0 1 0 BEL 2012 BEL2012",
header = TRUE
)

library(data.table)
setDT(dfA)
setDT(dfB)


Mostly when I merge these datasets I simply do:



dfA<- merge(dfA, dfB, by= "matchcode", all.x = TRUE, allow.cartesian=FALSE)


The problem is that sometimes the years do not completely match. So I tried:



dfA <- dfA[dfB, on = .(iso, year), roll = "nearest", nomatch = 0]


But this reduces the amount of observations to 11.



# A tibble: 11 x 18
A B C D E F G iso year matchcode K L M N O P Q i.matchcode
<int> <int> <int> <int> <int> <int> <int> <fct> <int> <fct> <int> <int> <int> <int> <int> <int> <int> <fct>
1 0 1 1 1 0 1 0 NLD 2009 NLD2010 0 1 1 1 0 1 0 NLD2009
2 1 0 0 0 1 0 1 NLD 2014 NLD2014 1 0 0 0 1 0 1 NLD2014
3 1 0 0 0 1 0 1 NLD 2014 NLD2014 1 0 0 0 1 0 1 NLD2014
4 0 0 0 1 1 0 0 AUS 2011 AUS2010 0 0 0 1 1 0 0 AUS2011
5 1 0 1 0 0 1 0 AUS 2007 AUS2006 1 0 1 0 0 1 0 AUS2007
6 0 1 0 1 0 1 1 USA 2007 USA2008 0 1 0 1 0 1 1 USA2007
7 0 0 1 0 0 0 1 USA 2011 USA2010 0 0 1 0 0 0 1 USA2010
8 0 1 0 1 0 0 0 USA 2013 USA2012 0 1 0 1 0 0 0 USA2013
9 1 0 1 0 0 1 0 BLG 2007 BLG2008 1 0 1 0 0 1 0 BLG2007
10 0 1 0 1 1 0 1 BEL 2009 BEL2008 0 1 0 1 1 0 1 BEL2009
11 1 0 1 0 0 1 0 BEL 2012 BEL2010 1 0 1 0 0 1 0 BEL2012


The preferred output would be as follows:



#    A B C D E F G iso year matchcodeA H I J matchcodeB
# 1: 1 0 0 0 1 0 1 NLD 2014 NLD2014 1 0 1 NLD2014
# 2: 0 0 0 1 1 0 0 AUS 2011 AUS2010 1 0 0 AUS2011
# 3: 1 0 1 0 0 1 0 AUS 2007 AUS2006 0 1 0 AUS2007
# 4: 0 0 1 0 0 0 1 USA 2011 USA2010 0 0 1 USA2010
# 5: 0 1 0 1 0 0 0 USA 2013 USA2012 0 0 0 USA2013
# 6: 0 1 0 1 1 0 1 BEL 2009 BEL2008 1 0 1 BEL2009
# 7: 0 1 1 1 0 1 0 NLD 2009 NLD2010 0 1 0 NLD2009
# 8: 0 1 0 1 0 1 1 USA 2007 USA2008 0 1 1 USA2007
# 9: 0 1 0 1 0 0 0 USA 2011 USA2012 0 0 1 USA2010
#10: 1 0 1 0 0 1 0 BEL 2009 BEL2010 1 0 1 BEL2009
#11: 1 0 0 0 1 0 1 NLD 2014 NLD2014 1 0 1 NLD2014
#12: 0 0 0 1 1 0 0 AUS 2011 AUS2010 1 0 0 AUS2011
#13: 1 0 1 0 0 1 0 AUS 2007 AUS2006 0 1 0 AUS2007
#14: 0 0 1 0 0 0 1 USA 2011 USA2010 0 0 1 USA2010
#15: 0 1 0 1 0 0 0 USA 2013 USA2012 0 0 0 USA2013
#16: 0 1 0 1 1 0 1 BEL 2009 BEL2008 1 0 1 BEL2009
#17: 0 1 1 1 0 1 0 NLD 2009 NLD2010 0 1 0 NLD2009
#18: 0 1 0 1 0 1 1 USA 2007 USA2008 0 1 1 USA2007
#19: 0 1 0 1 0 0 0 USA 2011 USA2012 0 0 1 USA2010
#20: 1 0 1 0 0 1 0 BEL 2009 BEL2010 1 0 1 BEL2009


Additional Sources:



1. The previous try



2. The try before that







r merge data.table fuzzyjoin






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Jan 4 at 11:42









TomTom

355213




355213













  • It seems that you want to match each row from dfA in dfB. Does this give you the desired output: dfB[dfA, on = .(iso, year), roll = "nearest", nomatch = 0]?

    – mt1022
    Jan 4 at 11:57













  • Thanks for taking a look mt1022! It does for the example, but I am still losing about 14.000 observations in the actual dataset regrettably..

    – Tom
    Jan 4 at 12:08











  • Which is already way less than before by the way!

    – Tom
    Jan 4 at 12:12











  • I guess some code in dfA are not present in B. You can set nomatch = NA in join and examine what happens to those rows that get NA values.

    – mt1022
    Jan 4 at 12:12



















  • It seems that you want to match each row from dfA in dfB. Does this give you the desired output: dfB[dfA, on = .(iso, year), roll = "nearest", nomatch = 0]?

    – mt1022
    Jan 4 at 11:57













  • Thanks for taking a look mt1022! It does for the example, but I am still losing about 14.000 observations in the actual dataset regrettably..

    – Tom
    Jan 4 at 12:08











  • Which is already way less than before by the way!

    – Tom
    Jan 4 at 12:12











  • I guess some code in dfA are not present in B. You can set nomatch = NA in join and examine what happens to those rows that get NA values.

    – mt1022
    Jan 4 at 12:12

















It seems that you want to match each row from dfA in dfB. Does this give you the desired output: dfB[dfA, on = .(iso, year), roll = "nearest", nomatch = 0]?

– mt1022
Jan 4 at 11:57







It seems that you want to match each row from dfA in dfB. Does this give you the desired output: dfB[dfA, on = .(iso, year), roll = "nearest", nomatch = 0]?

– mt1022
Jan 4 at 11:57















Thanks for taking a look mt1022! It does for the example, but I am still losing about 14.000 observations in the actual dataset regrettably..

– Tom
Jan 4 at 12:08





Thanks for taking a look mt1022! It does for the example, but I am still losing about 14.000 observations in the actual dataset regrettably..

– Tom
Jan 4 at 12:08













Which is already way less than before by the way!

– Tom
Jan 4 at 12:12





Which is already way less than before by the way!

– Tom
Jan 4 at 12:12













I guess some code in dfA are not present in B. You can set nomatch = NA in join and examine what happens to those rows that get NA values.

– mt1022
Jan 4 at 12:12





I guess some code in dfA are not present in B. You can set nomatch = NA in join and examine what happens to those rows that get NA values.

– mt1022
Jan 4 at 12:12












1 Answer
1






active

oldest

votes


















1














Hers is my (default) approach for a join like this, using data.table



code



library( data.table )

#change the name of the matchcode-column
setnames(dfA, c("matchcode", "iso", "year"), c("matchcodeA", "isoA", "yearA"))
setnames(dfB, c("matchcode", "iso", "year"), c("matchcodeB", "isoB", "yearB"))

#store column-order for in the end
namesA <- as.character( names( dfA ) )
namesB <- as.character( setdiff( names(dfB), names(dfA) ) )
colorder <- c(namesA, namesB)

#create columns to join on
dfA[, `:=`(iso.join = isoA, year.join = yearA)]
dfB[, `:=`(iso.join = isoB, year.join = yearB)]

#perform left join
result <- dfB[dfA, on = c("iso.join", "year.join"),roll = "nearest" ]

#drop columns that are not needed
result[, grep("^i\.", names(result)) := NULL ]
result[, grep("join$", names(result)) := NULL ]

#set column order
setcolorder(result, colorder)


result



#     A B C D E F G isoA yearA matchcodeA H I J isoB yearB matchcodeB
# 1: 0 1 1 1 0 1 0 NLD 2010 NLD2010 0 1 0 NLD 2009 NLD2009
# 2: 1 0 0 0 1 0 1 NLD 2014 NLD2014 1 0 1 NLD 2014 NLD2014
# 3: 0 0 0 1 1 0 0 AUS 2010 AUS2010 1 0 0 AUS 2011 AUS2011
# 4: 1 0 1 0 0 1 0 AUS 2006 AUS2006 0 1 0 AUS 2007 AUS2007
# 5: 0 1 0 1 0 1 1 USA 2008 USA2008 0 1 1 USA 2007 USA2007
# 6: 0 0 1 0 0 0 1 USA 2010 USA2010 0 0 1 USA 2011 USA2010
# 7: 0 0 1 0 0 0 0 USA 2012 USA2012 0 0 1 USA 2011 USA2010
# 8: 1 0 1 0 0 1 0 BLG 2008 BLG2008 0 1 0 BLG 2007 BLG2007
# 9: 0 1 0 1 1 0 1 BEL 2008 BEL2008 1 0 1 BEL 2009 BEL2009
# 10: 0 1 0 1 0 1 0 BEL 2010 BEL2010 1 0 1 BEL 2009 BEL2009
# 11: 0 1 1 1 0 1 0 NLD 2010 NLD2010 0 1 0 NLD 2009 NLD2009
# 12: 1 0 0 0 1 0 1 NLD 2014 NLD2014 1 0 1 NLD 2014 NLD2014
# 13: 0 0 0 1 1 0 0 AUS 2010 AUS2010 1 0 0 AUS 2011 AUS2011
# 14: 1 0 1 0 0 1 0 AUS 2006 AUS2006 0 1 0 AUS 2007 AUS2007
# 15: 0 1 0 1 0 1 1 USA 2008 USA2008 0 1 1 USA 2007 USA2007
# 16: 0 0 1 0 0 0 1 USA 2010 USA2010 0 0 1 USA 2011 USA2010
# 17: 0 0 1 0 0 0 0 USA 2012 USA2012 0 0 1 USA 2011 USA2010
# 18: 1 0 1 0 0 1 0 BLG 2008 BLG2008 0 1 0 BLG 2007 BLG2007
# 19: 0 1 0 1 1 0 1 BEL 2008 BEL2008 1 0 1 BEL 2009 BEL2009
# 20: 0 1 0 1 0 1 0 BEL 2010 BEL2010 1 0 1 BEL 2009 BEL2009


sample data



dfA <- fread(
"A B C D E F G iso year matchcode
0 1 1 1 0 1 0 NLD 2010 NLD2010
1 0 0 0 1 0 1 NLD 2014 NLD2014
0 0 0 1 1 0 0 AUS 2010 AUS2010
1 0 1 0 0 1 0 AUS 2006 AUS2006
0 1 0 1 0 1 1 USA 2008 USA2008
0 0 1 0 0 0 1 USA 2010 USA2010
0 1 0 1 0 0 0 USA 2012 USA2012
1 0 1 0 0 1 0 BLG 2008 BLG2008
0 1 0 1 1 0 1 BEL 2008 BEL2008
1 0 1 0 0 1 0 BEL 2010 BEL2010
0 1 1 1 0 1 0 NLD 2010 NLD2010
1 0 0 0 1 0 1 NLD 2014 NLD2014
0 0 0 1 1 0 0 AUS 2010 AUS2010
1 0 1 0 0 1 0 AUS 2006 AUS2006
0 1 0 1 0 1 1 USA 2008 USA2008
0 0 1 0 0 0 1 USA 2010 USA2010
0 1 0 1 0 0 0 USA 2012 USA2012
1 0 1 0 0 1 0 BLG 2008 BLG2008
0 1 0 1 1 0 1 BEL 2008 BEL2008
1 0 1 0 0 1 0 BEL 2010 BEL2010",
header = TRUE
)


dfB <- fread(
"A B C D H I J iso year matchcode
0 1 1 1 0 1 0 NLD 2009 NLD2009
1 0 0 0 1 0 1 NLD 2014 NLD2014
0 0 0 1 1 0 0 AUS 2011 AUS2011
1 0 1 0 0 1 0 AUS 2007 AUS2007
0 1 0 1 0 1 1 USA 2007 USA2007
0 0 1 0 0 0 1 USA 2011 USA2010
0 1 0 1 0 0 0 USA 2013 USA2013
1 0 1 0 0 1 0 BLG 2007 BLG2007
0 1 0 1 1 0 1 BEL 2009 BEL2009
1 0 1 0 0 1 0 BEL 2012 BEL2012",
header = TRUE
)





share|improve this answer


























  • I removed a previous column because I thought something was wrong. On a more thorough inspection it did however work just fine. Thank you so much for your help. A very small question. I noticed that apart from matchcodeB, also year and isoa are empty (this is my confusion arose). Is it possible to a adapt the code to leave iso and year from dfA intact?

    – Tom
    Jan 5 at 2:32






  • 1





    @Tom Sure just give them unique names. See updated answer

    – Wimpel
    Jan 5 at 6:53











  • In the hope you are still here: I have been using your answer to merge many databases. For example A with number 1-8 and then B with number 1 to 8. The first merge goes well but after I am getting problems with duplicates in the colorder. I'm breaking my head on solving it. Would you have any idea why it happens and how to prevent it? Should I also remove the i. and .join from the other databases?

    – Tom
    Jan 5 at 7:13











  • Nevermind, I fixed it by dropping all variables which are not needed from each dfB as well before doing the second merging. That appears to work.

    – Tom
    Jan 5 at 7:29











  • @Tom glad you got it to work!

    – Wimpel
    Jan 5 at 8:40












Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54038311%2fdoing-a-fuzzy-and-non-fuzzy-many-to-1-merge-with-data-table%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









1














Hers is my (default) approach for a join like this, using data.table



code



library( data.table )

#change the name of the matchcode-column
setnames(dfA, c("matchcode", "iso", "year"), c("matchcodeA", "isoA", "yearA"))
setnames(dfB, c("matchcode", "iso", "year"), c("matchcodeB", "isoB", "yearB"))

#store column-order for in the end
namesA <- as.character( names( dfA ) )
namesB <- as.character( setdiff( names(dfB), names(dfA) ) )
colorder <- c(namesA, namesB)

#create columns to join on
dfA[, `:=`(iso.join = isoA, year.join = yearA)]
dfB[, `:=`(iso.join = isoB, year.join = yearB)]

#perform left join
result <- dfB[dfA, on = c("iso.join", "year.join"),roll = "nearest" ]

#drop columns that are not needed
result[, grep("^i\.", names(result)) := NULL ]
result[, grep("join$", names(result)) := NULL ]

#set column order
setcolorder(result, colorder)


result



#     A B C D E F G isoA yearA matchcodeA H I J isoB yearB matchcodeB
# 1: 0 1 1 1 0 1 0 NLD 2010 NLD2010 0 1 0 NLD 2009 NLD2009
# 2: 1 0 0 0 1 0 1 NLD 2014 NLD2014 1 0 1 NLD 2014 NLD2014
# 3: 0 0 0 1 1 0 0 AUS 2010 AUS2010 1 0 0 AUS 2011 AUS2011
# 4: 1 0 1 0 0 1 0 AUS 2006 AUS2006 0 1 0 AUS 2007 AUS2007
# 5: 0 1 0 1 0 1 1 USA 2008 USA2008 0 1 1 USA 2007 USA2007
# 6: 0 0 1 0 0 0 1 USA 2010 USA2010 0 0 1 USA 2011 USA2010
# 7: 0 0 1 0 0 0 0 USA 2012 USA2012 0 0 1 USA 2011 USA2010
# 8: 1 0 1 0 0 1 0 BLG 2008 BLG2008 0 1 0 BLG 2007 BLG2007
# 9: 0 1 0 1 1 0 1 BEL 2008 BEL2008 1 0 1 BEL 2009 BEL2009
# 10: 0 1 0 1 0 1 0 BEL 2010 BEL2010 1 0 1 BEL 2009 BEL2009
# 11: 0 1 1 1 0 1 0 NLD 2010 NLD2010 0 1 0 NLD 2009 NLD2009
# 12: 1 0 0 0 1 0 1 NLD 2014 NLD2014 1 0 1 NLD 2014 NLD2014
# 13: 0 0 0 1 1 0 0 AUS 2010 AUS2010 1 0 0 AUS 2011 AUS2011
# 14: 1 0 1 0 0 1 0 AUS 2006 AUS2006 0 1 0 AUS 2007 AUS2007
# 15: 0 1 0 1 0 1 1 USA 2008 USA2008 0 1 1 USA 2007 USA2007
# 16: 0 0 1 0 0 0 1 USA 2010 USA2010 0 0 1 USA 2011 USA2010
# 17: 0 0 1 0 0 0 0 USA 2012 USA2012 0 0 1 USA 2011 USA2010
# 18: 1 0 1 0 0 1 0 BLG 2008 BLG2008 0 1 0 BLG 2007 BLG2007
# 19: 0 1 0 1 1 0 1 BEL 2008 BEL2008 1 0 1 BEL 2009 BEL2009
# 20: 0 1 0 1 0 1 0 BEL 2010 BEL2010 1 0 1 BEL 2009 BEL2009


sample data



dfA <- fread(
"A B C D E F G iso year matchcode
0 1 1 1 0 1 0 NLD 2010 NLD2010
1 0 0 0 1 0 1 NLD 2014 NLD2014
0 0 0 1 1 0 0 AUS 2010 AUS2010
1 0 1 0 0 1 0 AUS 2006 AUS2006
0 1 0 1 0 1 1 USA 2008 USA2008
0 0 1 0 0 0 1 USA 2010 USA2010
0 1 0 1 0 0 0 USA 2012 USA2012
1 0 1 0 0 1 0 BLG 2008 BLG2008
0 1 0 1 1 0 1 BEL 2008 BEL2008
1 0 1 0 0 1 0 BEL 2010 BEL2010
0 1 1 1 0 1 0 NLD 2010 NLD2010
1 0 0 0 1 0 1 NLD 2014 NLD2014
0 0 0 1 1 0 0 AUS 2010 AUS2010
1 0 1 0 0 1 0 AUS 2006 AUS2006
0 1 0 1 0 1 1 USA 2008 USA2008
0 0 1 0 0 0 1 USA 2010 USA2010
0 1 0 1 0 0 0 USA 2012 USA2012
1 0 1 0 0 1 0 BLG 2008 BLG2008
0 1 0 1 1 0 1 BEL 2008 BEL2008
1 0 1 0 0 1 0 BEL 2010 BEL2010",
header = TRUE
)


dfB <- fread(
"A B C D H I J iso year matchcode
0 1 1 1 0 1 0 NLD 2009 NLD2009
1 0 0 0 1 0 1 NLD 2014 NLD2014
0 0 0 1 1 0 0 AUS 2011 AUS2011
1 0 1 0 0 1 0 AUS 2007 AUS2007
0 1 0 1 0 1 1 USA 2007 USA2007
0 0 1 0 0 0 1 USA 2011 USA2010
0 1 0 1 0 0 0 USA 2013 USA2013
1 0 1 0 0 1 0 BLG 2007 BLG2007
0 1 0 1 1 0 1 BEL 2009 BEL2009
1 0 1 0 0 1 0 BEL 2012 BEL2012",
header = TRUE
)





share|improve this answer


























  • I removed a previous column because I thought something was wrong. On a more thorough inspection it did however work just fine. Thank you so much for your help. A very small question. I noticed that apart from matchcodeB, also year and isoa are empty (this is my confusion arose). Is it possible to a adapt the code to leave iso and year from dfA intact?

    – Tom
    Jan 5 at 2:32






  • 1





    @Tom Sure just give them unique names. See updated answer

    – Wimpel
    Jan 5 at 6:53











  • In the hope you are still here: I have been using your answer to merge many databases. For example A with number 1-8 and then B with number 1 to 8. The first merge goes well but after I am getting problems with duplicates in the colorder. I'm breaking my head on solving it. Would you have any idea why it happens and how to prevent it? Should I also remove the i. and .join from the other databases?

    – Tom
    Jan 5 at 7:13











  • Nevermind, I fixed it by dropping all variables which are not needed from each dfB as well before doing the second merging. That appears to work.

    – Tom
    Jan 5 at 7:29











  • @Tom glad you got it to work!

    – Wimpel
    Jan 5 at 8:40
















1














Hers is my (default) approach for a join like this, using data.table



code



library( data.table )

#change the name of the matchcode-column
setnames(dfA, c("matchcode", "iso", "year"), c("matchcodeA", "isoA", "yearA"))
setnames(dfB, c("matchcode", "iso", "year"), c("matchcodeB", "isoB", "yearB"))

#store column-order for in the end
namesA <- as.character( names( dfA ) )
namesB <- as.character( setdiff( names(dfB), names(dfA) ) )
colorder <- c(namesA, namesB)

#create columns to join on
dfA[, `:=`(iso.join = isoA, year.join = yearA)]
dfB[, `:=`(iso.join = isoB, year.join = yearB)]

#perform left join
result <- dfB[dfA, on = c("iso.join", "year.join"),roll = "nearest" ]

#drop columns that are not needed
result[, grep("^i\.", names(result)) := NULL ]
result[, grep("join$", names(result)) := NULL ]

#set column order
setcolorder(result, colorder)


result



#     A B C D E F G isoA yearA matchcodeA H I J isoB yearB matchcodeB
# 1: 0 1 1 1 0 1 0 NLD 2010 NLD2010 0 1 0 NLD 2009 NLD2009
# 2: 1 0 0 0 1 0 1 NLD 2014 NLD2014 1 0 1 NLD 2014 NLD2014
# 3: 0 0 0 1 1 0 0 AUS 2010 AUS2010 1 0 0 AUS 2011 AUS2011
# 4: 1 0 1 0 0 1 0 AUS 2006 AUS2006 0 1 0 AUS 2007 AUS2007
# 5: 0 1 0 1 0 1 1 USA 2008 USA2008 0 1 1 USA 2007 USA2007
# 6: 0 0 1 0 0 0 1 USA 2010 USA2010 0 0 1 USA 2011 USA2010
# 7: 0 0 1 0 0 0 0 USA 2012 USA2012 0 0 1 USA 2011 USA2010
# 8: 1 0 1 0 0 1 0 BLG 2008 BLG2008 0 1 0 BLG 2007 BLG2007
# 9: 0 1 0 1 1 0 1 BEL 2008 BEL2008 1 0 1 BEL 2009 BEL2009
# 10: 0 1 0 1 0 1 0 BEL 2010 BEL2010 1 0 1 BEL 2009 BEL2009
# 11: 0 1 1 1 0 1 0 NLD 2010 NLD2010 0 1 0 NLD 2009 NLD2009
# 12: 1 0 0 0 1 0 1 NLD 2014 NLD2014 1 0 1 NLD 2014 NLD2014
# 13: 0 0 0 1 1 0 0 AUS 2010 AUS2010 1 0 0 AUS 2011 AUS2011
# 14: 1 0 1 0 0 1 0 AUS 2006 AUS2006 0 1 0 AUS 2007 AUS2007
# 15: 0 1 0 1 0 1 1 USA 2008 USA2008 0 1 1 USA 2007 USA2007
# 16: 0 0 1 0 0 0 1 USA 2010 USA2010 0 0 1 USA 2011 USA2010
# 17: 0 0 1 0 0 0 0 USA 2012 USA2012 0 0 1 USA 2011 USA2010
# 18: 1 0 1 0 0 1 0 BLG 2008 BLG2008 0 1 0 BLG 2007 BLG2007
# 19: 0 1 0 1 1 0 1 BEL 2008 BEL2008 1 0 1 BEL 2009 BEL2009
# 20: 0 1 0 1 0 1 0 BEL 2010 BEL2010 1 0 1 BEL 2009 BEL2009


sample data



dfA <- fread(
"A B C D E F G iso year matchcode
0 1 1 1 0 1 0 NLD 2010 NLD2010
1 0 0 0 1 0 1 NLD 2014 NLD2014
0 0 0 1 1 0 0 AUS 2010 AUS2010
1 0 1 0 0 1 0 AUS 2006 AUS2006
0 1 0 1 0 1 1 USA 2008 USA2008
0 0 1 0 0 0 1 USA 2010 USA2010
0 1 0 1 0 0 0 USA 2012 USA2012
1 0 1 0 0 1 0 BLG 2008 BLG2008
0 1 0 1 1 0 1 BEL 2008 BEL2008
1 0 1 0 0 1 0 BEL 2010 BEL2010
0 1 1 1 0 1 0 NLD 2010 NLD2010
1 0 0 0 1 0 1 NLD 2014 NLD2014
0 0 0 1 1 0 0 AUS 2010 AUS2010
1 0 1 0 0 1 0 AUS 2006 AUS2006
0 1 0 1 0 1 1 USA 2008 USA2008
0 0 1 0 0 0 1 USA 2010 USA2010
0 1 0 1 0 0 0 USA 2012 USA2012
1 0 1 0 0 1 0 BLG 2008 BLG2008
0 1 0 1 1 0 1 BEL 2008 BEL2008
1 0 1 0 0 1 0 BEL 2010 BEL2010",
header = TRUE
)


dfB <- fread(
"A B C D H I J iso year matchcode
0 1 1 1 0 1 0 NLD 2009 NLD2009
1 0 0 0 1 0 1 NLD 2014 NLD2014
0 0 0 1 1 0 0 AUS 2011 AUS2011
1 0 1 0 0 1 0 AUS 2007 AUS2007
0 1 0 1 0 1 1 USA 2007 USA2007
0 0 1 0 0 0 1 USA 2011 USA2010
0 1 0 1 0 0 0 USA 2013 USA2013
1 0 1 0 0 1 0 BLG 2007 BLG2007
0 1 0 1 1 0 1 BEL 2009 BEL2009
1 0 1 0 0 1 0 BEL 2012 BEL2012",
header = TRUE
)





share|improve this answer


























  • I removed a previous column because I thought something was wrong. On a more thorough inspection it did however work just fine. Thank you so much for your help. A very small question. I noticed that apart from matchcodeB, also year and isoa are empty (this is my confusion arose). Is it possible to a adapt the code to leave iso and year from dfA intact?

    – Tom
    Jan 5 at 2:32






  • 1





    @Tom Sure just give them unique names. See updated answer

    – Wimpel
    Jan 5 at 6:53











  • In the hope you are still here: I have been using your answer to merge many databases. For example A with number 1-8 and then B with number 1 to 8. The first merge goes well but after I am getting problems with duplicates in the colorder. I'm breaking my head on solving it. Would you have any idea why it happens and how to prevent it? Should I also remove the i. and .join from the other databases?

    – Tom
    Jan 5 at 7:13











  • Nevermind, I fixed it by dropping all variables which are not needed from each dfB as well before doing the second merging. That appears to work.

    – Tom
    Jan 5 at 7:29











  • @Tom glad you got it to work!

    – Wimpel
    Jan 5 at 8:40














1












1








1







Hers is my (default) approach for a join like this, using data.table



code



library( data.table )

#change the name of the matchcode-column
setnames(dfA, c("matchcode", "iso", "year"), c("matchcodeA", "isoA", "yearA"))
setnames(dfB, c("matchcode", "iso", "year"), c("matchcodeB", "isoB", "yearB"))

#store column-order for in the end
namesA <- as.character( names( dfA ) )
namesB <- as.character( setdiff( names(dfB), names(dfA) ) )
colorder <- c(namesA, namesB)

#create columns to join on
dfA[, `:=`(iso.join = isoA, year.join = yearA)]
dfB[, `:=`(iso.join = isoB, year.join = yearB)]

#perform left join
result <- dfB[dfA, on = c("iso.join", "year.join"),roll = "nearest" ]

#drop columns that are not needed
result[, grep("^i\.", names(result)) := NULL ]
result[, grep("join$", names(result)) := NULL ]

#set column order
setcolorder(result, colorder)


result



#     A B C D E F G isoA yearA matchcodeA H I J isoB yearB matchcodeB
# 1: 0 1 1 1 0 1 0 NLD 2010 NLD2010 0 1 0 NLD 2009 NLD2009
# 2: 1 0 0 0 1 0 1 NLD 2014 NLD2014 1 0 1 NLD 2014 NLD2014
# 3: 0 0 0 1 1 0 0 AUS 2010 AUS2010 1 0 0 AUS 2011 AUS2011
# 4: 1 0 1 0 0 1 0 AUS 2006 AUS2006 0 1 0 AUS 2007 AUS2007
# 5: 0 1 0 1 0 1 1 USA 2008 USA2008 0 1 1 USA 2007 USA2007
# 6: 0 0 1 0 0 0 1 USA 2010 USA2010 0 0 1 USA 2011 USA2010
# 7: 0 0 1 0 0 0 0 USA 2012 USA2012 0 0 1 USA 2011 USA2010
# 8: 1 0 1 0 0 1 0 BLG 2008 BLG2008 0 1 0 BLG 2007 BLG2007
# 9: 0 1 0 1 1 0 1 BEL 2008 BEL2008 1 0 1 BEL 2009 BEL2009
# 10: 0 1 0 1 0 1 0 BEL 2010 BEL2010 1 0 1 BEL 2009 BEL2009
# 11: 0 1 1 1 0 1 0 NLD 2010 NLD2010 0 1 0 NLD 2009 NLD2009
# 12: 1 0 0 0 1 0 1 NLD 2014 NLD2014 1 0 1 NLD 2014 NLD2014
# 13: 0 0 0 1 1 0 0 AUS 2010 AUS2010 1 0 0 AUS 2011 AUS2011
# 14: 1 0 1 0 0 1 0 AUS 2006 AUS2006 0 1 0 AUS 2007 AUS2007
# 15: 0 1 0 1 0 1 1 USA 2008 USA2008 0 1 1 USA 2007 USA2007
# 16: 0 0 1 0 0 0 1 USA 2010 USA2010 0 0 1 USA 2011 USA2010
# 17: 0 0 1 0 0 0 0 USA 2012 USA2012 0 0 1 USA 2011 USA2010
# 18: 1 0 1 0 0 1 0 BLG 2008 BLG2008 0 1 0 BLG 2007 BLG2007
# 19: 0 1 0 1 1 0 1 BEL 2008 BEL2008 1 0 1 BEL 2009 BEL2009
# 20: 0 1 0 1 0 1 0 BEL 2010 BEL2010 1 0 1 BEL 2009 BEL2009


sample data



dfA <- fread(
"A B C D E F G iso year matchcode
0 1 1 1 0 1 0 NLD 2010 NLD2010
1 0 0 0 1 0 1 NLD 2014 NLD2014
0 0 0 1 1 0 0 AUS 2010 AUS2010
1 0 1 0 0 1 0 AUS 2006 AUS2006
0 1 0 1 0 1 1 USA 2008 USA2008
0 0 1 0 0 0 1 USA 2010 USA2010
0 1 0 1 0 0 0 USA 2012 USA2012
1 0 1 0 0 1 0 BLG 2008 BLG2008
0 1 0 1 1 0 1 BEL 2008 BEL2008
1 0 1 0 0 1 0 BEL 2010 BEL2010
0 1 1 1 0 1 0 NLD 2010 NLD2010
1 0 0 0 1 0 1 NLD 2014 NLD2014
0 0 0 1 1 0 0 AUS 2010 AUS2010
1 0 1 0 0 1 0 AUS 2006 AUS2006
0 1 0 1 0 1 1 USA 2008 USA2008
0 0 1 0 0 0 1 USA 2010 USA2010
0 1 0 1 0 0 0 USA 2012 USA2012
1 0 1 0 0 1 0 BLG 2008 BLG2008
0 1 0 1 1 0 1 BEL 2008 BEL2008
1 0 1 0 0 1 0 BEL 2010 BEL2010",
header = TRUE
)


dfB <- fread(
"A B C D H I J iso year matchcode
0 1 1 1 0 1 0 NLD 2009 NLD2009
1 0 0 0 1 0 1 NLD 2014 NLD2014
0 0 0 1 1 0 0 AUS 2011 AUS2011
1 0 1 0 0 1 0 AUS 2007 AUS2007
0 1 0 1 0 1 1 USA 2007 USA2007
0 0 1 0 0 0 1 USA 2011 USA2010
0 1 0 1 0 0 0 USA 2013 USA2013
1 0 1 0 0 1 0 BLG 2007 BLG2007
0 1 0 1 1 0 1 BEL 2009 BEL2009
1 0 1 0 0 1 0 BEL 2012 BEL2012",
header = TRUE
)





share|improve this answer















Hers is my (default) approach for a join like this, using data.table



code



library( data.table )

#change the name of the matchcode-column
setnames(dfA, c("matchcode", "iso", "year"), c("matchcodeA", "isoA", "yearA"))
setnames(dfB, c("matchcode", "iso", "year"), c("matchcodeB", "isoB", "yearB"))

#store column-order for in the end
namesA <- as.character( names( dfA ) )
namesB <- as.character( setdiff( names(dfB), names(dfA) ) )
colorder <- c(namesA, namesB)

#create columns to join on
dfA[, `:=`(iso.join = isoA, year.join = yearA)]
dfB[, `:=`(iso.join = isoB, year.join = yearB)]

#perform left join
result <- dfB[dfA, on = c("iso.join", "year.join"),roll = "nearest" ]

#drop columns that are not needed
result[, grep("^i\.", names(result)) := NULL ]
result[, grep("join$", names(result)) := NULL ]

#set column order
setcolorder(result, colorder)


result



#     A B C D E F G isoA yearA matchcodeA H I J isoB yearB matchcodeB
# 1: 0 1 1 1 0 1 0 NLD 2010 NLD2010 0 1 0 NLD 2009 NLD2009
# 2: 1 0 0 0 1 0 1 NLD 2014 NLD2014 1 0 1 NLD 2014 NLD2014
# 3: 0 0 0 1 1 0 0 AUS 2010 AUS2010 1 0 0 AUS 2011 AUS2011
# 4: 1 0 1 0 0 1 0 AUS 2006 AUS2006 0 1 0 AUS 2007 AUS2007
# 5: 0 1 0 1 0 1 1 USA 2008 USA2008 0 1 1 USA 2007 USA2007
# 6: 0 0 1 0 0 0 1 USA 2010 USA2010 0 0 1 USA 2011 USA2010
# 7: 0 0 1 0 0 0 0 USA 2012 USA2012 0 0 1 USA 2011 USA2010
# 8: 1 0 1 0 0 1 0 BLG 2008 BLG2008 0 1 0 BLG 2007 BLG2007
# 9: 0 1 0 1 1 0 1 BEL 2008 BEL2008 1 0 1 BEL 2009 BEL2009
# 10: 0 1 0 1 0 1 0 BEL 2010 BEL2010 1 0 1 BEL 2009 BEL2009
# 11: 0 1 1 1 0 1 0 NLD 2010 NLD2010 0 1 0 NLD 2009 NLD2009
# 12: 1 0 0 0 1 0 1 NLD 2014 NLD2014 1 0 1 NLD 2014 NLD2014
# 13: 0 0 0 1 1 0 0 AUS 2010 AUS2010 1 0 0 AUS 2011 AUS2011
# 14: 1 0 1 0 0 1 0 AUS 2006 AUS2006 0 1 0 AUS 2007 AUS2007
# 15: 0 1 0 1 0 1 1 USA 2008 USA2008 0 1 1 USA 2007 USA2007
# 16: 0 0 1 0 0 0 1 USA 2010 USA2010 0 0 1 USA 2011 USA2010
# 17: 0 0 1 0 0 0 0 USA 2012 USA2012 0 0 1 USA 2011 USA2010
# 18: 1 0 1 0 0 1 0 BLG 2008 BLG2008 0 1 0 BLG 2007 BLG2007
# 19: 0 1 0 1 1 0 1 BEL 2008 BEL2008 1 0 1 BEL 2009 BEL2009
# 20: 0 1 0 1 0 1 0 BEL 2010 BEL2010 1 0 1 BEL 2009 BEL2009


sample data



dfA <- fread(
"A B C D E F G iso year matchcode
0 1 1 1 0 1 0 NLD 2010 NLD2010
1 0 0 0 1 0 1 NLD 2014 NLD2014
0 0 0 1 1 0 0 AUS 2010 AUS2010
1 0 1 0 0 1 0 AUS 2006 AUS2006
0 1 0 1 0 1 1 USA 2008 USA2008
0 0 1 0 0 0 1 USA 2010 USA2010
0 1 0 1 0 0 0 USA 2012 USA2012
1 0 1 0 0 1 0 BLG 2008 BLG2008
0 1 0 1 1 0 1 BEL 2008 BEL2008
1 0 1 0 0 1 0 BEL 2010 BEL2010
0 1 1 1 0 1 0 NLD 2010 NLD2010
1 0 0 0 1 0 1 NLD 2014 NLD2014
0 0 0 1 1 0 0 AUS 2010 AUS2010
1 0 1 0 0 1 0 AUS 2006 AUS2006
0 1 0 1 0 1 1 USA 2008 USA2008
0 0 1 0 0 0 1 USA 2010 USA2010
0 1 0 1 0 0 0 USA 2012 USA2012
1 0 1 0 0 1 0 BLG 2008 BLG2008
0 1 0 1 1 0 1 BEL 2008 BEL2008
1 0 1 0 0 1 0 BEL 2010 BEL2010",
header = TRUE
)


dfB <- fread(
"A B C D H I J iso year matchcode
0 1 1 1 0 1 0 NLD 2009 NLD2009
1 0 0 0 1 0 1 NLD 2014 NLD2014
0 0 0 1 1 0 0 AUS 2011 AUS2011
1 0 1 0 0 1 0 AUS 2007 AUS2007
0 1 0 1 0 1 1 USA 2007 USA2007
0 0 1 0 0 0 1 USA 2011 USA2010
0 1 0 1 0 0 0 USA 2013 USA2013
1 0 1 0 0 1 0 BLG 2007 BLG2007
0 1 0 1 1 0 1 BEL 2009 BEL2009
1 0 1 0 0 1 0 BEL 2012 BEL2012",
header = TRUE
)






share|improve this answer














share|improve this answer



share|improve this answer








edited Jan 5 at 6:53

























answered Jan 4 at 12:40









WimpelWimpel

6,362323




6,362323













  • I removed a previous column because I thought something was wrong. On a more thorough inspection it did however work just fine. Thank you so much for your help. A very small question. I noticed that apart from matchcodeB, also year and isoa are empty (this is my confusion arose). Is it possible to a adapt the code to leave iso and year from dfA intact?

    – Tom
    Jan 5 at 2:32






  • 1





    @Tom Sure just give them unique names. See updated answer

    – Wimpel
    Jan 5 at 6:53











  • In the hope you are still here: I have been using your answer to merge many databases. For example A with number 1-8 and then B with number 1 to 8. The first merge goes well but after I am getting problems with duplicates in the colorder. I'm breaking my head on solving it. Would you have any idea why it happens and how to prevent it? Should I also remove the i. and .join from the other databases?

    – Tom
    Jan 5 at 7:13











  • Nevermind, I fixed it by dropping all variables which are not needed from each dfB as well before doing the second merging. That appears to work.

    – Tom
    Jan 5 at 7:29











  • @Tom glad you got it to work!

    – Wimpel
    Jan 5 at 8:40



















  • I removed a previous column because I thought something was wrong. On a more thorough inspection it did however work just fine. Thank you so much for your help. A very small question. I noticed that apart from matchcodeB, also year and isoa are empty (this is my confusion arose). Is it possible to a adapt the code to leave iso and year from dfA intact?

    – Tom
    Jan 5 at 2:32






  • 1





    @Tom Sure just give them unique names. See updated answer

    – Wimpel
    Jan 5 at 6:53











  • In the hope you are still here: I have been using your answer to merge many databases. For example A with number 1-8 and then B with number 1 to 8. The first merge goes well but after I am getting problems with duplicates in the colorder. I'm breaking my head on solving it. Would you have any idea why it happens and how to prevent it? Should I also remove the i. and .join from the other databases?

    – Tom
    Jan 5 at 7:13











  • Nevermind, I fixed it by dropping all variables which are not needed from each dfB as well before doing the second merging. That appears to work.

    – Tom
    Jan 5 at 7:29











  • @Tom glad you got it to work!

    – Wimpel
    Jan 5 at 8:40

















I removed a previous column because I thought something was wrong. On a more thorough inspection it did however work just fine. Thank you so much for your help. A very small question. I noticed that apart from matchcodeB, also year and isoa are empty (this is my confusion arose). Is it possible to a adapt the code to leave iso and year from dfA intact?

– Tom
Jan 5 at 2:32





I removed a previous column because I thought something was wrong. On a more thorough inspection it did however work just fine. Thank you so much for your help. A very small question. I noticed that apart from matchcodeB, also year and isoa are empty (this is my confusion arose). Is it possible to a adapt the code to leave iso and year from dfA intact?

– Tom
Jan 5 at 2:32




1




1





@Tom Sure just give them unique names. See updated answer

– Wimpel
Jan 5 at 6:53





@Tom Sure just give them unique names. See updated answer

– Wimpel
Jan 5 at 6:53













In the hope you are still here: I have been using your answer to merge many databases. For example A with number 1-8 and then B with number 1 to 8. The first merge goes well but after I am getting problems with duplicates in the colorder. I'm breaking my head on solving it. Would you have any idea why it happens and how to prevent it? Should I also remove the i. and .join from the other databases?

– Tom
Jan 5 at 7:13





In the hope you are still here: I have been using your answer to merge many databases. For example A with number 1-8 and then B with number 1 to 8. The first merge goes well but after I am getting problems with duplicates in the colorder. I'm breaking my head on solving it. Would you have any idea why it happens and how to prevent it? Should I also remove the i. and .join from the other databases?

– Tom
Jan 5 at 7:13













Nevermind, I fixed it by dropping all variables which are not needed from each dfB as well before doing the second merging. That appears to work.

– Tom
Jan 5 at 7:29





Nevermind, I fixed it by dropping all variables which are not needed from each dfB as well before doing the second merging. That appears to work.

– Tom
Jan 5 at 7:29













@Tom glad you got it to work!

– Wimpel
Jan 5 at 8:40





@Tom glad you got it to work!

– Wimpel
Jan 5 at 8:40




















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54038311%2fdoing-a-fuzzy-and-non-fuzzy-many-to-1-merge-with-data-table%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Monofisismo

Angular Downloading a file using contenturl with Basic Authentication

Olmecas