Doing a “fuzzy” and non-fuzzy, many to 1 merge with data.table

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}

Lets assume I have two databases dfA and dfB. One has individual observations and one has country level data (which is applicable to multiple observations which are from the same year and country) For each of these databases I have created a key called matchcode. This matchcode is a combination of a country code and a year.

   dfA <- read.table(

  text = "A   B   C   D   E   F   G   iso   year   matchcode

  1   0   1   1   1   0   1   0   NLD   2010   NLD2010

  2   1   0   0   0   1   0   1   NLD   2014   NLD2014

  3   0   0   0   1   1   0   0   AUS   2010   AUS2010

  4   1   0   1   0   0   1   0   AUS   2006   AUS2006

  5   0   1   0   1   0   1   1   USA   2008   USA2008

  6   0   0   1   0   0   0   1   USA   2010   USA2010

  7   0   1   0   1   0   0   0   USA   2012   USA2012

  8   1   0   1   0   0   1   0   BLG   2008   BLG2008

  9   0   1   0   1   1   0   1   BEL   2008   BEL2008

  10  1   0   1   0   0   1   0   BEL   2010   BEL2010

  11  0   1   1   1   0   1   0   NLD   2010   NLD2010

  12  1   0   0   0   1   0   1   NLD   2014   NLD2014

  13  0   0   0   1   1   0   0   AUS   2010   AUS2010

  14  1   0   1   0   0   1   0   AUS   2006   AUS2006

  15  0   1   0   1   0   1   1   USA   2008   USA2008

  16  0   0   1   0   0   0   1   USA   2010   USA2010

  17  0   1   0   1   0   0   0   USA   2012   USA2012

  18  1   0   1   0   0   1   0   BLG   2008   BLG2008

  19  0   1   0   1   1   0   1   BEL   2008   BEL2008

  20  1   0   1   0   0   1   0   BEL   2010   BEL2010",

  header = TRUE

)



   dfB <- read.table(

  text = "A   B   C   D   H   I   J   iso   year   matchcode

  1   0   1   1   1   0   1   0   NLD   2009   NLD2009

  2   1   0   0   0   1   0   1   NLD   2014   NLD2014

  3   0   0   0   1   1   0   0   AUS   2011   AUS2011

  4   1   0   1   0   0   1   0   AUS   2007   AUS2007

  5   0   1   0   1   0   1   1   USA   2007   USA2007

  6   0   0   1   0   0   0   1   USA   2011   USA2010

  7   0   1   0   1   0   0   0   USA   2013   USA2013

  8   1   0   1   0   0   1   0   BLG   2007   BLG2007

  9   0   1   0   1   1   0   1   BEL   2009   BEL2009

  10   1   0   1   0   0   1   0  BEL   2012   BEL2012",

  header = TRUE

)



library(data.table)

setDT(dfA)

setDT(dfB)

Mostly when I merge these datasets I simply do:

dfA<- merge(dfA, dfB, by= "matchcode", all.x = TRUE, allow.cartesian=FALSE)

The problem is that sometimes the years do not completely match. So I tried:

dfA <- dfA[dfB, on = .(iso, year), roll = "nearest", nomatch = 0]

But this reduces the amount of observations to 11.

# A tibble: 11 x 18

       A     B     C     D     E     F     G iso    year matchcode     K     L     M     N     O     P     Q i.matchcode

   <int> <int> <int> <int> <int> <int> <int> <fct> <int> <fct>     <int> <int> <int> <int> <int> <int> <int> <fct>      

 1     0     1     1     1     0     1     0 NLD    2009 NLD2010       0     1     1     1     0     1     0 NLD2009    

 2     1     0     0     0     1     0     1 NLD    2014 NLD2014       1     0     0     0     1     0     1 NLD2014    

 3     1     0     0     0     1     0     1 NLD    2014 NLD2014       1     0     0     0     1     0     1 NLD2014    

 4     0     0     0     1     1     0     0 AUS    2011 AUS2010       0     0     0     1     1     0     0 AUS2011    

 5     1     0     1     0     0     1     0 AUS    2007 AUS2006       1     0     1     0     0     1     0 AUS2007    

 6     0     1     0     1     0     1     1 USA    2007 USA2008       0     1     0     1     0     1     1 USA2007    

 7     0     0     1     0     0     0     1 USA    2011 USA2010       0     0     1     0     0     0     1 USA2010    

 8     0     1     0     1     0     0     0 USA    2013 USA2012       0     1     0     1     0     0     0 USA2013    

 9     1     0     1     0     0     1     0 BLG    2007 BLG2008       1     0     1     0     0     1     0 BLG2007    

10     0     1     0     1     1     0     1 BEL    2009 BEL2008       0     1     0     1     1     0     1 BEL2009    

11     1     0     1     0     0     1     0 BEL    2012 BEL2010       1     0     1     0     0     1     0 BEL2012

The preferred output would be as follows:

#    A B C D E F G iso year matchcodeA H I J matchcodeB

# 1: 1 0 0 0 1 0 1 NLD  2014  NLD2014  1 0 1    NLD2014

# 2: 0 0 0 1 1 0 0 AUS  2011  AUS2010  1 0 0    AUS2011

# 3: 1 0 1 0 0 1 0 AUS  2007  AUS2006  0 1 0    AUS2007

# 4: 0 0 1 0 0 0 1 USA  2011  USA2010  0 0 1    USA2010

# 5: 0 1 0 1 0 0 0 USA  2013  USA2012  0 0 0    USA2013

# 6: 0 1 0 1 1 0 1 BEL  2009  BEL2008  1 0 1    BEL2009

# 7: 0 1 1 1 0 1 0 NLD  2009  NLD2010  0 1 0    NLD2009

# 8: 0 1 0 1 0 1 1 USA  2007  USA2008  0 1 1    USA2007

# 9: 0 1 0 1 0 0 0 USA  2011  USA2012  0 0 1    USA2010

#10: 1 0 1 0 0 1 0 BEL  2009  BEL2010  1 0 1    BEL2009

#11: 1 0 0 0 1 0 1 NLD  2014  NLD2014  1 0 1    NLD2014

#12: 0 0 0 1 1 0 0 AUS  2011  AUS2010  1 0 0    AUS2011

#13: 1 0 1 0 0 1 0 AUS  2007  AUS2006  0 1 0    AUS2007

#14: 0 0 1 0 0 0 1 USA  2011  USA2010  0 0 1    USA2010

#15: 0 1 0 1 0 0 0 USA  2013  USA2012  0 0 0    USA2013

#16: 0 1 0 1 1 0 1 BEL  2009  BEL2008  1 0 1    BEL2009

#17: 0 1 1 1 0 1 0 NLD  2009  NLD2010  0 1 0    NLD2009

#18: 0 1 0 1 0 1 1 USA  2007  USA2008  0 1 1    USA2007

#19: 0 1 0 1 0 0 0 USA  2011  USA2012  0 0 1    USA2010

#20: 1 0 1 0 0 1 0 BEL  2009  BEL2010  1 0 1    BEL2009

Additional Sources:

1. The previous try

2. The try before that

asked Jan 4 at 11:42

Tom

355213

It seems that you want to match each row from dfA in dfB. Does this give you the desired output: dfB[dfA, on = .(iso, year), roll = "nearest", nomatch = 0]?

– mt1022
Jan 4 at 11:57

Thanks for taking a look mt1022! It does for the example, but I am still losing about 14.000 observations in the actual dataset regrettably..

– Tom
Jan 4 at 12:08

Which is already way less than before by the way!

– Tom
Jan 4 at 12:12

I guess some code in dfA are not present in B. You can set nomatch = NA in join and examine what happens to those rows that get NA values.

– mt1022
Jan 4 at 12:12

add a comment |

   dfA <- read.table(

  text = "A   B   C   D   E   F   G   iso   year   matchcode

  1   0   1   1   1   0   1   0   NLD   2010   NLD2010

  2   1   0   0   0   1   0   1   NLD   2014   NLD2014

  3   0   0   0   1   1   0   0   AUS   2010   AUS2010

  4   1   0   1   0   0   1   0   AUS   2006   AUS2006

  5   0   1   0   1   0   1   1   USA   2008   USA2008

  6   0   0   1   0   0   0   1   USA   2010   USA2010

  7   0   1   0   1   0   0   0   USA   2012   USA2012

  8   1   0   1   0   0   1   0   BLG   2008   BLG2008

  9   0   1   0   1   1   0   1   BEL   2008   BEL2008

  10  1   0   1   0   0   1   0   BEL   2010   BEL2010

  11  0   1   1   1   0   1   0   NLD   2010   NLD2010

  12  1   0   0   0   1   0   1   NLD   2014   NLD2014

  13  0   0   0   1   1   0   0   AUS   2010   AUS2010

  14  1   0   1   0   0   1   0   AUS   2006   AUS2006

  15  0   1   0   1   0   1   1   USA   2008   USA2008

  16  0   0   1   0   0   0   1   USA   2010   USA2010

  17  0   1   0   1   0   0   0   USA   2012   USA2012

  18  1   0   1   0   0   1   0   BLG   2008   BLG2008

  19  0   1   0   1   1   0   1   BEL   2008   BEL2008

  20  1   0   1   0   0   1   0   BEL   2010   BEL2010",

  header = TRUE

)



   dfB <- read.table(

  text = "A   B   C   D   H   I   J   iso   year   matchcode

  1   0   1   1   1   0   1   0   NLD   2009   NLD2009

  2   1   0   0   0   1   0   1   NLD   2014   NLD2014

  3   0   0   0   1   1   0   0   AUS   2011   AUS2011

  4   1   0   1   0   0   1   0   AUS   2007   AUS2007

  5   0   1   0   1   0   1   1   USA   2007   USA2007

  6   0   0   1   0   0   0   1   USA   2011   USA2010

  7   0   1   0   1   0   0   0   USA   2013   USA2013

  8   1   0   1   0   0   1   0   BLG   2007   BLG2007

  9   0   1   0   1   1   0   1   BEL   2009   BEL2009

  10   1   0   1   0   0   1   0  BEL   2012   BEL2012",

  header = TRUE

)



library(data.table)

setDT(dfA)

setDT(dfB)

Mostly when I merge these datasets I simply do:

dfA<- merge(dfA, dfB, by= "matchcode", all.x = TRUE, allow.cartesian=FALSE)

The problem is that sometimes the years do not completely match. So I tried:

dfA <- dfA[dfB, on = .(iso, year), roll = "nearest", nomatch = 0]

But this reduces the amount of observations to 11.

# A tibble: 11 x 18

       A     B     C     D     E     F     G iso    year matchcode     K     L     M     N     O     P     Q i.matchcode

   <int> <int> <int> <int> <int> <int> <int> <fct> <int> <fct>     <int> <int> <int> <int> <int> <int> <int> <fct>      

 1     0     1     1     1     0     1     0 NLD    2009 NLD2010       0     1     1     1     0     1     0 NLD2009    

 2     1     0     0     0     1     0     1 NLD    2014 NLD2014       1     0     0     0     1     0     1 NLD2014    

 3     1     0     0     0     1     0     1 NLD    2014 NLD2014       1     0     0     0     1     0     1 NLD2014    

 4     0     0     0     1     1     0     0 AUS    2011 AUS2010       0     0     0     1     1     0     0 AUS2011    

 5     1     0     1     0     0     1     0 AUS    2007 AUS2006       1     0     1     0     0     1     0 AUS2007    

 6     0     1     0     1     0     1     1 USA    2007 USA2008       0     1     0     1     0     1     1 USA2007    

 7     0     0     1     0     0     0     1 USA    2011 USA2010       0     0     1     0     0     0     1 USA2010    

 8     0     1     0     1     0     0     0 USA    2013 USA2012       0     1     0     1     0     0     0 USA2013    

 9     1     0     1     0     0     1     0 BLG    2007 BLG2008       1     0     1     0     0     1     0 BLG2007    

10     0     1     0     1     1     0     1 BEL    2009 BEL2008       0     1     0     1     1     0     1 BEL2009    

11     1     0     1     0     0     1     0 BEL    2012 BEL2010       1     0     1     0     0     1     0 BEL2012

The preferred output would be as follows:

#    A B C D E F G iso year matchcodeA H I J matchcodeB

# 1: 1 0 0 0 1 0 1 NLD  2014  NLD2014  1 0 1    NLD2014

# 2: 0 0 0 1 1 0 0 AUS  2011  AUS2010  1 0 0    AUS2011

# 3: 1 0 1 0 0 1 0 AUS  2007  AUS2006  0 1 0    AUS2007

# 4: 0 0 1 0 0 0 1 USA  2011  USA2010  0 0 1    USA2010

# 5: 0 1 0 1 0 0 0 USA  2013  USA2012  0 0 0    USA2013

# 6: 0 1 0 1 1 0 1 BEL  2009  BEL2008  1 0 1    BEL2009

# 7: 0 1 1 1 0 1 0 NLD  2009  NLD2010  0 1 0    NLD2009

# 8: 0 1 0 1 0 1 1 USA  2007  USA2008  0 1 1    USA2007

# 9: 0 1 0 1 0 0 0 USA  2011  USA2012  0 0 1    USA2010

#10: 1 0 1 0 0 1 0 BEL  2009  BEL2010  1 0 1    BEL2009

#11: 1 0 0 0 1 0 1 NLD  2014  NLD2014  1 0 1    NLD2014

#12: 0 0 0 1 1 0 0 AUS  2011  AUS2010  1 0 0    AUS2011

#13: 1 0 1 0 0 1 0 AUS  2007  AUS2006  0 1 0    AUS2007

#14: 0 0 1 0 0 0 1 USA  2011  USA2010  0 0 1    USA2010

#15: 0 1 0 1 0 0 0 USA  2013  USA2012  0 0 0    USA2013

#16: 0 1 0 1 1 0 1 BEL  2009  BEL2008  1 0 1    BEL2009

#17: 0 1 1 1 0 1 0 NLD  2009  NLD2010  0 1 0    NLD2009

#18: 0 1 0 1 0 1 1 USA  2007  USA2008  0 1 1    USA2007

#19: 0 1 0 1 0 0 0 USA  2011  USA2012  0 0 1    USA2010

#20: 1 0 1 0 0 1 0 BEL  2009  BEL2010  1 0 1    BEL2009

Additional Sources:

1. The previous try

2. The try before that

asked Jan 4 at 11:42

Tom

355213

It seems that you want to match each row from dfA in dfB. Does this give you the desired output: dfB[dfA, on = .(iso, year), roll = "nearest", nomatch = 0]?

– mt1022
Jan 4 at 11:57

Thanks for taking a look mt1022! It does for the example, but I am still losing about 14.000 observations in the actual dataset regrettably..

– Tom
Jan 4 at 12:08

Which is already way less than before by the way!

– Tom
Jan 4 at 12:12

I guess some code in dfA are not present in B. You can set nomatch = NA in join and examine what happens to those rows that get NA values.

– mt1022
Jan 4 at 12:12

add a comment |

   dfA <- read.table(

  text = "A   B   C   D   E   F   G   iso   year   matchcode

  1   0   1   1   1   0   1   0   NLD   2010   NLD2010

  2   1   0   0   0   1   0   1   NLD   2014   NLD2014

  3   0   0   0   1   1   0   0   AUS   2010   AUS2010

  4   1   0   1   0   0   1   0   AUS   2006   AUS2006

  5   0   1   0   1   0   1   1   USA   2008   USA2008

  6   0   0   1   0   0   0   1   USA   2010   USA2010

  7   0   1   0   1   0   0   0   USA   2012   USA2012

  8   1   0   1   0   0   1   0   BLG   2008   BLG2008

  9   0   1   0   1   1   0   1   BEL   2008   BEL2008

  10  1   0   1   0   0   1   0   BEL   2010   BEL2010

  11  0   1   1   1   0   1   0   NLD   2010   NLD2010

  12  1   0   0   0   1   0   1   NLD   2014   NLD2014

  13  0   0   0   1   1   0   0   AUS   2010   AUS2010

  14  1   0   1   0   0   1   0   AUS   2006   AUS2006

  15  0   1   0   1   0   1   1   USA   2008   USA2008

  16  0   0   1   0   0   0   1   USA   2010   USA2010

  17  0   1   0   1   0   0   0   USA   2012   USA2012

  18  1   0   1   0   0   1   0   BLG   2008   BLG2008

  19  0   1   0   1   1   0   1   BEL   2008   BEL2008

  20  1   0   1   0   0   1   0   BEL   2010   BEL2010",

  header = TRUE

)



   dfB <- read.table(

  text = "A   B   C   D   H   I   J   iso   year   matchcode

  1   0   1   1   1   0   1   0   NLD   2009   NLD2009

  2   1   0   0   0   1   0   1   NLD   2014   NLD2014

  3   0   0   0   1   1   0   0   AUS   2011   AUS2011

  4   1   0   1   0   0   1   0   AUS   2007   AUS2007

  5   0   1   0   1   0   1   1   USA   2007   USA2007

  6   0   0   1   0   0   0   1   USA   2011   USA2010

  7   0   1   0   1   0   0   0   USA   2013   USA2013

  8   1   0   1   0   0   1   0   BLG   2007   BLG2007

  9   0   1   0   1   1   0   1   BEL   2009   BEL2009

  10   1   0   1   0   0   1   0  BEL   2012   BEL2012",

  header = TRUE

)



library(data.table)

setDT(dfA)

setDT(dfB)

Mostly when I merge these datasets I simply do:

dfA<- merge(dfA, dfB, by= "matchcode", all.x = TRUE, allow.cartesian=FALSE)

The problem is that sometimes the years do not completely match. So I tried:

dfA <- dfA[dfB, on = .(iso, year), roll = "nearest", nomatch = 0]

But this reduces the amount of observations to 11.

# A tibble: 11 x 18

       A     B     C     D     E     F     G iso    year matchcode     K     L     M     N     O     P     Q i.matchcode

   <int> <int> <int> <int> <int> <int> <int> <fct> <int> <fct>     <int> <int> <int> <int> <int> <int> <int> <fct>      

 1     0     1     1     1     0     1     0 NLD    2009 NLD2010       0     1     1     1     0     1     0 NLD2009    

 2     1     0     0     0     1     0     1 NLD    2014 NLD2014       1     0     0     0     1     0     1 NLD2014    

 3     1     0     0     0     1     0     1 NLD    2014 NLD2014       1     0     0     0     1     0     1 NLD2014    

 4     0     0     0     1     1     0     0 AUS    2011 AUS2010       0     0     0     1     1     0     0 AUS2011    

 5     1     0     1     0     0     1     0 AUS    2007 AUS2006       1     0     1     0     0     1     0 AUS2007    

 6     0     1     0     1     0     1     1 USA    2007 USA2008       0     1     0     1     0     1     1 USA2007    

 7     0     0     1     0     0     0     1 USA    2011 USA2010       0     0     1     0     0     0     1 USA2010    

 8     0     1     0     1     0     0     0 USA    2013 USA2012       0     1     0     1     0     0     0 USA2013    

 9     1     0     1     0     0     1     0 BLG    2007 BLG2008       1     0     1     0     0     1     0 BLG2007    

10     0     1     0     1     1     0     1 BEL    2009 BEL2008       0     1     0     1     1     0     1 BEL2009    

11     1     0     1     0     0     1     0 BEL    2012 BEL2010       1     0     1     0     0     1     0 BEL2012

The preferred output would be as follows:

#    A B C D E F G iso year matchcodeA H I J matchcodeB

# 1: 1 0 0 0 1 0 1 NLD  2014  NLD2014  1 0 1    NLD2014

# 2: 0 0 0 1 1 0 0 AUS  2011  AUS2010  1 0 0    AUS2011

# 3: 1 0 1 0 0 1 0 AUS  2007  AUS2006  0 1 0    AUS2007

# 4: 0 0 1 0 0 0 1 USA  2011  USA2010  0 0 1    USA2010

# 5: 0 1 0 1 0 0 0 USA  2013  USA2012  0 0 0    USA2013

# 6: 0 1 0 1 1 0 1 BEL  2009  BEL2008  1 0 1    BEL2009

# 7: 0 1 1 1 0 1 0 NLD  2009  NLD2010  0 1 0    NLD2009

# 8: 0 1 0 1 0 1 1 USA  2007  USA2008  0 1 1    USA2007

# 9: 0 1 0 1 0 0 0 USA  2011  USA2012  0 0 1    USA2010

#10: 1 0 1 0 0 1 0 BEL  2009  BEL2010  1 0 1    BEL2009

#11: 1 0 0 0 1 0 1 NLD  2014  NLD2014  1 0 1    NLD2014

#12: 0 0 0 1 1 0 0 AUS  2011  AUS2010  1 0 0    AUS2011

#13: 1 0 1 0 0 1 0 AUS  2007  AUS2006  0 1 0    AUS2007

#14: 0 0 1 0 0 0 1 USA  2011  USA2010  0 0 1    USA2010

#15: 0 1 0 1 0 0 0 USA  2013  USA2012  0 0 0    USA2013

#16: 0 1 0 1 1 0 1 BEL  2009  BEL2008  1 0 1    BEL2009

#17: 0 1 1 1 0 1 0 NLD  2009  NLD2010  0 1 0    NLD2009

#18: 0 1 0 1 0 1 1 USA  2007  USA2008  0 1 1    USA2007

#19: 0 1 0 1 0 0 0 USA  2011  USA2012  0 0 1    USA2010

#20: 1 0 1 0 0 1 0 BEL  2009  BEL2010  1 0 1    BEL2009

Additional Sources:

1. The previous try

2. The try before that

asked Jan 4 at 11:42

Tom

355213

   dfA <- read.table(

  text = "A   B   C   D   E   F   G   iso   year   matchcode

  1   0   1   1   1   0   1   0   NLD   2010   NLD2010

  2   1   0   0   0   1   0   1   NLD   2014   NLD2014

  3   0   0   0   1   1   0   0   AUS   2010   AUS2010

  4   1   0   1   0   0   1   0   AUS   2006   AUS2006

  5   0   1   0   1   0   1   1   USA   2008   USA2008

  6   0   0   1   0   0   0   1   USA   2010   USA2010

  7   0   1   0   1   0   0   0   USA   2012   USA2012

  8   1   0   1   0   0   1   0   BLG   2008   BLG2008

  9   0   1   0   1   1   0   1   BEL   2008   BEL2008

  10  1   0   1   0   0   1   0   BEL   2010   BEL2010

  11  0   1   1   1   0   1   0   NLD   2010   NLD2010

  12  1   0   0   0   1   0   1   NLD   2014   NLD2014

  13  0   0   0   1   1   0   0   AUS   2010   AUS2010

  14  1   0   1   0   0   1   0   AUS   2006   AUS2006

  15  0   1   0   1   0   1   1   USA   2008   USA2008

  16  0   0   1   0   0   0   1   USA   2010   USA2010

  17  0   1   0   1   0   0   0   USA   2012   USA2012

  18  1   0   1   0   0   1   0   BLG   2008   BLG2008

  19  0   1   0   1   1   0   1   BEL   2008   BEL2008

  20  1   0   1   0   0   1   0   BEL   2010   BEL2010",

  header = TRUE

)



   dfB <- read.table(

  text = "A   B   C   D   H   I   J   iso   year   matchcode

  1   0   1   1   1   0   1   0   NLD   2009   NLD2009

  2   1   0   0   0   1   0   1   NLD   2014   NLD2014

  3   0   0   0   1   1   0   0   AUS   2011   AUS2011

  4   1   0   1   0   0   1   0   AUS   2007   AUS2007

  5   0   1   0   1   0   1   1   USA   2007   USA2007

  6   0   0   1   0   0   0   1   USA   2011   USA2010

  7   0   1   0   1   0   0   0   USA   2013   USA2013

  8   1   0   1   0   0   1   0   BLG   2007   BLG2007

  9   0   1   0   1   1   0   1   BEL   2009   BEL2009

  10   1   0   1   0   0   1   0  BEL   2012   BEL2012",

  header = TRUE

)



library(data.table)

setDT(dfA)

setDT(dfB)

Mostly when I merge these datasets I simply do:

dfA<- merge(dfA, dfB, by= "matchcode", all.x = TRUE, allow.cartesian=FALSE)

The problem is that sometimes the years do not completely match. So I tried:

dfA <- dfA[dfB, on = .(iso, year), roll = "nearest", nomatch = 0]

But this reduces the amount of observations to 11.

# A tibble: 11 x 18

       A     B     C     D     E     F     G iso    year matchcode     K     L     M     N     O     P     Q i.matchcode

   <int> <int> <int> <int> <int> <int> <int> <fct> <int> <fct>     <int> <int> <int> <int> <int> <int> <int> <fct>      

 1     0     1     1     1     0     1     0 NLD    2009 NLD2010       0     1     1     1     0     1     0 NLD2009    

 2     1     0     0     0     1     0     1 NLD    2014 NLD2014       1     0     0     0     1     0     1 NLD2014    

 3     1     0     0     0     1     0     1 NLD    2014 NLD2014       1     0     0     0     1     0     1 NLD2014    

 4     0     0     0     1     1     0     0 AUS    2011 AUS2010       0     0     0     1     1     0     0 AUS2011    

 5     1     0     1     0     0     1     0 AUS    2007 AUS2006       1     0     1     0     0     1     0 AUS2007    

 6     0     1     0     1     0     1     1 USA    2007 USA2008       0     1     0     1     0     1     1 USA2007    

 7     0     0     1     0     0     0     1 USA    2011 USA2010       0     0     1     0     0     0     1 USA2010    

 8     0     1     0     1     0     0     0 USA    2013 USA2012       0     1     0     1     0     0     0 USA2013    

 9     1     0     1     0     0     1     0 BLG    2007 BLG2008       1     0     1     0     0     1     0 BLG2007    

10     0     1     0     1     1     0     1 BEL    2009 BEL2008       0     1     0     1     1     0     1 BEL2009    

11     1     0     1     0     0     1     0 BEL    2012 BEL2010       1     0     1     0     0     1     0 BEL2012

The preferred output would be as follows:

#    A B C D E F G iso year matchcodeA H I J matchcodeB

# 1: 1 0 0 0 1 0 1 NLD  2014  NLD2014  1 0 1    NLD2014

# 2: 0 0 0 1 1 0 0 AUS  2011  AUS2010  1 0 0    AUS2011

# 3: 1 0 1 0 0 1 0 AUS  2007  AUS2006  0 1 0    AUS2007

# 4: 0 0 1 0 0 0 1 USA  2011  USA2010  0 0 1    USA2010

# 5: 0 1 0 1 0 0 0 USA  2013  USA2012  0 0 0    USA2013

# 6: 0 1 0 1 1 0 1 BEL  2009  BEL2008  1 0 1    BEL2009

# 7: 0 1 1 1 0 1 0 NLD  2009  NLD2010  0 1 0    NLD2009

# 8: 0 1 0 1 0 1 1 USA  2007  USA2008  0 1 1    USA2007

# 9: 0 1 0 1 0 0 0 USA  2011  USA2012  0 0 1    USA2010

#10: 1 0 1 0 0 1 0 BEL  2009  BEL2010  1 0 1    BEL2009

#11: 1 0 0 0 1 0 1 NLD  2014  NLD2014  1 0 1    NLD2014

#12: 0 0 0 1 1 0 0 AUS  2011  AUS2010  1 0 0    AUS2011

#13: 1 0 1 0 0 1 0 AUS  2007  AUS2006  0 1 0    AUS2007

#14: 0 0 1 0 0 0 1 USA  2011  USA2010  0 0 1    USA2010

#15: 0 1 0 1 0 0 0 USA  2013  USA2012  0 0 0    USA2013

#16: 0 1 0 1 1 0 1 BEL  2009  BEL2008  1 0 1    BEL2009

#17: 0 1 1 1 0 1 0 NLD  2009  NLD2010  0 1 0    NLD2009

#18: 0 1 0 1 0 1 1 USA  2007  USA2008  0 1 1    USA2007

#19: 0 1 0 1 0 0 0 USA  2011  USA2012  0 0 1    USA2010

#20: 1 0 1 0 0 1 0 BEL  2009  BEL2010  1 0 1    BEL2009

Additional Sources:

1. The previous try

2. The try before that

r merge data.table fuzzyjoin

asked Jan 4 at 11:42

Tom

355213

asked Jan 4 at 11:42

Tom

355213

asked Jan 4 at 11:42

Tom

355213

asked Jan 4 at 11:42

Tom

355213

asked Jan 4 at 11:42

Tom

355213

It seems that you want to match each row from dfA in dfB. Does this give you the desired output: dfB[dfA, on = .(iso, year), roll = "nearest", nomatch = 0]?

– mt1022
Jan 4 at 11:57

Thanks for taking a look mt1022! It does for the example, but I am still losing about 14.000 observations in the actual dataset regrettably..

– Tom
Jan 4 at 12:08

Which is already way less than before by the way!

– Tom
Jan 4 at 12:12

I guess some code in dfA are not present in B. You can set nomatch = NA in join and examine what happens to those rows that get NA values.

– mt1022
Jan 4 at 12:12

add a comment |

It seems that you want to match each row from dfA in dfB. Does this give you the desired output: dfB[dfA, on = .(iso, year), roll = "nearest", nomatch = 0]?

– mt1022
Jan 4 at 11:57

Thanks for taking a look mt1022! It does for the example, but I am still losing about 14.000 observations in the actual dataset regrettably..

– Tom
Jan 4 at 12:08

Which is already way less than before by the way!

– Tom
Jan 4 at 12:12

I guess some code in dfA are not present in B. You can set nomatch = NA in join and examine what happens to those rows that get NA values.

– mt1022
Jan 4 at 12:12

It seems that you want to match each row from dfA in dfB. Does this give you the desired output: dfB[dfA, on = .(iso, year), roll = "nearest", nomatch = 0]?

– mt1022
Jan 4 at 11:57

Thanks for taking a look mt1022! It does for the example, but I am still losing about 14.000 observations in the actual dataset regrettably..

– Tom
Jan 4 at 12:08

Which is already way less than before by the way!

– Tom
Jan 4 at 12:12

I guess some code in dfA are not present in B. You can set nomatch = NA in join and examine what happens to those rows that get NA values.

– mt1022
Jan 4 at 12:12

add a comment |

1 Answer
1

active

oldest

votes

Hers is my (default) approach for a join like this, using data.table

code

library( data.table )



#change the name of the matchcode-column

setnames(dfA, c("matchcode", "iso", "year"), c("matchcodeA", "isoA", "yearA"))

setnames(dfB, c("matchcode", "iso", "year"), c("matchcodeB", "isoB", "yearB"))



#store column-order for in the end

namesA <- as.character( names( dfA ) )

namesB <- as.character( setdiff( names(dfB), names(dfA) ) )

colorder <- c(namesA, namesB)



#create columns to join on

dfA[, `:=`(iso.join = isoA, year.join = yearA)]

dfB[, `:=`(iso.join = isoB, year.join = yearB)]



#perform left join

result <- dfB[dfA, on = c("iso.join", "year.join"),roll = "nearest" ]



#drop columns that are not needed

result[, grep("^i\.", names(result)) := NULL ]

result[, grep("join$", names(result)) := NULL ]



#set column order

setcolorder(result, colorder)

result

#     A B C D E F G isoA yearA matchcodeA H I J isoB yearB matchcodeB

#  1: 0 1 1 1 0 1 0  NLD  2010    NLD2010 0 1 0  NLD  2009    NLD2009

#  2: 1 0 0 0 1 0 1  NLD  2014    NLD2014 1 0 1  NLD  2014    NLD2014

#  3: 0 0 0 1 1 0 0  AUS  2010    AUS2010 1 0 0  AUS  2011    AUS2011

#  4: 1 0 1 0 0 1 0  AUS  2006    AUS2006 0 1 0  AUS  2007    AUS2007

#  5: 0 1 0 1 0 1 1  USA  2008    USA2008 0 1 1  USA  2007    USA2007

#  6: 0 0 1 0 0 0 1  USA  2010    USA2010 0 0 1  USA  2011    USA2010

#  7: 0 0 1 0 0 0 0  USA  2012    USA2012 0 0 1  USA  2011    USA2010

#  8: 1 0 1 0 0 1 0  BLG  2008    BLG2008 0 1 0  BLG  2007    BLG2007

#  9: 0 1 0 1 1 0 1  BEL  2008    BEL2008 1 0 1  BEL  2009    BEL2009

# 10: 0 1 0 1 0 1 0  BEL  2010    BEL2010 1 0 1  BEL  2009    BEL2009

# 11: 0 1 1 1 0 1 0  NLD  2010    NLD2010 0 1 0  NLD  2009    NLD2009

# 12: 1 0 0 0 1 0 1  NLD  2014    NLD2014 1 0 1  NLD  2014    NLD2014

# 13: 0 0 0 1 1 0 0  AUS  2010    AUS2010 1 0 0  AUS  2011    AUS2011

# 14: 1 0 1 0 0 1 0  AUS  2006    AUS2006 0 1 0  AUS  2007    AUS2007

# 15: 0 1 0 1 0 1 1  USA  2008    USA2008 0 1 1  USA  2007    USA2007

# 16: 0 0 1 0 0 0 1  USA  2010    USA2010 0 0 1  USA  2011    USA2010

# 17: 0 0 1 0 0 0 0  USA  2012    USA2012 0 0 1  USA  2011    USA2010

# 18: 1 0 1 0 0 1 0  BLG  2008    BLG2008 0 1 0  BLG  2007    BLG2007

# 19: 0 1 0 1 1 0 1  BEL  2008    BEL2008 1 0 1  BEL  2009    BEL2009

# 20: 0 1 0 1 0 1 0  BEL  2010    BEL2010 1 0 1  BEL  2009    BEL2009

sample data

dfA <- fread(

  "A   B   C   D   E   F   G   iso   year   matchcode

  0   1   1   1   0   1   0   NLD   2010   NLD2010

     1   0   0   0   1   0   1   NLD   2014   NLD2014

     0   0   0   1   1   0   0   AUS   2010   AUS2010

     1   0   1   0   0   1   0   AUS   2006   AUS2006

     0   1   0   1   0   1   1   USA   2008   USA2008

     0   0   1   0   0   0   1   USA   2010   USA2010

     0   1   0   1   0   0   0   USA   2012   USA2012

     1   0   1   0   0   1   0   BLG   2008   BLG2008

     0   1   0   1   1   0   1   BEL   2008   BEL2008

    1   0   1   0   0   1   0   BEL   2010   BEL2010

    0   1   1   1   0   1   0   NLD   2010   NLD2010

    1   0   0   0   1   0   1   NLD   2014   NLD2014

    0   0   0   1   1   0   0   AUS   2010   AUS2010

    1   0   1   0   0   1   0   AUS   2006   AUS2006

    0   1   0   1   0   1   1   USA   2008   USA2008

    0   0   1   0   0   0   1   USA   2010   USA2010

    0   1   0   1   0   0   0   USA   2012   USA2012

    1   0   1   0   0   1   0   BLG   2008   BLG2008

    0   1   0   1   1   0   1   BEL   2008   BEL2008

    1   0   1   0   0   1   0   BEL   2010   BEL2010",

  header = TRUE

)





dfB <- fread(

  "A   B   C   D   H   I   J   iso   year   matchcode

     0   1   1   1   0   1   0   NLD   2009   NLD2009

     1   0   0   0   1   0   1   NLD   2014   NLD2014

     0   0   0   1   1   0   0   AUS   2011   AUS2011

     1   0   1   0   0   1   0   AUS   2007   AUS2007

     0   1   0   1   0   1   1   USA   2007   USA2007

     0   0   1   0   0   0   1   USA   2011   USA2010

     0   1   0   1   0   0   0   USA   2013   USA2013

     1   0   1   0   0   1   0   BLG   2007   BLG2007

     0   1   0   1   1   0   1   BEL   2009   BEL2009

     1   0   1   0   0   1   0  BEL   2012   BEL2012",

  header = TRUE

)

edited Jan 5 at 6:53

answered Jan 4 at 12:40

Wimpel

6,362323

I removed a previous column because I thought something was wrong. On a more thorough inspection it did however work just fine. Thank you so much for your help. A very small question. I noticed that apart from matchcodeB, also year and isoa are empty (this is my confusion arose). Is it possible to a adapt the code to leave iso and year from dfA intact?

– Tom
Jan 5 at 2:32

1

@Tom Sure just give them unique names. See updated answer

– Wimpel
Jan 5 at 6:53

In the hope you are still here: I have been using your answer to merge many databases. For example A with number 1-8 and then B with number 1 to 8. The first merge goes well but after I am getting problems with duplicates in the colorder. I'm breaking my head on solving it. Would you have any idea why it happens and how to prevent it? Should I also remove the i. and .join from the other databases?

– Tom
Jan 5 at 7:13

Nevermind, I fixed it by dropping all variables which are not needed from each dfB as well before doing the second merging. That appears to work.

– Tom
Jan 5 at 7:29

@Tom glad you got it to work!

– Wimpel
Jan 5 at 8:40

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54038311%2fdoing-a-fuzzy-and-non-fuzzy-many-to-1-merge-with-data-table%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Hers is my (default) approach for a join like this, using data.table

code

library( data.table )



#change the name of the matchcode-column

setnames(dfA, c("matchcode", "iso", "year"), c("matchcodeA", "isoA", "yearA"))

setnames(dfB, c("matchcode", "iso", "year"), c("matchcodeB", "isoB", "yearB"))



#store column-order for in the end

namesA <- as.character( names( dfA ) )

namesB <- as.character( setdiff( names(dfB), names(dfA) ) )

colorder <- c(namesA, namesB)



#create columns to join on

dfA[, `:=`(iso.join = isoA, year.join = yearA)]

dfB[, `:=`(iso.join = isoB, year.join = yearB)]



#perform left join

result <- dfB[dfA, on = c("iso.join", "year.join"),roll = "nearest" ]



#drop columns that are not needed

result[, grep("^i\.", names(result)) := NULL ]

result[, grep("join$", names(result)) := NULL ]



#set column order

setcolorder(result, colorder)

result

#     A B C D E F G isoA yearA matchcodeA H I J isoB yearB matchcodeB

#  1: 0 1 1 1 0 1 0  NLD  2010    NLD2010 0 1 0  NLD  2009    NLD2009

#  2: 1 0 0 0 1 0 1  NLD  2014    NLD2014 1 0 1  NLD  2014    NLD2014

#  3: 0 0 0 1 1 0 0  AUS  2010    AUS2010 1 0 0  AUS  2011    AUS2011

#  4: 1 0 1 0 0 1 0  AUS  2006    AUS2006 0 1 0  AUS  2007    AUS2007

#  5: 0 1 0 1 0 1 1  USA  2008    USA2008 0 1 1  USA  2007    USA2007

#  6: 0 0 1 0 0 0 1  USA  2010    USA2010 0 0 1  USA  2011    USA2010

#  7: 0 0 1 0 0 0 0  USA  2012    USA2012 0 0 1  USA  2011    USA2010

#  8: 1 0 1 0 0 1 0  BLG  2008    BLG2008 0 1 0  BLG  2007    BLG2007

#  9: 0 1 0 1 1 0 1  BEL  2008    BEL2008 1 0 1  BEL  2009    BEL2009

# 10: 0 1 0 1 0 1 0  BEL  2010    BEL2010 1 0 1  BEL  2009    BEL2009

# 11: 0 1 1 1 0 1 0  NLD  2010    NLD2010 0 1 0  NLD  2009    NLD2009

# 12: 1 0 0 0 1 0 1  NLD  2014    NLD2014 1 0 1  NLD  2014    NLD2014

# 13: 0 0 0 1 1 0 0  AUS  2010    AUS2010 1 0 0  AUS  2011    AUS2011

# 14: 1 0 1 0 0 1 0  AUS  2006    AUS2006 0 1 0  AUS  2007    AUS2007

# 15: 0 1 0 1 0 1 1  USA  2008    USA2008 0 1 1  USA  2007    USA2007

# 16: 0 0 1 0 0 0 1  USA  2010    USA2010 0 0 1  USA  2011    USA2010

# 17: 0 0 1 0 0 0 0  USA  2012    USA2012 0 0 1  USA  2011    USA2010

# 18: 1 0 1 0 0 1 0  BLG  2008    BLG2008 0 1 0  BLG  2007    BLG2007

# 19: 0 1 0 1 1 0 1  BEL  2008    BEL2008 1 0 1  BEL  2009    BEL2009

# 20: 0 1 0 1 0 1 0  BEL  2010    BEL2010 1 0 1  BEL  2009    BEL2009

sample data

dfA <- fread(

  "A   B   C   D   E   F   G   iso   year   matchcode

  0   1   1   1   0   1   0   NLD   2010   NLD2010

     1   0   0   0   1   0   1   NLD   2014   NLD2014

     0   0   0   1   1   0   0   AUS   2010   AUS2010

     1   0   1   0   0   1   0   AUS   2006   AUS2006

     0   1   0   1   0   1   1   USA   2008   USA2008

     0   0   1   0   0   0   1   USA   2010   USA2010

     0   1   0   1   0   0   0   USA   2012   USA2012

     1   0   1   0   0   1   0   BLG   2008   BLG2008

     0   1   0   1   1   0   1   BEL   2008   BEL2008

    1   0   1   0   0   1   0   BEL   2010   BEL2010

    0   1   1   1   0   1   0   NLD   2010   NLD2010

    1   0   0   0   1   0   1   NLD   2014   NLD2014

    0   0   0   1   1   0   0   AUS   2010   AUS2010

    1   0   1   0   0   1   0   AUS   2006   AUS2006

    0   1   0   1   0   1   1   USA   2008   USA2008

    0   0   1   0   0   0   1   USA   2010   USA2010

    0   1   0   1   0   0   0   USA   2012   USA2012

    1   0   1   0   0   1   0   BLG   2008   BLG2008

    0   1   0   1   1   0   1   BEL   2008   BEL2008

    1   0   1   0   0   1   0   BEL   2010   BEL2010",

  header = TRUE

)





dfB <- fread(

  "A   B   C   D   H   I   J   iso   year   matchcode

     0   1   1   1   0   1   0   NLD   2009   NLD2009

     1   0   0   0   1   0   1   NLD   2014   NLD2014

     0   0   0   1   1   0   0   AUS   2011   AUS2011

     1   0   1   0   0   1   0   AUS   2007   AUS2007

     0   1   0   1   0   1   1   USA   2007   USA2007

     0   0   1   0   0   0   1   USA   2011   USA2010

     0   1   0   1   0   0   0   USA   2013   USA2013

     1   0   1   0   0   1   0   BLG   2007   BLG2007

     0   1   0   1   1   0   1   BEL   2009   BEL2009

     1   0   1   0   0   1   0  BEL   2012   BEL2012",

  header = TRUE

)

edited Jan 5 at 6:53

answered Jan 4 at 12:40

Wimpel

6,362323

I removed a previous column because I thought something was wrong. On a more thorough inspection it did however work just fine. Thank you so much for your help. A very small question. I noticed that apart from matchcodeB, also year and isoa are empty (this is my confusion arose). Is it possible to a adapt the code to leave iso and year from dfA intact?

– Tom
Jan 5 at 2:32

1

@Tom Sure just give them unique names. See updated answer

– Wimpel
Jan 5 at 6:53

In the hope you are still here: I have been using your answer to merge many databases. For example A with number 1-8 and then B with number 1 to 8. The first merge goes well but after I am getting problems with duplicates in the colorder. I'm breaking my head on solving it. Would you have any idea why it happens and how to prevent it? Should I also remove the i. and .join from the other databases?

– Tom
Jan 5 at 7:13

Nevermind, I fixed it by dropping all variables which are not needed from each dfB as well before doing the second merging. That appears to work.

– Tom
Jan 5 at 7:29

@Tom glad you got it to work!

– Wimpel
Jan 5 at 8:40

add a comment |

Hers is my (default) approach for a join like this, using data.table

code

library( data.table )



#change the name of the matchcode-column

setnames(dfA, c("matchcode", "iso", "year"), c("matchcodeA", "isoA", "yearA"))

setnames(dfB, c("matchcode", "iso", "year"), c("matchcodeB", "isoB", "yearB"))



#store column-order for in the end

namesA <- as.character( names( dfA ) )

namesB <- as.character( setdiff( names(dfB), names(dfA) ) )

colorder <- c(namesA, namesB)



#create columns to join on

dfA[, `:=`(iso.join = isoA, year.join = yearA)]

dfB[, `:=`(iso.join = isoB, year.join = yearB)]



#perform left join

result <- dfB[dfA, on = c("iso.join", "year.join"),roll = "nearest" ]



#drop columns that are not needed

result[, grep("^i\.", names(result)) := NULL ]

result[, grep("join$", names(result)) := NULL ]



#set column order

setcolorder(result, colorder)

result

#     A B C D E F G isoA yearA matchcodeA H I J isoB yearB matchcodeB

#  1: 0 1 1 1 0 1 0  NLD  2010    NLD2010 0 1 0  NLD  2009    NLD2009

#  2: 1 0 0 0 1 0 1  NLD  2014    NLD2014 1 0 1  NLD  2014    NLD2014

#  3: 0 0 0 1 1 0 0  AUS  2010    AUS2010 1 0 0  AUS  2011    AUS2011

#  4: 1 0 1 0 0 1 0  AUS  2006    AUS2006 0 1 0  AUS  2007    AUS2007

#  5: 0 1 0 1 0 1 1  USA  2008    USA2008 0 1 1  USA  2007    USA2007

#  6: 0 0 1 0 0 0 1  USA  2010    USA2010 0 0 1  USA  2011    USA2010

#  7: 0 0 1 0 0 0 0  USA  2012    USA2012 0 0 1  USA  2011    USA2010

#  8: 1 0 1 0 0 1 0  BLG  2008    BLG2008 0 1 0  BLG  2007    BLG2007

#  9: 0 1 0 1 1 0 1  BEL  2008    BEL2008 1 0 1  BEL  2009    BEL2009

# 10: 0 1 0 1 0 1 0  BEL  2010    BEL2010 1 0 1  BEL  2009    BEL2009

# 11: 0 1 1 1 0 1 0  NLD  2010    NLD2010 0 1 0  NLD  2009    NLD2009

# 12: 1 0 0 0 1 0 1  NLD  2014    NLD2014 1 0 1  NLD  2014    NLD2014

# 13: 0 0 0 1 1 0 0  AUS  2010    AUS2010 1 0 0  AUS  2011    AUS2011

# 14: 1 0 1 0 0 1 0  AUS  2006    AUS2006 0 1 0  AUS  2007    AUS2007

# 15: 0 1 0 1 0 1 1  USA  2008    USA2008 0 1 1  USA  2007    USA2007

# 16: 0 0 1 0 0 0 1  USA  2010    USA2010 0 0 1  USA  2011    USA2010

# 17: 0 0 1 0 0 0 0  USA  2012    USA2012 0 0 1  USA  2011    USA2010

# 18: 1 0 1 0 0 1 0  BLG  2008    BLG2008 0 1 0  BLG  2007    BLG2007

# 19: 0 1 0 1 1 0 1  BEL  2008    BEL2008 1 0 1  BEL  2009    BEL2009

# 20: 0 1 0 1 0 1 0  BEL  2010    BEL2010 1 0 1  BEL  2009    BEL2009

sample data

dfA <- fread(

  "A   B   C   D   E   F   G   iso   year   matchcode

  0   1   1   1   0   1   0   NLD   2010   NLD2010

     1   0   0   0   1   0   1   NLD   2014   NLD2014

     0   0   0   1   1   0   0   AUS   2010   AUS2010

     1   0   1   0   0   1   0   AUS   2006   AUS2006

     0   1   0   1   0   1   1   USA   2008   USA2008

     0   0   1   0   0   0   1   USA   2010   USA2010

     0   1   0   1   0   0   0   USA   2012   USA2012

     1   0   1   0   0   1   0   BLG   2008   BLG2008

     0   1   0   1   1   0   1   BEL   2008   BEL2008

    1   0   1   0   0   1   0   BEL   2010   BEL2010

    0   1   1   1   0   1   0   NLD   2010   NLD2010

    1   0   0   0   1   0   1   NLD   2014   NLD2014

    0   0   0   1   1   0   0   AUS   2010   AUS2010

    1   0   1   0   0   1   0   AUS   2006   AUS2006

    0   1   0   1   0   1   1   USA   2008   USA2008

    0   0   1   0   0   0   1   USA   2010   USA2010

    0   1   0   1   0   0   0   USA   2012   USA2012

    1   0   1   0   0   1   0   BLG   2008   BLG2008

    0   1   0   1   1   0   1   BEL   2008   BEL2008

    1   0   1   0   0   1   0   BEL   2010   BEL2010",

  header = TRUE

)





dfB <- fread(

  "A   B   C   D   H   I   J   iso   year   matchcode

     0   1   1   1   0   1   0   NLD   2009   NLD2009

     1   0   0   0   1   0   1   NLD   2014   NLD2014

     0   0   0   1   1   0   0   AUS   2011   AUS2011

     1   0   1   0   0   1   0   AUS   2007   AUS2007

     0   1   0   1   0   1   1   USA   2007   USA2007

     0   0   1   0   0   0   1   USA   2011   USA2010

     0   1   0   1   0   0   0   USA   2013   USA2013

     1   0   1   0   0   1   0   BLG   2007   BLG2007

     0   1   0   1   1   0   1   BEL   2009   BEL2009

     1   0   1   0   0   1   0  BEL   2012   BEL2012",

  header = TRUE

)

edited Jan 5 at 6:53

answered Jan 4 at 12:40

Wimpel

6,362323

I removed a previous column because I thought something was wrong. On a more thorough inspection it did however work just fine. Thank you so much for your help. A very small question. I noticed that apart from matchcodeB, also year and isoa are empty (this is my confusion arose). Is it possible to a adapt the code to leave iso and year from dfA intact?

– Tom
Jan 5 at 2:32

1

@Tom Sure just give them unique names. See updated answer

– Wimpel
Jan 5 at 6:53

In the hope you are still here: I have been using your answer to merge many databases. For example A with number 1-8 and then B with number 1 to 8. The first merge goes well but after I am getting problems with duplicates in the colorder. I'm breaking my head on solving it. Would you have any idea why it happens and how to prevent it? Should I also remove the i. and .join from the other databases?

– Tom
Jan 5 at 7:13

Nevermind, I fixed it by dropping all variables which are not needed from each dfB as well before doing the second merging. That appears to work.

– Tom
Jan 5 at 7:29

@Tom glad you got it to work!

– Wimpel
Jan 5 at 8:40

add a comment |

Hers is my (default) approach for a join like this, using data.table

code

library( data.table )



#change the name of the matchcode-column

setnames(dfA, c("matchcode", "iso", "year"), c("matchcodeA", "isoA", "yearA"))

setnames(dfB, c("matchcode", "iso", "year"), c("matchcodeB", "isoB", "yearB"))



#store column-order for in the end

namesA <- as.character( names( dfA ) )

namesB <- as.character( setdiff( names(dfB), names(dfA) ) )

colorder <- c(namesA, namesB)



#create columns to join on

dfA[, `:=`(iso.join = isoA, year.join = yearA)]

dfB[, `:=`(iso.join = isoB, year.join = yearB)]



#perform left join

result <- dfB[dfA, on = c("iso.join", "year.join"),roll = "nearest" ]



#drop columns that are not needed

result[, grep("^i\.", names(result)) := NULL ]

result[, grep("join$", names(result)) := NULL ]



#set column order

setcolorder(result, colorder)

result

#     A B C D E F G isoA yearA matchcodeA H I J isoB yearB matchcodeB

#  1: 0 1 1 1 0 1 0  NLD  2010    NLD2010 0 1 0  NLD  2009    NLD2009

#  2: 1 0 0 0 1 0 1  NLD  2014    NLD2014 1 0 1  NLD  2014    NLD2014

#  3: 0 0 0 1 1 0 0  AUS  2010    AUS2010 1 0 0  AUS  2011    AUS2011

#  4: 1 0 1 0 0 1 0  AUS  2006    AUS2006 0 1 0  AUS  2007    AUS2007

#  5: 0 1 0 1 0 1 1  USA  2008    USA2008 0 1 1  USA  2007    USA2007

#  6: 0 0 1 0 0 0 1  USA  2010    USA2010 0 0 1  USA  2011    USA2010

#  7: 0 0 1 0 0 0 0  USA  2012    USA2012 0 0 1  USA  2011    USA2010

#  8: 1 0 1 0 0 1 0  BLG  2008    BLG2008 0 1 0  BLG  2007    BLG2007

#  9: 0 1 0 1 1 0 1  BEL  2008    BEL2008 1 0 1  BEL  2009    BEL2009

# 10: 0 1 0 1 0 1 0  BEL  2010    BEL2010 1 0 1  BEL  2009    BEL2009

# 11: 0 1 1 1 0 1 0  NLD  2010    NLD2010 0 1 0  NLD  2009    NLD2009

# 12: 1 0 0 0 1 0 1  NLD  2014    NLD2014 1 0 1  NLD  2014    NLD2014

# 13: 0 0 0 1 1 0 0  AUS  2010    AUS2010 1 0 0  AUS  2011    AUS2011

# 14: 1 0 1 0 0 1 0  AUS  2006    AUS2006 0 1 0  AUS  2007    AUS2007

# 15: 0 1 0 1 0 1 1  USA  2008    USA2008 0 1 1  USA  2007    USA2007

# 16: 0 0 1 0 0 0 1  USA  2010    USA2010 0 0 1  USA  2011    USA2010

# 17: 0 0 1 0 0 0 0  USA  2012    USA2012 0 0 1  USA  2011    USA2010

# 18: 1 0 1 0 0 1 0  BLG  2008    BLG2008 0 1 0  BLG  2007    BLG2007

# 19: 0 1 0 1 1 0 1  BEL  2008    BEL2008 1 0 1  BEL  2009    BEL2009

# 20: 0 1 0 1 0 1 0  BEL  2010    BEL2010 1 0 1  BEL  2009    BEL2009

sample data

dfA <- fread(

  "A   B   C   D   E   F   G   iso   year   matchcode

  0   1   1   1   0   1   0   NLD   2010   NLD2010

     1   0   0   0   1   0   1   NLD   2014   NLD2014

     0   0   0   1   1   0   0   AUS   2010   AUS2010

     1   0   1   0   0   1   0   AUS   2006   AUS2006

     0   1   0   1   0   1   1   USA   2008   USA2008

     0   0   1   0   0   0   1   USA   2010   USA2010

     0   1   0   1   0   0   0   USA   2012   USA2012

     1   0   1   0   0   1   0   BLG   2008   BLG2008

     0   1   0   1   1   0   1   BEL   2008   BEL2008

    1   0   1   0   0   1   0   BEL   2010   BEL2010

    0   1   1   1   0   1   0   NLD   2010   NLD2010

    1   0   0   0   1   0   1   NLD   2014   NLD2014

    0   0   0   1   1   0   0   AUS   2010   AUS2010

    1   0   1   0   0   1   0   AUS   2006   AUS2006

    0   1   0   1   0   1   1   USA   2008   USA2008

    0   0   1   0   0   0   1   USA   2010   USA2010

    0   1   0   1   0   0   0   USA   2012   USA2012

    1   0   1   0   0   1   0   BLG   2008   BLG2008

    0   1   0   1   1   0   1   BEL   2008   BEL2008

    1   0   1   0   0   1   0   BEL   2010   BEL2010",

  header = TRUE

)





dfB <- fread(

  "A   B   C   D   H   I   J   iso   year   matchcode

     0   1   1   1   0   1   0   NLD   2009   NLD2009

     1   0   0   0   1   0   1   NLD   2014   NLD2014

     0   0   0   1   1   0   0   AUS   2011   AUS2011

     1   0   1   0   0   1   0   AUS   2007   AUS2007

     0   1   0   1   0   1   1   USA   2007   USA2007

     0   0   1   0   0   0   1   USA   2011   USA2010

     0   1   0   1   0   0   0   USA   2013   USA2013

     1   0   1   0   0   1   0   BLG   2007   BLG2007

     0   1   0   1   1   0   1   BEL   2009   BEL2009

     1   0   1   0   0   1   0  BEL   2012   BEL2012",

  header = TRUE

)

edited Jan 5 at 6:53

answered Jan 4 at 12:40

Wimpel

6,362323

Hers is my (default) approach for a join like this, using data.table

code

library( data.table )



#change the name of the matchcode-column

setnames(dfA, c("matchcode", "iso", "year"), c("matchcodeA", "isoA", "yearA"))

setnames(dfB, c("matchcode", "iso", "year"), c("matchcodeB", "isoB", "yearB"))



#store column-order for in the end

namesA <- as.character( names( dfA ) )

namesB <- as.character( setdiff( names(dfB), names(dfA) ) )

colorder <- c(namesA, namesB)



#create columns to join on

dfA[, `:=`(iso.join = isoA, year.join = yearA)]

dfB[, `:=`(iso.join = isoB, year.join = yearB)]



#perform left join

result <- dfB[dfA, on = c("iso.join", "year.join"),roll = "nearest" ]



#drop columns that are not needed

result[, grep("^i\.", names(result)) := NULL ]

result[, grep("join$", names(result)) := NULL ]



#set column order

setcolorder(result, colorder)

result

#     A B C D E F G isoA yearA matchcodeA H I J isoB yearB matchcodeB

#  1: 0 1 1 1 0 1 0  NLD  2010    NLD2010 0 1 0  NLD  2009    NLD2009

#  2: 1 0 0 0 1 0 1  NLD  2014    NLD2014 1 0 1  NLD  2014    NLD2014

#  3: 0 0 0 1 1 0 0  AUS  2010    AUS2010 1 0 0  AUS  2011    AUS2011

#  4: 1 0 1 0 0 1 0  AUS  2006    AUS2006 0 1 0  AUS  2007    AUS2007

#  5: 0 1 0 1 0 1 1  USA  2008    USA2008 0 1 1  USA  2007    USA2007

#  6: 0 0 1 0 0 0 1  USA  2010    USA2010 0 0 1  USA  2011    USA2010

#  7: 0 0 1 0 0 0 0  USA  2012    USA2012 0 0 1  USA  2011    USA2010

#  8: 1 0 1 0 0 1 0  BLG  2008    BLG2008 0 1 0  BLG  2007    BLG2007

#  9: 0 1 0 1 1 0 1  BEL  2008    BEL2008 1 0 1  BEL  2009    BEL2009

# 10: 0 1 0 1 0 1 0  BEL  2010    BEL2010 1 0 1  BEL  2009    BEL2009

# 11: 0 1 1 1 0 1 0  NLD  2010    NLD2010 0 1 0  NLD  2009    NLD2009

# 12: 1 0 0 0 1 0 1  NLD  2014    NLD2014 1 0 1  NLD  2014    NLD2014

# 13: 0 0 0 1 1 0 0  AUS  2010    AUS2010 1 0 0  AUS  2011    AUS2011

# 14: 1 0 1 0 0 1 0  AUS  2006    AUS2006 0 1 0  AUS  2007    AUS2007

# 15: 0 1 0 1 0 1 1  USA  2008    USA2008 0 1 1  USA  2007    USA2007

# 16: 0 0 1 0 0 0 1  USA  2010    USA2010 0 0 1  USA  2011    USA2010

# 17: 0 0 1 0 0 0 0  USA  2012    USA2012 0 0 1  USA  2011    USA2010

# 18: 1 0 1 0 0 1 0  BLG  2008    BLG2008 0 1 0  BLG  2007    BLG2007

# 19: 0 1 0 1 1 0 1  BEL  2008    BEL2008 1 0 1  BEL  2009    BEL2009

# 20: 0 1 0 1 0 1 0  BEL  2010    BEL2010 1 0 1  BEL  2009    BEL2009

sample data

dfA <- fread(

  "A   B   C   D   E   F   G   iso   year   matchcode

  0   1   1   1   0   1   0   NLD   2010   NLD2010

     1   0   0   0   1   0   1   NLD   2014   NLD2014

     0   0   0   1   1   0   0   AUS   2010   AUS2010

     1   0   1   0   0   1   0   AUS   2006   AUS2006

     0   1   0   1   0   1   1   USA   2008   USA2008

     0   0   1   0   0   0   1   USA   2010   USA2010

     0   1   0   1   0   0   0   USA   2012   USA2012

     1   0   1   0   0   1   0   BLG   2008   BLG2008

     0   1   0   1   1   0   1   BEL   2008   BEL2008

    1   0   1   0   0   1   0   BEL   2010   BEL2010

    0   1   1   1   0   1   0   NLD   2010   NLD2010

    1   0   0   0   1   0   1   NLD   2014   NLD2014

    0   0   0   1   1   0   0   AUS   2010   AUS2010

    1   0   1   0   0   1   0   AUS   2006   AUS2006

    0   1   0   1   0   1   1   USA   2008   USA2008

    0   0   1   0   0   0   1   USA   2010   USA2010

    0   1   0   1   0   0   0   USA   2012   USA2012

    1   0   1   0   0   1   0   BLG   2008   BLG2008

    0   1   0   1   1   0   1   BEL   2008   BEL2008

    1   0   1   0   0   1   0   BEL   2010   BEL2010",

  header = TRUE

)





dfB <- fread(

  "A   B   C   D   H   I   J   iso   year   matchcode

     0   1   1   1   0   1   0   NLD   2009   NLD2009

     1   0   0   0   1   0   1   NLD   2014   NLD2014

     0   0   0   1   1   0   0   AUS   2011   AUS2011

     1   0   1   0   0   1   0   AUS   2007   AUS2007

     0   1   0   1   0   1   1   USA   2007   USA2007

     0   0   1   0   0   0   1   USA   2011   USA2010

     0   1   0   1   0   0   0   USA   2013   USA2013

     1   0   1   0   0   1   0   BLG   2007   BLG2007

     0   1   0   1   1   0   1   BEL   2009   BEL2009

     1   0   1   0   0   1   0  BEL   2012   BEL2012",

  header = TRUE

)

edited Jan 5 at 6:53

answered Jan 4 at 12:40

Wimpel

6,362323

edited Jan 5 at 6:53

answered Jan 4 at 12:40

Wimpel

6,362323

answered Jan 4 at 12:40

Wimpel

6,362323

answered Jan 4 at 12:40

Wimpel

6,362323

I removed a previous column because I thought something was wrong. On a more thorough inspection it did however work just fine. Thank you so much for your help. A very small question. I noticed that apart from matchcodeB, also year and isoa are empty (this is my confusion arose). Is it possible to a adapt the code to leave iso and year from dfA intact?

– Tom
Jan 5 at 2:32

1

@Tom Sure just give them unique names. See updated answer

– Wimpel
Jan 5 at 6:53

In the hope you are still here: I have been using your answer to merge many databases. For example A with number 1-8 and then B with number 1 to 8. The first merge goes well but after I am getting problems with duplicates in the colorder. I'm breaking my head on solving it. Would you have any idea why it happens and how to prevent it? Should I also remove the i. and .join from the other databases?

– Tom
Jan 5 at 7:13

Nevermind, I fixed it by dropping all variables which are not needed from each dfB as well before doing the second merging. That appears to work.

– Tom
Jan 5 at 7:29

@Tom glad you got it to work!

– Wimpel
Jan 5 at 8:40

add a comment |

I removed a previous column because I thought something was wrong. On a more thorough inspection it did however work just fine. Thank you so much for your help. A very small question. I noticed that apart from matchcodeB, also year and isoa are empty (this is my confusion arose). Is it possible to a adapt the code to leave iso and year from dfA intact?

– Tom
Jan 5 at 2:32

1

@Tom Sure just give them unique names. See updated answer

– Wimpel
Jan 5 at 6:53

In the hope you are still here: I have been using your answer to merge many databases. For example A with number 1-8 and then B with number 1 to 8. The first merge goes well but after I am getting problems with duplicates in the colorder. I'm breaking my head on solving it. Would you have any idea why it happens and how to prevent it? Should I also remove the i. and .join from the other databases?

– Tom
Jan 5 at 7:13

Nevermind, I fixed it by dropping all variables which are not needed from each dfB as well before doing the second merging. That appears to work.

– Tom
Jan 5 at 7:29

@Tom glad you got it to work!

– Wimpel
Jan 5 at 8:40

I removed a previous column because I thought something was wrong. On a more thorough inspection it did however work just fine. Thank you so much for your help. A very small question. I noticed that apart from matchcodeB, also year and isoa are empty (this is my confusion arose). Is it possible to a adapt the code to leave iso and year from dfA intact?

– Tom
Jan 5 at 2:32

@Tom Sure just give them unique names. See updated answer

– Wimpel
Jan 5 at 6:53

In the hope you are still here: I have been using your answer to merge many databases. For example A with number 1-8 and then B with number 1 to 8. The first merge goes well but after I am getting problems with duplicates in the colorder. I'm breaking my head on solving it. Would you have any idea why it happens and how to prevent it? Should I also remove the i. and .join from the other databases?

– Tom
Jan 5 at 7:13

Nevermind, I fixed it by dropping all variables which are not needed from each dfB as well before doing the second merging. That appears to work.

– Tom
Jan 5 at 7:29

@Tom glad you got it to work!

– Wimpel
Jan 5 at 8:40

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Bdtjtk