How to split a pandas string to extract middle names?

I want to split names of individuals into multiple strings. I am able to extract the first name and last name quite easily, but I have problems extracting the middle name or names as these are quite different in each scenario.

The data would look like this:

ID| Complete_Name               | Type

1 | JERRY, Ben                  | "I"

2 | VON HELSINKI, Olga          | "I"

3 | JENSEN, James Goodboy Dean  | "I"

4 | THE COMPANY                 | "C"

5 | CRUZ, Juan S. de la         | "I"

Whereby there are names with only a first and last name and names with something in between or two middle names. How can I extract the middle names from a Pandas dataframe? I can already extract the first and last names.

df = pd.read_csv("list.pip", sep="|")

df["First Name"] = 

np.where(df["Type"]=="I",df['Complete_Name'].str.split(',').str.get(1) , df[""])

df["Last Name"] = np.where(df["Type"]=="I",df['Complete_Name'].str.split(' ').str.get(1) , df[""])

The desired results should look like this:

ID| Complete_Name               | Type | First Name | Middle Name | Last Name

1 | JERRY, Ben                  | "I"  | Ben        |             | JERRY

2 | VON HELSINKI, Olga          | "I"  | Olga       |             |

3 | JENSEN, James Goodboy Dean  | "I"  | James      | Goodboy Dean| VON HELSINKI

4 | THE COMPANY                 | "C"  |            |             |

5 | CRUZ, Juan S. de la         | "I"  | Juan       | S. de la    | CRUZ

edited Jan 19 at 0:13

coldspeed

128k23129216

asked Dec 30 '18 at 17:14

mrPy

857

Possible duplicate of splitting a column by delimiter pandas python

– Nitin Sharma
Dec 30 '18 at 18:13

^ Nope, not that one.

– coldspeed
Dec 30 '18 at 18:34

add a comment |

The data would look like this:

ID| Complete_Name               | Type

1 | JERRY, Ben                  | "I"

2 | VON HELSINKI, Olga          | "I"

3 | JENSEN, James Goodboy Dean  | "I"

4 | THE COMPANY                 | "C"

5 | CRUZ, Juan S. de la         | "I"

df = pd.read_csv("list.pip", sep="|")

df["First Name"] = 

np.where(df["Type"]=="I",df['Complete_Name'].str.split(',').str.get(1) , df[""])

df["Last Name"] = np.where(df["Type"]=="I",df['Complete_Name'].str.split(' ').str.get(1) , df[""])

The desired results should look like this:

ID| Complete_Name               | Type | First Name | Middle Name | Last Name

1 | JERRY, Ben                  | "I"  | Ben        |             | JERRY

2 | VON HELSINKI, Olga          | "I"  | Olga       |             |

3 | JENSEN, James Goodboy Dean  | "I"  | James      | Goodboy Dean| VON HELSINKI

4 | THE COMPANY                 | "C"  |            |             |

5 | CRUZ, Juan S. de la         | "I"  | Juan       | S. de la    | CRUZ

edited Jan 19 at 0:13

coldspeed

128k23129216

asked Dec 30 '18 at 17:14

mrPy

857

Possible duplicate of splitting a column by delimiter pandas python

– Nitin Sharma
Dec 30 '18 at 18:13

^ Nope, not that one.

– coldspeed
Dec 30 '18 at 18:34

add a comment |

The data would look like this:

ID| Complete_Name               | Type

1 | JERRY, Ben                  | "I"

2 | VON HELSINKI, Olga          | "I"

3 | JENSEN, James Goodboy Dean  | "I"

4 | THE COMPANY                 | "C"

5 | CRUZ, Juan S. de la         | "I"

df = pd.read_csv("list.pip", sep="|")

df["First Name"] = 

np.where(df["Type"]=="I",df['Complete_Name'].str.split(',').str.get(1) , df[""])

df["Last Name"] = np.where(df["Type"]=="I",df['Complete_Name'].str.split(' ').str.get(1) , df[""])

The desired results should look like this:

ID| Complete_Name               | Type | First Name | Middle Name | Last Name

1 | JERRY, Ben                  | "I"  | Ben        |             | JERRY

2 | VON HELSINKI, Olga          | "I"  | Olga       |             |

3 | JENSEN, James Goodboy Dean  | "I"  | James      | Goodboy Dean| VON HELSINKI

4 | THE COMPANY                 | "C"  |            |             |

5 | CRUZ, Juan S. de la         | "I"  | Juan       | S. de la    | CRUZ

edited Jan 19 at 0:13

coldspeed

128k23129216

asked Dec 30 '18 at 17:14

mrPy

857

The data would look like this:

ID| Complete_Name               | Type

1 | JERRY, Ben                  | "I"

2 | VON HELSINKI, Olga          | "I"

3 | JENSEN, James Goodboy Dean  | "I"

4 | THE COMPANY                 | "C"

5 | CRUZ, Juan S. de la         | "I"

df = pd.read_csv("list.pip", sep="|")

df["First Name"] = 

np.where(df["Type"]=="I",df['Complete_Name'].str.split(',').str.get(1) , df[""])

df["Last Name"] = np.where(df["Type"]=="I",df['Complete_Name'].str.split(' ').str.get(1) , df[""])

The desired results should look like this:

ID| Complete_Name               | Type | First Name | Middle Name | Last Name

1 | JERRY, Ben                  | "I"  | Ben        |             | JERRY

2 | VON HELSINKI, Olga          | "I"  | Olga       |             |

3 | JENSEN, James Goodboy Dean  | "I"  | James      | Goodboy Dean| VON HELSINKI

4 | THE COMPANY                 | "C"  |            |             |

5 | CRUZ, Juan S. de la         | "I"  | Juan       | S. de la    | CRUZ

python regex string pandas split

edited Jan 19 at 0:13

coldspeed

128k23129216

asked Dec 30 '18 at 17:14

mrPy

857

edited Jan 19 at 0:13

coldspeed

128k23129216

asked Dec 30 '18 at 17:14

mrPy

857

edited Jan 19 at 0:13

coldspeed

128k23129216

edited Jan 19 at 0:13

coldspeed

128k23129216

edited Jan 19 at 0:13

coldspeed

128k23129216

asked Dec 30 '18 at 17:14

mrPy

857

asked Dec 30 '18 at 17:14

mrPy

857

asked Dec 30 '18 at 17:14

mrPy

857

Possible duplicate of splitting a column by delimiter pandas python

– Nitin Sharma
Dec 30 '18 at 18:13

^ Nope, not that one.

– coldspeed
Dec 30 '18 at 18:34

add a comment |

Possible duplicate of splitting a column by delimiter pandas python

– Nitin Sharma
Dec 30 '18 at 18:13

^ Nope, not that one.

– coldspeed
Dec 30 '18 at 18:34

Possible duplicate of splitting a column by delimiter pandas python

– Nitin Sharma
Dec 30 '18 at 18:13

^ Nope, not that one.

– coldspeed
Dec 30 '18 at 18:34

add a comment |

3 Answers
3

active

oldest

votes

A single str.extract call will work here:

p = r'^(?P<Last_Name>.*), (?P<First_Name>S+)bs*(?P<Middle_Name>.*)' 

u = df.loc[df.Type == "I", 'Complete_Name'].str.extract(p)

pd.concat([df, u], axis=1).fillna('')



   ID               Complete_Name Type     Last_Name First_Name   Middle_Name

0   1                  JERRY, Ben    I         JERRY        Ben              

1   2          VON HELSINKI, Olga    I  VON HELSINKI       Olga              

2   3  JENSEN, James Goodboy Dean    I        JENSEN      James  Goodboy Dean

3   4                 THE COMPANY    C                                       

4   5         CRUZ, Juan S. de la    I          CRUZ       Juan      S. de la

Regex Breakdown

^                # Start-of-line

(?P<Last_Name>   # First named capture group - Last Name

    .*           # Match anything until...

)

,                # ...we see a comma

s               # whitespace 

(?P<First_Name>  # Second capture group - First Name

    S+          # Match all non-whitespace characters

)

b               # Word boundary 

s*              # Optional whitespace chars (mostly housekeeping) 

(?P<Middle_Name> # Third capture group - Zero of more middle names 

    .*           # Match everything till the end of string

)

edited Dec 31 '18 at 10:03

answered Dec 30 '18 at 18:00

coldspeed

128k23129216

Pretty neat, but can you explain the regex in more detail for people who will visit in the future so that they can adapt the regex to their own needs?

– mrPy
Dec 31 '18 at 10:00

@mrPy Edited, hope that helps.

– coldspeed
Dec 31 '18 at 10:03

add a comment |

I think you can do:

# take the complete_name column and split it multiple times

df2 = (df.loc[df['Type'].eq('I'),'Complete_Name'].str

       .split(',', expand=True)

       .fillna(''))



# remove extra spaces 

for x in df2.columns:

    df2[x] = [x.strip() for x in df2[x]]



# split the name on first space and join it

df2 = pd.concat([df2[0],df2[1].str.split(' ',1, expand=True)], axis=1)

df2.columns = ['last','first','middle']



# join the data frames

df = pd.concat([df[['ID','Complete_Name']], df2], axis=1)



# rearrange columns - not necessary though

df = df[['ID','Complete_Name','first','middle','last']]



# remove none values

df = df.replace([None], '')



   ID                  Complete_Name Type  first        middle          last

0   1   JERRY, Ben                      I    Ben                       JERRY

1   2   VON HELSINKI, Olga              I   Olga                VON HELSINKI

2   3   JENSEN, James Goodboy Dean      I  James  Goodboy Dean        JENSEN

3   4   THE COMPANY                     C                                   

4   5   CRUZ, Juan S. de la             I   Juan      S. de la          CRUZ

edited Dec 30 '18 at 23:33

answered Dec 30 '18 at 17:25

YOLO

5,4521324

Seems to work YOLO, except for the fact that you did not take into account the np.where(df.Type=="I"). So it also splits the "C" types.

– mrPy
Dec 30 '18 at 17:47

@mrPy sorry i missed that part, fixing it now.

– YOLO
Dec 30 '18 at 17:54

add a comment |

Here's another answer that uses some simple lambda functionality.

import numpy as np

import pandas as pd





""" Create data and data frame """



info_dict = {

    'ID': [1,2,3,4,5,],

    'Complete_Name':[

        'JERRY, Ben',

        'VON HELSINKI, Olga',

        'JENSEN, James Goodboy Dean',

        'THE COMPANY',

        'CRUZ, Juan S. de la',

        ],

    'Type':['I','I','I','C','I',],

    }



data = pd.DataFrame(info_dict, columns = info_dict.keys())





""" List of columns to add """

name_cols = [

    'First Name',

    'Middle Name',

    'Last Name',

    ]



"""

Use partition() to separate first and middle names into Pandas series.

Note: data[data['Type'] == 'I']['Complete_Name'] will allow us to target only the

values that we want.

"""

NO_LAST_NAMES = data[data['Type'] == 'I']['Complete_Name'].apply(lambda x: str(x).partition(',')[2].strip())

LAST_NAMES = data[data['Type'] == 'I']['Complete_Name'].apply(lambda x: str(x).partition(',')[0].strip())



# We can use index positions to quickly add columns to the dataframe.

# The partition() function will keep the delimited value in the 1 index, so we'll use

# the 0 and 2 index positions for first and middle names.

data[name_cols[0]] = NO_LAST_NAMES.str.partition(' ')[0]

data[name_cols[1]] = NO_LAST_NAMES.str.partition(' ')[2]



# Finally, we'll add our Last Names column

data[name_cols[2]] = LAST_NAMES



# Optional: We can replace all blank values with numpy.NaN values using regular expressions.

data = data.replace(r'^$', np.NaN, regex=True)

Then you should end up with something like this:

   ID               Complete_Name Type First Name   Middle Name     Last Name

0   1                  JERRY, Ben    I        Ben           NaN         JERRY

1   2          VON HELSINKI, Olga    I       Olga           NaN  VON HELSINKI

2   3  JENSEN, James Goodboy Dean    I      James  Goodboy Dean        JENSEN

3   4                 THE COMPANY    C        NaN           NaN           NaN

4   5         CRUZ, Juan S. de la    I       Juan      S. de la          CRUZ

Or, replace NaN values with with blank strings:

data = data.replace(np.NaN, r'', regex=False)

Then you have:

   ID               Complete_Name Type First Name   Middle Name     Last Name

0   1                  JERRY, Ben    I        Ben                       JERRY

1   2          VON HELSINKI, Olga    I       Olga                VON HELSINKI

2   3  JENSEN, James Goodboy Dean    I      James  Goodboy Dean        JENSEN

3   4                 THE COMPANY    C                                       

4   5         CRUZ, Juan S. de la    I       Juan      S. de la          CRUZ

answered Dec 30 '18 at 19:12

Mark Moretto

10615

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53979750%2fhow-to-split-a-pandas-string-to-extract-middle-names%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

A single str.extract call will work here:

p = r'^(?P<Last_Name>.*), (?P<First_Name>S+)bs*(?P<Middle_Name>.*)' 

u = df.loc[df.Type == "I", 'Complete_Name'].str.extract(p)

pd.concat([df, u], axis=1).fillna('')



   ID               Complete_Name Type     Last_Name First_Name   Middle_Name

0   1                  JERRY, Ben    I         JERRY        Ben              

1   2          VON HELSINKI, Olga    I  VON HELSINKI       Olga              

2   3  JENSEN, James Goodboy Dean    I        JENSEN      James  Goodboy Dean

3   4                 THE COMPANY    C                                       

4   5         CRUZ, Juan S. de la    I          CRUZ       Juan      S. de la

Regex Breakdown

^                # Start-of-line

(?P<Last_Name>   # First named capture group - Last Name

    .*           # Match anything until...

)

,                # ...we see a comma

s               # whitespace 

(?P<First_Name>  # Second capture group - First Name

    S+          # Match all non-whitespace characters

)

b               # Word boundary 

s*              # Optional whitespace chars (mostly housekeeping) 

(?P<Middle_Name> # Third capture group - Zero of more middle names 

    .*           # Match everything till the end of string

)

edited Dec 31 '18 at 10:03

answered Dec 30 '18 at 18:00

coldspeed

128k23129216

Pretty neat, but can you explain the regex in more detail for people who will visit in the future so that they can adapt the regex to their own needs?

– mrPy
Dec 31 '18 at 10:00

@mrPy Edited, hope that helps.

– coldspeed
Dec 31 '18 at 10:03

add a comment |

A single str.extract call will work here:

p = r'^(?P<Last_Name>.*), (?P<First_Name>S+)bs*(?P<Middle_Name>.*)' 

u = df.loc[df.Type == "I", 'Complete_Name'].str.extract(p)

pd.concat([df, u], axis=1).fillna('')



   ID               Complete_Name Type     Last_Name First_Name   Middle_Name

0   1                  JERRY, Ben    I         JERRY        Ben              

1   2          VON HELSINKI, Olga    I  VON HELSINKI       Olga              

2   3  JENSEN, James Goodboy Dean    I        JENSEN      James  Goodboy Dean

3   4                 THE COMPANY    C                                       

4   5         CRUZ, Juan S. de la    I          CRUZ       Juan      S. de la

Regex Breakdown

^                # Start-of-line

(?P<Last_Name>   # First named capture group - Last Name

    .*           # Match anything until...

)

,                # ...we see a comma

s               # whitespace 

(?P<First_Name>  # Second capture group - First Name

    S+          # Match all non-whitespace characters

)

b               # Word boundary 

s*              # Optional whitespace chars (mostly housekeeping) 

(?P<Middle_Name> # Third capture group - Zero of more middle names 

    .*           # Match everything till the end of string

)

edited Dec 31 '18 at 10:03

answered Dec 30 '18 at 18:00

coldspeed

128k23129216

Pretty neat, but can you explain the regex in more detail for people who will visit in the future so that they can adapt the regex to their own needs?

– mrPy
Dec 31 '18 at 10:00

@mrPy Edited, hope that helps.

– coldspeed
Dec 31 '18 at 10:03

add a comment |

A single str.extract call will work here:

p = r'^(?P<Last_Name>.*), (?P<First_Name>S+)bs*(?P<Middle_Name>.*)' 

u = df.loc[df.Type == "I", 'Complete_Name'].str.extract(p)

pd.concat([df, u], axis=1).fillna('')



   ID               Complete_Name Type     Last_Name First_Name   Middle_Name

0   1                  JERRY, Ben    I         JERRY        Ben              

1   2          VON HELSINKI, Olga    I  VON HELSINKI       Olga              

2   3  JENSEN, James Goodboy Dean    I        JENSEN      James  Goodboy Dean

3   4                 THE COMPANY    C                                       

4   5         CRUZ, Juan S. de la    I          CRUZ       Juan      S. de la

Regex Breakdown

^                # Start-of-line

(?P<Last_Name>   # First named capture group - Last Name

    .*           # Match anything until...

)

,                # ...we see a comma

s               # whitespace 

(?P<First_Name>  # Second capture group - First Name

    S+          # Match all non-whitespace characters

)

b               # Word boundary 

s*              # Optional whitespace chars (mostly housekeeping) 

(?P<Middle_Name> # Third capture group - Zero of more middle names 

    .*           # Match everything till the end of string

)

edited Dec 31 '18 at 10:03

answered Dec 30 '18 at 18:00

coldspeed

128k23129216

A single str.extract call will work here:

p = r'^(?P<Last_Name>.*), (?P<First_Name>S+)bs*(?P<Middle_Name>.*)' 

u = df.loc[df.Type == "I", 'Complete_Name'].str.extract(p)

pd.concat([df, u], axis=1).fillna('')



   ID               Complete_Name Type     Last_Name First_Name   Middle_Name

0   1                  JERRY, Ben    I         JERRY        Ben              

1   2          VON HELSINKI, Olga    I  VON HELSINKI       Olga              

2   3  JENSEN, James Goodboy Dean    I        JENSEN      James  Goodboy Dean

3   4                 THE COMPANY    C                                       

4   5         CRUZ, Juan S. de la    I          CRUZ       Juan      S. de la

Regex Breakdown

^                # Start-of-line

(?P<Last_Name>   # First named capture group - Last Name

    .*           # Match anything until...

)

,                # ...we see a comma

s               # whitespace 

(?P<First_Name>  # Second capture group - First Name

    S+          # Match all non-whitespace characters

)

b               # Word boundary 

s*              # Optional whitespace chars (mostly housekeeping) 

(?P<Middle_Name> # Third capture group - Zero of more middle names 

    .*           # Match everything till the end of string

)

edited Dec 31 '18 at 10:03

answered Dec 30 '18 at 18:00

coldspeed

128k23129216

edited Dec 31 '18 at 10:03

answered Dec 30 '18 at 18:00

coldspeed

128k23129216

answered Dec 30 '18 at 18:00

coldspeed

128k23129216

answered Dec 30 '18 at 18:00

coldspeed

128k23129216

Pretty neat, but can you explain the regex in more detail for people who will visit in the future so that they can adapt the regex to their own needs?

– mrPy
Dec 31 '18 at 10:00

@mrPy Edited, hope that helps.

– coldspeed
Dec 31 '18 at 10:03

add a comment |

Pretty neat, but can you explain the regex in more detail for people who will visit in the future so that they can adapt the regex to their own needs?

– mrPy
Dec 31 '18 at 10:00

@mrPy Edited, hope that helps.

– coldspeed
Dec 31 '18 at 10:03

Pretty neat, but can you explain the regex in more detail for people who will visit in the future so that they can adapt the regex to their own needs?

– mrPy
Dec 31 '18 at 10:00

@mrPy Edited, hope that helps.

– coldspeed
Dec 31 '18 at 10:03

add a comment |

I think you can do:

# take the complete_name column and split it multiple times

df2 = (df.loc[df['Type'].eq('I'),'Complete_Name'].str

       .split(',', expand=True)

       .fillna(''))



# remove extra spaces 

for x in df2.columns:

    df2[x] = [x.strip() for x in df2[x]]



# split the name on first space and join it

df2 = pd.concat([df2[0],df2[1].str.split(' ',1, expand=True)], axis=1)

df2.columns = ['last','first','middle']



# join the data frames

df = pd.concat([df[['ID','Complete_Name']], df2], axis=1)



# rearrange columns - not necessary though

df = df[['ID','Complete_Name','first','middle','last']]



# remove none values

df = df.replace([None], '')



   ID                  Complete_Name Type  first        middle          last

0   1   JERRY, Ben                      I    Ben                       JERRY

1   2   VON HELSINKI, Olga              I   Olga                VON HELSINKI

2   3   JENSEN, James Goodboy Dean      I  James  Goodboy Dean        JENSEN

3   4   THE COMPANY                     C                                   

4   5   CRUZ, Juan S. de la             I   Juan      S. de la          CRUZ

edited Dec 30 '18 at 23:33

answered Dec 30 '18 at 17:25

YOLO

5,4521324

Seems to work YOLO, except for the fact that you did not take into account the np.where(df.Type=="I"). So it also splits the "C" types.

– mrPy
Dec 30 '18 at 17:47

@mrPy sorry i missed that part, fixing it now.

– YOLO
Dec 30 '18 at 17:54

add a comment |

I think you can do:

# take the complete_name column and split it multiple times

df2 = (df.loc[df['Type'].eq('I'),'Complete_Name'].str

       .split(',', expand=True)

       .fillna(''))



# remove extra spaces 

for x in df2.columns:

    df2[x] = [x.strip() for x in df2[x]]



# split the name on first space and join it

df2 = pd.concat([df2[0],df2[1].str.split(' ',1, expand=True)], axis=1)

df2.columns = ['last','first','middle']



# join the data frames

df = pd.concat([df[['ID','Complete_Name']], df2], axis=1)



# rearrange columns - not necessary though

df = df[['ID','Complete_Name','first','middle','last']]



# remove none values

df = df.replace([None], '')



   ID                  Complete_Name Type  first        middle          last

0   1   JERRY, Ben                      I    Ben                       JERRY

1   2   VON HELSINKI, Olga              I   Olga                VON HELSINKI

2   3   JENSEN, James Goodboy Dean      I  James  Goodboy Dean        JENSEN

3   4   THE COMPANY                     C                                   

4   5   CRUZ, Juan S. de la             I   Juan      S. de la          CRUZ

edited Dec 30 '18 at 23:33

answered Dec 30 '18 at 17:25

YOLO

5,4521324

Seems to work YOLO, except for the fact that you did not take into account the np.where(df.Type=="I"). So it also splits the "C" types.

– mrPy
Dec 30 '18 at 17:47

@mrPy sorry i missed that part, fixing it now.

– YOLO
Dec 30 '18 at 17:54

add a comment |

I think you can do:

# take the complete_name column and split it multiple times

df2 = (df.loc[df['Type'].eq('I'),'Complete_Name'].str

       .split(',', expand=True)

       .fillna(''))



# remove extra spaces 

for x in df2.columns:

    df2[x] = [x.strip() for x in df2[x]]



# split the name on first space and join it

df2 = pd.concat([df2[0],df2[1].str.split(' ',1, expand=True)], axis=1)

df2.columns = ['last','first','middle']



# join the data frames

df = pd.concat([df[['ID','Complete_Name']], df2], axis=1)



# rearrange columns - not necessary though

df = df[['ID','Complete_Name','first','middle','last']]



# remove none values

df = df.replace([None], '')



   ID                  Complete_Name Type  first        middle          last

0   1   JERRY, Ben                      I    Ben                       JERRY

1   2   VON HELSINKI, Olga              I   Olga                VON HELSINKI

2   3   JENSEN, James Goodboy Dean      I  James  Goodboy Dean        JENSEN

3   4   THE COMPANY                     C                                   

4   5   CRUZ, Juan S. de la             I   Juan      S. de la          CRUZ

edited Dec 30 '18 at 23:33

answered Dec 30 '18 at 17:25

YOLO

5,4521324

I think you can do:

# take the complete_name column and split it multiple times

df2 = (df.loc[df['Type'].eq('I'),'Complete_Name'].str

       .split(',', expand=True)

       .fillna(''))



# remove extra spaces 

for x in df2.columns:

    df2[x] = [x.strip() for x in df2[x]]



# split the name on first space and join it

df2 = pd.concat([df2[0],df2[1].str.split(' ',1, expand=True)], axis=1)

df2.columns = ['last','first','middle']



# join the data frames

df = pd.concat([df[['ID','Complete_Name']], df2], axis=1)



# rearrange columns - not necessary though

df = df[['ID','Complete_Name','first','middle','last']]



# remove none values

df = df.replace([None], '')



   ID                  Complete_Name Type  first        middle          last

0   1   JERRY, Ben                      I    Ben                       JERRY

1   2   VON HELSINKI, Olga              I   Olga                VON HELSINKI

2   3   JENSEN, James Goodboy Dean      I  James  Goodboy Dean        JENSEN

3   4   THE COMPANY                     C                                   

4   5   CRUZ, Juan S. de la             I   Juan      S. de la          CRUZ

edited Dec 30 '18 at 23:33

answered Dec 30 '18 at 17:25

YOLO

5,4521324

edited Dec 30 '18 at 23:33

answered Dec 30 '18 at 17:25

YOLO

5,4521324

answered Dec 30 '18 at 17:25

YOLO

5,4521324

answered Dec 30 '18 at 17:25

YOLO

5,4521324

Seems to work YOLO, except for the fact that you did not take into account the np.where(df.Type=="I"). So it also splits the "C" types.

– mrPy
Dec 30 '18 at 17:47

@mrPy sorry i missed that part, fixing it now.

– YOLO
Dec 30 '18 at 17:54

add a comment |

Seems to work YOLO, except for the fact that you did not take into account the np.where(df.Type=="I"). So it also splits the "C" types.

– mrPy
Dec 30 '18 at 17:47

@mrPy sorry i missed that part, fixing it now.

– YOLO
Dec 30 '18 at 17:54

Seems to work YOLO, except for the fact that you did not take into account the np.where(df.Type=="I"). So it also splits the "C" types.

– mrPy
Dec 30 '18 at 17:47

@mrPy sorry i missed that part, fixing it now.

– YOLO
Dec 30 '18 at 17:54

add a comment |

Here's another answer that uses some simple lambda functionality.

import numpy as np

import pandas as pd





""" Create data and data frame """



info_dict = {

    'ID': [1,2,3,4,5,],

    'Complete_Name':[

        'JERRY, Ben',

        'VON HELSINKI, Olga',

        'JENSEN, James Goodboy Dean',

        'THE COMPANY',

        'CRUZ, Juan S. de la',

        ],

    'Type':['I','I','I','C','I',],

    }



data = pd.DataFrame(info_dict, columns = info_dict.keys())





""" List of columns to add """

name_cols = [

    'First Name',

    'Middle Name',

    'Last Name',

    ]



"""

Use partition() to separate first and middle names into Pandas series.

Note: data[data['Type'] == 'I']['Complete_Name'] will allow us to target only the

values that we want.

"""

NO_LAST_NAMES = data[data['Type'] == 'I']['Complete_Name'].apply(lambda x: str(x).partition(',')[2].strip())

LAST_NAMES = data[data['Type'] == 'I']['Complete_Name'].apply(lambda x: str(x).partition(',')[0].strip())



# We can use index positions to quickly add columns to the dataframe.

# The partition() function will keep the delimited value in the 1 index, so we'll use

# the 0 and 2 index positions for first and middle names.

data[name_cols[0]] = NO_LAST_NAMES.str.partition(' ')[0]

data[name_cols[1]] = NO_LAST_NAMES.str.partition(' ')[2]



# Finally, we'll add our Last Names column

data[name_cols[2]] = LAST_NAMES



# Optional: We can replace all blank values with numpy.NaN values using regular expressions.

data = data.replace(r'^$', np.NaN, regex=True)

Then you should end up with something like this:

   ID               Complete_Name Type First Name   Middle Name     Last Name

0   1                  JERRY, Ben    I        Ben           NaN         JERRY

1   2          VON HELSINKI, Olga    I       Olga           NaN  VON HELSINKI

2   3  JENSEN, James Goodboy Dean    I      James  Goodboy Dean        JENSEN

3   4                 THE COMPANY    C        NaN           NaN           NaN

4   5         CRUZ, Juan S. de la    I       Juan      S. de la          CRUZ

Or, replace NaN values with with blank strings:

data = data.replace(np.NaN, r'', regex=False)

Then you have:

   ID               Complete_Name Type First Name   Middle Name     Last Name

0   1                  JERRY, Ben    I        Ben                       JERRY

1   2          VON HELSINKI, Olga    I       Olga                VON HELSINKI

2   3  JENSEN, James Goodboy Dean    I      James  Goodboy Dean        JENSEN

3   4                 THE COMPANY    C                                       

4   5         CRUZ, Juan S. de la    I       Juan      S. de la          CRUZ

answered Dec 30 '18 at 19:12

Mark Moretto

10615

add a comment |

Here's another answer that uses some simple lambda functionality.

import numpy as np

import pandas as pd





""" Create data and data frame """



info_dict = {

    'ID': [1,2,3,4,5,],

    'Complete_Name':[

        'JERRY, Ben',

        'VON HELSINKI, Olga',

        'JENSEN, James Goodboy Dean',

        'THE COMPANY',

        'CRUZ, Juan S. de la',

        ],

    'Type':['I','I','I','C','I',],

    }



data = pd.DataFrame(info_dict, columns = info_dict.keys())





""" List of columns to add """

name_cols = [

    'First Name',

    'Middle Name',

    'Last Name',

    ]



"""

Use partition() to separate first and middle names into Pandas series.

Note: data[data['Type'] == 'I']['Complete_Name'] will allow us to target only the

values that we want.

"""

NO_LAST_NAMES = data[data['Type'] == 'I']['Complete_Name'].apply(lambda x: str(x).partition(',')[2].strip())

LAST_NAMES = data[data['Type'] == 'I']['Complete_Name'].apply(lambda x: str(x).partition(',')[0].strip())



# We can use index positions to quickly add columns to the dataframe.

# The partition() function will keep the delimited value in the 1 index, so we'll use

# the 0 and 2 index positions for first and middle names.

data[name_cols[0]] = NO_LAST_NAMES.str.partition(' ')[0]

data[name_cols[1]] = NO_LAST_NAMES.str.partition(' ')[2]



# Finally, we'll add our Last Names column

data[name_cols[2]] = LAST_NAMES



# Optional: We can replace all blank values with numpy.NaN values using regular expressions.

data = data.replace(r'^$', np.NaN, regex=True)

Then you should end up with something like this:

   ID               Complete_Name Type First Name   Middle Name     Last Name

0   1                  JERRY, Ben    I        Ben           NaN         JERRY

1   2          VON HELSINKI, Olga    I       Olga           NaN  VON HELSINKI

2   3  JENSEN, James Goodboy Dean    I      James  Goodboy Dean        JENSEN

3   4                 THE COMPANY    C        NaN           NaN           NaN

4   5         CRUZ, Juan S. de la    I       Juan      S. de la          CRUZ

Or, replace NaN values with with blank strings:

data = data.replace(np.NaN, r'', regex=False)

Then you have:

   ID               Complete_Name Type First Name   Middle Name     Last Name

0   1                  JERRY, Ben    I        Ben                       JERRY

1   2          VON HELSINKI, Olga    I       Olga                VON HELSINKI

2   3  JENSEN, James Goodboy Dean    I      James  Goodboy Dean        JENSEN

3   4                 THE COMPANY    C                                       

4   5         CRUZ, Juan S. de la    I       Juan      S. de la          CRUZ

answered Dec 30 '18 at 19:12

Mark Moretto

10615

add a comment |

Here's another answer that uses some simple lambda functionality.

import numpy as np

import pandas as pd





""" Create data and data frame """



info_dict = {

    'ID': [1,2,3,4,5,],

    'Complete_Name':[

        'JERRY, Ben',

        'VON HELSINKI, Olga',

        'JENSEN, James Goodboy Dean',

        'THE COMPANY',

        'CRUZ, Juan S. de la',

        ],

    'Type':['I','I','I','C','I',],

    }



data = pd.DataFrame(info_dict, columns = info_dict.keys())





""" List of columns to add """

name_cols = [

    'First Name',

    'Middle Name',

    'Last Name',

    ]



"""

Use partition() to separate first and middle names into Pandas series.

Note: data[data['Type'] == 'I']['Complete_Name'] will allow us to target only the

values that we want.

"""

NO_LAST_NAMES = data[data['Type'] == 'I']['Complete_Name'].apply(lambda x: str(x).partition(',')[2].strip())

LAST_NAMES = data[data['Type'] == 'I']['Complete_Name'].apply(lambda x: str(x).partition(',')[0].strip())



# We can use index positions to quickly add columns to the dataframe.

# The partition() function will keep the delimited value in the 1 index, so we'll use

# the 0 and 2 index positions for first and middle names.

data[name_cols[0]] = NO_LAST_NAMES.str.partition(' ')[0]

data[name_cols[1]] = NO_LAST_NAMES.str.partition(' ')[2]



# Finally, we'll add our Last Names column

data[name_cols[2]] = LAST_NAMES



# Optional: We can replace all blank values with numpy.NaN values using regular expressions.

data = data.replace(r'^$', np.NaN, regex=True)

Then you should end up with something like this:

   ID               Complete_Name Type First Name   Middle Name     Last Name

0   1                  JERRY, Ben    I        Ben           NaN         JERRY

1   2          VON HELSINKI, Olga    I       Olga           NaN  VON HELSINKI

2   3  JENSEN, James Goodboy Dean    I      James  Goodboy Dean        JENSEN

3   4                 THE COMPANY    C        NaN           NaN           NaN

4   5         CRUZ, Juan S. de la    I       Juan      S. de la          CRUZ

Or, replace NaN values with with blank strings:

data = data.replace(np.NaN, r'', regex=False)

Then you have:

   ID               Complete_Name Type First Name   Middle Name     Last Name

0   1                  JERRY, Ben    I        Ben                       JERRY

1   2          VON HELSINKI, Olga    I       Olga                VON HELSINKI

2   3  JENSEN, James Goodboy Dean    I      James  Goodboy Dean        JENSEN

3   4                 THE COMPANY    C                                       

4   5         CRUZ, Juan S. de la    I       Juan      S. de la          CRUZ

answered Dec 30 '18 at 19:12

Mark Moretto

10615

Here's another answer that uses some simple lambda functionality.

import numpy as np

import pandas as pd





""" Create data and data frame """



info_dict = {

    'ID': [1,2,3,4,5,],

    'Complete_Name':[

        'JERRY, Ben',

        'VON HELSINKI, Olga',

        'JENSEN, James Goodboy Dean',

        'THE COMPANY',

        'CRUZ, Juan S. de la',

        ],

    'Type':['I','I','I','C','I',],

    }



data = pd.DataFrame(info_dict, columns = info_dict.keys())





""" List of columns to add """

name_cols = [

    'First Name',

    'Middle Name',

    'Last Name',

    ]



"""

Use partition() to separate first and middle names into Pandas series.

Note: data[data['Type'] == 'I']['Complete_Name'] will allow us to target only the

values that we want.

"""

NO_LAST_NAMES = data[data['Type'] == 'I']['Complete_Name'].apply(lambda x: str(x).partition(',')[2].strip())

LAST_NAMES = data[data['Type'] == 'I']['Complete_Name'].apply(lambda x: str(x).partition(',')[0].strip())



# We can use index positions to quickly add columns to the dataframe.

# The partition() function will keep the delimited value in the 1 index, so we'll use

# the 0 and 2 index positions for first and middle names.

data[name_cols[0]] = NO_LAST_NAMES.str.partition(' ')[0]

data[name_cols[1]] = NO_LAST_NAMES.str.partition(' ')[2]



# Finally, we'll add our Last Names column

data[name_cols[2]] = LAST_NAMES



# Optional: We can replace all blank values with numpy.NaN values using regular expressions.

data = data.replace(r'^$', np.NaN, regex=True)

Then you should end up with something like this:

   ID               Complete_Name Type First Name   Middle Name     Last Name

0   1                  JERRY, Ben    I        Ben           NaN         JERRY

1   2          VON HELSINKI, Olga    I       Olga           NaN  VON HELSINKI

2   3  JENSEN, James Goodboy Dean    I      James  Goodboy Dean        JENSEN

3   4                 THE COMPANY    C        NaN           NaN           NaN

4   5         CRUZ, Juan S. de la    I       Juan      S. de la          CRUZ

Or, replace NaN values with with blank strings:

data = data.replace(np.NaN, r'', regex=False)

Then you have:

   ID               Complete_Name Type First Name   Middle Name     Last Name

0   1                  JERRY, Ben    I        Ben                       JERRY

1   2          VON HELSINKI, Olga    I       Olga                VON HELSINKI

2   3  JENSEN, James Goodboy Dean    I      James  Goodboy Dean        JENSEN

3   4                 THE COMPANY    C                                       

4   5         CRUZ, Juan S. de la    I       Juan      S. de la          CRUZ

answered Dec 30 '18 at 19:12

Mark Moretto

10615

answered Dec 30 '18 at 19:12

Mark Moretto

10615

answered Dec 30 '18 at 19:12

Mark Moretto

10615

answered Dec 30 '18 at 19:12

Mark Moretto

10615

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Bdtjtk