How to split a pandas string to extract middle names?












4















I want to split names of individuals into multiple strings. I am able to extract the first name and last name quite easily, but I have problems extracting the middle name or names as these are quite different in each scenario.



The data would look like this:



ID| Complete_Name               | Type
1 | JERRY, Ben | "I"
2 | VON HELSINKI, Olga | "I"
3 | JENSEN, James Goodboy Dean | "I"
4 | THE COMPANY | "C"
5 | CRUZ, Juan S. de la | "I"


Whereby there are names with only a first and last name and names with something in between or two middle names. How can I extract the middle names from a Pandas dataframe? I can already extract the first and last names.



df = pd.read_csv("list.pip", sep="|")
df["First Name"] =
np.where(df["Type"]=="I",df['Complete_Name'].str.split(',').str.get(1) , df[""])
df["Last Name"] = np.where(df["Type"]=="I",df['Complete_Name'].str.split(' ').str.get(1) , df[""])


The desired results should look like this:



ID| Complete_Name               | Type | First Name | Middle Name | Last Name
1 | JERRY, Ben | "I" | Ben | | JERRY
2 | VON HELSINKI, Olga | "I" | Olga | |
3 | JENSEN, James Goodboy Dean | "I" | James | Goodboy Dean| VON HELSINKI
4 | THE COMPANY | "C" | | |
5 | CRUZ, Juan S. de la | "I" | Juan | S. de la | CRUZ









share|improve this question

























  • Possible duplicate of splitting a column by delimiter pandas python

    – Nitin Sharma
    Dec 30 '18 at 18:13











  • ^ Nope, not that one.

    – coldspeed
    Dec 30 '18 at 18:34
















4















I want to split names of individuals into multiple strings. I am able to extract the first name and last name quite easily, but I have problems extracting the middle name or names as these are quite different in each scenario.



The data would look like this:



ID| Complete_Name               | Type
1 | JERRY, Ben | "I"
2 | VON HELSINKI, Olga | "I"
3 | JENSEN, James Goodboy Dean | "I"
4 | THE COMPANY | "C"
5 | CRUZ, Juan S. de la | "I"


Whereby there are names with only a first and last name and names with something in between or two middle names. How can I extract the middle names from a Pandas dataframe? I can already extract the first and last names.



df = pd.read_csv("list.pip", sep="|")
df["First Name"] =
np.where(df["Type"]=="I",df['Complete_Name'].str.split(',').str.get(1) , df[""])
df["Last Name"] = np.where(df["Type"]=="I",df['Complete_Name'].str.split(' ').str.get(1) , df[""])


The desired results should look like this:



ID| Complete_Name               | Type | First Name | Middle Name | Last Name
1 | JERRY, Ben | "I" | Ben | | JERRY
2 | VON HELSINKI, Olga | "I" | Olga | |
3 | JENSEN, James Goodboy Dean | "I" | James | Goodboy Dean| VON HELSINKI
4 | THE COMPANY | "C" | | |
5 | CRUZ, Juan S. de la | "I" | Juan | S. de la | CRUZ









share|improve this question

























  • Possible duplicate of splitting a column by delimiter pandas python

    – Nitin Sharma
    Dec 30 '18 at 18:13











  • ^ Nope, not that one.

    – coldspeed
    Dec 30 '18 at 18:34














4












4








4


1






I want to split names of individuals into multiple strings. I am able to extract the first name and last name quite easily, but I have problems extracting the middle name or names as these are quite different in each scenario.



The data would look like this:



ID| Complete_Name               | Type
1 | JERRY, Ben | "I"
2 | VON HELSINKI, Olga | "I"
3 | JENSEN, James Goodboy Dean | "I"
4 | THE COMPANY | "C"
5 | CRUZ, Juan S. de la | "I"


Whereby there are names with only a first and last name and names with something in between or two middle names. How can I extract the middle names from a Pandas dataframe? I can already extract the first and last names.



df = pd.read_csv("list.pip", sep="|")
df["First Name"] =
np.where(df["Type"]=="I",df['Complete_Name'].str.split(',').str.get(1) , df[""])
df["Last Name"] = np.where(df["Type"]=="I",df['Complete_Name'].str.split(' ').str.get(1) , df[""])


The desired results should look like this:



ID| Complete_Name               | Type | First Name | Middle Name | Last Name
1 | JERRY, Ben | "I" | Ben | | JERRY
2 | VON HELSINKI, Olga | "I" | Olga | |
3 | JENSEN, James Goodboy Dean | "I" | James | Goodboy Dean| VON HELSINKI
4 | THE COMPANY | "C" | | |
5 | CRUZ, Juan S. de la | "I" | Juan | S. de la | CRUZ









share|improve this question
















I want to split names of individuals into multiple strings. I am able to extract the first name and last name quite easily, but I have problems extracting the middle name or names as these are quite different in each scenario.



The data would look like this:



ID| Complete_Name               | Type
1 | JERRY, Ben | "I"
2 | VON HELSINKI, Olga | "I"
3 | JENSEN, James Goodboy Dean | "I"
4 | THE COMPANY | "C"
5 | CRUZ, Juan S. de la | "I"


Whereby there are names with only a first and last name and names with something in between or two middle names. How can I extract the middle names from a Pandas dataframe? I can already extract the first and last names.



df = pd.read_csv("list.pip", sep="|")
df["First Name"] =
np.where(df["Type"]=="I",df['Complete_Name'].str.split(',').str.get(1) , df[""])
df["Last Name"] = np.where(df["Type"]=="I",df['Complete_Name'].str.split(' ').str.get(1) , df[""])


The desired results should look like this:



ID| Complete_Name               | Type | First Name | Middle Name | Last Name
1 | JERRY, Ben | "I" | Ben | | JERRY
2 | VON HELSINKI, Olga | "I" | Olga | |
3 | JENSEN, James Goodboy Dean | "I" | James | Goodboy Dean| VON HELSINKI
4 | THE COMPANY | "C" | | |
5 | CRUZ, Juan S. de la | "I" | Juan | S. de la | CRUZ






python regex string pandas split






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 19 at 0:13









coldspeed

128k23129216




128k23129216










asked Dec 30 '18 at 17:14









mrPymrPy

857




857













  • Possible duplicate of splitting a column by delimiter pandas python

    – Nitin Sharma
    Dec 30 '18 at 18:13











  • ^ Nope, not that one.

    – coldspeed
    Dec 30 '18 at 18:34



















  • Possible duplicate of splitting a column by delimiter pandas python

    – Nitin Sharma
    Dec 30 '18 at 18:13











  • ^ Nope, not that one.

    – coldspeed
    Dec 30 '18 at 18:34

















Possible duplicate of splitting a column by delimiter pandas python

– Nitin Sharma
Dec 30 '18 at 18:13





Possible duplicate of splitting a column by delimiter pandas python

– Nitin Sharma
Dec 30 '18 at 18:13













^ Nope, not that one.

– coldspeed
Dec 30 '18 at 18:34





^ Nope, not that one.

– coldspeed
Dec 30 '18 at 18:34












3 Answers
3






active

oldest

votes


















5














A single str.extract call will work here:



p = r'^(?P<Last_Name>.*), (?P<First_Name>S+)bs*(?P<Middle_Name>.*)' 
u = df.loc[df.Type == "I", 'Complete_Name'].str.extract(p)
pd.concat([df, u], axis=1).fillna('')

ID Complete_Name Type Last_Name First_Name Middle_Name
0 1 JERRY, Ben I JERRY Ben
1 2 VON HELSINKI, Olga I VON HELSINKI Olga
2 3 JENSEN, James Goodboy Dean I JENSEN James Goodboy Dean
3 4 THE COMPANY C
4 5 CRUZ, Juan S. de la I CRUZ Juan S. de la


Regex Breakdown



^                # Start-of-line
(?P<Last_Name> # First named capture group - Last Name
.* # Match anything until...
)
, # ...we see a comma
s # whitespace
(?P<First_Name> # Second capture group - First Name
S+ # Match all non-whitespace characters
)
b # Word boundary
s* # Optional whitespace chars (mostly housekeeping)
(?P<Middle_Name> # Third capture group - Zero of more middle names
.* # Match everything till the end of string
)





share|improve this answer


























  • Pretty neat, but can you explain the regex in more detail for people who will visit in the future so that they can adapt the regex to their own needs?

    – mrPy
    Dec 31 '18 at 10:00











  • @mrPy Edited, hope that helps.

    – coldspeed
    Dec 31 '18 at 10:03



















3














I think you can do:



# take the complete_name column and split it multiple times
df2 = (df.loc[df['Type'].eq('I'),'Complete_Name'].str
.split(',', expand=True)
.fillna(''))

# remove extra spaces
for x in df2.columns:
df2[x] = [x.strip() for x in df2[x]]

# split the name on first space and join it
df2 = pd.concat([df2[0],df2[1].str.split(' ',1, expand=True)], axis=1)
df2.columns = ['last','first','middle']

# join the data frames
df = pd.concat([df[['ID','Complete_Name']], df2], axis=1)

# rearrange columns - not necessary though
df = df[['ID','Complete_Name','first','middle','last']]

# remove none values
df = df.replace([None], '')

ID Complete_Name Type first middle last
0 1 JERRY, Ben I Ben JERRY
1 2 VON HELSINKI, Olga I Olga VON HELSINKI
2 3 JENSEN, James Goodboy Dean I James Goodboy Dean JENSEN
3 4 THE COMPANY C
4 5 CRUZ, Juan S. de la I Juan S. de la CRUZ





share|improve this answer


























  • Seems to work YOLO, except for the fact that you did not take into account the np.where(df.Type=="I"). So it also splits the "C" types.

    – mrPy
    Dec 30 '18 at 17:47













  • @mrPy sorry i missed that part, fixing it now.

    – YOLO
    Dec 30 '18 at 17:54



















1














Here's another answer that uses some simple lambda functionality.



import numpy as np
import pandas as pd


""" Create data and data frame """

info_dict = {
'ID': [1,2,3,4,5,],
'Complete_Name':[
'JERRY, Ben',
'VON HELSINKI, Olga',
'JENSEN, James Goodboy Dean',
'THE COMPANY',
'CRUZ, Juan S. de la',
],
'Type':['I','I','I','C','I',],
}

data = pd.DataFrame(info_dict, columns = info_dict.keys())


""" List of columns to add """
name_cols = [
'First Name',
'Middle Name',
'Last Name',
]

"""
Use partition() to separate first and middle names into Pandas series.
Note: data[data['Type'] == 'I']['Complete_Name'] will allow us to target only the
values that we want.
"""
NO_LAST_NAMES = data[data['Type'] == 'I']['Complete_Name'].apply(lambda x: str(x).partition(',')[2].strip())
LAST_NAMES = data[data['Type'] == 'I']['Complete_Name'].apply(lambda x: str(x).partition(',')[0].strip())

# We can use index positions to quickly add columns to the dataframe.
# The partition() function will keep the delimited value in the 1 index, so we'll use
# the 0 and 2 index positions for first and middle names.
data[name_cols[0]] = NO_LAST_NAMES.str.partition(' ')[0]
data[name_cols[1]] = NO_LAST_NAMES.str.partition(' ')[2]

# Finally, we'll add our Last Names column
data[name_cols[2]] = LAST_NAMES

# Optional: We can replace all blank values with numpy.NaN values using regular expressions.
data = data.replace(r'^$', np.NaN, regex=True)


Then you should end up with something like this:



   ID               Complete_Name Type First Name   Middle Name     Last Name
0 1 JERRY, Ben I Ben NaN JERRY
1 2 VON HELSINKI, Olga I Olga NaN VON HELSINKI
2 3 JENSEN, James Goodboy Dean I James Goodboy Dean JENSEN
3 4 THE COMPANY C NaN NaN NaN
4 5 CRUZ, Juan S. de la I Juan S. de la CRUZ


Or, replace NaN values with with blank strings:



data = data.replace(np.NaN, r'', regex=False)


Then you have:



   ID               Complete_Name Type First Name   Middle Name     Last Name
0 1 JERRY, Ben I Ben JERRY
1 2 VON HELSINKI, Olga I Olga VON HELSINKI
2 3 JENSEN, James Goodboy Dean I James Goodboy Dean JENSEN
3 4 THE COMPANY C
4 5 CRUZ, Juan S. de la I Juan S. de la CRUZ





share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53979750%2fhow-to-split-a-pandas-string-to-extract-middle-names%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    3 Answers
    3






    active

    oldest

    votes








    3 Answers
    3






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    5














    A single str.extract call will work here:



    p = r'^(?P<Last_Name>.*), (?P<First_Name>S+)bs*(?P<Middle_Name>.*)' 
    u = df.loc[df.Type == "I", 'Complete_Name'].str.extract(p)
    pd.concat([df, u], axis=1).fillna('')

    ID Complete_Name Type Last_Name First_Name Middle_Name
    0 1 JERRY, Ben I JERRY Ben
    1 2 VON HELSINKI, Olga I VON HELSINKI Olga
    2 3 JENSEN, James Goodboy Dean I JENSEN James Goodboy Dean
    3 4 THE COMPANY C
    4 5 CRUZ, Juan S. de la I CRUZ Juan S. de la


    Regex Breakdown



    ^                # Start-of-line
    (?P<Last_Name> # First named capture group - Last Name
    .* # Match anything until...
    )
    , # ...we see a comma
    s # whitespace
    (?P<First_Name> # Second capture group - First Name
    S+ # Match all non-whitespace characters
    )
    b # Word boundary
    s* # Optional whitespace chars (mostly housekeeping)
    (?P<Middle_Name> # Third capture group - Zero of more middle names
    .* # Match everything till the end of string
    )





    share|improve this answer


























    • Pretty neat, but can you explain the regex in more detail for people who will visit in the future so that they can adapt the regex to their own needs?

      – mrPy
      Dec 31 '18 at 10:00











    • @mrPy Edited, hope that helps.

      – coldspeed
      Dec 31 '18 at 10:03
















    5














    A single str.extract call will work here:



    p = r'^(?P<Last_Name>.*), (?P<First_Name>S+)bs*(?P<Middle_Name>.*)' 
    u = df.loc[df.Type == "I", 'Complete_Name'].str.extract(p)
    pd.concat([df, u], axis=1).fillna('')

    ID Complete_Name Type Last_Name First_Name Middle_Name
    0 1 JERRY, Ben I JERRY Ben
    1 2 VON HELSINKI, Olga I VON HELSINKI Olga
    2 3 JENSEN, James Goodboy Dean I JENSEN James Goodboy Dean
    3 4 THE COMPANY C
    4 5 CRUZ, Juan S. de la I CRUZ Juan S. de la


    Regex Breakdown



    ^                # Start-of-line
    (?P<Last_Name> # First named capture group - Last Name
    .* # Match anything until...
    )
    , # ...we see a comma
    s # whitespace
    (?P<First_Name> # Second capture group - First Name
    S+ # Match all non-whitespace characters
    )
    b # Word boundary
    s* # Optional whitespace chars (mostly housekeeping)
    (?P<Middle_Name> # Third capture group - Zero of more middle names
    .* # Match everything till the end of string
    )





    share|improve this answer


























    • Pretty neat, but can you explain the regex in more detail for people who will visit in the future so that they can adapt the regex to their own needs?

      – mrPy
      Dec 31 '18 at 10:00











    • @mrPy Edited, hope that helps.

      – coldspeed
      Dec 31 '18 at 10:03














    5












    5








    5







    A single str.extract call will work here:



    p = r'^(?P<Last_Name>.*), (?P<First_Name>S+)bs*(?P<Middle_Name>.*)' 
    u = df.loc[df.Type == "I", 'Complete_Name'].str.extract(p)
    pd.concat([df, u], axis=1).fillna('')

    ID Complete_Name Type Last_Name First_Name Middle_Name
    0 1 JERRY, Ben I JERRY Ben
    1 2 VON HELSINKI, Olga I VON HELSINKI Olga
    2 3 JENSEN, James Goodboy Dean I JENSEN James Goodboy Dean
    3 4 THE COMPANY C
    4 5 CRUZ, Juan S. de la I CRUZ Juan S. de la


    Regex Breakdown



    ^                # Start-of-line
    (?P<Last_Name> # First named capture group - Last Name
    .* # Match anything until...
    )
    , # ...we see a comma
    s # whitespace
    (?P<First_Name> # Second capture group - First Name
    S+ # Match all non-whitespace characters
    )
    b # Word boundary
    s* # Optional whitespace chars (mostly housekeeping)
    (?P<Middle_Name> # Third capture group - Zero of more middle names
    .* # Match everything till the end of string
    )





    share|improve this answer















    A single str.extract call will work here:



    p = r'^(?P<Last_Name>.*), (?P<First_Name>S+)bs*(?P<Middle_Name>.*)' 
    u = df.loc[df.Type == "I", 'Complete_Name'].str.extract(p)
    pd.concat([df, u], axis=1).fillna('')

    ID Complete_Name Type Last_Name First_Name Middle_Name
    0 1 JERRY, Ben I JERRY Ben
    1 2 VON HELSINKI, Olga I VON HELSINKI Olga
    2 3 JENSEN, James Goodboy Dean I JENSEN James Goodboy Dean
    3 4 THE COMPANY C
    4 5 CRUZ, Juan S. de la I CRUZ Juan S. de la


    Regex Breakdown



    ^                # Start-of-line
    (?P<Last_Name> # First named capture group - Last Name
    .* # Match anything until...
    )
    , # ...we see a comma
    s # whitespace
    (?P<First_Name> # Second capture group - First Name
    S+ # Match all non-whitespace characters
    )
    b # Word boundary
    s* # Optional whitespace chars (mostly housekeeping)
    (?P<Middle_Name> # Third capture group - Zero of more middle names
    .* # Match everything till the end of string
    )






    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Dec 31 '18 at 10:03

























    answered Dec 30 '18 at 18:00









    coldspeedcoldspeed

    128k23129216




    128k23129216













    • Pretty neat, but can you explain the regex in more detail for people who will visit in the future so that they can adapt the regex to their own needs?

      – mrPy
      Dec 31 '18 at 10:00











    • @mrPy Edited, hope that helps.

      – coldspeed
      Dec 31 '18 at 10:03



















    • Pretty neat, but can you explain the regex in more detail for people who will visit in the future so that they can adapt the regex to their own needs?

      – mrPy
      Dec 31 '18 at 10:00











    • @mrPy Edited, hope that helps.

      – coldspeed
      Dec 31 '18 at 10:03

















    Pretty neat, but can you explain the regex in more detail for people who will visit in the future so that they can adapt the regex to their own needs?

    – mrPy
    Dec 31 '18 at 10:00





    Pretty neat, but can you explain the regex in more detail for people who will visit in the future so that they can adapt the regex to their own needs?

    – mrPy
    Dec 31 '18 at 10:00













    @mrPy Edited, hope that helps.

    – coldspeed
    Dec 31 '18 at 10:03





    @mrPy Edited, hope that helps.

    – coldspeed
    Dec 31 '18 at 10:03













    3














    I think you can do:



    # take the complete_name column and split it multiple times
    df2 = (df.loc[df['Type'].eq('I'),'Complete_Name'].str
    .split(',', expand=True)
    .fillna(''))

    # remove extra spaces
    for x in df2.columns:
    df2[x] = [x.strip() for x in df2[x]]

    # split the name on first space and join it
    df2 = pd.concat([df2[0],df2[1].str.split(' ',1, expand=True)], axis=1)
    df2.columns = ['last','first','middle']

    # join the data frames
    df = pd.concat([df[['ID','Complete_Name']], df2], axis=1)

    # rearrange columns - not necessary though
    df = df[['ID','Complete_Name','first','middle','last']]

    # remove none values
    df = df.replace([None], '')

    ID Complete_Name Type first middle last
    0 1 JERRY, Ben I Ben JERRY
    1 2 VON HELSINKI, Olga I Olga VON HELSINKI
    2 3 JENSEN, James Goodboy Dean I James Goodboy Dean JENSEN
    3 4 THE COMPANY C
    4 5 CRUZ, Juan S. de la I Juan S. de la CRUZ





    share|improve this answer


























    • Seems to work YOLO, except for the fact that you did not take into account the np.where(df.Type=="I"). So it also splits the "C" types.

      – mrPy
      Dec 30 '18 at 17:47













    • @mrPy sorry i missed that part, fixing it now.

      – YOLO
      Dec 30 '18 at 17:54
















    3














    I think you can do:



    # take the complete_name column and split it multiple times
    df2 = (df.loc[df['Type'].eq('I'),'Complete_Name'].str
    .split(',', expand=True)
    .fillna(''))

    # remove extra spaces
    for x in df2.columns:
    df2[x] = [x.strip() for x in df2[x]]

    # split the name on first space and join it
    df2 = pd.concat([df2[0],df2[1].str.split(' ',1, expand=True)], axis=1)
    df2.columns = ['last','first','middle']

    # join the data frames
    df = pd.concat([df[['ID','Complete_Name']], df2], axis=1)

    # rearrange columns - not necessary though
    df = df[['ID','Complete_Name','first','middle','last']]

    # remove none values
    df = df.replace([None], '')

    ID Complete_Name Type first middle last
    0 1 JERRY, Ben I Ben JERRY
    1 2 VON HELSINKI, Olga I Olga VON HELSINKI
    2 3 JENSEN, James Goodboy Dean I James Goodboy Dean JENSEN
    3 4 THE COMPANY C
    4 5 CRUZ, Juan S. de la I Juan S. de la CRUZ





    share|improve this answer


























    • Seems to work YOLO, except for the fact that you did not take into account the np.where(df.Type=="I"). So it also splits the "C" types.

      – mrPy
      Dec 30 '18 at 17:47













    • @mrPy sorry i missed that part, fixing it now.

      – YOLO
      Dec 30 '18 at 17:54














    3












    3








    3







    I think you can do:



    # take the complete_name column and split it multiple times
    df2 = (df.loc[df['Type'].eq('I'),'Complete_Name'].str
    .split(',', expand=True)
    .fillna(''))

    # remove extra spaces
    for x in df2.columns:
    df2[x] = [x.strip() for x in df2[x]]

    # split the name on first space and join it
    df2 = pd.concat([df2[0],df2[1].str.split(' ',1, expand=True)], axis=1)
    df2.columns = ['last','first','middle']

    # join the data frames
    df = pd.concat([df[['ID','Complete_Name']], df2], axis=1)

    # rearrange columns - not necessary though
    df = df[['ID','Complete_Name','first','middle','last']]

    # remove none values
    df = df.replace([None], '')

    ID Complete_Name Type first middle last
    0 1 JERRY, Ben I Ben JERRY
    1 2 VON HELSINKI, Olga I Olga VON HELSINKI
    2 3 JENSEN, James Goodboy Dean I James Goodboy Dean JENSEN
    3 4 THE COMPANY C
    4 5 CRUZ, Juan S. de la I Juan S. de la CRUZ





    share|improve this answer















    I think you can do:



    # take the complete_name column and split it multiple times
    df2 = (df.loc[df['Type'].eq('I'),'Complete_Name'].str
    .split(',', expand=True)
    .fillna(''))

    # remove extra spaces
    for x in df2.columns:
    df2[x] = [x.strip() for x in df2[x]]

    # split the name on first space and join it
    df2 = pd.concat([df2[0],df2[1].str.split(' ',1, expand=True)], axis=1)
    df2.columns = ['last','first','middle']

    # join the data frames
    df = pd.concat([df[['ID','Complete_Name']], df2], axis=1)

    # rearrange columns - not necessary though
    df = df[['ID','Complete_Name','first','middle','last']]

    # remove none values
    df = df.replace([None], '')

    ID Complete_Name Type first middle last
    0 1 JERRY, Ben I Ben JERRY
    1 2 VON HELSINKI, Olga I Olga VON HELSINKI
    2 3 JENSEN, James Goodboy Dean I James Goodboy Dean JENSEN
    3 4 THE COMPANY C
    4 5 CRUZ, Juan S. de la I Juan S. de la CRUZ






    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Dec 30 '18 at 23:33

























    answered Dec 30 '18 at 17:25









    YOLOYOLO

    5,4521324




    5,4521324













    • Seems to work YOLO, except for the fact that you did not take into account the np.where(df.Type=="I"). So it also splits the "C" types.

      – mrPy
      Dec 30 '18 at 17:47













    • @mrPy sorry i missed that part, fixing it now.

      – YOLO
      Dec 30 '18 at 17:54



















    • Seems to work YOLO, except for the fact that you did not take into account the np.where(df.Type=="I"). So it also splits the "C" types.

      – mrPy
      Dec 30 '18 at 17:47













    • @mrPy sorry i missed that part, fixing it now.

      – YOLO
      Dec 30 '18 at 17:54

















    Seems to work YOLO, except for the fact that you did not take into account the np.where(df.Type=="I"). So it also splits the "C" types.

    – mrPy
    Dec 30 '18 at 17:47







    Seems to work YOLO, except for the fact that you did not take into account the np.where(df.Type=="I"). So it also splits the "C" types.

    – mrPy
    Dec 30 '18 at 17:47















    @mrPy sorry i missed that part, fixing it now.

    – YOLO
    Dec 30 '18 at 17:54





    @mrPy sorry i missed that part, fixing it now.

    – YOLO
    Dec 30 '18 at 17:54











    1














    Here's another answer that uses some simple lambda functionality.



    import numpy as np
    import pandas as pd


    """ Create data and data frame """

    info_dict = {
    'ID': [1,2,3,4,5,],
    'Complete_Name':[
    'JERRY, Ben',
    'VON HELSINKI, Olga',
    'JENSEN, James Goodboy Dean',
    'THE COMPANY',
    'CRUZ, Juan S. de la',
    ],
    'Type':['I','I','I','C','I',],
    }

    data = pd.DataFrame(info_dict, columns = info_dict.keys())


    """ List of columns to add """
    name_cols = [
    'First Name',
    'Middle Name',
    'Last Name',
    ]

    """
    Use partition() to separate first and middle names into Pandas series.
    Note: data[data['Type'] == 'I']['Complete_Name'] will allow us to target only the
    values that we want.
    """
    NO_LAST_NAMES = data[data['Type'] == 'I']['Complete_Name'].apply(lambda x: str(x).partition(',')[2].strip())
    LAST_NAMES = data[data['Type'] == 'I']['Complete_Name'].apply(lambda x: str(x).partition(',')[0].strip())

    # We can use index positions to quickly add columns to the dataframe.
    # The partition() function will keep the delimited value in the 1 index, so we'll use
    # the 0 and 2 index positions for first and middle names.
    data[name_cols[0]] = NO_LAST_NAMES.str.partition(' ')[0]
    data[name_cols[1]] = NO_LAST_NAMES.str.partition(' ')[2]

    # Finally, we'll add our Last Names column
    data[name_cols[2]] = LAST_NAMES

    # Optional: We can replace all blank values with numpy.NaN values using regular expressions.
    data = data.replace(r'^$', np.NaN, regex=True)


    Then you should end up with something like this:



       ID               Complete_Name Type First Name   Middle Name     Last Name
    0 1 JERRY, Ben I Ben NaN JERRY
    1 2 VON HELSINKI, Olga I Olga NaN VON HELSINKI
    2 3 JENSEN, James Goodboy Dean I James Goodboy Dean JENSEN
    3 4 THE COMPANY C NaN NaN NaN
    4 5 CRUZ, Juan S. de la I Juan S. de la CRUZ


    Or, replace NaN values with with blank strings:



    data = data.replace(np.NaN, r'', regex=False)


    Then you have:



       ID               Complete_Name Type First Name   Middle Name     Last Name
    0 1 JERRY, Ben I Ben JERRY
    1 2 VON HELSINKI, Olga I Olga VON HELSINKI
    2 3 JENSEN, James Goodboy Dean I James Goodboy Dean JENSEN
    3 4 THE COMPANY C
    4 5 CRUZ, Juan S. de la I Juan S. de la CRUZ





    share|improve this answer




























      1














      Here's another answer that uses some simple lambda functionality.



      import numpy as np
      import pandas as pd


      """ Create data and data frame """

      info_dict = {
      'ID': [1,2,3,4,5,],
      'Complete_Name':[
      'JERRY, Ben',
      'VON HELSINKI, Olga',
      'JENSEN, James Goodboy Dean',
      'THE COMPANY',
      'CRUZ, Juan S. de la',
      ],
      'Type':['I','I','I','C','I',],
      }

      data = pd.DataFrame(info_dict, columns = info_dict.keys())


      """ List of columns to add """
      name_cols = [
      'First Name',
      'Middle Name',
      'Last Name',
      ]

      """
      Use partition() to separate first and middle names into Pandas series.
      Note: data[data['Type'] == 'I']['Complete_Name'] will allow us to target only the
      values that we want.
      """
      NO_LAST_NAMES = data[data['Type'] == 'I']['Complete_Name'].apply(lambda x: str(x).partition(',')[2].strip())
      LAST_NAMES = data[data['Type'] == 'I']['Complete_Name'].apply(lambda x: str(x).partition(',')[0].strip())

      # We can use index positions to quickly add columns to the dataframe.
      # The partition() function will keep the delimited value in the 1 index, so we'll use
      # the 0 and 2 index positions for first and middle names.
      data[name_cols[0]] = NO_LAST_NAMES.str.partition(' ')[0]
      data[name_cols[1]] = NO_LAST_NAMES.str.partition(' ')[2]

      # Finally, we'll add our Last Names column
      data[name_cols[2]] = LAST_NAMES

      # Optional: We can replace all blank values with numpy.NaN values using regular expressions.
      data = data.replace(r'^$', np.NaN, regex=True)


      Then you should end up with something like this:



         ID               Complete_Name Type First Name   Middle Name     Last Name
      0 1 JERRY, Ben I Ben NaN JERRY
      1 2 VON HELSINKI, Olga I Olga NaN VON HELSINKI
      2 3 JENSEN, James Goodboy Dean I James Goodboy Dean JENSEN
      3 4 THE COMPANY C NaN NaN NaN
      4 5 CRUZ, Juan S. de la I Juan S. de la CRUZ


      Or, replace NaN values with with blank strings:



      data = data.replace(np.NaN, r'', regex=False)


      Then you have:



         ID               Complete_Name Type First Name   Middle Name     Last Name
      0 1 JERRY, Ben I Ben JERRY
      1 2 VON HELSINKI, Olga I Olga VON HELSINKI
      2 3 JENSEN, James Goodboy Dean I James Goodboy Dean JENSEN
      3 4 THE COMPANY C
      4 5 CRUZ, Juan S. de la I Juan S. de la CRUZ





      share|improve this answer


























        1












        1








        1







        Here's another answer that uses some simple lambda functionality.



        import numpy as np
        import pandas as pd


        """ Create data and data frame """

        info_dict = {
        'ID': [1,2,3,4,5,],
        'Complete_Name':[
        'JERRY, Ben',
        'VON HELSINKI, Olga',
        'JENSEN, James Goodboy Dean',
        'THE COMPANY',
        'CRUZ, Juan S. de la',
        ],
        'Type':['I','I','I','C','I',],
        }

        data = pd.DataFrame(info_dict, columns = info_dict.keys())


        """ List of columns to add """
        name_cols = [
        'First Name',
        'Middle Name',
        'Last Name',
        ]

        """
        Use partition() to separate first and middle names into Pandas series.
        Note: data[data['Type'] == 'I']['Complete_Name'] will allow us to target only the
        values that we want.
        """
        NO_LAST_NAMES = data[data['Type'] == 'I']['Complete_Name'].apply(lambda x: str(x).partition(',')[2].strip())
        LAST_NAMES = data[data['Type'] == 'I']['Complete_Name'].apply(lambda x: str(x).partition(',')[0].strip())

        # We can use index positions to quickly add columns to the dataframe.
        # The partition() function will keep the delimited value in the 1 index, so we'll use
        # the 0 and 2 index positions for first and middle names.
        data[name_cols[0]] = NO_LAST_NAMES.str.partition(' ')[0]
        data[name_cols[1]] = NO_LAST_NAMES.str.partition(' ')[2]

        # Finally, we'll add our Last Names column
        data[name_cols[2]] = LAST_NAMES

        # Optional: We can replace all blank values with numpy.NaN values using regular expressions.
        data = data.replace(r'^$', np.NaN, regex=True)


        Then you should end up with something like this:



           ID               Complete_Name Type First Name   Middle Name     Last Name
        0 1 JERRY, Ben I Ben NaN JERRY
        1 2 VON HELSINKI, Olga I Olga NaN VON HELSINKI
        2 3 JENSEN, James Goodboy Dean I James Goodboy Dean JENSEN
        3 4 THE COMPANY C NaN NaN NaN
        4 5 CRUZ, Juan S. de la I Juan S. de la CRUZ


        Or, replace NaN values with with blank strings:



        data = data.replace(np.NaN, r'', regex=False)


        Then you have:



           ID               Complete_Name Type First Name   Middle Name     Last Name
        0 1 JERRY, Ben I Ben JERRY
        1 2 VON HELSINKI, Olga I Olga VON HELSINKI
        2 3 JENSEN, James Goodboy Dean I James Goodboy Dean JENSEN
        3 4 THE COMPANY C
        4 5 CRUZ, Juan S. de la I Juan S. de la CRUZ





        share|improve this answer













        Here's another answer that uses some simple lambda functionality.



        import numpy as np
        import pandas as pd


        """ Create data and data frame """

        info_dict = {
        'ID': [1,2,3,4,5,],
        'Complete_Name':[
        'JERRY, Ben',
        'VON HELSINKI, Olga',
        'JENSEN, James Goodboy Dean',
        'THE COMPANY',
        'CRUZ, Juan S. de la',
        ],
        'Type':['I','I','I','C','I',],
        }

        data = pd.DataFrame(info_dict, columns = info_dict.keys())


        """ List of columns to add """
        name_cols = [
        'First Name',
        'Middle Name',
        'Last Name',
        ]

        """
        Use partition() to separate first and middle names into Pandas series.
        Note: data[data['Type'] == 'I']['Complete_Name'] will allow us to target only the
        values that we want.
        """
        NO_LAST_NAMES = data[data['Type'] == 'I']['Complete_Name'].apply(lambda x: str(x).partition(',')[2].strip())
        LAST_NAMES = data[data['Type'] == 'I']['Complete_Name'].apply(lambda x: str(x).partition(',')[0].strip())

        # We can use index positions to quickly add columns to the dataframe.
        # The partition() function will keep the delimited value in the 1 index, so we'll use
        # the 0 and 2 index positions for first and middle names.
        data[name_cols[0]] = NO_LAST_NAMES.str.partition(' ')[0]
        data[name_cols[1]] = NO_LAST_NAMES.str.partition(' ')[2]

        # Finally, we'll add our Last Names column
        data[name_cols[2]] = LAST_NAMES

        # Optional: We can replace all blank values with numpy.NaN values using regular expressions.
        data = data.replace(r'^$', np.NaN, regex=True)


        Then you should end up with something like this:



           ID               Complete_Name Type First Name   Middle Name     Last Name
        0 1 JERRY, Ben I Ben NaN JERRY
        1 2 VON HELSINKI, Olga I Olga NaN VON HELSINKI
        2 3 JENSEN, James Goodboy Dean I James Goodboy Dean JENSEN
        3 4 THE COMPANY C NaN NaN NaN
        4 5 CRUZ, Juan S. de la I Juan S. de la CRUZ


        Or, replace NaN values with with blank strings:



        data = data.replace(np.NaN, r'', regex=False)


        Then you have:



           ID               Complete_Name Type First Name   Middle Name     Last Name
        0 1 JERRY, Ben I Ben JERRY
        1 2 VON HELSINKI, Olga I Olga VON HELSINKI
        2 3 JENSEN, James Goodboy Dean I James Goodboy Dean JENSEN
        3 4 THE COMPANY C
        4 5 CRUZ, Juan S. de la I Juan S. de la CRUZ






        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Dec 30 '18 at 19:12









        Mark MorettoMark Moretto

        10615




        10615






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53979750%2fhow-to-split-a-pandas-string-to-extract-middle-names%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Monofisismo

            Angular Downloading a file using contenturl with Basic Authentication

            Olmecas