How to split a pandas string to extract middle names?
I want to split names of individuals into multiple strings. I am able to extract the first name and last name quite easily, but I have problems extracting the middle name or names as these are quite different in each scenario.
The data would look like this:
ID| Complete_Name | Type
1 | JERRY, Ben | "I"
2 | VON HELSINKI, Olga | "I"
3 | JENSEN, James Goodboy Dean | "I"
4 | THE COMPANY | "C"
5 | CRUZ, Juan S. de la | "I"
Whereby there are names with only a first and last name and names with something in between or two middle names. How can I extract the middle names from a Pandas dataframe? I can already extract the first and last names.
df = pd.read_csv("list.pip", sep="|")
df["First Name"] =
np.where(df["Type"]=="I",df['Complete_Name'].str.split(',').str.get(1) , df[""])
df["Last Name"] = np.where(df["Type"]=="I",df['Complete_Name'].str.split(' ').str.get(1) , df[""])
The desired results should look like this:
ID| Complete_Name | Type | First Name | Middle Name | Last Name
1 | JERRY, Ben | "I" | Ben | | JERRY
2 | VON HELSINKI, Olga | "I" | Olga | |
3 | JENSEN, James Goodboy Dean | "I" | James | Goodboy Dean| VON HELSINKI
4 | THE COMPANY | "C" | | |
5 | CRUZ, Juan S. de la | "I" | Juan | S. de la | CRUZ
python regex string pandas split
add a comment |
I want to split names of individuals into multiple strings. I am able to extract the first name and last name quite easily, but I have problems extracting the middle name or names as these are quite different in each scenario.
The data would look like this:
ID| Complete_Name | Type
1 | JERRY, Ben | "I"
2 | VON HELSINKI, Olga | "I"
3 | JENSEN, James Goodboy Dean | "I"
4 | THE COMPANY | "C"
5 | CRUZ, Juan S. de la | "I"
Whereby there are names with only a first and last name and names with something in between or two middle names. How can I extract the middle names from a Pandas dataframe? I can already extract the first and last names.
df = pd.read_csv("list.pip", sep="|")
df["First Name"] =
np.where(df["Type"]=="I",df['Complete_Name'].str.split(',').str.get(1) , df[""])
df["Last Name"] = np.where(df["Type"]=="I",df['Complete_Name'].str.split(' ').str.get(1) , df[""])
The desired results should look like this:
ID| Complete_Name | Type | First Name | Middle Name | Last Name
1 | JERRY, Ben | "I" | Ben | | JERRY
2 | VON HELSINKI, Olga | "I" | Olga | |
3 | JENSEN, James Goodboy Dean | "I" | James | Goodboy Dean| VON HELSINKI
4 | THE COMPANY | "C" | | |
5 | CRUZ, Juan S. de la | "I" | Juan | S. de la | CRUZ
python regex string pandas split
Possible duplicate of splitting a column by delimiter pandas python
– Nitin Sharma
Dec 30 '18 at 18:13
^ Nope, not that one.
– coldspeed
Dec 30 '18 at 18:34
add a comment |
I want to split names of individuals into multiple strings. I am able to extract the first name and last name quite easily, but I have problems extracting the middle name or names as these are quite different in each scenario.
The data would look like this:
ID| Complete_Name | Type
1 | JERRY, Ben | "I"
2 | VON HELSINKI, Olga | "I"
3 | JENSEN, James Goodboy Dean | "I"
4 | THE COMPANY | "C"
5 | CRUZ, Juan S. de la | "I"
Whereby there are names with only a first and last name and names with something in between or two middle names. How can I extract the middle names from a Pandas dataframe? I can already extract the first and last names.
df = pd.read_csv("list.pip", sep="|")
df["First Name"] =
np.where(df["Type"]=="I",df['Complete_Name'].str.split(',').str.get(1) , df[""])
df["Last Name"] = np.where(df["Type"]=="I",df['Complete_Name'].str.split(' ').str.get(1) , df[""])
The desired results should look like this:
ID| Complete_Name | Type | First Name | Middle Name | Last Name
1 | JERRY, Ben | "I" | Ben | | JERRY
2 | VON HELSINKI, Olga | "I" | Olga | |
3 | JENSEN, James Goodboy Dean | "I" | James | Goodboy Dean| VON HELSINKI
4 | THE COMPANY | "C" | | |
5 | CRUZ, Juan S. de la | "I" | Juan | S. de la | CRUZ
python regex string pandas split
I want to split names of individuals into multiple strings. I am able to extract the first name and last name quite easily, but I have problems extracting the middle name or names as these are quite different in each scenario.
The data would look like this:
ID| Complete_Name | Type
1 | JERRY, Ben | "I"
2 | VON HELSINKI, Olga | "I"
3 | JENSEN, James Goodboy Dean | "I"
4 | THE COMPANY | "C"
5 | CRUZ, Juan S. de la | "I"
Whereby there are names with only a first and last name and names with something in between or two middle names. How can I extract the middle names from a Pandas dataframe? I can already extract the first and last names.
df = pd.read_csv("list.pip", sep="|")
df["First Name"] =
np.where(df["Type"]=="I",df['Complete_Name'].str.split(',').str.get(1) , df[""])
df["Last Name"] = np.where(df["Type"]=="I",df['Complete_Name'].str.split(' ').str.get(1) , df[""])
The desired results should look like this:
ID| Complete_Name | Type | First Name | Middle Name | Last Name
1 | JERRY, Ben | "I" | Ben | | JERRY
2 | VON HELSINKI, Olga | "I" | Olga | |
3 | JENSEN, James Goodboy Dean | "I" | James | Goodboy Dean| VON HELSINKI
4 | THE COMPANY | "C" | | |
5 | CRUZ, Juan S. de la | "I" | Juan | S. de la | CRUZ
python regex string pandas split
python regex string pandas split
edited Jan 19 at 0:13
coldspeed
128k23129216
128k23129216
asked Dec 30 '18 at 17:14
mrPymrPy
857
857
Possible duplicate of splitting a column by delimiter pandas python
– Nitin Sharma
Dec 30 '18 at 18:13
^ Nope, not that one.
– coldspeed
Dec 30 '18 at 18:34
add a comment |
Possible duplicate of splitting a column by delimiter pandas python
– Nitin Sharma
Dec 30 '18 at 18:13
^ Nope, not that one.
– coldspeed
Dec 30 '18 at 18:34
Possible duplicate of splitting a column by delimiter pandas python
– Nitin Sharma
Dec 30 '18 at 18:13
Possible duplicate of splitting a column by delimiter pandas python
– Nitin Sharma
Dec 30 '18 at 18:13
^ Nope, not that one.
– coldspeed
Dec 30 '18 at 18:34
^ Nope, not that one.
– coldspeed
Dec 30 '18 at 18:34
add a comment |
3 Answers
3
active
oldest
votes
A single str.extract
call will work here:
p = r'^(?P<Last_Name>.*), (?P<First_Name>S+)bs*(?P<Middle_Name>.*)'
u = df.loc[df.Type == "I", 'Complete_Name'].str.extract(p)
pd.concat([df, u], axis=1).fillna('')
ID Complete_Name Type Last_Name First_Name Middle_Name
0 1 JERRY, Ben I JERRY Ben
1 2 VON HELSINKI, Olga I VON HELSINKI Olga
2 3 JENSEN, James Goodboy Dean I JENSEN James Goodboy Dean
3 4 THE COMPANY C
4 5 CRUZ, Juan S. de la I CRUZ Juan S. de la
Regex Breakdown
^ # Start-of-line
(?P<Last_Name> # First named capture group - Last Name
.* # Match anything until...
)
, # ...we see a comma
s # whitespace
(?P<First_Name> # Second capture group - First Name
S+ # Match all non-whitespace characters
)
b # Word boundary
s* # Optional whitespace chars (mostly housekeeping)
(?P<Middle_Name> # Third capture group - Zero of more middle names
.* # Match everything till the end of string
)
Pretty neat, but can you explain the regex in more detail for people who will visit in the future so that they can adapt the regex to their own needs?
– mrPy
Dec 31 '18 at 10:00
@mrPy Edited, hope that helps.
– coldspeed
Dec 31 '18 at 10:03
add a comment |
I think you can do:
# take the complete_name column and split it multiple times
df2 = (df.loc[df['Type'].eq('I'),'Complete_Name'].str
.split(',', expand=True)
.fillna(''))
# remove extra spaces
for x in df2.columns:
df2[x] = [x.strip() for x in df2[x]]
# split the name on first space and join it
df2 = pd.concat([df2[0],df2[1].str.split(' ',1, expand=True)], axis=1)
df2.columns = ['last','first','middle']
# join the data frames
df = pd.concat([df[['ID','Complete_Name']], df2], axis=1)
# rearrange columns - not necessary though
df = df[['ID','Complete_Name','first','middle','last']]
# remove none values
df = df.replace([None], '')
ID Complete_Name Type first middle last
0 1 JERRY, Ben I Ben JERRY
1 2 VON HELSINKI, Olga I Olga VON HELSINKI
2 3 JENSEN, James Goodboy Dean I James Goodboy Dean JENSEN
3 4 THE COMPANY C
4 5 CRUZ, Juan S. de la I Juan S. de la CRUZ
Seems to work YOLO, except for the fact that you did not take into account the np.where(df.Type=="I"). So it also splits the "C" types.
– mrPy
Dec 30 '18 at 17:47
@mrPy sorry i missed that part, fixing it now.
– YOLO
Dec 30 '18 at 17:54
add a comment |
Here's another answer that uses some simple lambda functionality.
import numpy as np
import pandas as pd
""" Create data and data frame """
info_dict = {
'ID': [1,2,3,4,5,],
'Complete_Name':[
'JERRY, Ben',
'VON HELSINKI, Olga',
'JENSEN, James Goodboy Dean',
'THE COMPANY',
'CRUZ, Juan S. de la',
],
'Type':['I','I','I','C','I',],
}
data = pd.DataFrame(info_dict, columns = info_dict.keys())
""" List of columns to add """
name_cols = [
'First Name',
'Middle Name',
'Last Name',
]
"""
Use partition() to separate first and middle names into Pandas series.
Note: data[data['Type'] == 'I']['Complete_Name'] will allow us to target only the
values that we want.
"""
NO_LAST_NAMES = data[data['Type'] == 'I']['Complete_Name'].apply(lambda x: str(x).partition(',')[2].strip())
LAST_NAMES = data[data['Type'] == 'I']['Complete_Name'].apply(lambda x: str(x).partition(',')[0].strip())
# We can use index positions to quickly add columns to the dataframe.
# The partition() function will keep the delimited value in the 1 index, so we'll use
# the 0 and 2 index positions for first and middle names.
data[name_cols[0]] = NO_LAST_NAMES.str.partition(' ')[0]
data[name_cols[1]] = NO_LAST_NAMES.str.partition(' ')[2]
# Finally, we'll add our Last Names column
data[name_cols[2]] = LAST_NAMES
# Optional: We can replace all blank values with numpy.NaN values using regular expressions.
data = data.replace(r'^$', np.NaN, regex=True)
Then you should end up with something like this:
ID Complete_Name Type First Name Middle Name Last Name
0 1 JERRY, Ben I Ben NaN JERRY
1 2 VON HELSINKI, Olga I Olga NaN VON HELSINKI
2 3 JENSEN, James Goodboy Dean I James Goodboy Dean JENSEN
3 4 THE COMPANY C NaN NaN NaN
4 5 CRUZ, Juan S. de la I Juan S. de la CRUZ
Or, replace NaN values with with blank strings:
data = data.replace(np.NaN, r'', regex=False)
Then you have:
ID Complete_Name Type First Name Middle Name Last Name
0 1 JERRY, Ben I Ben JERRY
1 2 VON HELSINKI, Olga I Olga VON HELSINKI
2 3 JENSEN, James Goodboy Dean I James Goodboy Dean JENSEN
3 4 THE COMPANY C
4 5 CRUZ, Juan S. de la I Juan S. de la CRUZ
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53979750%2fhow-to-split-a-pandas-string-to-extract-middle-names%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
A single str.extract
call will work here:
p = r'^(?P<Last_Name>.*), (?P<First_Name>S+)bs*(?P<Middle_Name>.*)'
u = df.loc[df.Type == "I", 'Complete_Name'].str.extract(p)
pd.concat([df, u], axis=1).fillna('')
ID Complete_Name Type Last_Name First_Name Middle_Name
0 1 JERRY, Ben I JERRY Ben
1 2 VON HELSINKI, Olga I VON HELSINKI Olga
2 3 JENSEN, James Goodboy Dean I JENSEN James Goodboy Dean
3 4 THE COMPANY C
4 5 CRUZ, Juan S. de la I CRUZ Juan S. de la
Regex Breakdown
^ # Start-of-line
(?P<Last_Name> # First named capture group - Last Name
.* # Match anything until...
)
, # ...we see a comma
s # whitespace
(?P<First_Name> # Second capture group - First Name
S+ # Match all non-whitespace characters
)
b # Word boundary
s* # Optional whitespace chars (mostly housekeeping)
(?P<Middle_Name> # Third capture group - Zero of more middle names
.* # Match everything till the end of string
)
Pretty neat, but can you explain the regex in more detail for people who will visit in the future so that they can adapt the regex to their own needs?
– mrPy
Dec 31 '18 at 10:00
@mrPy Edited, hope that helps.
– coldspeed
Dec 31 '18 at 10:03
add a comment |
A single str.extract
call will work here:
p = r'^(?P<Last_Name>.*), (?P<First_Name>S+)bs*(?P<Middle_Name>.*)'
u = df.loc[df.Type == "I", 'Complete_Name'].str.extract(p)
pd.concat([df, u], axis=1).fillna('')
ID Complete_Name Type Last_Name First_Name Middle_Name
0 1 JERRY, Ben I JERRY Ben
1 2 VON HELSINKI, Olga I VON HELSINKI Olga
2 3 JENSEN, James Goodboy Dean I JENSEN James Goodboy Dean
3 4 THE COMPANY C
4 5 CRUZ, Juan S. de la I CRUZ Juan S. de la
Regex Breakdown
^ # Start-of-line
(?P<Last_Name> # First named capture group - Last Name
.* # Match anything until...
)
, # ...we see a comma
s # whitespace
(?P<First_Name> # Second capture group - First Name
S+ # Match all non-whitespace characters
)
b # Word boundary
s* # Optional whitespace chars (mostly housekeeping)
(?P<Middle_Name> # Third capture group - Zero of more middle names
.* # Match everything till the end of string
)
Pretty neat, but can you explain the regex in more detail for people who will visit in the future so that they can adapt the regex to their own needs?
– mrPy
Dec 31 '18 at 10:00
@mrPy Edited, hope that helps.
– coldspeed
Dec 31 '18 at 10:03
add a comment |
A single str.extract
call will work here:
p = r'^(?P<Last_Name>.*), (?P<First_Name>S+)bs*(?P<Middle_Name>.*)'
u = df.loc[df.Type == "I", 'Complete_Name'].str.extract(p)
pd.concat([df, u], axis=1).fillna('')
ID Complete_Name Type Last_Name First_Name Middle_Name
0 1 JERRY, Ben I JERRY Ben
1 2 VON HELSINKI, Olga I VON HELSINKI Olga
2 3 JENSEN, James Goodboy Dean I JENSEN James Goodboy Dean
3 4 THE COMPANY C
4 5 CRUZ, Juan S. de la I CRUZ Juan S. de la
Regex Breakdown
^ # Start-of-line
(?P<Last_Name> # First named capture group - Last Name
.* # Match anything until...
)
, # ...we see a comma
s # whitespace
(?P<First_Name> # Second capture group - First Name
S+ # Match all non-whitespace characters
)
b # Word boundary
s* # Optional whitespace chars (mostly housekeeping)
(?P<Middle_Name> # Third capture group - Zero of more middle names
.* # Match everything till the end of string
)
A single str.extract
call will work here:
p = r'^(?P<Last_Name>.*), (?P<First_Name>S+)bs*(?P<Middle_Name>.*)'
u = df.loc[df.Type == "I", 'Complete_Name'].str.extract(p)
pd.concat([df, u], axis=1).fillna('')
ID Complete_Name Type Last_Name First_Name Middle_Name
0 1 JERRY, Ben I JERRY Ben
1 2 VON HELSINKI, Olga I VON HELSINKI Olga
2 3 JENSEN, James Goodboy Dean I JENSEN James Goodboy Dean
3 4 THE COMPANY C
4 5 CRUZ, Juan S. de la I CRUZ Juan S. de la
Regex Breakdown
^ # Start-of-line
(?P<Last_Name> # First named capture group - Last Name
.* # Match anything until...
)
, # ...we see a comma
s # whitespace
(?P<First_Name> # Second capture group - First Name
S+ # Match all non-whitespace characters
)
b # Word boundary
s* # Optional whitespace chars (mostly housekeeping)
(?P<Middle_Name> # Third capture group - Zero of more middle names
.* # Match everything till the end of string
)
edited Dec 31 '18 at 10:03
answered Dec 30 '18 at 18:00
coldspeedcoldspeed
128k23129216
128k23129216
Pretty neat, but can you explain the regex in more detail for people who will visit in the future so that they can adapt the regex to their own needs?
– mrPy
Dec 31 '18 at 10:00
@mrPy Edited, hope that helps.
– coldspeed
Dec 31 '18 at 10:03
add a comment |
Pretty neat, but can you explain the regex in more detail for people who will visit in the future so that they can adapt the regex to their own needs?
– mrPy
Dec 31 '18 at 10:00
@mrPy Edited, hope that helps.
– coldspeed
Dec 31 '18 at 10:03
Pretty neat, but can you explain the regex in more detail for people who will visit in the future so that they can adapt the regex to their own needs?
– mrPy
Dec 31 '18 at 10:00
Pretty neat, but can you explain the regex in more detail for people who will visit in the future so that they can adapt the regex to their own needs?
– mrPy
Dec 31 '18 at 10:00
@mrPy Edited, hope that helps.
– coldspeed
Dec 31 '18 at 10:03
@mrPy Edited, hope that helps.
– coldspeed
Dec 31 '18 at 10:03
add a comment |
I think you can do:
# take the complete_name column and split it multiple times
df2 = (df.loc[df['Type'].eq('I'),'Complete_Name'].str
.split(',', expand=True)
.fillna(''))
# remove extra spaces
for x in df2.columns:
df2[x] = [x.strip() for x in df2[x]]
# split the name on first space and join it
df2 = pd.concat([df2[0],df2[1].str.split(' ',1, expand=True)], axis=1)
df2.columns = ['last','first','middle']
# join the data frames
df = pd.concat([df[['ID','Complete_Name']], df2], axis=1)
# rearrange columns - not necessary though
df = df[['ID','Complete_Name','first','middle','last']]
# remove none values
df = df.replace([None], '')
ID Complete_Name Type first middle last
0 1 JERRY, Ben I Ben JERRY
1 2 VON HELSINKI, Olga I Olga VON HELSINKI
2 3 JENSEN, James Goodboy Dean I James Goodboy Dean JENSEN
3 4 THE COMPANY C
4 5 CRUZ, Juan S. de la I Juan S. de la CRUZ
Seems to work YOLO, except for the fact that you did not take into account the np.where(df.Type=="I"). So it also splits the "C" types.
– mrPy
Dec 30 '18 at 17:47
@mrPy sorry i missed that part, fixing it now.
– YOLO
Dec 30 '18 at 17:54
add a comment |
I think you can do:
# take the complete_name column and split it multiple times
df2 = (df.loc[df['Type'].eq('I'),'Complete_Name'].str
.split(',', expand=True)
.fillna(''))
# remove extra spaces
for x in df2.columns:
df2[x] = [x.strip() for x in df2[x]]
# split the name on first space and join it
df2 = pd.concat([df2[0],df2[1].str.split(' ',1, expand=True)], axis=1)
df2.columns = ['last','first','middle']
# join the data frames
df = pd.concat([df[['ID','Complete_Name']], df2], axis=1)
# rearrange columns - not necessary though
df = df[['ID','Complete_Name','first','middle','last']]
# remove none values
df = df.replace([None], '')
ID Complete_Name Type first middle last
0 1 JERRY, Ben I Ben JERRY
1 2 VON HELSINKI, Olga I Olga VON HELSINKI
2 3 JENSEN, James Goodboy Dean I James Goodboy Dean JENSEN
3 4 THE COMPANY C
4 5 CRUZ, Juan S. de la I Juan S. de la CRUZ
Seems to work YOLO, except for the fact that you did not take into account the np.where(df.Type=="I"). So it also splits the "C" types.
– mrPy
Dec 30 '18 at 17:47
@mrPy sorry i missed that part, fixing it now.
– YOLO
Dec 30 '18 at 17:54
add a comment |
I think you can do:
# take the complete_name column and split it multiple times
df2 = (df.loc[df['Type'].eq('I'),'Complete_Name'].str
.split(',', expand=True)
.fillna(''))
# remove extra spaces
for x in df2.columns:
df2[x] = [x.strip() for x in df2[x]]
# split the name on first space and join it
df2 = pd.concat([df2[0],df2[1].str.split(' ',1, expand=True)], axis=1)
df2.columns = ['last','first','middle']
# join the data frames
df = pd.concat([df[['ID','Complete_Name']], df2], axis=1)
# rearrange columns - not necessary though
df = df[['ID','Complete_Name','first','middle','last']]
# remove none values
df = df.replace([None], '')
ID Complete_Name Type first middle last
0 1 JERRY, Ben I Ben JERRY
1 2 VON HELSINKI, Olga I Olga VON HELSINKI
2 3 JENSEN, James Goodboy Dean I James Goodboy Dean JENSEN
3 4 THE COMPANY C
4 5 CRUZ, Juan S. de la I Juan S. de la CRUZ
I think you can do:
# take the complete_name column and split it multiple times
df2 = (df.loc[df['Type'].eq('I'),'Complete_Name'].str
.split(',', expand=True)
.fillna(''))
# remove extra spaces
for x in df2.columns:
df2[x] = [x.strip() for x in df2[x]]
# split the name on first space and join it
df2 = pd.concat([df2[0],df2[1].str.split(' ',1, expand=True)], axis=1)
df2.columns = ['last','first','middle']
# join the data frames
df = pd.concat([df[['ID','Complete_Name']], df2], axis=1)
# rearrange columns - not necessary though
df = df[['ID','Complete_Name','first','middle','last']]
# remove none values
df = df.replace([None], '')
ID Complete_Name Type first middle last
0 1 JERRY, Ben I Ben JERRY
1 2 VON HELSINKI, Olga I Olga VON HELSINKI
2 3 JENSEN, James Goodboy Dean I James Goodboy Dean JENSEN
3 4 THE COMPANY C
4 5 CRUZ, Juan S. de la I Juan S. de la CRUZ
edited Dec 30 '18 at 23:33
answered Dec 30 '18 at 17:25
YOLOYOLO
5,4521324
5,4521324
Seems to work YOLO, except for the fact that you did not take into account the np.where(df.Type=="I"). So it also splits the "C" types.
– mrPy
Dec 30 '18 at 17:47
@mrPy sorry i missed that part, fixing it now.
– YOLO
Dec 30 '18 at 17:54
add a comment |
Seems to work YOLO, except for the fact that you did not take into account the np.where(df.Type=="I"). So it also splits the "C" types.
– mrPy
Dec 30 '18 at 17:47
@mrPy sorry i missed that part, fixing it now.
– YOLO
Dec 30 '18 at 17:54
Seems to work YOLO, except for the fact that you did not take into account the np.where(df.Type=="I"). So it also splits the "C" types.
– mrPy
Dec 30 '18 at 17:47
Seems to work YOLO, except for the fact that you did not take into account the np.where(df.Type=="I"). So it also splits the "C" types.
– mrPy
Dec 30 '18 at 17:47
@mrPy sorry i missed that part, fixing it now.
– YOLO
Dec 30 '18 at 17:54
@mrPy sorry i missed that part, fixing it now.
– YOLO
Dec 30 '18 at 17:54
add a comment |
Here's another answer that uses some simple lambda functionality.
import numpy as np
import pandas as pd
""" Create data and data frame """
info_dict = {
'ID': [1,2,3,4,5,],
'Complete_Name':[
'JERRY, Ben',
'VON HELSINKI, Olga',
'JENSEN, James Goodboy Dean',
'THE COMPANY',
'CRUZ, Juan S. de la',
],
'Type':['I','I','I','C','I',],
}
data = pd.DataFrame(info_dict, columns = info_dict.keys())
""" List of columns to add """
name_cols = [
'First Name',
'Middle Name',
'Last Name',
]
"""
Use partition() to separate first and middle names into Pandas series.
Note: data[data['Type'] == 'I']['Complete_Name'] will allow us to target only the
values that we want.
"""
NO_LAST_NAMES = data[data['Type'] == 'I']['Complete_Name'].apply(lambda x: str(x).partition(',')[2].strip())
LAST_NAMES = data[data['Type'] == 'I']['Complete_Name'].apply(lambda x: str(x).partition(',')[0].strip())
# We can use index positions to quickly add columns to the dataframe.
# The partition() function will keep the delimited value in the 1 index, so we'll use
# the 0 and 2 index positions for first and middle names.
data[name_cols[0]] = NO_LAST_NAMES.str.partition(' ')[0]
data[name_cols[1]] = NO_LAST_NAMES.str.partition(' ')[2]
# Finally, we'll add our Last Names column
data[name_cols[2]] = LAST_NAMES
# Optional: We can replace all blank values with numpy.NaN values using regular expressions.
data = data.replace(r'^$', np.NaN, regex=True)
Then you should end up with something like this:
ID Complete_Name Type First Name Middle Name Last Name
0 1 JERRY, Ben I Ben NaN JERRY
1 2 VON HELSINKI, Olga I Olga NaN VON HELSINKI
2 3 JENSEN, James Goodboy Dean I James Goodboy Dean JENSEN
3 4 THE COMPANY C NaN NaN NaN
4 5 CRUZ, Juan S. de la I Juan S. de la CRUZ
Or, replace NaN values with with blank strings:
data = data.replace(np.NaN, r'', regex=False)
Then you have:
ID Complete_Name Type First Name Middle Name Last Name
0 1 JERRY, Ben I Ben JERRY
1 2 VON HELSINKI, Olga I Olga VON HELSINKI
2 3 JENSEN, James Goodboy Dean I James Goodboy Dean JENSEN
3 4 THE COMPANY C
4 5 CRUZ, Juan S. de la I Juan S. de la CRUZ
add a comment |
Here's another answer that uses some simple lambda functionality.
import numpy as np
import pandas as pd
""" Create data and data frame """
info_dict = {
'ID': [1,2,3,4,5,],
'Complete_Name':[
'JERRY, Ben',
'VON HELSINKI, Olga',
'JENSEN, James Goodboy Dean',
'THE COMPANY',
'CRUZ, Juan S. de la',
],
'Type':['I','I','I','C','I',],
}
data = pd.DataFrame(info_dict, columns = info_dict.keys())
""" List of columns to add """
name_cols = [
'First Name',
'Middle Name',
'Last Name',
]
"""
Use partition() to separate first and middle names into Pandas series.
Note: data[data['Type'] == 'I']['Complete_Name'] will allow us to target only the
values that we want.
"""
NO_LAST_NAMES = data[data['Type'] == 'I']['Complete_Name'].apply(lambda x: str(x).partition(',')[2].strip())
LAST_NAMES = data[data['Type'] == 'I']['Complete_Name'].apply(lambda x: str(x).partition(',')[0].strip())
# We can use index positions to quickly add columns to the dataframe.
# The partition() function will keep the delimited value in the 1 index, so we'll use
# the 0 and 2 index positions for first and middle names.
data[name_cols[0]] = NO_LAST_NAMES.str.partition(' ')[0]
data[name_cols[1]] = NO_LAST_NAMES.str.partition(' ')[2]
# Finally, we'll add our Last Names column
data[name_cols[2]] = LAST_NAMES
# Optional: We can replace all blank values with numpy.NaN values using regular expressions.
data = data.replace(r'^$', np.NaN, regex=True)
Then you should end up with something like this:
ID Complete_Name Type First Name Middle Name Last Name
0 1 JERRY, Ben I Ben NaN JERRY
1 2 VON HELSINKI, Olga I Olga NaN VON HELSINKI
2 3 JENSEN, James Goodboy Dean I James Goodboy Dean JENSEN
3 4 THE COMPANY C NaN NaN NaN
4 5 CRUZ, Juan S. de la I Juan S. de la CRUZ
Or, replace NaN values with with blank strings:
data = data.replace(np.NaN, r'', regex=False)
Then you have:
ID Complete_Name Type First Name Middle Name Last Name
0 1 JERRY, Ben I Ben JERRY
1 2 VON HELSINKI, Olga I Olga VON HELSINKI
2 3 JENSEN, James Goodboy Dean I James Goodboy Dean JENSEN
3 4 THE COMPANY C
4 5 CRUZ, Juan S. de la I Juan S. de la CRUZ
add a comment |
Here's another answer that uses some simple lambda functionality.
import numpy as np
import pandas as pd
""" Create data and data frame """
info_dict = {
'ID': [1,2,3,4,5,],
'Complete_Name':[
'JERRY, Ben',
'VON HELSINKI, Olga',
'JENSEN, James Goodboy Dean',
'THE COMPANY',
'CRUZ, Juan S. de la',
],
'Type':['I','I','I','C','I',],
}
data = pd.DataFrame(info_dict, columns = info_dict.keys())
""" List of columns to add """
name_cols = [
'First Name',
'Middle Name',
'Last Name',
]
"""
Use partition() to separate first and middle names into Pandas series.
Note: data[data['Type'] == 'I']['Complete_Name'] will allow us to target only the
values that we want.
"""
NO_LAST_NAMES = data[data['Type'] == 'I']['Complete_Name'].apply(lambda x: str(x).partition(',')[2].strip())
LAST_NAMES = data[data['Type'] == 'I']['Complete_Name'].apply(lambda x: str(x).partition(',')[0].strip())
# We can use index positions to quickly add columns to the dataframe.
# The partition() function will keep the delimited value in the 1 index, so we'll use
# the 0 and 2 index positions for first and middle names.
data[name_cols[0]] = NO_LAST_NAMES.str.partition(' ')[0]
data[name_cols[1]] = NO_LAST_NAMES.str.partition(' ')[2]
# Finally, we'll add our Last Names column
data[name_cols[2]] = LAST_NAMES
# Optional: We can replace all blank values with numpy.NaN values using regular expressions.
data = data.replace(r'^$', np.NaN, regex=True)
Then you should end up with something like this:
ID Complete_Name Type First Name Middle Name Last Name
0 1 JERRY, Ben I Ben NaN JERRY
1 2 VON HELSINKI, Olga I Olga NaN VON HELSINKI
2 3 JENSEN, James Goodboy Dean I James Goodboy Dean JENSEN
3 4 THE COMPANY C NaN NaN NaN
4 5 CRUZ, Juan S. de la I Juan S. de la CRUZ
Or, replace NaN values with with blank strings:
data = data.replace(np.NaN, r'', regex=False)
Then you have:
ID Complete_Name Type First Name Middle Name Last Name
0 1 JERRY, Ben I Ben JERRY
1 2 VON HELSINKI, Olga I Olga VON HELSINKI
2 3 JENSEN, James Goodboy Dean I James Goodboy Dean JENSEN
3 4 THE COMPANY C
4 5 CRUZ, Juan S. de la I Juan S. de la CRUZ
Here's another answer that uses some simple lambda functionality.
import numpy as np
import pandas as pd
""" Create data and data frame """
info_dict = {
'ID': [1,2,3,4,5,],
'Complete_Name':[
'JERRY, Ben',
'VON HELSINKI, Olga',
'JENSEN, James Goodboy Dean',
'THE COMPANY',
'CRUZ, Juan S. de la',
],
'Type':['I','I','I','C','I',],
}
data = pd.DataFrame(info_dict, columns = info_dict.keys())
""" List of columns to add """
name_cols = [
'First Name',
'Middle Name',
'Last Name',
]
"""
Use partition() to separate first and middle names into Pandas series.
Note: data[data['Type'] == 'I']['Complete_Name'] will allow us to target only the
values that we want.
"""
NO_LAST_NAMES = data[data['Type'] == 'I']['Complete_Name'].apply(lambda x: str(x).partition(',')[2].strip())
LAST_NAMES = data[data['Type'] == 'I']['Complete_Name'].apply(lambda x: str(x).partition(',')[0].strip())
# We can use index positions to quickly add columns to the dataframe.
# The partition() function will keep the delimited value in the 1 index, so we'll use
# the 0 and 2 index positions for first and middle names.
data[name_cols[0]] = NO_LAST_NAMES.str.partition(' ')[0]
data[name_cols[1]] = NO_LAST_NAMES.str.partition(' ')[2]
# Finally, we'll add our Last Names column
data[name_cols[2]] = LAST_NAMES
# Optional: We can replace all blank values with numpy.NaN values using regular expressions.
data = data.replace(r'^$', np.NaN, regex=True)
Then you should end up with something like this:
ID Complete_Name Type First Name Middle Name Last Name
0 1 JERRY, Ben I Ben NaN JERRY
1 2 VON HELSINKI, Olga I Olga NaN VON HELSINKI
2 3 JENSEN, James Goodboy Dean I James Goodboy Dean JENSEN
3 4 THE COMPANY C NaN NaN NaN
4 5 CRUZ, Juan S. de la I Juan S. de la CRUZ
Or, replace NaN values with with blank strings:
data = data.replace(np.NaN, r'', regex=False)
Then you have:
ID Complete_Name Type First Name Middle Name Last Name
0 1 JERRY, Ben I Ben JERRY
1 2 VON HELSINKI, Olga I Olga VON HELSINKI
2 3 JENSEN, James Goodboy Dean I James Goodboy Dean JENSEN
3 4 THE COMPANY C
4 5 CRUZ, Juan S. de la I Juan S. de la CRUZ
answered Dec 30 '18 at 19:12
Mark MorettoMark Moretto
10615
10615
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53979750%2fhow-to-split-a-pandas-string-to-extract-middle-names%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Possible duplicate of splitting a column by delimiter pandas python
– Nitin Sharma
Dec 30 '18 at 18:13
^ Nope, not that one.
– coldspeed
Dec 30 '18 at 18:34