How can I efficiently extract hours and minutes from a pandas column that has integer values in format HHMM,...
I have a csv file that contains a column of data where each value is an integer meant to represent the hour and minute in a day. The problem is that each value does not follow the same format. If it is between 12:00 AM and 12:10 AM the value will just be one digit, the minute. If it is between 12:10 AM and 1:00 AM, the value will have to digits, again the minute. If it is between 1:00 AM and 10:00 AM, the value will have three digits, the hour and minute. Finally, for all other values (those between 10:00 AM and 12:00 AM, the value will have four digits, again the hour and minute.
I tried using the pandas, "to_datetime" function to operate on the whole column.
from pandas import read_csv, to_datetime
url = lambda year: f'ftp://sidads.colorado.edu/pub/DATASETS/NOAA/G00807/IIP_{year}IcebergSeason.csv'
df = read_csv(url(2011))
def convert_float_column_to_int_column(df, *column_names):
for column_name in column_names:
try:
df[column_name] = df[column_name].astype(int)
except ValueError:
df = df.dropna(subset=[column_name]).reset_index(drop=True)
df[column_name] = df[column_name].astype(int)
return df
df2 = convert_float_column_to_int_column(df, 'ICEBERG_NUMBER', 'SIGHTING_TIME')
df2['SIGHTING_TIME'] = to_datetime(df2['SIGHTING_TIME'].astype(str), format='%H%M')
The result I got was:
ValueError: time data '0' does not match format '%H%M' (match).
Which was as expected.
I'm sure I could work around this problem by iterating through each row, using if statements, and converting each value to a four character string but these files are relatively big so that would be too slow of a solution.
python pandas
add a comment |
I have a csv file that contains a column of data where each value is an integer meant to represent the hour and minute in a day. The problem is that each value does not follow the same format. If it is between 12:00 AM and 12:10 AM the value will just be one digit, the minute. If it is between 12:10 AM and 1:00 AM, the value will have to digits, again the minute. If it is between 1:00 AM and 10:00 AM, the value will have three digits, the hour and minute. Finally, for all other values (those between 10:00 AM and 12:00 AM, the value will have four digits, again the hour and minute.
I tried using the pandas, "to_datetime" function to operate on the whole column.
from pandas import read_csv, to_datetime
url = lambda year: f'ftp://sidads.colorado.edu/pub/DATASETS/NOAA/G00807/IIP_{year}IcebergSeason.csv'
df = read_csv(url(2011))
def convert_float_column_to_int_column(df, *column_names):
for column_name in column_names:
try:
df[column_name] = df[column_name].astype(int)
except ValueError:
df = df.dropna(subset=[column_name]).reset_index(drop=True)
df[column_name] = df[column_name].astype(int)
return df
df2 = convert_float_column_to_int_column(df, 'ICEBERG_NUMBER', 'SIGHTING_TIME')
df2['SIGHTING_TIME'] = to_datetime(df2['SIGHTING_TIME'].astype(str), format='%H%M')
The result I got was:
ValueError: time data '0' does not match format '%H%M' (match).
Which was as expected.
I'm sure I could work around this problem by iterating through each row, using if statements, and converting each value to a four character string but these files are relatively big so that would be too slow of a solution.
python pandas
Yes, that returnsValueError: day is out of range for month
.
– evankielley
Jan 3 at 14:28
1
can you printdf2
or atleast a sample of it and an expected output. It will help us reproduce the issue.
– anky_91
Jan 3 at 14:32
Sure. This should cover all cases:import pandas as pd;
sample_df = pd.DataFrame({'A': [0, 1, 2, 3], 'B': [1, 10, 100, 1000]});
sample_df['B'] = pd.to_datetime(sample_df['B'].astype(str), format='%H%M')
It produces the same error. What I'm looking for is a convenient way to know that the first value in B is meant to represent 12:01 AM. Then 12:10 AM, 1:00 AM, and 10:00 AM.
– evankielley
Jan 3 at 14:44
add a comment |
I have a csv file that contains a column of data where each value is an integer meant to represent the hour and minute in a day. The problem is that each value does not follow the same format. If it is between 12:00 AM and 12:10 AM the value will just be one digit, the minute. If it is between 12:10 AM and 1:00 AM, the value will have to digits, again the minute. If it is between 1:00 AM and 10:00 AM, the value will have three digits, the hour and minute. Finally, for all other values (those between 10:00 AM and 12:00 AM, the value will have four digits, again the hour and minute.
I tried using the pandas, "to_datetime" function to operate on the whole column.
from pandas import read_csv, to_datetime
url = lambda year: f'ftp://sidads.colorado.edu/pub/DATASETS/NOAA/G00807/IIP_{year}IcebergSeason.csv'
df = read_csv(url(2011))
def convert_float_column_to_int_column(df, *column_names):
for column_name in column_names:
try:
df[column_name] = df[column_name].astype(int)
except ValueError:
df = df.dropna(subset=[column_name]).reset_index(drop=True)
df[column_name] = df[column_name].astype(int)
return df
df2 = convert_float_column_to_int_column(df, 'ICEBERG_NUMBER', 'SIGHTING_TIME')
df2['SIGHTING_TIME'] = to_datetime(df2['SIGHTING_TIME'].astype(str), format='%H%M')
The result I got was:
ValueError: time data '0' does not match format '%H%M' (match).
Which was as expected.
I'm sure I could work around this problem by iterating through each row, using if statements, and converting each value to a four character string but these files are relatively big so that would be too slow of a solution.
python pandas
I have a csv file that contains a column of data where each value is an integer meant to represent the hour and minute in a day. The problem is that each value does not follow the same format. If it is between 12:00 AM and 12:10 AM the value will just be one digit, the minute. If it is between 12:10 AM and 1:00 AM, the value will have to digits, again the minute. If it is between 1:00 AM and 10:00 AM, the value will have three digits, the hour and minute. Finally, for all other values (those between 10:00 AM and 12:00 AM, the value will have four digits, again the hour and minute.
I tried using the pandas, "to_datetime" function to operate on the whole column.
from pandas import read_csv, to_datetime
url = lambda year: f'ftp://sidads.colorado.edu/pub/DATASETS/NOAA/G00807/IIP_{year}IcebergSeason.csv'
df = read_csv(url(2011))
def convert_float_column_to_int_column(df, *column_names):
for column_name in column_names:
try:
df[column_name] = df[column_name].astype(int)
except ValueError:
df = df.dropna(subset=[column_name]).reset_index(drop=True)
df[column_name] = df[column_name].astype(int)
return df
df2 = convert_float_column_to_int_column(df, 'ICEBERG_NUMBER', 'SIGHTING_TIME')
df2['SIGHTING_TIME'] = to_datetime(df2['SIGHTING_TIME'].astype(str), format='%H%M')
The result I got was:
ValueError: time data '0' does not match format '%H%M' (match).
Which was as expected.
I'm sure I could work around this problem by iterating through each row, using if statements, and converting each value to a four character string but these files are relatively big so that would be too slow of a solution.
python pandas
python pandas
asked Jan 3 at 14:21
evankielleyevankielley
183
183
Yes, that returnsValueError: day is out of range for month
.
– evankielley
Jan 3 at 14:28
1
can you printdf2
or atleast a sample of it and an expected output. It will help us reproduce the issue.
– anky_91
Jan 3 at 14:32
Sure. This should cover all cases:import pandas as pd;
sample_df = pd.DataFrame({'A': [0, 1, 2, 3], 'B': [1, 10, 100, 1000]});
sample_df['B'] = pd.to_datetime(sample_df['B'].astype(str), format='%H%M')
It produces the same error. What I'm looking for is a convenient way to know that the first value in B is meant to represent 12:01 AM. Then 12:10 AM, 1:00 AM, and 10:00 AM.
– evankielley
Jan 3 at 14:44
add a comment |
Yes, that returnsValueError: day is out of range for month
.
– evankielley
Jan 3 at 14:28
1
can you printdf2
or atleast a sample of it and an expected output. It will help us reproduce the issue.
– anky_91
Jan 3 at 14:32
Sure. This should cover all cases:import pandas as pd;
sample_df = pd.DataFrame({'A': [0, 1, 2, 3], 'B': [1, 10, 100, 1000]});
sample_df['B'] = pd.to_datetime(sample_df['B'].astype(str), format='%H%M')
It produces the same error. What I'm looking for is a convenient way to know that the first value in B is meant to represent 12:01 AM. Then 12:10 AM, 1:00 AM, and 10:00 AM.
– evankielley
Jan 3 at 14:44
Yes, that returns
ValueError: day is out of range for month
.– evankielley
Jan 3 at 14:28
Yes, that returns
ValueError: day is out of range for month
.– evankielley
Jan 3 at 14:28
1
1
can you print
df2
or atleast a sample of it and an expected output. It will help us reproduce the issue.– anky_91
Jan 3 at 14:32
can you print
df2
or atleast a sample of it and an expected output. It will help us reproduce the issue.– anky_91
Jan 3 at 14:32
Sure. This should cover all cases:
import pandas as pd;
sample_df = pd.DataFrame({'A': [0, 1, 2, 3], 'B': [1, 10, 100, 1000]});
sample_df['B'] = pd.to_datetime(sample_df['B'].astype(str), format='%H%M')
It produces the same error. What I'm looking for is a convenient way to know that the first value in B is meant to represent 12:01 AM. Then 12:10 AM, 1:00 AM, and 10:00 AM.– evankielley
Jan 3 at 14:44
Sure. This should cover all cases:
import pandas as pd;
sample_df = pd.DataFrame({'A': [0, 1, 2, 3], 'B': [1, 10, 100, 1000]});
sample_df['B'] = pd.to_datetime(sample_df['B'].astype(str), format='%H%M')
It produces the same error. What I'm looking for is a convenient way to know that the first value in B is meant to represent 12:01 AM. Then 12:10 AM, 1:00 AM, and 10:00 AM.– evankielley
Jan 3 at 14:44
add a comment |
1 Answer
1
active
oldest
votes
No need for if statements. Series.str.zfill
will pad it with the correct number of zeros to get it in the proper format. Then use pd.to_datetime
, subtracting off 1900-01-01 which is the date it will use when none of those fields are present:
Input Data
import pandas as pd
df = pd.DataFrame({'Time': [1, 12, 123, 1234]})
# Time
#0 1
#1 12
#2 123
#3 1234
pd.to_datetime
df['Time'] = (pd.to_datetime(df.Time.astype(str).str.zfill(4), format='%H%M')
- pd.to_datetime('1900-01-01'))
#0 00:01:00
#1 00:12:00
#2 01:23:00
#3 12:34:00
#Name: Time, dtype: timedelta64[ns]
pd.to_timedelta
Can also be used, but since you cannot specify a format parameter you need to clean everything beforehand:
df['Time'] = df.Time.astype(str).str.zfill(4)
# Pandas .str methods are slow, use a list comprehension to speed it up
#df['Time'] = df.Time.str[0:2] + ':' + df.Time.str[2:4] + ':00'
csize=2
df['Time'] = [':'.join(x[i:i+csize] for i in range(0, len(x), csize))+':00' for x in df.Time.values]
df['Time'] = pd.to_timedelta(df.Time)
#0 00:01:00
#1 00:12:00
#2 01:23:00
#3 12:34:00
#Name: Time, dtype: timedelta64[ns]
2
This is great, thank you!
– evankielley
Jan 3 at 14:52
@evankielley Thanks! I updated it with apd.to_datetime
option, which I believe is probably more concise and performant. You just need to subtract off the dummy date
– ALollz
Jan 3 at 14:56
Awesome, thanks again!
– evankielley
Jan 3 at 15:08
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54024179%2fhow-can-i-efficiently-extract-hours-and-minutes-from-a-pandas-column-that-has-in%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
No need for if statements. Series.str.zfill
will pad it with the correct number of zeros to get it in the proper format. Then use pd.to_datetime
, subtracting off 1900-01-01 which is the date it will use when none of those fields are present:
Input Data
import pandas as pd
df = pd.DataFrame({'Time': [1, 12, 123, 1234]})
# Time
#0 1
#1 12
#2 123
#3 1234
pd.to_datetime
df['Time'] = (pd.to_datetime(df.Time.astype(str).str.zfill(4), format='%H%M')
- pd.to_datetime('1900-01-01'))
#0 00:01:00
#1 00:12:00
#2 01:23:00
#3 12:34:00
#Name: Time, dtype: timedelta64[ns]
pd.to_timedelta
Can also be used, but since you cannot specify a format parameter you need to clean everything beforehand:
df['Time'] = df.Time.astype(str).str.zfill(4)
# Pandas .str methods are slow, use a list comprehension to speed it up
#df['Time'] = df.Time.str[0:2] + ':' + df.Time.str[2:4] + ':00'
csize=2
df['Time'] = [':'.join(x[i:i+csize] for i in range(0, len(x), csize))+':00' for x in df.Time.values]
df['Time'] = pd.to_timedelta(df.Time)
#0 00:01:00
#1 00:12:00
#2 01:23:00
#3 12:34:00
#Name: Time, dtype: timedelta64[ns]
2
This is great, thank you!
– evankielley
Jan 3 at 14:52
@evankielley Thanks! I updated it with apd.to_datetime
option, which I believe is probably more concise and performant. You just need to subtract off the dummy date
– ALollz
Jan 3 at 14:56
Awesome, thanks again!
– evankielley
Jan 3 at 15:08
add a comment |
No need for if statements. Series.str.zfill
will pad it with the correct number of zeros to get it in the proper format. Then use pd.to_datetime
, subtracting off 1900-01-01 which is the date it will use when none of those fields are present:
Input Data
import pandas as pd
df = pd.DataFrame({'Time': [1, 12, 123, 1234]})
# Time
#0 1
#1 12
#2 123
#3 1234
pd.to_datetime
df['Time'] = (pd.to_datetime(df.Time.astype(str).str.zfill(4), format='%H%M')
- pd.to_datetime('1900-01-01'))
#0 00:01:00
#1 00:12:00
#2 01:23:00
#3 12:34:00
#Name: Time, dtype: timedelta64[ns]
pd.to_timedelta
Can also be used, but since you cannot specify a format parameter you need to clean everything beforehand:
df['Time'] = df.Time.astype(str).str.zfill(4)
# Pandas .str methods are slow, use a list comprehension to speed it up
#df['Time'] = df.Time.str[0:2] + ':' + df.Time.str[2:4] + ':00'
csize=2
df['Time'] = [':'.join(x[i:i+csize] for i in range(0, len(x), csize))+':00' for x in df.Time.values]
df['Time'] = pd.to_timedelta(df.Time)
#0 00:01:00
#1 00:12:00
#2 01:23:00
#3 12:34:00
#Name: Time, dtype: timedelta64[ns]
2
This is great, thank you!
– evankielley
Jan 3 at 14:52
@evankielley Thanks! I updated it with apd.to_datetime
option, which I believe is probably more concise and performant. You just need to subtract off the dummy date
– ALollz
Jan 3 at 14:56
Awesome, thanks again!
– evankielley
Jan 3 at 15:08
add a comment |
No need for if statements. Series.str.zfill
will pad it with the correct number of zeros to get it in the proper format. Then use pd.to_datetime
, subtracting off 1900-01-01 which is the date it will use when none of those fields are present:
Input Data
import pandas as pd
df = pd.DataFrame({'Time': [1, 12, 123, 1234]})
# Time
#0 1
#1 12
#2 123
#3 1234
pd.to_datetime
df['Time'] = (pd.to_datetime(df.Time.astype(str).str.zfill(4), format='%H%M')
- pd.to_datetime('1900-01-01'))
#0 00:01:00
#1 00:12:00
#2 01:23:00
#3 12:34:00
#Name: Time, dtype: timedelta64[ns]
pd.to_timedelta
Can also be used, but since you cannot specify a format parameter you need to clean everything beforehand:
df['Time'] = df.Time.astype(str).str.zfill(4)
# Pandas .str methods are slow, use a list comprehension to speed it up
#df['Time'] = df.Time.str[0:2] + ':' + df.Time.str[2:4] + ':00'
csize=2
df['Time'] = [':'.join(x[i:i+csize] for i in range(0, len(x), csize))+':00' for x in df.Time.values]
df['Time'] = pd.to_timedelta(df.Time)
#0 00:01:00
#1 00:12:00
#2 01:23:00
#3 12:34:00
#Name: Time, dtype: timedelta64[ns]
No need for if statements. Series.str.zfill
will pad it with the correct number of zeros to get it in the proper format. Then use pd.to_datetime
, subtracting off 1900-01-01 which is the date it will use when none of those fields are present:
Input Data
import pandas as pd
df = pd.DataFrame({'Time': [1, 12, 123, 1234]})
# Time
#0 1
#1 12
#2 123
#3 1234
pd.to_datetime
df['Time'] = (pd.to_datetime(df.Time.astype(str).str.zfill(4), format='%H%M')
- pd.to_datetime('1900-01-01'))
#0 00:01:00
#1 00:12:00
#2 01:23:00
#3 12:34:00
#Name: Time, dtype: timedelta64[ns]
pd.to_timedelta
Can also be used, but since you cannot specify a format parameter you need to clean everything beforehand:
df['Time'] = df.Time.astype(str).str.zfill(4)
# Pandas .str methods are slow, use a list comprehension to speed it up
#df['Time'] = df.Time.str[0:2] + ':' + df.Time.str[2:4] + ':00'
csize=2
df['Time'] = [':'.join(x[i:i+csize] for i in range(0, len(x), csize))+':00' for x in df.Time.values]
df['Time'] = pd.to_timedelta(df.Time)
#0 00:01:00
#1 00:12:00
#2 01:23:00
#3 12:34:00
#Name: Time, dtype: timedelta64[ns]
edited Jan 3 at 15:27
answered Jan 3 at 14:45
ALollzALollz
15.8k31738
15.8k31738
2
This is great, thank you!
– evankielley
Jan 3 at 14:52
@evankielley Thanks! I updated it with apd.to_datetime
option, which I believe is probably more concise and performant. You just need to subtract off the dummy date
– ALollz
Jan 3 at 14:56
Awesome, thanks again!
– evankielley
Jan 3 at 15:08
add a comment |
2
This is great, thank you!
– evankielley
Jan 3 at 14:52
@evankielley Thanks! I updated it with apd.to_datetime
option, which I believe is probably more concise and performant. You just need to subtract off the dummy date
– ALollz
Jan 3 at 14:56
Awesome, thanks again!
– evankielley
Jan 3 at 15:08
2
2
This is great, thank you!
– evankielley
Jan 3 at 14:52
This is great, thank you!
– evankielley
Jan 3 at 14:52
@evankielley Thanks! I updated it with a
pd.to_datetime
option, which I believe is probably more concise and performant. You just need to subtract off the dummy date– ALollz
Jan 3 at 14:56
@evankielley Thanks! I updated it with a
pd.to_datetime
option, which I believe is probably more concise and performant. You just need to subtract off the dummy date– ALollz
Jan 3 at 14:56
Awesome, thanks again!
– evankielley
Jan 3 at 15:08
Awesome, thanks again!
– evankielley
Jan 3 at 15:08
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54024179%2fhow-can-i-efficiently-extract-hours-and-minutes-from-a-pandas-column-that-has-in%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Yes, that returns
ValueError: day is out of range for month
.– evankielley
Jan 3 at 14:28
1
can you print
df2
or atleast a sample of it and an expected output. It will help us reproduce the issue.– anky_91
Jan 3 at 14:32
Sure. This should cover all cases:
import pandas as pd;
sample_df = pd.DataFrame({'A': [0, 1, 2, 3], 'B': [1, 10, 100, 1000]});
sample_df['B'] = pd.to_datetime(sample_df['B'].astype(str), format='%H%M')
It produces the same error. What I'm looking for is a convenient way to know that the first value in B is meant to represent 12:01 AM. Then 12:10 AM, 1:00 AM, and 10:00 AM.– evankielley
Jan 3 at 14:44