How can I efficiently extract hours and minutes from a pandas column that has integer values in format HHMM,...












1















I have a csv file that contains a column of data where each value is an integer meant to represent the hour and minute in a day. The problem is that each value does not follow the same format. If it is between 12:00 AM and 12:10 AM the value will just be one digit, the minute. If it is between 12:10 AM and 1:00 AM, the value will have to digits, again the minute. If it is between 1:00 AM and 10:00 AM, the value will have three digits, the hour and minute. Finally, for all other values (those between 10:00 AM and 12:00 AM, the value will have four digits, again the hour and minute.



I tried using the pandas, "to_datetime" function to operate on the whole column.



from pandas import read_csv, to_datetime
url = lambda year: f'ftp://sidads.colorado.edu/pub/DATASETS/NOAA/G00807/IIP_{year}IcebergSeason.csv'
df = read_csv(url(2011))
def convert_float_column_to_int_column(df, *column_names):
for column_name in column_names:
try:
df[column_name] = df[column_name].astype(int)
except ValueError:
df = df.dropna(subset=[column_name]).reset_index(drop=True)
df[column_name] = df[column_name].astype(int)
return df
df2 = convert_float_column_to_int_column(df, 'ICEBERG_NUMBER', 'SIGHTING_TIME')
df2['SIGHTING_TIME'] = to_datetime(df2['SIGHTING_TIME'].astype(str), format='%H%M')


The result I got was:



ValueError: time data '0' does not match format '%H%M' (match). 


Which was as expected.



I'm sure I could work around this problem by iterating through each row, using if statements, and converting each value to a four character string but these files are relatively big so that would be too slow of a solution.










share|improve this question























  • Yes, that returns ValueError: day is out of range for month.

    – evankielley
    Jan 3 at 14:28






  • 1





    can you print df2 or atleast a sample of it and an expected output. It will help us reproduce the issue.

    – anky_91
    Jan 3 at 14:32













  • Sure. This should cover all cases: import pandas as pd; sample_df = pd.DataFrame({'A': [0, 1, 2, 3], 'B': [1, 10, 100, 1000]}); sample_df['B'] = pd.to_datetime(sample_df['B'].astype(str), format='%H%M') It produces the same error. What I'm looking for is a convenient way to know that the first value in B is meant to represent 12:01 AM. Then 12:10 AM, 1:00 AM, and 10:00 AM.

    – evankielley
    Jan 3 at 14:44


















1















I have a csv file that contains a column of data where each value is an integer meant to represent the hour and minute in a day. The problem is that each value does not follow the same format. If it is between 12:00 AM and 12:10 AM the value will just be one digit, the minute. If it is between 12:10 AM and 1:00 AM, the value will have to digits, again the minute. If it is between 1:00 AM and 10:00 AM, the value will have three digits, the hour and minute. Finally, for all other values (those between 10:00 AM and 12:00 AM, the value will have four digits, again the hour and minute.



I tried using the pandas, "to_datetime" function to operate on the whole column.



from pandas import read_csv, to_datetime
url = lambda year: f'ftp://sidads.colorado.edu/pub/DATASETS/NOAA/G00807/IIP_{year}IcebergSeason.csv'
df = read_csv(url(2011))
def convert_float_column_to_int_column(df, *column_names):
for column_name in column_names:
try:
df[column_name] = df[column_name].astype(int)
except ValueError:
df = df.dropna(subset=[column_name]).reset_index(drop=True)
df[column_name] = df[column_name].astype(int)
return df
df2 = convert_float_column_to_int_column(df, 'ICEBERG_NUMBER', 'SIGHTING_TIME')
df2['SIGHTING_TIME'] = to_datetime(df2['SIGHTING_TIME'].astype(str), format='%H%M')


The result I got was:



ValueError: time data '0' does not match format '%H%M' (match). 


Which was as expected.



I'm sure I could work around this problem by iterating through each row, using if statements, and converting each value to a four character string but these files are relatively big so that would be too slow of a solution.










share|improve this question























  • Yes, that returns ValueError: day is out of range for month.

    – evankielley
    Jan 3 at 14:28






  • 1





    can you print df2 or atleast a sample of it and an expected output. It will help us reproduce the issue.

    – anky_91
    Jan 3 at 14:32













  • Sure. This should cover all cases: import pandas as pd; sample_df = pd.DataFrame({'A': [0, 1, 2, 3], 'B': [1, 10, 100, 1000]}); sample_df['B'] = pd.to_datetime(sample_df['B'].astype(str), format='%H%M') It produces the same error. What I'm looking for is a convenient way to know that the first value in B is meant to represent 12:01 AM. Then 12:10 AM, 1:00 AM, and 10:00 AM.

    – evankielley
    Jan 3 at 14:44
















1












1








1








I have a csv file that contains a column of data where each value is an integer meant to represent the hour and minute in a day. The problem is that each value does not follow the same format. If it is between 12:00 AM and 12:10 AM the value will just be one digit, the minute. If it is between 12:10 AM and 1:00 AM, the value will have to digits, again the minute. If it is between 1:00 AM and 10:00 AM, the value will have three digits, the hour and minute. Finally, for all other values (those between 10:00 AM and 12:00 AM, the value will have four digits, again the hour and minute.



I tried using the pandas, "to_datetime" function to operate on the whole column.



from pandas import read_csv, to_datetime
url = lambda year: f'ftp://sidads.colorado.edu/pub/DATASETS/NOAA/G00807/IIP_{year}IcebergSeason.csv'
df = read_csv(url(2011))
def convert_float_column_to_int_column(df, *column_names):
for column_name in column_names:
try:
df[column_name] = df[column_name].astype(int)
except ValueError:
df = df.dropna(subset=[column_name]).reset_index(drop=True)
df[column_name] = df[column_name].astype(int)
return df
df2 = convert_float_column_to_int_column(df, 'ICEBERG_NUMBER', 'SIGHTING_TIME')
df2['SIGHTING_TIME'] = to_datetime(df2['SIGHTING_TIME'].astype(str), format='%H%M')


The result I got was:



ValueError: time data '0' does not match format '%H%M' (match). 


Which was as expected.



I'm sure I could work around this problem by iterating through each row, using if statements, and converting each value to a four character string but these files are relatively big so that would be too slow of a solution.










share|improve this question














I have a csv file that contains a column of data where each value is an integer meant to represent the hour and minute in a day. The problem is that each value does not follow the same format. If it is between 12:00 AM and 12:10 AM the value will just be one digit, the minute. If it is between 12:10 AM and 1:00 AM, the value will have to digits, again the minute. If it is between 1:00 AM and 10:00 AM, the value will have three digits, the hour and minute. Finally, for all other values (those between 10:00 AM and 12:00 AM, the value will have four digits, again the hour and minute.



I tried using the pandas, "to_datetime" function to operate on the whole column.



from pandas import read_csv, to_datetime
url = lambda year: f'ftp://sidads.colorado.edu/pub/DATASETS/NOAA/G00807/IIP_{year}IcebergSeason.csv'
df = read_csv(url(2011))
def convert_float_column_to_int_column(df, *column_names):
for column_name in column_names:
try:
df[column_name] = df[column_name].astype(int)
except ValueError:
df = df.dropna(subset=[column_name]).reset_index(drop=True)
df[column_name] = df[column_name].astype(int)
return df
df2 = convert_float_column_to_int_column(df, 'ICEBERG_NUMBER', 'SIGHTING_TIME')
df2['SIGHTING_TIME'] = to_datetime(df2['SIGHTING_TIME'].astype(str), format='%H%M')


The result I got was:



ValueError: time data '0' does not match format '%H%M' (match). 


Which was as expected.



I'm sure I could work around this problem by iterating through each row, using if statements, and converting each value to a four character string but these files are relatively big so that would be too slow of a solution.







python pandas






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Jan 3 at 14:21









evankielleyevankielley

183




183













  • Yes, that returns ValueError: day is out of range for month.

    – evankielley
    Jan 3 at 14:28






  • 1





    can you print df2 or atleast a sample of it and an expected output. It will help us reproduce the issue.

    – anky_91
    Jan 3 at 14:32













  • Sure. This should cover all cases: import pandas as pd; sample_df = pd.DataFrame({'A': [0, 1, 2, 3], 'B': [1, 10, 100, 1000]}); sample_df['B'] = pd.to_datetime(sample_df['B'].astype(str), format='%H%M') It produces the same error. What I'm looking for is a convenient way to know that the first value in B is meant to represent 12:01 AM. Then 12:10 AM, 1:00 AM, and 10:00 AM.

    – evankielley
    Jan 3 at 14:44





















  • Yes, that returns ValueError: day is out of range for month.

    – evankielley
    Jan 3 at 14:28






  • 1





    can you print df2 or atleast a sample of it and an expected output. It will help us reproduce the issue.

    – anky_91
    Jan 3 at 14:32













  • Sure. This should cover all cases: import pandas as pd; sample_df = pd.DataFrame({'A': [0, 1, 2, 3], 'B': [1, 10, 100, 1000]}); sample_df['B'] = pd.to_datetime(sample_df['B'].astype(str), format='%H%M') It produces the same error. What I'm looking for is a convenient way to know that the first value in B is meant to represent 12:01 AM. Then 12:10 AM, 1:00 AM, and 10:00 AM.

    – evankielley
    Jan 3 at 14:44



















Yes, that returns ValueError: day is out of range for month.

– evankielley
Jan 3 at 14:28





Yes, that returns ValueError: day is out of range for month.

– evankielley
Jan 3 at 14:28




1




1





can you print df2 or atleast a sample of it and an expected output. It will help us reproduce the issue.

– anky_91
Jan 3 at 14:32







can you print df2 or atleast a sample of it and an expected output. It will help us reproduce the issue.

– anky_91
Jan 3 at 14:32















Sure. This should cover all cases: import pandas as pd; sample_df = pd.DataFrame({'A': [0, 1, 2, 3], 'B': [1, 10, 100, 1000]}); sample_df['B'] = pd.to_datetime(sample_df['B'].astype(str), format='%H%M') It produces the same error. What I'm looking for is a convenient way to know that the first value in B is meant to represent 12:01 AM. Then 12:10 AM, 1:00 AM, and 10:00 AM.

– evankielley
Jan 3 at 14:44







Sure. This should cover all cases: import pandas as pd; sample_df = pd.DataFrame({'A': [0, 1, 2, 3], 'B': [1, 10, 100, 1000]}); sample_df['B'] = pd.to_datetime(sample_df['B'].astype(str), format='%H%M') It produces the same error. What I'm looking for is a convenient way to know that the first value in B is meant to represent 12:01 AM. Then 12:10 AM, 1:00 AM, and 10:00 AM.

– evankielley
Jan 3 at 14:44














1 Answer
1






active

oldest

votes


















1














No need for if statements. Series.str.zfill will pad it with the correct number of zeros to get it in the proper format. Then use pd.to_datetime, subtracting off 1900-01-01 which is the date it will use when none of those fields are present:



Input Data



import pandas as pd
df = pd.DataFrame({'Time': [1, 12, 123, 1234]})
# Time
#0 1
#1 12
#2 123
#3 1234


pd.to_datetime



df['Time'] = (pd.to_datetime(df.Time.astype(str).str.zfill(4), format='%H%M') 
- pd.to_datetime('1900-01-01'))

#0 00:01:00
#1 00:12:00
#2 01:23:00
#3 12:34:00
#Name: Time, dtype: timedelta64[ns]


pd.to_timedelta



Can also be used, but since you cannot specify a format parameter you need to clean everything beforehand:



df['Time'] = df.Time.astype(str).str.zfill(4)

# Pandas .str methods are slow, use a list comprehension to speed it up
#df['Time'] = df.Time.str[0:2] + ':' + df.Time.str[2:4] + ':00'
csize=2
df['Time'] = [':'.join(x[i:i+csize] for i in range(0, len(x), csize))+':00' for x in df.Time.values]

df['Time'] = pd.to_timedelta(df.Time)

#0 00:01:00
#1 00:12:00
#2 01:23:00
#3 12:34:00
#Name: Time, dtype: timedelta64[ns]





share|improve this answer





















  • 2





    This is great, thank you!

    – evankielley
    Jan 3 at 14:52











  • @evankielley Thanks! I updated it with a pd.to_datetime option, which I believe is probably more concise and performant. You just need to subtract off the dummy date

    – ALollz
    Jan 3 at 14:56











  • Awesome, thanks again!

    – evankielley
    Jan 3 at 15:08












Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54024179%2fhow-can-i-efficiently-extract-hours-and-minutes-from-a-pandas-column-that-has-in%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









1














No need for if statements. Series.str.zfill will pad it with the correct number of zeros to get it in the proper format. Then use pd.to_datetime, subtracting off 1900-01-01 which is the date it will use when none of those fields are present:



Input Data



import pandas as pd
df = pd.DataFrame({'Time': [1, 12, 123, 1234]})
# Time
#0 1
#1 12
#2 123
#3 1234


pd.to_datetime



df['Time'] = (pd.to_datetime(df.Time.astype(str).str.zfill(4), format='%H%M') 
- pd.to_datetime('1900-01-01'))

#0 00:01:00
#1 00:12:00
#2 01:23:00
#3 12:34:00
#Name: Time, dtype: timedelta64[ns]


pd.to_timedelta



Can also be used, but since you cannot specify a format parameter you need to clean everything beforehand:



df['Time'] = df.Time.astype(str).str.zfill(4)

# Pandas .str methods are slow, use a list comprehension to speed it up
#df['Time'] = df.Time.str[0:2] + ':' + df.Time.str[2:4] + ':00'
csize=2
df['Time'] = [':'.join(x[i:i+csize] for i in range(0, len(x), csize))+':00' for x in df.Time.values]

df['Time'] = pd.to_timedelta(df.Time)

#0 00:01:00
#1 00:12:00
#2 01:23:00
#3 12:34:00
#Name: Time, dtype: timedelta64[ns]





share|improve this answer





















  • 2





    This is great, thank you!

    – evankielley
    Jan 3 at 14:52











  • @evankielley Thanks! I updated it with a pd.to_datetime option, which I believe is probably more concise and performant. You just need to subtract off the dummy date

    – ALollz
    Jan 3 at 14:56











  • Awesome, thanks again!

    – evankielley
    Jan 3 at 15:08
















1














No need for if statements. Series.str.zfill will pad it with the correct number of zeros to get it in the proper format. Then use pd.to_datetime, subtracting off 1900-01-01 which is the date it will use when none of those fields are present:



Input Data



import pandas as pd
df = pd.DataFrame({'Time': [1, 12, 123, 1234]})
# Time
#0 1
#1 12
#2 123
#3 1234


pd.to_datetime



df['Time'] = (pd.to_datetime(df.Time.astype(str).str.zfill(4), format='%H%M') 
- pd.to_datetime('1900-01-01'))

#0 00:01:00
#1 00:12:00
#2 01:23:00
#3 12:34:00
#Name: Time, dtype: timedelta64[ns]


pd.to_timedelta



Can also be used, but since you cannot specify a format parameter you need to clean everything beforehand:



df['Time'] = df.Time.astype(str).str.zfill(4)

# Pandas .str methods are slow, use a list comprehension to speed it up
#df['Time'] = df.Time.str[0:2] + ':' + df.Time.str[2:4] + ':00'
csize=2
df['Time'] = [':'.join(x[i:i+csize] for i in range(0, len(x), csize))+':00' for x in df.Time.values]

df['Time'] = pd.to_timedelta(df.Time)

#0 00:01:00
#1 00:12:00
#2 01:23:00
#3 12:34:00
#Name: Time, dtype: timedelta64[ns]





share|improve this answer





















  • 2





    This is great, thank you!

    – evankielley
    Jan 3 at 14:52











  • @evankielley Thanks! I updated it with a pd.to_datetime option, which I believe is probably more concise and performant. You just need to subtract off the dummy date

    – ALollz
    Jan 3 at 14:56











  • Awesome, thanks again!

    – evankielley
    Jan 3 at 15:08














1












1








1







No need for if statements. Series.str.zfill will pad it with the correct number of zeros to get it in the proper format. Then use pd.to_datetime, subtracting off 1900-01-01 which is the date it will use when none of those fields are present:



Input Data



import pandas as pd
df = pd.DataFrame({'Time': [1, 12, 123, 1234]})
# Time
#0 1
#1 12
#2 123
#3 1234


pd.to_datetime



df['Time'] = (pd.to_datetime(df.Time.astype(str).str.zfill(4), format='%H%M') 
- pd.to_datetime('1900-01-01'))

#0 00:01:00
#1 00:12:00
#2 01:23:00
#3 12:34:00
#Name: Time, dtype: timedelta64[ns]


pd.to_timedelta



Can also be used, but since you cannot specify a format parameter you need to clean everything beforehand:



df['Time'] = df.Time.astype(str).str.zfill(4)

# Pandas .str methods are slow, use a list comprehension to speed it up
#df['Time'] = df.Time.str[0:2] + ':' + df.Time.str[2:4] + ':00'
csize=2
df['Time'] = [':'.join(x[i:i+csize] for i in range(0, len(x), csize))+':00' for x in df.Time.values]

df['Time'] = pd.to_timedelta(df.Time)

#0 00:01:00
#1 00:12:00
#2 01:23:00
#3 12:34:00
#Name: Time, dtype: timedelta64[ns]





share|improve this answer















No need for if statements. Series.str.zfill will pad it with the correct number of zeros to get it in the proper format. Then use pd.to_datetime, subtracting off 1900-01-01 which is the date it will use when none of those fields are present:



Input Data



import pandas as pd
df = pd.DataFrame({'Time': [1, 12, 123, 1234]})
# Time
#0 1
#1 12
#2 123
#3 1234


pd.to_datetime



df['Time'] = (pd.to_datetime(df.Time.astype(str).str.zfill(4), format='%H%M') 
- pd.to_datetime('1900-01-01'))

#0 00:01:00
#1 00:12:00
#2 01:23:00
#3 12:34:00
#Name: Time, dtype: timedelta64[ns]


pd.to_timedelta



Can also be used, but since you cannot specify a format parameter you need to clean everything beforehand:



df['Time'] = df.Time.astype(str).str.zfill(4)

# Pandas .str methods are slow, use a list comprehension to speed it up
#df['Time'] = df.Time.str[0:2] + ':' + df.Time.str[2:4] + ':00'
csize=2
df['Time'] = [':'.join(x[i:i+csize] for i in range(0, len(x), csize))+':00' for x in df.Time.values]

df['Time'] = pd.to_timedelta(df.Time)

#0 00:01:00
#1 00:12:00
#2 01:23:00
#3 12:34:00
#Name: Time, dtype: timedelta64[ns]






share|improve this answer














share|improve this answer



share|improve this answer








edited Jan 3 at 15:27

























answered Jan 3 at 14:45









ALollzALollz

15.8k31738




15.8k31738








  • 2





    This is great, thank you!

    – evankielley
    Jan 3 at 14:52











  • @evankielley Thanks! I updated it with a pd.to_datetime option, which I believe is probably more concise and performant. You just need to subtract off the dummy date

    – ALollz
    Jan 3 at 14:56











  • Awesome, thanks again!

    – evankielley
    Jan 3 at 15:08














  • 2





    This is great, thank you!

    – evankielley
    Jan 3 at 14:52











  • @evankielley Thanks! I updated it with a pd.to_datetime option, which I believe is probably more concise and performant. You just need to subtract off the dummy date

    – ALollz
    Jan 3 at 14:56











  • Awesome, thanks again!

    – evankielley
    Jan 3 at 15:08








2




2





This is great, thank you!

– evankielley
Jan 3 at 14:52





This is great, thank you!

– evankielley
Jan 3 at 14:52













@evankielley Thanks! I updated it with a pd.to_datetime option, which I believe is probably more concise and performant. You just need to subtract off the dummy date

– ALollz
Jan 3 at 14:56





@evankielley Thanks! I updated it with a pd.to_datetime option, which I believe is probably more concise and performant. You just need to subtract off the dummy date

– ALollz
Jan 3 at 14:56













Awesome, thanks again!

– evankielley
Jan 3 at 15:08





Awesome, thanks again!

– evankielley
Jan 3 at 15:08




















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54024179%2fhow-can-i-efficiently-extract-hours-and-minutes-from-a-pandas-column-that-has-in%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Monofisismo

Angular Downloading a file using contenturl with Basic Authentication

Olmecas