How can I efficiently extract hours and minutes from a pandas column that has integer values in format HHMM,...

I have a csv file that contains a column of data where each value is an integer meant to represent the hour and minute in a day. The problem is that each value does not follow the same format. If it is between 12:00 AM and 12:10 AM the value will just be one digit, the minute. If it is between 12:10 AM and 1:00 AM, the value will have to digits, again the minute. If it is between 1:00 AM and 10:00 AM, the value will have three digits, the hour and minute. Finally, for all other values (those between 10:00 AM and 12:00 AM, the value will have four digits, again the hour and minute.

I tried using the pandas, "to_datetime" function to operate on the whole column.

from pandas import read_csv, to_datetime

url = lambda year: f'ftp://sidads.colorado.edu/pub/DATASETS/NOAA/G00807/IIP_{year}IcebergSeason.csv'

df = read_csv(url(2011))

def convert_float_column_to_int_column(df, *column_names):

    for column_name in column_names:

        try:

            df[column_name] = df[column_name].astype(int)

        except ValueError:

            df = df.dropna(subset=[column_name]).reset_index(drop=True)

            df[column_name] = df[column_name].astype(int)

    return df

df2 = convert_float_column_to_int_column(df, 'ICEBERG_NUMBER', 'SIGHTING_TIME')

df2['SIGHTING_TIME'] = to_datetime(df2['SIGHTING_TIME'].astype(str), format='%H%M')

The result I got was:

ValueError: time data '0' does not match format '%H%M' (match).

Which was as expected.

I'm sure I could work around this problem by iterating through each row, using if statements, and converting each value to a four character string but these files are relatively big so that would be too slow of a solution.

asked Jan 3 at 14:21

evankielley

183

Yes, that returns ValueError: day is out of range for month.

– evankielley
Jan 3 at 14:28

1

can you print df2 or atleast a sample of it and an expected output. It will help us reproduce the issue.

– anky_91
Jan 3 at 14:32

Sure. This should cover all cases: import pandas as pd; sample_df = pd.DataFrame({'A': [0, 1, 2, 3], 'B': [1, 10, 100, 1000]}); sample_df['B'] = pd.to_datetime(sample_df['B'].astype(str), format='%H%M') It produces the same error. What I'm looking for is a convenient way to know that the first value in B is meant to represent 12:01 AM. Then 12:10 AM, 1:00 AM, and 10:00 AM.

– evankielley
Jan 3 at 14:44

add a comment |

I tried using the pandas, "to_datetime" function to operate on the whole column.

from pandas import read_csv, to_datetime

url = lambda year: f'ftp://sidads.colorado.edu/pub/DATASETS/NOAA/G00807/IIP_{year}IcebergSeason.csv'

df = read_csv(url(2011))

def convert_float_column_to_int_column(df, *column_names):

    for column_name in column_names:

        try:

            df[column_name] = df[column_name].astype(int)

        except ValueError:

            df = df.dropna(subset=[column_name]).reset_index(drop=True)

            df[column_name] = df[column_name].astype(int)

    return df

df2 = convert_float_column_to_int_column(df, 'ICEBERG_NUMBER', 'SIGHTING_TIME')

df2['SIGHTING_TIME'] = to_datetime(df2['SIGHTING_TIME'].astype(str), format='%H%M')

The result I got was:

ValueError: time data '0' does not match format '%H%M' (match).

Which was as expected.

asked Jan 3 at 14:21

evankielley

183

Yes, that returns ValueError: day is out of range for month.

– evankielley
Jan 3 at 14:28

1

can you print df2 or atleast a sample of it and an expected output. It will help us reproduce the issue.

– anky_91
Jan 3 at 14:32

Sure. This should cover all cases: import pandas as pd; sample_df = pd.DataFrame({'A': [0, 1, 2, 3], 'B': [1, 10, 100, 1000]}); sample_df['B'] = pd.to_datetime(sample_df['B'].astype(str), format='%H%M') It produces the same error. What I'm looking for is a convenient way to know that the first value in B is meant to represent 12:01 AM. Then 12:10 AM, 1:00 AM, and 10:00 AM.

– evankielley
Jan 3 at 14:44

add a comment |

I tried using the pandas, "to_datetime" function to operate on the whole column.

from pandas import read_csv, to_datetime

url = lambda year: f'ftp://sidads.colorado.edu/pub/DATASETS/NOAA/G00807/IIP_{year}IcebergSeason.csv'

df = read_csv(url(2011))

def convert_float_column_to_int_column(df, *column_names):

    for column_name in column_names:

        try:

            df[column_name] = df[column_name].astype(int)

        except ValueError:

            df = df.dropna(subset=[column_name]).reset_index(drop=True)

            df[column_name] = df[column_name].astype(int)

    return df

df2 = convert_float_column_to_int_column(df, 'ICEBERG_NUMBER', 'SIGHTING_TIME')

df2['SIGHTING_TIME'] = to_datetime(df2['SIGHTING_TIME'].astype(str), format='%H%M')

The result I got was:

ValueError: time data '0' does not match format '%H%M' (match).

Which was as expected.

asked Jan 3 at 14:21

evankielley

183

I tried using the pandas, "to_datetime" function to operate on the whole column.

from pandas import read_csv, to_datetime

url = lambda year: f'ftp://sidads.colorado.edu/pub/DATASETS/NOAA/G00807/IIP_{year}IcebergSeason.csv'

df = read_csv(url(2011))

def convert_float_column_to_int_column(df, *column_names):

    for column_name in column_names:

        try:

            df[column_name] = df[column_name].astype(int)

        except ValueError:

            df = df.dropna(subset=[column_name]).reset_index(drop=True)

            df[column_name] = df[column_name].astype(int)

    return df

df2 = convert_float_column_to_int_column(df, 'ICEBERG_NUMBER', 'SIGHTING_TIME')

df2['SIGHTING_TIME'] = to_datetime(df2['SIGHTING_TIME'].astype(str), format='%H%M')

The result I got was:

ValueError: time data '0' does not match format '%H%M' (match).

Which was as expected.

python pandas

asked Jan 3 at 14:21

evankielley

183

asked Jan 3 at 14:21

evankielley

183

asked Jan 3 at 14:21

evankielley

183

asked Jan 3 at 14:21

evankielley

183

asked Jan 3 at 14:21

evankielley

183

Yes, that returns ValueError: day is out of range for month.

– evankielley
Jan 3 at 14:28

1

can you print df2 or atleast a sample of it and an expected output. It will help us reproduce the issue.

– anky_91
Jan 3 at 14:32

Sure. This should cover all cases: import pandas as pd; sample_df = pd.DataFrame({'A': [0, 1, 2, 3], 'B': [1, 10, 100, 1000]}); sample_df['B'] = pd.to_datetime(sample_df['B'].astype(str), format='%H%M') It produces the same error. What I'm looking for is a convenient way to know that the first value in B is meant to represent 12:01 AM. Then 12:10 AM, 1:00 AM, and 10:00 AM.

– evankielley
Jan 3 at 14:44

add a comment |

Yes, that returns ValueError: day is out of range for month.

– evankielley
Jan 3 at 14:28

1

can you print df2 or atleast a sample of it and an expected output. It will help us reproduce the issue.

– anky_91
Jan 3 at 14:32

Sure. This should cover all cases: import pandas as pd; sample_df = pd.DataFrame({'A': [0, 1, 2, 3], 'B': [1, 10, 100, 1000]}); sample_df['B'] = pd.to_datetime(sample_df['B'].astype(str), format='%H%M') It produces the same error. What I'm looking for is a convenient way to know that the first value in B is meant to represent 12:01 AM. Then 12:10 AM, 1:00 AM, and 10:00 AM.

– evankielley
Jan 3 at 14:44

Yes, that returns ValueError: day is out of range for month.

– evankielley
Jan 3 at 14:28

can you print df2 or atleast a sample of it and an expected output. It will help us reproduce the issue.

– anky_91
Jan 3 at 14:32

Sure. This should cover all cases: import pandas as pd; sample_df = pd.DataFrame({'A': [0, 1, 2, 3], 'B': [1, 10, 100, 1000]}); sample_df['B'] = pd.to_datetime(sample_df['B'].astype(str), format='%H%M') It produces the same error. What I'm looking for is a convenient way to know that the first value in B is meant to represent 12:01 AM. Then 12:10 AM, 1:00 AM, and 10:00 AM.

– evankielley
Jan 3 at 14:44

add a comment |

1 Answer
1

active

oldest

votes

No need for if statements. Series.str.zfill will pad it with the correct number of zeros to get it in the proper format. Then use pd.to_datetime, subtracting off 1900-01-01 which is the date it will use when none of those fields are present:

Input Data

import pandas as pd

df = pd.DataFrame({'Time': [1, 12, 123, 1234]})

#   Time

#0     1

#1    12

#2   123

#3  1234

`pd.to_datetime`

df['Time'] = (pd.to_datetime(df.Time.astype(str).str.zfill(4), format='%H%M') 

              - pd.to_datetime('1900-01-01'))



#0   00:01:00

#1   00:12:00

#2   01:23:00

#3   12:34:00

#Name: Time, dtype: timedelta64[ns]

`pd.to_timedelta`

Can also be used, but since you cannot specify a format parameter you need to clean everything beforehand:

df['Time'] = df.Time.astype(str).str.zfill(4)



# Pandas .str methods are slow, use a list comprehension to speed it up

#df['Time'] = df.Time.str[0:2] + ':' + df.Time.str[2:4] + ':00'    

csize=2

df['Time'] = [':'.join(x[i:i+csize] for i in range(0, len(x), csize))+':00' for x in df.Time.values]



df['Time'] = pd.to_timedelta(df.Time)



#0   00:01:00

#1   00:12:00

#2   01:23:00

#3   12:34:00

#Name: Time, dtype: timedelta64[ns]

edited Jan 3 at 15:27

answered Jan 3 at 14:45

ALollz

15.8k31738

2

This is great, thank you!

– evankielley
Jan 3 at 14:52

@evankielley Thanks! I updated it with a pd.to_datetime option, which I believe is probably more concise and performant. You just need to subtract off the dummy date

– ALollz
Jan 3 at 14:56

Awesome, thanks again!

– evankielley
Jan 3 at 15:08

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54024179%2fhow-can-i-efficiently-extract-hours-and-minutes-from-a-pandas-column-that-has-in%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Input Data

import pandas as pd

df = pd.DataFrame({'Time': [1, 12, 123, 1234]})

#   Time

#0     1

#1    12

#2   123

#3  1234

`pd.to_datetime`

df['Time'] = (pd.to_datetime(df.Time.astype(str).str.zfill(4), format='%H%M') 

              - pd.to_datetime('1900-01-01'))



#0   00:01:00

#1   00:12:00

#2   01:23:00

#3   12:34:00

#Name: Time, dtype: timedelta64[ns]

`pd.to_timedelta`

Can also be used, but since you cannot specify a format parameter you need to clean everything beforehand:

df['Time'] = df.Time.astype(str).str.zfill(4)



# Pandas .str methods are slow, use a list comprehension to speed it up

#df['Time'] = df.Time.str[0:2] + ':' + df.Time.str[2:4] + ':00'    

csize=2

df['Time'] = [':'.join(x[i:i+csize] for i in range(0, len(x), csize))+':00' for x in df.Time.values]



df['Time'] = pd.to_timedelta(df.Time)



#0   00:01:00

#1   00:12:00

#2   01:23:00

#3   12:34:00

#Name: Time, dtype: timedelta64[ns]

edited Jan 3 at 15:27

answered Jan 3 at 14:45

ALollz

15.8k31738

2

This is great, thank you!

– evankielley
Jan 3 at 14:52

@evankielley Thanks! I updated it with a pd.to_datetime option, which I believe is probably more concise and performant. You just need to subtract off the dummy date

– ALollz
Jan 3 at 14:56

Awesome, thanks again!

– evankielley
Jan 3 at 15:08

add a comment |

Input Data

import pandas as pd

df = pd.DataFrame({'Time': [1, 12, 123, 1234]})

#   Time

#0     1

#1    12

#2   123

#3  1234

`pd.to_datetime`

df['Time'] = (pd.to_datetime(df.Time.astype(str).str.zfill(4), format='%H%M') 

              - pd.to_datetime('1900-01-01'))



#0   00:01:00

#1   00:12:00

#2   01:23:00

#3   12:34:00

#Name: Time, dtype: timedelta64[ns]

`pd.to_timedelta`

Can also be used, but since you cannot specify a format parameter you need to clean everything beforehand:

df['Time'] = df.Time.astype(str).str.zfill(4)



# Pandas .str methods are slow, use a list comprehension to speed it up

#df['Time'] = df.Time.str[0:2] + ':' + df.Time.str[2:4] + ':00'    

csize=2

df['Time'] = [':'.join(x[i:i+csize] for i in range(0, len(x), csize))+':00' for x in df.Time.values]



df['Time'] = pd.to_timedelta(df.Time)



#0   00:01:00

#1   00:12:00

#2   01:23:00

#3   12:34:00

#Name: Time, dtype: timedelta64[ns]

edited Jan 3 at 15:27

answered Jan 3 at 14:45

ALollz

15.8k31738

2

This is great, thank you!

– evankielley
Jan 3 at 14:52

@evankielley Thanks! I updated it with a pd.to_datetime option, which I believe is probably more concise and performant. You just need to subtract off the dummy date

– ALollz
Jan 3 at 14:56

Awesome, thanks again!

– evankielley
Jan 3 at 15:08

add a comment |

Input Data

import pandas as pd

df = pd.DataFrame({'Time': [1, 12, 123, 1234]})

#   Time

#0     1

#1    12

#2   123

#3  1234

`pd.to_datetime`

df['Time'] = (pd.to_datetime(df.Time.astype(str).str.zfill(4), format='%H%M') 

              - pd.to_datetime('1900-01-01'))



#0   00:01:00

#1   00:12:00

#2   01:23:00

#3   12:34:00

#Name: Time, dtype: timedelta64[ns]

`pd.to_timedelta`

Can also be used, but since you cannot specify a format parameter you need to clean everything beforehand:

df['Time'] = df.Time.astype(str).str.zfill(4)



# Pandas .str methods are slow, use a list comprehension to speed it up

#df['Time'] = df.Time.str[0:2] + ':' + df.Time.str[2:4] + ':00'    

csize=2

df['Time'] = [':'.join(x[i:i+csize] for i in range(0, len(x), csize))+':00' for x in df.Time.values]



df['Time'] = pd.to_timedelta(df.Time)



#0   00:01:00

#1   00:12:00

#2   01:23:00

#3   12:34:00

#Name: Time, dtype: timedelta64[ns]

edited Jan 3 at 15:27

answered Jan 3 at 14:45

ALollz

15.8k31738

Input Data

import pandas as pd

df = pd.DataFrame({'Time': [1, 12, 123, 1234]})

#   Time

#0     1

#1    12

#2   123

#3  1234

`pd.to_datetime`

df['Time'] = (pd.to_datetime(df.Time.astype(str).str.zfill(4), format='%H%M') 

              - pd.to_datetime('1900-01-01'))



#0   00:01:00

#1   00:12:00

#2   01:23:00

#3   12:34:00

#Name: Time, dtype: timedelta64[ns]

`pd.to_timedelta`

Can also be used, but since you cannot specify a format parameter you need to clean everything beforehand:

df['Time'] = df.Time.astype(str).str.zfill(4)



# Pandas .str methods are slow, use a list comprehension to speed it up

#df['Time'] = df.Time.str[0:2] + ':' + df.Time.str[2:4] + ':00'    

csize=2

df['Time'] = [':'.join(x[i:i+csize] for i in range(0, len(x), csize))+':00' for x in df.Time.values]



df['Time'] = pd.to_timedelta(df.Time)



#0   00:01:00

#1   00:12:00

#2   01:23:00

#3   12:34:00

#Name: Time, dtype: timedelta64[ns]

edited Jan 3 at 15:27

answered Jan 3 at 14:45

ALollz

15.8k31738

edited Jan 3 at 15:27

answered Jan 3 at 14:45

ALollz

15.8k31738

answered Jan 3 at 14:45

ALollz

15.8k31738

answered Jan 3 at 14:45

ALollz

15.8k31738

2

This is great, thank you!

– evankielley
Jan 3 at 14:52

@evankielley Thanks! I updated it with a pd.to_datetime option, which I believe is probably more concise and performant. You just need to subtract off the dummy date

– ALollz
Jan 3 at 14:56

Awesome, thanks again!

– evankielley
Jan 3 at 15:08

add a comment |

2

This is great, thank you!

– evankielley
Jan 3 at 14:52

@evankielley Thanks! I updated it with a pd.to_datetime option, which I believe is probably more concise and performant. You just need to subtract off the dummy date

– ALollz
Jan 3 at 14:56

Awesome, thanks again!

– evankielley
Jan 3 at 15:08

This is great, thank you!

– evankielley
Jan 3 at 14:52

@evankielley Thanks! I updated it with a pd.to_datetime option, which I believe is probably more concise and performant. You just need to subtract off the dummy date

– ALollz
Jan 3 at 14:56

Awesome, thanks again!

– evankielley
Jan 3 at 15:08

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Bdtjtk