Automating interpolation of missing values in pandas dataframe
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}
I have a dataframe with airline booking data for the past year for a particular origin and destination. There are hundreds of similar data-sets in the system.
In each data-set, there are holes in data. In the current example, we have about 85 days of year for which we don't have booking data.
There are two columns here - departure_date and bookings.
The next step for me would be to include the missing dates in the date column, and set the corresponding values in bookings column to NaN.
I am looking for the best way to do this.
Please find a part of the dataFrame below:
Index departure_date bookings
0 2017-11-02 00:00:00 43
1 2017-11-03 00:00:00 27
2 2017-11-05 00:00:00 27 ********
3 2017-11-06 00:00:00 22
4 2017-11-07 00:00:00 39
.
.
164 2018-05-22 00:00:00 17
165 2018-05-23 00:00:00 41
166 2018-05-24 00:00:00 73
167 2018-07-02 00:00:00 4 *********
168 2018-07-03 00:00:00 31
.
.
277 2018-10-31 00:00:00 50
278 2018-11-01 00:00:00 60
We can see that the data-set is for a one year period (Nov 2, 2017 to Nov 1, 2018). But we have data for 279 days only. For example, we don't have any data between 2018-05-25 and 2018-07-01. I would have to include these dates in the departure_date column and set the corresponding booking values to NaN.
For the second step, I plan to do some interpolation using something like
dataFrame['bookings'].interpolate(method='time', inplace=True)
Please suggest if there are better alternatives in Python.
python pandas time-series interpolation missing-data
add a comment |
I have a dataframe with airline booking data for the past year for a particular origin and destination. There are hundreds of similar data-sets in the system.
In each data-set, there are holes in data. In the current example, we have about 85 days of year for which we don't have booking data.
There are two columns here - departure_date and bookings.
The next step for me would be to include the missing dates in the date column, and set the corresponding values in bookings column to NaN.
I am looking for the best way to do this.
Please find a part of the dataFrame below:
Index departure_date bookings
0 2017-11-02 00:00:00 43
1 2017-11-03 00:00:00 27
2 2017-11-05 00:00:00 27 ********
3 2017-11-06 00:00:00 22
4 2017-11-07 00:00:00 39
.
.
164 2018-05-22 00:00:00 17
165 2018-05-23 00:00:00 41
166 2018-05-24 00:00:00 73
167 2018-07-02 00:00:00 4 *********
168 2018-07-03 00:00:00 31
.
.
277 2018-10-31 00:00:00 50
278 2018-11-01 00:00:00 60
We can see that the data-set is for a one year period (Nov 2, 2017 to Nov 1, 2018). But we have data for 279 days only. For example, we don't have any data between 2018-05-25 and 2018-07-01. I would have to include these dates in the departure_date column and set the corresponding booking values to NaN.
For the second step, I plan to do some interpolation using something like
dataFrame['bookings'].interpolate(method='time', inplace=True)
Please suggest if there are better alternatives in Python.
python pandas time-series interpolation missing-data
I doubt an interpolation would be accurate...
– BlackBear
Jan 4 at 10:39
That's true.. it's only for some testing purpose. For now, I need to know how to prepare the dataFrame by including missing dates and NaN values in bookings column. There seems to be many methods to estimate missing time series data.
– rahul
Jan 4 at 10:40
add a comment |
I have a dataframe with airline booking data for the past year for a particular origin and destination. There are hundreds of similar data-sets in the system.
In each data-set, there are holes in data. In the current example, we have about 85 days of year for which we don't have booking data.
There are two columns here - departure_date and bookings.
The next step for me would be to include the missing dates in the date column, and set the corresponding values in bookings column to NaN.
I am looking for the best way to do this.
Please find a part of the dataFrame below:
Index departure_date bookings
0 2017-11-02 00:00:00 43
1 2017-11-03 00:00:00 27
2 2017-11-05 00:00:00 27 ********
3 2017-11-06 00:00:00 22
4 2017-11-07 00:00:00 39
.
.
164 2018-05-22 00:00:00 17
165 2018-05-23 00:00:00 41
166 2018-05-24 00:00:00 73
167 2018-07-02 00:00:00 4 *********
168 2018-07-03 00:00:00 31
.
.
277 2018-10-31 00:00:00 50
278 2018-11-01 00:00:00 60
We can see that the data-set is for a one year period (Nov 2, 2017 to Nov 1, 2018). But we have data for 279 days only. For example, we don't have any data between 2018-05-25 and 2018-07-01. I would have to include these dates in the departure_date column and set the corresponding booking values to NaN.
For the second step, I plan to do some interpolation using something like
dataFrame['bookings'].interpolate(method='time', inplace=True)
Please suggest if there are better alternatives in Python.
python pandas time-series interpolation missing-data
I have a dataframe with airline booking data for the past year for a particular origin and destination. There are hundreds of similar data-sets in the system.
In each data-set, there are holes in data. In the current example, we have about 85 days of year for which we don't have booking data.
There are two columns here - departure_date and bookings.
The next step for me would be to include the missing dates in the date column, and set the corresponding values in bookings column to NaN.
I am looking for the best way to do this.
Please find a part of the dataFrame below:
Index departure_date bookings
0 2017-11-02 00:00:00 43
1 2017-11-03 00:00:00 27
2 2017-11-05 00:00:00 27 ********
3 2017-11-06 00:00:00 22
4 2017-11-07 00:00:00 39
.
.
164 2018-05-22 00:00:00 17
165 2018-05-23 00:00:00 41
166 2018-05-24 00:00:00 73
167 2018-07-02 00:00:00 4 *********
168 2018-07-03 00:00:00 31
.
.
277 2018-10-31 00:00:00 50
278 2018-11-01 00:00:00 60
We can see that the data-set is for a one year period (Nov 2, 2017 to Nov 1, 2018). But we have data for 279 days only. For example, we don't have any data between 2018-05-25 and 2018-07-01. I would have to include these dates in the departure_date column and set the corresponding booking values to NaN.
For the second step, I plan to do some interpolation using something like
dataFrame['bookings'].interpolate(method='time', inplace=True)
Please suggest if there are better alternatives in Python.
python pandas time-series interpolation missing-data
python pandas time-series interpolation missing-data
edited Jan 4 at 10:44
Georgy
2,34241729
2,34241729
asked Jan 4 at 10:22
rahulrahul
767
767
I doubt an interpolation would be accurate...
– BlackBear
Jan 4 at 10:39
That's true.. it's only for some testing purpose. For now, I need to know how to prepare the dataFrame by including missing dates and NaN values in bookings column. There seems to be many methods to estimate missing time series data.
– rahul
Jan 4 at 10:40
add a comment |
I doubt an interpolation would be accurate...
– BlackBear
Jan 4 at 10:39
That's true.. it's only for some testing purpose. For now, I need to know how to prepare the dataFrame by including missing dates and NaN values in bookings column. There seems to be many methods to estimate missing time series data.
– rahul
Jan 4 at 10:40
I doubt an interpolation would be accurate...
– BlackBear
Jan 4 at 10:39
I doubt an interpolation would be accurate...
– BlackBear
Jan 4 at 10:39
That's true.. it's only for some testing purpose. For now, I need to know how to prepare the dataFrame by including missing dates and NaN values in bookings column. There seems to be many methods to estimate missing time series data.
– rahul
Jan 4 at 10:40
That's true.. it's only for some testing purpose. For now, I need to know how to prepare the dataFrame by including missing dates and NaN values in bookings column. There seems to be many methods to estimate missing time series data.
– rahul
Jan 4 at 10:40
add a comment |
1 Answer
1
active
oldest
votes
This resample for each day. Then fill the gaps.
dataFrame['bookings'].resample('D').pad()
You can have more resampler idea on this page (so you can select the one that fit the best with your needs):
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.resample.html
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54037000%2fautomating-interpolation-of-missing-values-in-pandas-dataframe%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
This resample for each day. Then fill the gaps.
dataFrame['bookings'].resample('D').pad()
You can have more resampler idea on this page (so you can select the one that fit the best with your needs):
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.resample.html
add a comment |
This resample for each day. Then fill the gaps.
dataFrame['bookings'].resample('D').pad()
You can have more resampler idea on this page (so you can select the one that fit the best with your needs):
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.resample.html
add a comment |
This resample for each day. Then fill the gaps.
dataFrame['bookings'].resample('D').pad()
You can have more resampler idea on this page (so you can select the one that fit the best with your needs):
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.resample.html
This resample for each day. Then fill the gaps.
dataFrame['bookings'].resample('D').pad()
You can have more resampler idea on this page (so you can select the one that fit the best with your needs):
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.resample.html
answered Jan 4 at 14:24
Ludo SchmidtLudo Schmidt
1438
1438
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54037000%2fautomating-interpolation-of-missing-values-in-pandas-dataframe%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
I doubt an interpolation would be accurate...
– BlackBear
Jan 4 at 10:39
That's true.. it's only for some testing purpose. For now, I need to know how to prepare the dataFrame by including missing dates and NaN values in bookings column. There seems to be many methods to estimate missing time series data.
– rahul
Jan 4 at 10:40