Filter a dictionary based on the value of its date keys
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}
I want to import articles from as many sources around the world as from a certain date.
import requests
url = ('https://newsapi.org/v2/top-headlines?'
'country=us&'
'apiKey=de9e19b7547e44c4983ad761c104278f')
response = requests.get(url)
response_dataframe = pd.DataFrame(response.json())
articles = {article for article in response_dataframe['articles'] if article['publishedAt'] >= '2019-01-04T11:30:00Z'}
print(articles)
But I get:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-84-0f21f2f50907> in <module>
2 response_dataframe['articles'][1]['publishedAt']
3
----> 4 articles = {article for article in response_dataframe['articles'] if article['publishedAt'] >= '2018-01-04T11:30:00Z'}
5 print(articles)
<ipython-input-84-0f21f2f50907> in <setcomp>(.0)
2 response_dataframe['articles'][1]['publishedAt']
3
----> 4 articles = {article for article in response_dataframe['articles'] if article['publishedAt'] >= '2018-01-04T11:30:00Z'}
5 print(articles)
TypeError: unhashable type: 'dict'
Therefore how to select a range of articles according by selecting on these keys ?
The expected output is a dataframe sorting the articles by the day and the newspaper.
The New York Times The Washington Post The Financial Times
2007-01-01 . What Sticks from '06. Somalia Orders Islamis... New Ebola Vaccine Gives 100 Percent Protecti...
2007-01-02 . Heart Health: Vitamin Does Not Prevent Death... Flurry of Settlements Over Toxic Mortgages M...
2007-01-03 . Google Answer to Filling Jobs Is an Algorith... Jason Miller Backs Out of White House Commun...
2007-01-04 . Helping Make the Shift From Combat to Commer... Wielding Claims of ‘Fake News,’ Conservative...
2007-01-05 . Rise in Ethanol Raises Concerns About Corn a... When One Party Has the Governor’s Mansion an
...
My Python version is 3.6.6
python python-3.x api date filter
|
show 5 more comments
I want to import articles from as many sources around the world as from a certain date.
import requests
url = ('https://newsapi.org/v2/top-headlines?'
'country=us&'
'apiKey=de9e19b7547e44c4983ad761c104278f')
response = requests.get(url)
response_dataframe = pd.DataFrame(response.json())
articles = {article for article in response_dataframe['articles'] if article['publishedAt'] >= '2019-01-04T11:30:00Z'}
print(articles)
But I get:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-84-0f21f2f50907> in <module>
2 response_dataframe['articles'][1]['publishedAt']
3
----> 4 articles = {article for article in response_dataframe['articles'] if article['publishedAt'] >= '2018-01-04T11:30:00Z'}
5 print(articles)
<ipython-input-84-0f21f2f50907> in <setcomp>(.0)
2 response_dataframe['articles'][1]['publishedAt']
3
----> 4 articles = {article for article in response_dataframe['articles'] if article['publishedAt'] >= '2018-01-04T11:30:00Z'}
5 print(articles)
TypeError: unhashable type: 'dict'
Therefore how to select a range of articles according by selecting on these keys ?
The expected output is a dataframe sorting the articles by the day and the newspaper.
The New York Times The Washington Post The Financial Times
2007-01-01 . What Sticks from '06. Somalia Orders Islamis... New Ebola Vaccine Gives 100 Percent Protecti...
2007-01-02 . Heart Health: Vitamin Does Not Prevent Death... Flurry of Settlements Over Toxic Mortgages M...
2007-01-03 . Google Answer to Filling Jobs Is an Algorith... Jason Miller Backs Out of White House Commun...
2007-01-04 . Helping Make the Shift From Combat to Commer... Wielding Claims of ‘Fake News,’ Conservative...
2007-01-05 . Rise in Ethanol Raises Concerns About Corn a... When One Party Has the Governor’s Mansion an
...
My Python version is 3.6.6
python python-3.x api date filter
It's not clear what you are trying to do. Why are you putting the JSON response in a dataframe? Why is a set comprehension used, are you trying to avoid duplicates of some kind?
– Martijn Pieters♦
Jan 4 at 17:23
Put differently, can you clearly illustrate what your expected outcome is here?
– Martijn Pieters♦
Jan 4 at 17:23
The code executes without any errors for me. When Iprint (response)
, I get<Response [200]>
. I am not sure if that's the expected response, but I can't reproduce the error. @MartijnPieters: What version are you using?
– Sheldore
Jan 4 at 17:24
1
Ah, I see. The code posted produces an empty set (no articles with that publication date). The traceback doesn't use the same criteria however, it uses>=
to filter with, producing a non-empty match, and then you try to put those dictionaries in a set. I was using the line from the traceback.
– Martijn Pieters♦
Jan 4 at 17:27
1
At any rate,[a for a in response.json()['articles'] if a['publishedAt'] >= '2018-01-04T11:30:00Z']
works and produces a list with 20 dictionaries, all with unique titles and urls, so I don't know why a set would be required here.
– Martijn Pieters♦
Jan 4 at 17:29
|
show 5 more comments
I want to import articles from as many sources around the world as from a certain date.
import requests
url = ('https://newsapi.org/v2/top-headlines?'
'country=us&'
'apiKey=de9e19b7547e44c4983ad761c104278f')
response = requests.get(url)
response_dataframe = pd.DataFrame(response.json())
articles = {article for article in response_dataframe['articles'] if article['publishedAt'] >= '2019-01-04T11:30:00Z'}
print(articles)
But I get:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-84-0f21f2f50907> in <module>
2 response_dataframe['articles'][1]['publishedAt']
3
----> 4 articles = {article for article in response_dataframe['articles'] if article['publishedAt'] >= '2018-01-04T11:30:00Z'}
5 print(articles)
<ipython-input-84-0f21f2f50907> in <setcomp>(.0)
2 response_dataframe['articles'][1]['publishedAt']
3
----> 4 articles = {article for article in response_dataframe['articles'] if article['publishedAt'] >= '2018-01-04T11:30:00Z'}
5 print(articles)
TypeError: unhashable type: 'dict'
Therefore how to select a range of articles according by selecting on these keys ?
The expected output is a dataframe sorting the articles by the day and the newspaper.
The New York Times The Washington Post The Financial Times
2007-01-01 . What Sticks from '06. Somalia Orders Islamis... New Ebola Vaccine Gives 100 Percent Protecti...
2007-01-02 . Heart Health: Vitamin Does Not Prevent Death... Flurry of Settlements Over Toxic Mortgages M...
2007-01-03 . Google Answer to Filling Jobs Is an Algorith... Jason Miller Backs Out of White House Commun...
2007-01-04 . Helping Make the Shift From Combat to Commer... Wielding Claims of ‘Fake News,’ Conservative...
2007-01-05 . Rise in Ethanol Raises Concerns About Corn a... When One Party Has the Governor’s Mansion an
...
My Python version is 3.6.6
python python-3.x api date filter
I want to import articles from as many sources around the world as from a certain date.
import requests
url = ('https://newsapi.org/v2/top-headlines?'
'country=us&'
'apiKey=de9e19b7547e44c4983ad761c104278f')
response = requests.get(url)
response_dataframe = pd.DataFrame(response.json())
articles = {article for article in response_dataframe['articles'] if article['publishedAt'] >= '2019-01-04T11:30:00Z'}
print(articles)
But I get:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-84-0f21f2f50907> in <module>
2 response_dataframe['articles'][1]['publishedAt']
3
----> 4 articles = {article for article in response_dataframe['articles'] if article['publishedAt'] >= '2018-01-04T11:30:00Z'}
5 print(articles)
<ipython-input-84-0f21f2f50907> in <setcomp>(.0)
2 response_dataframe['articles'][1]['publishedAt']
3
----> 4 articles = {article for article in response_dataframe['articles'] if article['publishedAt'] >= '2018-01-04T11:30:00Z'}
5 print(articles)
TypeError: unhashable type: 'dict'
Therefore how to select a range of articles according by selecting on these keys ?
The expected output is a dataframe sorting the articles by the day and the newspaper.
The New York Times The Washington Post The Financial Times
2007-01-01 . What Sticks from '06. Somalia Orders Islamis... New Ebola Vaccine Gives 100 Percent Protecti...
2007-01-02 . Heart Health: Vitamin Does Not Prevent Death... Flurry of Settlements Over Toxic Mortgages M...
2007-01-03 . Google Answer to Filling Jobs Is an Algorith... Jason Miller Backs Out of White House Commun...
2007-01-04 . Helping Make the Shift From Combat to Commer... Wielding Claims of ‘Fake News,’ Conservative...
2007-01-05 . Rise in Ethanol Raises Concerns About Corn a... When One Party Has the Governor’s Mansion an
...
My Python version is 3.6.6
python python-3.x api date filter
python python-3.x api date filter
edited Jan 5 at 16:30
Martijn Pieters♦
728k14525562357
728k14525562357
asked Jan 4 at 17:14
ThePassengerThePassenger
268522
268522
It's not clear what you are trying to do. Why are you putting the JSON response in a dataframe? Why is a set comprehension used, are you trying to avoid duplicates of some kind?
– Martijn Pieters♦
Jan 4 at 17:23
Put differently, can you clearly illustrate what your expected outcome is here?
– Martijn Pieters♦
Jan 4 at 17:23
The code executes without any errors for me. When Iprint (response)
, I get<Response [200]>
. I am not sure if that's the expected response, but I can't reproduce the error. @MartijnPieters: What version are you using?
– Sheldore
Jan 4 at 17:24
1
Ah, I see. The code posted produces an empty set (no articles with that publication date). The traceback doesn't use the same criteria however, it uses>=
to filter with, producing a non-empty match, and then you try to put those dictionaries in a set. I was using the line from the traceback.
– Martijn Pieters♦
Jan 4 at 17:27
1
At any rate,[a for a in response.json()['articles'] if a['publishedAt'] >= '2018-01-04T11:30:00Z']
works and produces a list with 20 dictionaries, all with unique titles and urls, so I don't know why a set would be required here.
– Martijn Pieters♦
Jan 4 at 17:29
|
show 5 more comments
It's not clear what you are trying to do. Why are you putting the JSON response in a dataframe? Why is a set comprehension used, are you trying to avoid duplicates of some kind?
– Martijn Pieters♦
Jan 4 at 17:23
Put differently, can you clearly illustrate what your expected outcome is here?
– Martijn Pieters♦
Jan 4 at 17:23
The code executes without any errors for me. When Iprint (response)
, I get<Response [200]>
. I am not sure if that's the expected response, but I can't reproduce the error. @MartijnPieters: What version are you using?
– Sheldore
Jan 4 at 17:24
1
Ah, I see. The code posted produces an empty set (no articles with that publication date). The traceback doesn't use the same criteria however, it uses>=
to filter with, producing a non-empty match, and then you try to put those dictionaries in a set. I was using the line from the traceback.
– Martijn Pieters♦
Jan 4 at 17:27
1
At any rate,[a for a in response.json()['articles'] if a['publishedAt'] >= '2018-01-04T11:30:00Z']
works and produces a list with 20 dictionaries, all with unique titles and urls, so I don't know why a set would be required here.
– Martijn Pieters♦
Jan 4 at 17:29
It's not clear what you are trying to do. Why are you putting the JSON response in a dataframe? Why is a set comprehension used, are you trying to avoid duplicates of some kind?
– Martijn Pieters♦
Jan 4 at 17:23
It's not clear what you are trying to do. Why are you putting the JSON response in a dataframe? Why is a set comprehension used, are you trying to avoid duplicates of some kind?
– Martijn Pieters♦
Jan 4 at 17:23
Put differently, can you clearly illustrate what your expected outcome is here?
– Martijn Pieters♦
Jan 4 at 17:23
Put differently, can you clearly illustrate what your expected outcome is here?
– Martijn Pieters♦
Jan 4 at 17:23
The code executes without any errors for me. When I
print (response)
, I get <Response [200]>
. I am not sure if that's the expected response, but I can't reproduce the error. @MartijnPieters: What version are you using?– Sheldore
Jan 4 at 17:24
The code executes without any errors for me. When I
print (response)
, I get <Response [200]>
. I am not sure if that's the expected response, but I can't reproduce the error. @MartijnPieters: What version are you using?– Sheldore
Jan 4 at 17:24
1
1
Ah, I see. The code posted produces an empty set (no articles with that publication date). The traceback doesn't use the same criteria however, it uses
>=
to filter with, producing a non-empty match, and then you try to put those dictionaries in a set. I was using the line from the traceback.– Martijn Pieters♦
Jan 4 at 17:27
Ah, I see. The code posted produces an empty set (no articles with that publication date). The traceback doesn't use the same criteria however, it uses
>=
to filter with, producing a non-empty match, and then you try to put those dictionaries in a set. I was using the line from the traceback.– Martijn Pieters♦
Jan 4 at 17:27
1
1
At any rate,
[a for a in response.json()['articles'] if a['publishedAt'] >= '2018-01-04T11:30:00Z']
works and produces a list with 20 dictionaries, all with unique titles and urls, so I don't know why a set would be required here.– Martijn Pieters♦
Jan 4 at 17:29
At any rate,
[a for a in response.json()['articles'] if a['publishedAt'] >= '2018-01-04T11:30:00Z']
works and produces a list with 20 dictionaries, all with unique titles and urls, so I don't know why a set would be required here.– Martijn Pieters♦
Jan 4 at 17:29
|
show 5 more comments
1 Answer
1
active
oldest
votes
You are filtering dictionaries then trying to put them in a set. Your expected outcome does not require you to de-duplicate anything, so the easiest path away from the error is to use a list comprehension instead; just swap out {...}
curly braces for square braces:
articles = [article for article in response_dataframe['articles'] if article['publishedAt'] >= '2019-01-04T11:30:00Z']
However, if you are going to put the data into a dataframe for processing, you would be much better of with using the pandas.io.json.json_normalize()
function; it can produce a dataframe for you from list-and-dictionaries structure typically loaded from a JSON source.
Start with loading just the article data you want into a dataframe and you can filter and re-arrange from there; the following code loads all data into a single dataframe with a new date
column, derived from the publishAt
information:
import pandas as pd
from pandas.io.json import json_normalize
df = json_normalize(response.json(), 'articles')
# make the datetime column a native type, and add a date-only column
df['publishedAt'] = pd.to_datetime(df['publishedAt'])
df['date'] = df['publishedAt'].dt.date
# move source dictionary into separate columns rather than dictionaries
source_columns = df['source'].apply(pd.Series).add_prefix('source_')
df = pd.concat([df.drop(['source'], axis=1), source_columns], axis=1)
That gives you a dataframe with all the article information, as a complete dataframe with native types, with the columns author
, content
, description
, publishedAt
, date
, title
, url
, urlToImage
and the source_id
and source_name
columns from the source
mapping.
I note that the API allows you to filter by date already, I'd rely on that instead of filtering locally, as you can save time and bandwidth by having the API give you a smaller dataset. The API also lets you apply sorting, again a good idea.
To group the rows by date and source name, you would have to pivot the dataframe; the dates should be the index, the columns the name of the source, and the titles as the values:
df.pivot(index='date', columns='source_name', values='title')
This fails however, because this format does not have space for more than one title per source per day:
ValueError: Index contains duplicate entries, cannot reshape
In the JSON data served to me, there are multiple CNN and Fox News articles just for today.
You could aggregate multiple titles into lists:
pd.pivot_table(df,
index='date', columns='source_name', values='title',
aggfunc=list)
For the default 20 results for 'today' this gives me:
>>> pd.pivot_table(
... df, index='date', columns='source_name', values='title',
... aggfunc=list
... )
source_name Bbc.com ... Youtube.com
date ...
2019-01-05 [Paul Whelan: Russia rules out prisoner swap f... ... [Bears Buzz: Eagles at Bears - Wildcard Round ...
[1 rows x 18 columns]
Personally, I'd just keep the dataframe limited to dates, titles and source names, with a date index:
>>> df[['date', 'source_name', 'title']].set_index('date').sort_values(['date', 'source_name'])
source_name title
date
2019-01-05 Bbc.com Paul Whelan: Russia rules out prisoner swap fo...
2019-01-05 Bloomberg Russia Says FBI Arrested Russian Citizen on Pa...
2019-01-05 CNN Pay raises frozen for Pence, Cabinet members u...
2019-01-05 CNN 16 big questions on Robert Mueller's Russia in...
2019-01-05 Colts.com news What They're Saying: Colts/Texans, Wild C...
2019-01-05 Engadget Pandora iOS update adds offline playback for A...
2019-01-05 Espn.com Roger Federer wins Hopman Cup with Switzerland...
2019-01-05 Fox News Japanese 'Tuna King' pays record $3M for prize...
2019-01-05 Fox News Knicks' Turkish star Enes Kanter to skip Londo...
2019-01-05 Latimes.com Flu toll mounts in California, with 42 deaths ...
2019-01-05 NBC News After the fire: Blazes pose hidden threat to t...
2019-01-05 Newser.com After Backlash, Ellen Not Ditching Support for...
2019-01-05 Npr.org Three Dead After Fight Escalates Into Shooting...
2019-01-05 Reuters French 'yellow vests' rail against unrepentant...
2019-01-05 The Hill Trump: 'I don’t care' that most federal employ...
2019-01-05 The Huffington Post 5 Children Dead After Church Van Crashes On Wa...
2019-01-05 The Verge Apple seeks to end bent iPad Pro controversy w...
2019-01-05 Thisisinsider.com Kanye West surprised Kim Kardashian with a $14...
2019-01-05 USA Today See 'Mean Girls' co-stars Lindsay Lohan and Jo...
2019-01-05 Youtube.com Bears Buzz: Eagles at Bears - Wildcard Round -...
The above is sorted by date and by source, so multilpe titles from the same source are grouped.
Many thanks ! And how to filter by date from the API to save time and bandwidth by having the API give you a smaller dataset ? Would I always get 20 results or may I get some more ?
– ThePassenger
Jan 7 at 9:41
@ThePassenger: you already found the documentation, it lists what parameters are acceptable. Thetop-headlines
path accepts apageSize
parameter, for example. Theeverything
path lets you query for a range of dates.
– Martijn Pieters♦
Jan 7 at 16:33
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54043356%2ffilter-a-dictionary-based-on-the-value-of-its-date-keys%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
You are filtering dictionaries then trying to put them in a set. Your expected outcome does not require you to de-duplicate anything, so the easiest path away from the error is to use a list comprehension instead; just swap out {...}
curly braces for square braces:
articles = [article for article in response_dataframe['articles'] if article['publishedAt'] >= '2019-01-04T11:30:00Z']
However, if you are going to put the data into a dataframe for processing, you would be much better of with using the pandas.io.json.json_normalize()
function; it can produce a dataframe for you from list-and-dictionaries structure typically loaded from a JSON source.
Start with loading just the article data you want into a dataframe and you can filter and re-arrange from there; the following code loads all data into a single dataframe with a new date
column, derived from the publishAt
information:
import pandas as pd
from pandas.io.json import json_normalize
df = json_normalize(response.json(), 'articles')
# make the datetime column a native type, and add a date-only column
df['publishedAt'] = pd.to_datetime(df['publishedAt'])
df['date'] = df['publishedAt'].dt.date
# move source dictionary into separate columns rather than dictionaries
source_columns = df['source'].apply(pd.Series).add_prefix('source_')
df = pd.concat([df.drop(['source'], axis=1), source_columns], axis=1)
That gives you a dataframe with all the article information, as a complete dataframe with native types, with the columns author
, content
, description
, publishedAt
, date
, title
, url
, urlToImage
and the source_id
and source_name
columns from the source
mapping.
I note that the API allows you to filter by date already, I'd rely on that instead of filtering locally, as you can save time and bandwidth by having the API give you a smaller dataset. The API also lets you apply sorting, again a good idea.
To group the rows by date and source name, you would have to pivot the dataframe; the dates should be the index, the columns the name of the source, and the titles as the values:
df.pivot(index='date', columns='source_name', values='title')
This fails however, because this format does not have space for more than one title per source per day:
ValueError: Index contains duplicate entries, cannot reshape
In the JSON data served to me, there are multiple CNN and Fox News articles just for today.
You could aggregate multiple titles into lists:
pd.pivot_table(df,
index='date', columns='source_name', values='title',
aggfunc=list)
For the default 20 results for 'today' this gives me:
>>> pd.pivot_table(
... df, index='date', columns='source_name', values='title',
... aggfunc=list
... )
source_name Bbc.com ... Youtube.com
date ...
2019-01-05 [Paul Whelan: Russia rules out prisoner swap f... ... [Bears Buzz: Eagles at Bears - Wildcard Round ...
[1 rows x 18 columns]
Personally, I'd just keep the dataframe limited to dates, titles and source names, with a date index:
>>> df[['date', 'source_name', 'title']].set_index('date').sort_values(['date', 'source_name'])
source_name title
date
2019-01-05 Bbc.com Paul Whelan: Russia rules out prisoner swap fo...
2019-01-05 Bloomberg Russia Says FBI Arrested Russian Citizen on Pa...
2019-01-05 CNN Pay raises frozen for Pence, Cabinet members u...
2019-01-05 CNN 16 big questions on Robert Mueller's Russia in...
2019-01-05 Colts.com news What They're Saying: Colts/Texans, Wild C...
2019-01-05 Engadget Pandora iOS update adds offline playback for A...
2019-01-05 Espn.com Roger Federer wins Hopman Cup with Switzerland...
2019-01-05 Fox News Japanese 'Tuna King' pays record $3M for prize...
2019-01-05 Fox News Knicks' Turkish star Enes Kanter to skip Londo...
2019-01-05 Latimes.com Flu toll mounts in California, with 42 deaths ...
2019-01-05 NBC News After the fire: Blazes pose hidden threat to t...
2019-01-05 Newser.com After Backlash, Ellen Not Ditching Support for...
2019-01-05 Npr.org Three Dead After Fight Escalates Into Shooting...
2019-01-05 Reuters French 'yellow vests' rail against unrepentant...
2019-01-05 The Hill Trump: 'I don’t care' that most federal employ...
2019-01-05 The Huffington Post 5 Children Dead After Church Van Crashes On Wa...
2019-01-05 The Verge Apple seeks to end bent iPad Pro controversy w...
2019-01-05 Thisisinsider.com Kanye West surprised Kim Kardashian with a $14...
2019-01-05 USA Today See 'Mean Girls' co-stars Lindsay Lohan and Jo...
2019-01-05 Youtube.com Bears Buzz: Eagles at Bears - Wildcard Round -...
The above is sorted by date and by source, so multilpe titles from the same source are grouped.
Many thanks ! And how to filter by date from the API to save time and bandwidth by having the API give you a smaller dataset ? Would I always get 20 results or may I get some more ?
– ThePassenger
Jan 7 at 9:41
@ThePassenger: you already found the documentation, it lists what parameters are acceptable. Thetop-headlines
path accepts apageSize
parameter, for example. Theeverything
path lets you query for a range of dates.
– Martijn Pieters♦
Jan 7 at 16:33
add a comment |
You are filtering dictionaries then trying to put them in a set. Your expected outcome does not require you to de-duplicate anything, so the easiest path away from the error is to use a list comprehension instead; just swap out {...}
curly braces for square braces:
articles = [article for article in response_dataframe['articles'] if article['publishedAt'] >= '2019-01-04T11:30:00Z']
However, if you are going to put the data into a dataframe for processing, you would be much better of with using the pandas.io.json.json_normalize()
function; it can produce a dataframe for you from list-and-dictionaries structure typically loaded from a JSON source.
Start with loading just the article data you want into a dataframe and you can filter and re-arrange from there; the following code loads all data into a single dataframe with a new date
column, derived from the publishAt
information:
import pandas as pd
from pandas.io.json import json_normalize
df = json_normalize(response.json(), 'articles')
# make the datetime column a native type, and add a date-only column
df['publishedAt'] = pd.to_datetime(df['publishedAt'])
df['date'] = df['publishedAt'].dt.date
# move source dictionary into separate columns rather than dictionaries
source_columns = df['source'].apply(pd.Series).add_prefix('source_')
df = pd.concat([df.drop(['source'], axis=1), source_columns], axis=1)
That gives you a dataframe with all the article information, as a complete dataframe with native types, with the columns author
, content
, description
, publishedAt
, date
, title
, url
, urlToImage
and the source_id
and source_name
columns from the source
mapping.
I note that the API allows you to filter by date already, I'd rely on that instead of filtering locally, as you can save time and bandwidth by having the API give you a smaller dataset. The API also lets you apply sorting, again a good idea.
To group the rows by date and source name, you would have to pivot the dataframe; the dates should be the index, the columns the name of the source, and the titles as the values:
df.pivot(index='date', columns='source_name', values='title')
This fails however, because this format does not have space for more than one title per source per day:
ValueError: Index contains duplicate entries, cannot reshape
In the JSON data served to me, there are multiple CNN and Fox News articles just for today.
You could aggregate multiple titles into lists:
pd.pivot_table(df,
index='date', columns='source_name', values='title',
aggfunc=list)
For the default 20 results for 'today' this gives me:
>>> pd.pivot_table(
... df, index='date', columns='source_name', values='title',
... aggfunc=list
... )
source_name Bbc.com ... Youtube.com
date ...
2019-01-05 [Paul Whelan: Russia rules out prisoner swap f... ... [Bears Buzz: Eagles at Bears - Wildcard Round ...
[1 rows x 18 columns]
Personally, I'd just keep the dataframe limited to dates, titles and source names, with a date index:
>>> df[['date', 'source_name', 'title']].set_index('date').sort_values(['date', 'source_name'])
source_name title
date
2019-01-05 Bbc.com Paul Whelan: Russia rules out prisoner swap fo...
2019-01-05 Bloomberg Russia Says FBI Arrested Russian Citizen on Pa...
2019-01-05 CNN Pay raises frozen for Pence, Cabinet members u...
2019-01-05 CNN 16 big questions on Robert Mueller's Russia in...
2019-01-05 Colts.com news What They're Saying: Colts/Texans, Wild C...
2019-01-05 Engadget Pandora iOS update adds offline playback for A...
2019-01-05 Espn.com Roger Federer wins Hopman Cup with Switzerland...
2019-01-05 Fox News Japanese 'Tuna King' pays record $3M for prize...
2019-01-05 Fox News Knicks' Turkish star Enes Kanter to skip Londo...
2019-01-05 Latimes.com Flu toll mounts in California, with 42 deaths ...
2019-01-05 NBC News After the fire: Blazes pose hidden threat to t...
2019-01-05 Newser.com After Backlash, Ellen Not Ditching Support for...
2019-01-05 Npr.org Three Dead After Fight Escalates Into Shooting...
2019-01-05 Reuters French 'yellow vests' rail against unrepentant...
2019-01-05 The Hill Trump: 'I don’t care' that most federal employ...
2019-01-05 The Huffington Post 5 Children Dead After Church Van Crashes On Wa...
2019-01-05 The Verge Apple seeks to end bent iPad Pro controversy w...
2019-01-05 Thisisinsider.com Kanye West surprised Kim Kardashian with a $14...
2019-01-05 USA Today See 'Mean Girls' co-stars Lindsay Lohan and Jo...
2019-01-05 Youtube.com Bears Buzz: Eagles at Bears - Wildcard Round -...
The above is sorted by date and by source, so multilpe titles from the same source are grouped.
Many thanks ! And how to filter by date from the API to save time and bandwidth by having the API give you a smaller dataset ? Would I always get 20 results or may I get some more ?
– ThePassenger
Jan 7 at 9:41
@ThePassenger: you already found the documentation, it lists what parameters are acceptable. Thetop-headlines
path accepts apageSize
parameter, for example. Theeverything
path lets you query for a range of dates.
– Martijn Pieters♦
Jan 7 at 16:33
add a comment |
You are filtering dictionaries then trying to put them in a set. Your expected outcome does not require you to de-duplicate anything, so the easiest path away from the error is to use a list comprehension instead; just swap out {...}
curly braces for square braces:
articles = [article for article in response_dataframe['articles'] if article['publishedAt'] >= '2019-01-04T11:30:00Z']
However, if you are going to put the data into a dataframe for processing, you would be much better of with using the pandas.io.json.json_normalize()
function; it can produce a dataframe for you from list-and-dictionaries structure typically loaded from a JSON source.
Start with loading just the article data you want into a dataframe and you can filter and re-arrange from there; the following code loads all data into a single dataframe with a new date
column, derived from the publishAt
information:
import pandas as pd
from pandas.io.json import json_normalize
df = json_normalize(response.json(), 'articles')
# make the datetime column a native type, and add a date-only column
df['publishedAt'] = pd.to_datetime(df['publishedAt'])
df['date'] = df['publishedAt'].dt.date
# move source dictionary into separate columns rather than dictionaries
source_columns = df['source'].apply(pd.Series).add_prefix('source_')
df = pd.concat([df.drop(['source'], axis=1), source_columns], axis=1)
That gives you a dataframe with all the article information, as a complete dataframe with native types, with the columns author
, content
, description
, publishedAt
, date
, title
, url
, urlToImage
and the source_id
and source_name
columns from the source
mapping.
I note that the API allows you to filter by date already, I'd rely on that instead of filtering locally, as you can save time and bandwidth by having the API give you a smaller dataset. The API also lets you apply sorting, again a good idea.
To group the rows by date and source name, you would have to pivot the dataframe; the dates should be the index, the columns the name of the source, and the titles as the values:
df.pivot(index='date', columns='source_name', values='title')
This fails however, because this format does not have space for more than one title per source per day:
ValueError: Index contains duplicate entries, cannot reshape
In the JSON data served to me, there are multiple CNN and Fox News articles just for today.
You could aggregate multiple titles into lists:
pd.pivot_table(df,
index='date', columns='source_name', values='title',
aggfunc=list)
For the default 20 results for 'today' this gives me:
>>> pd.pivot_table(
... df, index='date', columns='source_name', values='title',
... aggfunc=list
... )
source_name Bbc.com ... Youtube.com
date ...
2019-01-05 [Paul Whelan: Russia rules out prisoner swap f... ... [Bears Buzz: Eagles at Bears - Wildcard Round ...
[1 rows x 18 columns]
Personally, I'd just keep the dataframe limited to dates, titles and source names, with a date index:
>>> df[['date', 'source_name', 'title']].set_index('date').sort_values(['date', 'source_name'])
source_name title
date
2019-01-05 Bbc.com Paul Whelan: Russia rules out prisoner swap fo...
2019-01-05 Bloomberg Russia Says FBI Arrested Russian Citizen on Pa...
2019-01-05 CNN Pay raises frozen for Pence, Cabinet members u...
2019-01-05 CNN 16 big questions on Robert Mueller's Russia in...
2019-01-05 Colts.com news What They're Saying: Colts/Texans, Wild C...
2019-01-05 Engadget Pandora iOS update adds offline playback for A...
2019-01-05 Espn.com Roger Federer wins Hopman Cup with Switzerland...
2019-01-05 Fox News Japanese 'Tuna King' pays record $3M for prize...
2019-01-05 Fox News Knicks' Turkish star Enes Kanter to skip Londo...
2019-01-05 Latimes.com Flu toll mounts in California, with 42 deaths ...
2019-01-05 NBC News After the fire: Blazes pose hidden threat to t...
2019-01-05 Newser.com After Backlash, Ellen Not Ditching Support for...
2019-01-05 Npr.org Three Dead After Fight Escalates Into Shooting...
2019-01-05 Reuters French 'yellow vests' rail against unrepentant...
2019-01-05 The Hill Trump: 'I don’t care' that most federal employ...
2019-01-05 The Huffington Post 5 Children Dead After Church Van Crashes On Wa...
2019-01-05 The Verge Apple seeks to end bent iPad Pro controversy w...
2019-01-05 Thisisinsider.com Kanye West surprised Kim Kardashian with a $14...
2019-01-05 USA Today See 'Mean Girls' co-stars Lindsay Lohan and Jo...
2019-01-05 Youtube.com Bears Buzz: Eagles at Bears - Wildcard Round -...
The above is sorted by date and by source, so multilpe titles from the same source are grouped.
You are filtering dictionaries then trying to put them in a set. Your expected outcome does not require you to de-duplicate anything, so the easiest path away from the error is to use a list comprehension instead; just swap out {...}
curly braces for square braces:
articles = [article for article in response_dataframe['articles'] if article['publishedAt'] >= '2019-01-04T11:30:00Z']
However, if you are going to put the data into a dataframe for processing, you would be much better of with using the pandas.io.json.json_normalize()
function; it can produce a dataframe for you from list-and-dictionaries structure typically loaded from a JSON source.
Start with loading just the article data you want into a dataframe and you can filter and re-arrange from there; the following code loads all data into a single dataframe with a new date
column, derived from the publishAt
information:
import pandas as pd
from pandas.io.json import json_normalize
df = json_normalize(response.json(), 'articles')
# make the datetime column a native type, and add a date-only column
df['publishedAt'] = pd.to_datetime(df['publishedAt'])
df['date'] = df['publishedAt'].dt.date
# move source dictionary into separate columns rather than dictionaries
source_columns = df['source'].apply(pd.Series).add_prefix('source_')
df = pd.concat([df.drop(['source'], axis=1), source_columns], axis=1)
That gives you a dataframe with all the article information, as a complete dataframe with native types, with the columns author
, content
, description
, publishedAt
, date
, title
, url
, urlToImage
and the source_id
and source_name
columns from the source
mapping.
I note that the API allows you to filter by date already, I'd rely on that instead of filtering locally, as you can save time and bandwidth by having the API give you a smaller dataset. The API also lets you apply sorting, again a good idea.
To group the rows by date and source name, you would have to pivot the dataframe; the dates should be the index, the columns the name of the source, and the titles as the values:
df.pivot(index='date', columns='source_name', values='title')
This fails however, because this format does not have space for more than one title per source per day:
ValueError: Index contains duplicate entries, cannot reshape
In the JSON data served to me, there are multiple CNN and Fox News articles just for today.
You could aggregate multiple titles into lists:
pd.pivot_table(df,
index='date', columns='source_name', values='title',
aggfunc=list)
For the default 20 results for 'today' this gives me:
>>> pd.pivot_table(
... df, index='date', columns='source_name', values='title',
... aggfunc=list
... )
source_name Bbc.com ... Youtube.com
date ...
2019-01-05 [Paul Whelan: Russia rules out prisoner swap f... ... [Bears Buzz: Eagles at Bears - Wildcard Round ...
[1 rows x 18 columns]
Personally, I'd just keep the dataframe limited to dates, titles and source names, with a date index:
>>> df[['date', 'source_name', 'title']].set_index('date').sort_values(['date', 'source_name'])
source_name title
date
2019-01-05 Bbc.com Paul Whelan: Russia rules out prisoner swap fo...
2019-01-05 Bloomberg Russia Says FBI Arrested Russian Citizen on Pa...
2019-01-05 CNN Pay raises frozen for Pence, Cabinet members u...
2019-01-05 CNN 16 big questions on Robert Mueller's Russia in...
2019-01-05 Colts.com news What They're Saying: Colts/Texans, Wild C...
2019-01-05 Engadget Pandora iOS update adds offline playback for A...
2019-01-05 Espn.com Roger Federer wins Hopman Cup with Switzerland...
2019-01-05 Fox News Japanese 'Tuna King' pays record $3M for prize...
2019-01-05 Fox News Knicks' Turkish star Enes Kanter to skip Londo...
2019-01-05 Latimes.com Flu toll mounts in California, with 42 deaths ...
2019-01-05 NBC News After the fire: Blazes pose hidden threat to t...
2019-01-05 Newser.com After Backlash, Ellen Not Ditching Support for...
2019-01-05 Npr.org Three Dead After Fight Escalates Into Shooting...
2019-01-05 Reuters French 'yellow vests' rail against unrepentant...
2019-01-05 The Hill Trump: 'I don’t care' that most federal employ...
2019-01-05 The Huffington Post 5 Children Dead After Church Van Crashes On Wa...
2019-01-05 The Verge Apple seeks to end bent iPad Pro controversy w...
2019-01-05 Thisisinsider.com Kanye West surprised Kim Kardashian with a $14...
2019-01-05 USA Today See 'Mean Girls' co-stars Lindsay Lohan and Jo...
2019-01-05 Youtube.com Bears Buzz: Eagles at Bears - Wildcard Round -...
The above is sorted by date and by source, so multilpe titles from the same source are grouped.
edited Jan 5 at 18:12
answered Jan 5 at 17:14
Martijn Pieters♦Martijn Pieters
728k14525562357
728k14525562357
Many thanks ! And how to filter by date from the API to save time and bandwidth by having the API give you a smaller dataset ? Would I always get 20 results or may I get some more ?
– ThePassenger
Jan 7 at 9:41
@ThePassenger: you already found the documentation, it lists what parameters are acceptable. Thetop-headlines
path accepts apageSize
parameter, for example. Theeverything
path lets you query for a range of dates.
– Martijn Pieters♦
Jan 7 at 16:33
add a comment |
Many thanks ! And how to filter by date from the API to save time and bandwidth by having the API give you a smaller dataset ? Would I always get 20 results or may I get some more ?
– ThePassenger
Jan 7 at 9:41
@ThePassenger: you already found the documentation, it lists what parameters are acceptable. Thetop-headlines
path accepts apageSize
parameter, for example. Theeverything
path lets you query for a range of dates.
– Martijn Pieters♦
Jan 7 at 16:33
Many thanks ! And how to filter by date from the API to save time and bandwidth by having the API give you a smaller dataset ? Would I always get 20 results or may I get some more ?
– ThePassenger
Jan 7 at 9:41
Many thanks ! And how to filter by date from the API to save time and bandwidth by having the API give you a smaller dataset ? Would I always get 20 results or may I get some more ?
– ThePassenger
Jan 7 at 9:41
@ThePassenger: you already found the documentation, it lists what parameters are acceptable. The
top-headlines
path accepts a pageSize
parameter, for example. The everything
path lets you query for a range of dates.– Martijn Pieters♦
Jan 7 at 16:33
@ThePassenger: you already found the documentation, it lists what parameters are acceptable. The
top-headlines
path accepts a pageSize
parameter, for example. The everything
path lets you query for a range of dates.– Martijn Pieters♦
Jan 7 at 16:33
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54043356%2ffilter-a-dictionary-based-on-the-value-of-its-date-keys%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
It's not clear what you are trying to do. Why are you putting the JSON response in a dataframe? Why is a set comprehension used, are you trying to avoid duplicates of some kind?
– Martijn Pieters♦
Jan 4 at 17:23
Put differently, can you clearly illustrate what your expected outcome is here?
– Martijn Pieters♦
Jan 4 at 17:23
The code executes without any errors for me. When I
print (response)
, I get<Response [200]>
. I am not sure if that's the expected response, but I can't reproduce the error. @MartijnPieters: What version are you using?– Sheldore
Jan 4 at 17:24
1
Ah, I see. The code posted produces an empty set (no articles with that publication date). The traceback doesn't use the same criteria however, it uses
>=
to filter with, producing a non-empty match, and then you try to put those dictionaries in a set. I was using the line from the traceback.– Martijn Pieters♦
Jan 4 at 17:27
1
At any rate,
[a for a in response.json()['articles'] if a['publishedAt'] >= '2018-01-04T11:30:00Z']
works and produces a list with 20 dictionaries, all with unique titles and urls, so I don't know why a set would be required here.– Martijn Pieters♦
Jan 4 at 17:29