Filter a dictionary based on the value of its date keys





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}







1















I want to import articles from as many sources around the world as from a certain date.



import requests
url = ('https://newsapi.org/v2/top-headlines?'
'country=us&'
'apiKey=de9e19b7547e44c4983ad761c104278f')
response = requests.get(url)

response_dataframe = pd.DataFrame(response.json())

articles = {article for article in response_dataframe['articles'] if article['publishedAt'] >= '2019-01-04T11:30:00Z'}
print(articles)


But I get:



---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-84-0f21f2f50907> in <module>
2 response_dataframe['articles'][1]['publishedAt']
3
----> 4 articles = {article for article in response_dataframe['articles'] if article['publishedAt'] >= '2018-01-04T11:30:00Z'}
5 print(articles)

<ipython-input-84-0f21f2f50907> in <setcomp>(.0)
2 response_dataframe['articles'][1]['publishedAt']
3
----> 4 articles = {article for article in response_dataframe['articles'] if article['publishedAt'] >= '2018-01-04T11:30:00Z'}
5 print(articles)

TypeError: unhashable type: 'dict'


Therefore how to select a range of articles according by selecting on these keys ?
The expected output is a dataframe sorting the articles by the day and the newspaper.



              The New York Times                                The Washington Post                                The Financial Times  
2007-01-01 . What Sticks from '06. Somalia Orders Islamis... New Ebola Vaccine Gives 100 Percent Protecti...
2007-01-02 . Heart Health: Vitamin Does Not Prevent Death... Flurry of Settlements Over Toxic Mortgages M...
2007-01-03 . Google Answer to Filling Jobs Is an Algorith... Jason Miller Backs Out of White House Commun...
2007-01-04 . Helping Make the Shift From Combat to Commer... Wielding Claims of ‘Fake News,’ Conservative...
2007-01-05 . Rise in Ethanol Raises Concerns About Corn a... When One Party Has the Governor’s Mansion an
...


My Python version is 3.6.6










share|improve this question

























  • It's not clear what you are trying to do. Why are you putting the JSON response in a dataframe? Why is a set comprehension used, are you trying to avoid duplicates of some kind?

    – Martijn Pieters
    Jan 4 at 17:23











  • Put differently, can you clearly illustrate what your expected outcome is here?

    – Martijn Pieters
    Jan 4 at 17:23











  • The code executes without any errors for me. When I print (response), I get <Response [200]>. I am not sure if that's the expected response, but I can't reproduce the error. @MartijnPieters: What version are you using?

    – Sheldore
    Jan 4 at 17:24








  • 1





    Ah, I see. The code posted produces an empty set (no articles with that publication date). The traceback doesn't use the same criteria however, it uses >= to filter with, producing a non-empty match, and then you try to put those dictionaries in a set. I was using the line from the traceback.

    – Martijn Pieters
    Jan 4 at 17:27








  • 1





    At any rate, [a for a in response.json()['articles'] if a['publishedAt'] >= '2018-01-04T11:30:00Z'] works and produces a list with 20 dictionaries, all with unique titles and urls, so I don't know why a set would be required here.

    – Martijn Pieters
    Jan 4 at 17:29


















1















I want to import articles from as many sources around the world as from a certain date.



import requests
url = ('https://newsapi.org/v2/top-headlines?'
'country=us&'
'apiKey=de9e19b7547e44c4983ad761c104278f')
response = requests.get(url)

response_dataframe = pd.DataFrame(response.json())

articles = {article for article in response_dataframe['articles'] if article['publishedAt'] >= '2019-01-04T11:30:00Z'}
print(articles)


But I get:



---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-84-0f21f2f50907> in <module>
2 response_dataframe['articles'][1]['publishedAt']
3
----> 4 articles = {article for article in response_dataframe['articles'] if article['publishedAt'] >= '2018-01-04T11:30:00Z'}
5 print(articles)

<ipython-input-84-0f21f2f50907> in <setcomp>(.0)
2 response_dataframe['articles'][1]['publishedAt']
3
----> 4 articles = {article for article in response_dataframe['articles'] if article['publishedAt'] >= '2018-01-04T11:30:00Z'}
5 print(articles)

TypeError: unhashable type: 'dict'


Therefore how to select a range of articles according by selecting on these keys ?
The expected output is a dataframe sorting the articles by the day and the newspaper.



              The New York Times                                The Washington Post                                The Financial Times  
2007-01-01 . What Sticks from '06. Somalia Orders Islamis... New Ebola Vaccine Gives 100 Percent Protecti...
2007-01-02 . Heart Health: Vitamin Does Not Prevent Death... Flurry of Settlements Over Toxic Mortgages M...
2007-01-03 . Google Answer to Filling Jobs Is an Algorith... Jason Miller Backs Out of White House Commun...
2007-01-04 . Helping Make the Shift From Combat to Commer... Wielding Claims of ‘Fake News,’ Conservative...
2007-01-05 . Rise in Ethanol Raises Concerns About Corn a... When One Party Has the Governor’s Mansion an
...


My Python version is 3.6.6










share|improve this question

























  • It's not clear what you are trying to do. Why are you putting the JSON response in a dataframe? Why is a set comprehension used, are you trying to avoid duplicates of some kind?

    – Martijn Pieters
    Jan 4 at 17:23











  • Put differently, can you clearly illustrate what your expected outcome is here?

    – Martijn Pieters
    Jan 4 at 17:23











  • The code executes without any errors for me. When I print (response), I get <Response [200]>. I am not sure if that's the expected response, but I can't reproduce the error. @MartijnPieters: What version are you using?

    – Sheldore
    Jan 4 at 17:24








  • 1





    Ah, I see. The code posted produces an empty set (no articles with that publication date). The traceback doesn't use the same criteria however, it uses >= to filter with, producing a non-empty match, and then you try to put those dictionaries in a set. I was using the line from the traceback.

    – Martijn Pieters
    Jan 4 at 17:27








  • 1





    At any rate, [a for a in response.json()['articles'] if a['publishedAt'] >= '2018-01-04T11:30:00Z'] works and produces a list with 20 dictionaries, all with unique titles and urls, so I don't know why a set would be required here.

    – Martijn Pieters
    Jan 4 at 17:29














1












1








1








I want to import articles from as many sources around the world as from a certain date.



import requests
url = ('https://newsapi.org/v2/top-headlines?'
'country=us&'
'apiKey=de9e19b7547e44c4983ad761c104278f')
response = requests.get(url)

response_dataframe = pd.DataFrame(response.json())

articles = {article for article in response_dataframe['articles'] if article['publishedAt'] >= '2019-01-04T11:30:00Z'}
print(articles)


But I get:



---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-84-0f21f2f50907> in <module>
2 response_dataframe['articles'][1]['publishedAt']
3
----> 4 articles = {article for article in response_dataframe['articles'] if article['publishedAt'] >= '2018-01-04T11:30:00Z'}
5 print(articles)

<ipython-input-84-0f21f2f50907> in <setcomp>(.0)
2 response_dataframe['articles'][1]['publishedAt']
3
----> 4 articles = {article for article in response_dataframe['articles'] if article['publishedAt'] >= '2018-01-04T11:30:00Z'}
5 print(articles)

TypeError: unhashable type: 'dict'


Therefore how to select a range of articles according by selecting on these keys ?
The expected output is a dataframe sorting the articles by the day and the newspaper.



              The New York Times                                The Washington Post                                The Financial Times  
2007-01-01 . What Sticks from '06. Somalia Orders Islamis... New Ebola Vaccine Gives 100 Percent Protecti...
2007-01-02 . Heart Health: Vitamin Does Not Prevent Death... Flurry of Settlements Over Toxic Mortgages M...
2007-01-03 . Google Answer to Filling Jobs Is an Algorith... Jason Miller Backs Out of White House Commun...
2007-01-04 . Helping Make the Shift From Combat to Commer... Wielding Claims of ‘Fake News,’ Conservative...
2007-01-05 . Rise in Ethanol Raises Concerns About Corn a... When One Party Has the Governor’s Mansion an
...


My Python version is 3.6.6










share|improve this question
















I want to import articles from as many sources around the world as from a certain date.



import requests
url = ('https://newsapi.org/v2/top-headlines?'
'country=us&'
'apiKey=de9e19b7547e44c4983ad761c104278f')
response = requests.get(url)

response_dataframe = pd.DataFrame(response.json())

articles = {article for article in response_dataframe['articles'] if article['publishedAt'] >= '2019-01-04T11:30:00Z'}
print(articles)


But I get:



---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-84-0f21f2f50907> in <module>
2 response_dataframe['articles'][1]['publishedAt']
3
----> 4 articles = {article for article in response_dataframe['articles'] if article['publishedAt'] >= '2018-01-04T11:30:00Z'}
5 print(articles)

<ipython-input-84-0f21f2f50907> in <setcomp>(.0)
2 response_dataframe['articles'][1]['publishedAt']
3
----> 4 articles = {article for article in response_dataframe['articles'] if article['publishedAt'] >= '2018-01-04T11:30:00Z'}
5 print(articles)

TypeError: unhashable type: 'dict'


Therefore how to select a range of articles according by selecting on these keys ?
The expected output is a dataframe sorting the articles by the day and the newspaper.



              The New York Times                                The Washington Post                                The Financial Times  
2007-01-01 . What Sticks from '06. Somalia Orders Islamis... New Ebola Vaccine Gives 100 Percent Protecti...
2007-01-02 . Heart Health: Vitamin Does Not Prevent Death... Flurry of Settlements Over Toxic Mortgages M...
2007-01-03 . Google Answer to Filling Jobs Is an Algorith... Jason Miller Backs Out of White House Commun...
2007-01-04 . Helping Make the Shift From Combat to Commer... Wielding Claims of ‘Fake News,’ Conservative...
2007-01-05 . Rise in Ethanol Raises Concerns About Corn a... When One Party Has the Governor’s Mansion an
...


My Python version is 3.6.6







python python-3.x api date filter






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 5 at 16:30









Martijn Pieters

728k14525562357




728k14525562357










asked Jan 4 at 17:14









ThePassengerThePassenger

268522




268522













  • It's not clear what you are trying to do. Why are you putting the JSON response in a dataframe? Why is a set comprehension used, are you trying to avoid duplicates of some kind?

    – Martijn Pieters
    Jan 4 at 17:23











  • Put differently, can you clearly illustrate what your expected outcome is here?

    – Martijn Pieters
    Jan 4 at 17:23











  • The code executes without any errors for me. When I print (response), I get <Response [200]>. I am not sure if that's the expected response, but I can't reproduce the error. @MartijnPieters: What version are you using?

    – Sheldore
    Jan 4 at 17:24








  • 1





    Ah, I see. The code posted produces an empty set (no articles with that publication date). The traceback doesn't use the same criteria however, it uses >= to filter with, producing a non-empty match, and then you try to put those dictionaries in a set. I was using the line from the traceback.

    – Martijn Pieters
    Jan 4 at 17:27








  • 1





    At any rate, [a for a in response.json()['articles'] if a['publishedAt'] >= '2018-01-04T11:30:00Z'] works and produces a list with 20 dictionaries, all with unique titles and urls, so I don't know why a set would be required here.

    – Martijn Pieters
    Jan 4 at 17:29



















  • It's not clear what you are trying to do. Why are you putting the JSON response in a dataframe? Why is a set comprehension used, are you trying to avoid duplicates of some kind?

    – Martijn Pieters
    Jan 4 at 17:23











  • Put differently, can you clearly illustrate what your expected outcome is here?

    – Martijn Pieters
    Jan 4 at 17:23











  • The code executes without any errors for me. When I print (response), I get <Response [200]>. I am not sure if that's the expected response, but I can't reproduce the error. @MartijnPieters: What version are you using?

    – Sheldore
    Jan 4 at 17:24








  • 1





    Ah, I see. The code posted produces an empty set (no articles with that publication date). The traceback doesn't use the same criteria however, it uses >= to filter with, producing a non-empty match, and then you try to put those dictionaries in a set. I was using the line from the traceback.

    – Martijn Pieters
    Jan 4 at 17:27








  • 1





    At any rate, [a for a in response.json()['articles'] if a['publishedAt'] >= '2018-01-04T11:30:00Z'] works and produces a list with 20 dictionaries, all with unique titles and urls, so I don't know why a set would be required here.

    – Martijn Pieters
    Jan 4 at 17:29

















It's not clear what you are trying to do. Why are you putting the JSON response in a dataframe? Why is a set comprehension used, are you trying to avoid duplicates of some kind?

– Martijn Pieters
Jan 4 at 17:23





It's not clear what you are trying to do. Why are you putting the JSON response in a dataframe? Why is a set comprehension used, are you trying to avoid duplicates of some kind?

– Martijn Pieters
Jan 4 at 17:23













Put differently, can you clearly illustrate what your expected outcome is here?

– Martijn Pieters
Jan 4 at 17:23





Put differently, can you clearly illustrate what your expected outcome is here?

– Martijn Pieters
Jan 4 at 17:23













The code executes without any errors for me. When I print (response), I get <Response [200]>. I am not sure if that's the expected response, but I can't reproduce the error. @MartijnPieters: What version are you using?

– Sheldore
Jan 4 at 17:24







The code executes without any errors for me. When I print (response), I get <Response [200]>. I am not sure if that's the expected response, but I can't reproduce the error. @MartijnPieters: What version are you using?

– Sheldore
Jan 4 at 17:24






1




1





Ah, I see. The code posted produces an empty set (no articles with that publication date). The traceback doesn't use the same criteria however, it uses >= to filter with, producing a non-empty match, and then you try to put those dictionaries in a set. I was using the line from the traceback.

– Martijn Pieters
Jan 4 at 17:27







Ah, I see. The code posted produces an empty set (no articles with that publication date). The traceback doesn't use the same criteria however, it uses >= to filter with, producing a non-empty match, and then you try to put those dictionaries in a set. I was using the line from the traceback.

– Martijn Pieters
Jan 4 at 17:27






1




1





At any rate, [a for a in response.json()['articles'] if a['publishedAt'] >= '2018-01-04T11:30:00Z'] works and produces a list with 20 dictionaries, all with unique titles and urls, so I don't know why a set would be required here.

– Martijn Pieters
Jan 4 at 17:29





At any rate, [a for a in response.json()['articles'] if a['publishedAt'] >= '2018-01-04T11:30:00Z'] works and produces a list with 20 dictionaries, all with unique titles and urls, so I don't know why a set would be required here.

– Martijn Pieters
Jan 4 at 17:29












1 Answer
1






active

oldest

votes


















1














You are filtering dictionaries then trying to put them in a set. Your expected outcome does not require you to de-duplicate anything, so the easiest path away from the error is to use a list comprehension instead; just swap out {...} curly braces for square braces:



articles = [article for article in response_dataframe['articles'] if article['publishedAt'] >= '2019-01-04T11:30:00Z']


However, if you are going to put the data into a dataframe for processing, you would be much better of with using the pandas.io.json.json_normalize() function; it can produce a dataframe for you from list-and-dictionaries structure typically loaded from a JSON source.



Start with loading just the article data you want into a dataframe and you can filter and re-arrange from there; the following code loads all data into a single dataframe with a new date column, derived from the publishAt information:



import pandas as pd
from pandas.io.json import json_normalize

df = json_normalize(response.json(), 'articles')

# make the datetime column a native type, and add a date-only column
df['publishedAt'] = pd.to_datetime(df['publishedAt'])
df['date'] = df['publishedAt'].dt.date

# move source dictionary into separate columns rather than dictionaries
source_columns = df['source'].apply(pd.Series).add_prefix('source_')
df = pd.concat([df.drop(['source'], axis=1), source_columns], axis=1)


That gives you a dataframe with all the article information, as a complete dataframe with native types, with the columns author, content, description, publishedAt, date, title, url, urlToImage and the source_id and source_name columns from the source mapping.



I note that the API allows you to filter by date already, I'd rely on that instead of filtering locally, as you can save time and bandwidth by having the API give you a smaller dataset. The API also lets you apply sorting, again a good idea.



To group the rows by date and source name, you would have to pivot the dataframe; the dates should be the index, the columns the name of the source, and the titles as the values:



df.pivot(index='date', columns='source_name', values='title')


This fails however, because this format does not have space for more than one title per source per day:



ValueError: Index contains duplicate entries, cannot reshape


In the JSON data served to me, there are multiple CNN and Fox News articles just for today.



You could aggregate multiple titles into lists:



pd.pivot_table(df,
index='date', columns='source_name', values='title',
aggfunc=list)


For the default 20 results for 'today' this gives me:



>>> pd.pivot_table(
... df, index='date', columns='source_name', values='title',
... aggfunc=list
... )
source_name Bbc.com ... Youtube.com
date ...
2019-01-05 [Paul Whelan: Russia rules out prisoner swap f... ... [Bears Buzz: Eagles at Bears - Wildcard Round ...

[1 rows x 18 columns]


Personally, I'd just keep the dataframe limited to dates, titles and source names, with a date index:



>>> df[['date', 'source_name', 'title']].set_index('date').sort_values(['date', 'source_name'])
source_name title
date
2019-01-05 Bbc.com Paul Whelan: Russia rules out prisoner swap fo...
2019-01-05 Bloomberg Russia Says FBI Arrested Russian Citizen on Pa...
2019-01-05 CNN Pay raises frozen for Pence, Cabinet members u...
2019-01-05 CNN 16 big questions on Robert Mueller's Russia in...
2019-01-05 Colts.com news What They're Saying: Colts/Texans, Wild C...
2019-01-05 Engadget Pandora iOS update adds offline playback for A...
2019-01-05 Espn.com Roger Federer wins Hopman Cup with Switzerland...
2019-01-05 Fox News Japanese 'Tuna King' pays record $3M for prize...
2019-01-05 Fox News Knicks' Turkish star Enes Kanter to skip Londo...
2019-01-05 Latimes.com Flu toll mounts in California, with 42 deaths ...
2019-01-05 NBC News After the fire: Blazes pose hidden threat to t...
2019-01-05 Newser.com After Backlash, Ellen Not Ditching Support for...
2019-01-05 Npr.org Three Dead After Fight Escalates Into Shooting...
2019-01-05 Reuters French 'yellow vests' rail against unrepentant...
2019-01-05 The Hill Trump: 'I don’t care' that most federal employ...
2019-01-05 The Huffington Post 5 Children Dead After Church Van Crashes On Wa...
2019-01-05 The Verge Apple seeks to end bent iPad Pro controversy w...
2019-01-05 Thisisinsider.com Kanye West surprised Kim Kardashian with a $14...
2019-01-05 USA Today See 'Mean Girls' co-stars Lindsay Lohan and Jo...
2019-01-05 Youtube.com Bears Buzz: Eagles at Bears - Wildcard Round -...


The above is sorted by date and by source, so multilpe titles from the same source are grouped.






share|improve this answer


























  • Many thanks ! And how to filter by date from the API to save time and bandwidth by having the API give you a smaller dataset ? Would I always get 20 results or may I get some more ?

    – ThePassenger
    Jan 7 at 9:41











  • @ThePassenger: you already found the documentation, it lists what parameters are acceptable. The top-headlines path accepts a pageSize parameter, for example. The everything path lets you query for a range of dates.

    – Martijn Pieters
    Jan 7 at 16:33












Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54043356%2ffilter-a-dictionary-based-on-the-value-of-its-date-keys%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









1














You are filtering dictionaries then trying to put them in a set. Your expected outcome does not require you to de-duplicate anything, so the easiest path away from the error is to use a list comprehension instead; just swap out {...} curly braces for square braces:



articles = [article for article in response_dataframe['articles'] if article['publishedAt'] >= '2019-01-04T11:30:00Z']


However, if you are going to put the data into a dataframe for processing, you would be much better of with using the pandas.io.json.json_normalize() function; it can produce a dataframe for you from list-and-dictionaries structure typically loaded from a JSON source.



Start with loading just the article data you want into a dataframe and you can filter and re-arrange from there; the following code loads all data into a single dataframe with a new date column, derived from the publishAt information:



import pandas as pd
from pandas.io.json import json_normalize

df = json_normalize(response.json(), 'articles')

# make the datetime column a native type, and add a date-only column
df['publishedAt'] = pd.to_datetime(df['publishedAt'])
df['date'] = df['publishedAt'].dt.date

# move source dictionary into separate columns rather than dictionaries
source_columns = df['source'].apply(pd.Series).add_prefix('source_')
df = pd.concat([df.drop(['source'], axis=1), source_columns], axis=1)


That gives you a dataframe with all the article information, as a complete dataframe with native types, with the columns author, content, description, publishedAt, date, title, url, urlToImage and the source_id and source_name columns from the source mapping.



I note that the API allows you to filter by date already, I'd rely on that instead of filtering locally, as you can save time and bandwidth by having the API give you a smaller dataset. The API also lets you apply sorting, again a good idea.



To group the rows by date and source name, you would have to pivot the dataframe; the dates should be the index, the columns the name of the source, and the titles as the values:



df.pivot(index='date', columns='source_name', values='title')


This fails however, because this format does not have space for more than one title per source per day:



ValueError: Index contains duplicate entries, cannot reshape


In the JSON data served to me, there are multiple CNN and Fox News articles just for today.



You could aggregate multiple titles into lists:



pd.pivot_table(df,
index='date', columns='source_name', values='title',
aggfunc=list)


For the default 20 results for 'today' this gives me:



>>> pd.pivot_table(
... df, index='date', columns='source_name', values='title',
... aggfunc=list
... )
source_name Bbc.com ... Youtube.com
date ...
2019-01-05 [Paul Whelan: Russia rules out prisoner swap f... ... [Bears Buzz: Eagles at Bears - Wildcard Round ...

[1 rows x 18 columns]


Personally, I'd just keep the dataframe limited to dates, titles and source names, with a date index:



>>> df[['date', 'source_name', 'title']].set_index('date').sort_values(['date', 'source_name'])
source_name title
date
2019-01-05 Bbc.com Paul Whelan: Russia rules out prisoner swap fo...
2019-01-05 Bloomberg Russia Says FBI Arrested Russian Citizen on Pa...
2019-01-05 CNN Pay raises frozen for Pence, Cabinet members u...
2019-01-05 CNN 16 big questions on Robert Mueller's Russia in...
2019-01-05 Colts.com news What They're Saying: Colts/Texans, Wild C...
2019-01-05 Engadget Pandora iOS update adds offline playback for A...
2019-01-05 Espn.com Roger Federer wins Hopman Cup with Switzerland...
2019-01-05 Fox News Japanese 'Tuna King' pays record $3M for prize...
2019-01-05 Fox News Knicks' Turkish star Enes Kanter to skip Londo...
2019-01-05 Latimes.com Flu toll mounts in California, with 42 deaths ...
2019-01-05 NBC News After the fire: Blazes pose hidden threat to t...
2019-01-05 Newser.com After Backlash, Ellen Not Ditching Support for...
2019-01-05 Npr.org Three Dead After Fight Escalates Into Shooting...
2019-01-05 Reuters French 'yellow vests' rail against unrepentant...
2019-01-05 The Hill Trump: 'I don’t care' that most federal employ...
2019-01-05 The Huffington Post 5 Children Dead After Church Van Crashes On Wa...
2019-01-05 The Verge Apple seeks to end bent iPad Pro controversy w...
2019-01-05 Thisisinsider.com Kanye West surprised Kim Kardashian with a $14...
2019-01-05 USA Today See 'Mean Girls' co-stars Lindsay Lohan and Jo...
2019-01-05 Youtube.com Bears Buzz: Eagles at Bears - Wildcard Round -...


The above is sorted by date and by source, so multilpe titles from the same source are grouped.






share|improve this answer


























  • Many thanks ! And how to filter by date from the API to save time and bandwidth by having the API give you a smaller dataset ? Would I always get 20 results or may I get some more ?

    – ThePassenger
    Jan 7 at 9:41











  • @ThePassenger: you already found the documentation, it lists what parameters are acceptable. The top-headlines path accepts a pageSize parameter, for example. The everything path lets you query for a range of dates.

    – Martijn Pieters
    Jan 7 at 16:33
















1














You are filtering dictionaries then trying to put them in a set. Your expected outcome does not require you to de-duplicate anything, so the easiest path away from the error is to use a list comprehension instead; just swap out {...} curly braces for square braces:



articles = [article for article in response_dataframe['articles'] if article['publishedAt'] >= '2019-01-04T11:30:00Z']


However, if you are going to put the data into a dataframe for processing, you would be much better of with using the pandas.io.json.json_normalize() function; it can produce a dataframe for you from list-and-dictionaries structure typically loaded from a JSON source.



Start with loading just the article data you want into a dataframe and you can filter and re-arrange from there; the following code loads all data into a single dataframe with a new date column, derived from the publishAt information:



import pandas as pd
from pandas.io.json import json_normalize

df = json_normalize(response.json(), 'articles')

# make the datetime column a native type, and add a date-only column
df['publishedAt'] = pd.to_datetime(df['publishedAt'])
df['date'] = df['publishedAt'].dt.date

# move source dictionary into separate columns rather than dictionaries
source_columns = df['source'].apply(pd.Series).add_prefix('source_')
df = pd.concat([df.drop(['source'], axis=1), source_columns], axis=1)


That gives you a dataframe with all the article information, as a complete dataframe with native types, with the columns author, content, description, publishedAt, date, title, url, urlToImage and the source_id and source_name columns from the source mapping.



I note that the API allows you to filter by date already, I'd rely on that instead of filtering locally, as you can save time and bandwidth by having the API give you a smaller dataset. The API also lets you apply sorting, again a good idea.



To group the rows by date and source name, you would have to pivot the dataframe; the dates should be the index, the columns the name of the source, and the titles as the values:



df.pivot(index='date', columns='source_name', values='title')


This fails however, because this format does not have space for more than one title per source per day:



ValueError: Index contains duplicate entries, cannot reshape


In the JSON data served to me, there are multiple CNN and Fox News articles just for today.



You could aggregate multiple titles into lists:



pd.pivot_table(df,
index='date', columns='source_name', values='title',
aggfunc=list)


For the default 20 results for 'today' this gives me:



>>> pd.pivot_table(
... df, index='date', columns='source_name', values='title',
... aggfunc=list
... )
source_name Bbc.com ... Youtube.com
date ...
2019-01-05 [Paul Whelan: Russia rules out prisoner swap f... ... [Bears Buzz: Eagles at Bears - Wildcard Round ...

[1 rows x 18 columns]


Personally, I'd just keep the dataframe limited to dates, titles and source names, with a date index:



>>> df[['date', 'source_name', 'title']].set_index('date').sort_values(['date', 'source_name'])
source_name title
date
2019-01-05 Bbc.com Paul Whelan: Russia rules out prisoner swap fo...
2019-01-05 Bloomberg Russia Says FBI Arrested Russian Citizen on Pa...
2019-01-05 CNN Pay raises frozen for Pence, Cabinet members u...
2019-01-05 CNN 16 big questions on Robert Mueller's Russia in...
2019-01-05 Colts.com news What They're Saying: Colts/Texans, Wild C...
2019-01-05 Engadget Pandora iOS update adds offline playback for A...
2019-01-05 Espn.com Roger Federer wins Hopman Cup with Switzerland...
2019-01-05 Fox News Japanese 'Tuna King' pays record $3M for prize...
2019-01-05 Fox News Knicks' Turkish star Enes Kanter to skip Londo...
2019-01-05 Latimes.com Flu toll mounts in California, with 42 deaths ...
2019-01-05 NBC News After the fire: Blazes pose hidden threat to t...
2019-01-05 Newser.com After Backlash, Ellen Not Ditching Support for...
2019-01-05 Npr.org Three Dead After Fight Escalates Into Shooting...
2019-01-05 Reuters French 'yellow vests' rail against unrepentant...
2019-01-05 The Hill Trump: 'I don’t care' that most federal employ...
2019-01-05 The Huffington Post 5 Children Dead After Church Van Crashes On Wa...
2019-01-05 The Verge Apple seeks to end bent iPad Pro controversy w...
2019-01-05 Thisisinsider.com Kanye West surprised Kim Kardashian with a $14...
2019-01-05 USA Today See 'Mean Girls' co-stars Lindsay Lohan and Jo...
2019-01-05 Youtube.com Bears Buzz: Eagles at Bears - Wildcard Round -...


The above is sorted by date and by source, so multilpe titles from the same source are grouped.






share|improve this answer


























  • Many thanks ! And how to filter by date from the API to save time and bandwidth by having the API give you a smaller dataset ? Would I always get 20 results or may I get some more ?

    – ThePassenger
    Jan 7 at 9:41











  • @ThePassenger: you already found the documentation, it lists what parameters are acceptable. The top-headlines path accepts a pageSize parameter, for example. The everything path lets you query for a range of dates.

    – Martijn Pieters
    Jan 7 at 16:33














1












1








1







You are filtering dictionaries then trying to put them in a set. Your expected outcome does not require you to de-duplicate anything, so the easiest path away from the error is to use a list comprehension instead; just swap out {...} curly braces for square braces:



articles = [article for article in response_dataframe['articles'] if article['publishedAt'] >= '2019-01-04T11:30:00Z']


However, if you are going to put the data into a dataframe for processing, you would be much better of with using the pandas.io.json.json_normalize() function; it can produce a dataframe for you from list-and-dictionaries structure typically loaded from a JSON source.



Start with loading just the article data you want into a dataframe and you can filter and re-arrange from there; the following code loads all data into a single dataframe with a new date column, derived from the publishAt information:



import pandas as pd
from pandas.io.json import json_normalize

df = json_normalize(response.json(), 'articles')

# make the datetime column a native type, and add a date-only column
df['publishedAt'] = pd.to_datetime(df['publishedAt'])
df['date'] = df['publishedAt'].dt.date

# move source dictionary into separate columns rather than dictionaries
source_columns = df['source'].apply(pd.Series).add_prefix('source_')
df = pd.concat([df.drop(['source'], axis=1), source_columns], axis=1)


That gives you a dataframe with all the article information, as a complete dataframe with native types, with the columns author, content, description, publishedAt, date, title, url, urlToImage and the source_id and source_name columns from the source mapping.



I note that the API allows you to filter by date already, I'd rely on that instead of filtering locally, as you can save time and bandwidth by having the API give you a smaller dataset. The API also lets you apply sorting, again a good idea.



To group the rows by date and source name, you would have to pivot the dataframe; the dates should be the index, the columns the name of the source, and the titles as the values:



df.pivot(index='date', columns='source_name', values='title')


This fails however, because this format does not have space for more than one title per source per day:



ValueError: Index contains duplicate entries, cannot reshape


In the JSON data served to me, there are multiple CNN and Fox News articles just for today.



You could aggregate multiple titles into lists:



pd.pivot_table(df,
index='date', columns='source_name', values='title',
aggfunc=list)


For the default 20 results for 'today' this gives me:



>>> pd.pivot_table(
... df, index='date', columns='source_name', values='title',
... aggfunc=list
... )
source_name Bbc.com ... Youtube.com
date ...
2019-01-05 [Paul Whelan: Russia rules out prisoner swap f... ... [Bears Buzz: Eagles at Bears - Wildcard Round ...

[1 rows x 18 columns]


Personally, I'd just keep the dataframe limited to dates, titles and source names, with a date index:



>>> df[['date', 'source_name', 'title']].set_index('date').sort_values(['date', 'source_name'])
source_name title
date
2019-01-05 Bbc.com Paul Whelan: Russia rules out prisoner swap fo...
2019-01-05 Bloomberg Russia Says FBI Arrested Russian Citizen on Pa...
2019-01-05 CNN Pay raises frozen for Pence, Cabinet members u...
2019-01-05 CNN 16 big questions on Robert Mueller's Russia in...
2019-01-05 Colts.com news What They're Saying: Colts/Texans, Wild C...
2019-01-05 Engadget Pandora iOS update adds offline playback for A...
2019-01-05 Espn.com Roger Federer wins Hopman Cup with Switzerland...
2019-01-05 Fox News Japanese 'Tuna King' pays record $3M for prize...
2019-01-05 Fox News Knicks' Turkish star Enes Kanter to skip Londo...
2019-01-05 Latimes.com Flu toll mounts in California, with 42 deaths ...
2019-01-05 NBC News After the fire: Blazes pose hidden threat to t...
2019-01-05 Newser.com After Backlash, Ellen Not Ditching Support for...
2019-01-05 Npr.org Three Dead After Fight Escalates Into Shooting...
2019-01-05 Reuters French 'yellow vests' rail against unrepentant...
2019-01-05 The Hill Trump: 'I don’t care' that most federal employ...
2019-01-05 The Huffington Post 5 Children Dead After Church Van Crashes On Wa...
2019-01-05 The Verge Apple seeks to end bent iPad Pro controversy w...
2019-01-05 Thisisinsider.com Kanye West surprised Kim Kardashian with a $14...
2019-01-05 USA Today See 'Mean Girls' co-stars Lindsay Lohan and Jo...
2019-01-05 Youtube.com Bears Buzz: Eagles at Bears - Wildcard Round -...


The above is sorted by date and by source, so multilpe titles from the same source are grouped.






share|improve this answer















You are filtering dictionaries then trying to put them in a set. Your expected outcome does not require you to de-duplicate anything, so the easiest path away from the error is to use a list comprehension instead; just swap out {...} curly braces for square braces:



articles = [article for article in response_dataframe['articles'] if article['publishedAt'] >= '2019-01-04T11:30:00Z']


However, if you are going to put the data into a dataframe for processing, you would be much better of with using the pandas.io.json.json_normalize() function; it can produce a dataframe for you from list-and-dictionaries structure typically loaded from a JSON source.



Start with loading just the article data you want into a dataframe and you can filter and re-arrange from there; the following code loads all data into a single dataframe with a new date column, derived from the publishAt information:



import pandas as pd
from pandas.io.json import json_normalize

df = json_normalize(response.json(), 'articles')

# make the datetime column a native type, and add a date-only column
df['publishedAt'] = pd.to_datetime(df['publishedAt'])
df['date'] = df['publishedAt'].dt.date

# move source dictionary into separate columns rather than dictionaries
source_columns = df['source'].apply(pd.Series).add_prefix('source_')
df = pd.concat([df.drop(['source'], axis=1), source_columns], axis=1)


That gives you a dataframe with all the article information, as a complete dataframe with native types, with the columns author, content, description, publishedAt, date, title, url, urlToImage and the source_id and source_name columns from the source mapping.



I note that the API allows you to filter by date already, I'd rely on that instead of filtering locally, as you can save time and bandwidth by having the API give you a smaller dataset. The API also lets you apply sorting, again a good idea.



To group the rows by date and source name, you would have to pivot the dataframe; the dates should be the index, the columns the name of the source, and the titles as the values:



df.pivot(index='date', columns='source_name', values='title')


This fails however, because this format does not have space for more than one title per source per day:



ValueError: Index contains duplicate entries, cannot reshape


In the JSON data served to me, there are multiple CNN and Fox News articles just for today.



You could aggregate multiple titles into lists:



pd.pivot_table(df,
index='date', columns='source_name', values='title',
aggfunc=list)


For the default 20 results for 'today' this gives me:



>>> pd.pivot_table(
... df, index='date', columns='source_name', values='title',
... aggfunc=list
... )
source_name Bbc.com ... Youtube.com
date ...
2019-01-05 [Paul Whelan: Russia rules out prisoner swap f... ... [Bears Buzz: Eagles at Bears - Wildcard Round ...

[1 rows x 18 columns]


Personally, I'd just keep the dataframe limited to dates, titles and source names, with a date index:



>>> df[['date', 'source_name', 'title']].set_index('date').sort_values(['date', 'source_name'])
source_name title
date
2019-01-05 Bbc.com Paul Whelan: Russia rules out prisoner swap fo...
2019-01-05 Bloomberg Russia Says FBI Arrested Russian Citizen on Pa...
2019-01-05 CNN Pay raises frozen for Pence, Cabinet members u...
2019-01-05 CNN 16 big questions on Robert Mueller's Russia in...
2019-01-05 Colts.com news What They're Saying: Colts/Texans, Wild C...
2019-01-05 Engadget Pandora iOS update adds offline playback for A...
2019-01-05 Espn.com Roger Federer wins Hopman Cup with Switzerland...
2019-01-05 Fox News Japanese 'Tuna King' pays record $3M for prize...
2019-01-05 Fox News Knicks' Turkish star Enes Kanter to skip Londo...
2019-01-05 Latimes.com Flu toll mounts in California, with 42 deaths ...
2019-01-05 NBC News After the fire: Blazes pose hidden threat to t...
2019-01-05 Newser.com After Backlash, Ellen Not Ditching Support for...
2019-01-05 Npr.org Three Dead After Fight Escalates Into Shooting...
2019-01-05 Reuters French 'yellow vests' rail against unrepentant...
2019-01-05 The Hill Trump: 'I don’t care' that most federal employ...
2019-01-05 The Huffington Post 5 Children Dead After Church Van Crashes On Wa...
2019-01-05 The Verge Apple seeks to end bent iPad Pro controversy w...
2019-01-05 Thisisinsider.com Kanye West surprised Kim Kardashian with a $14...
2019-01-05 USA Today See 'Mean Girls' co-stars Lindsay Lohan and Jo...
2019-01-05 Youtube.com Bears Buzz: Eagles at Bears - Wildcard Round -...


The above is sorted by date and by source, so multilpe titles from the same source are grouped.







share|improve this answer














share|improve this answer



share|improve this answer








edited Jan 5 at 18:12

























answered Jan 5 at 17:14









Martijn PietersMartijn Pieters

728k14525562357




728k14525562357













  • Many thanks ! And how to filter by date from the API to save time and bandwidth by having the API give you a smaller dataset ? Would I always get 20 results or may I get some more ?

    – ThePassenger
    Jan 7 at 9:41











  • @ThePassenger: you already found the documentation, it lists what parameters are acceptable. The top-headlines path accepts a pageSize parameter, for example. The everything path lets you query for a range of dates.

    – Martijn Pieters
    Jan 7 at 16:33



















  • Many thanks ! And how to filter by date from the API to save time and bandwidth by having the API give you a smaller dataset ? Would I always get 20 results or may I get some more ?

    – ThePassenger
    Jan 7 at 9:41











  • @ThePassenger: you already found the documentation, it lists what parameters are acceptable. The top-headlines path accepts a pageSize parameter, for example. The everything path lets you query for a range of dates.

    – Martijn Pieters
    Jan 7 at 16:33

















Many thanks ! And how to filter by date from the API to save time and bandwidth by having the API give you a smaller dataset ? Would I always get 20 results or may I get some more ?

– ThePassenger
Jan 7 at 9:41





Many thanks ! And how to filter by date from the API to save time and bandwidth by having the API give you a smaller dataset ? Would I always get 20 results or may I get some more ?

– ThePassenger
Jan 7 at 9:41













@ThePassenger: you already found the documentation, it lists what parameters are acceptable. The top-headlines path accepts a pageSize parameter, for example. The everything path lets you query for a range of dates.

– Martijn Pieters
Jan 7 at 16:33





@ThePassenger: you already found the documentation, it lists what parameters are acceptable. The top-headlines path accepts a pageSize parameter, for example. The everything path lets you query for a range of dates.

– Martijn Pieters
Jan 7 at 16:33




















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54043356%2ffilter-a-dictionary-based-on-the-value-of-its-date-keys%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Monofisismo

Angular Downloading a file using contenturl with Basic Authentication

Olmecas