Filter a dictionary based on the value of its date keys

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}

I want to import articles from as many sources around the world as from a certain date.

import requests

url = ('https://newsapi.org/v2/top-headlines?'

       'country=us&'

       'apiKey=de9e19b7547e44c4983ad761c104278f')

response = requests.get(url)



response_dataframe = pd.DataFrame(response.json())



articles = {article for article in response_dataframe['articles'] if article['publishedAt'] >= '2019-01-04T11:30:00Z'}

print(articles)

But I get:

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-84-0f21f2f50907> in <module>

      2 response_dataframe['articles'][1]['publishedAt']

      3 

----> 4 articles = {article for article in response_dataframe['articles'] if article['publishedAt'] >= '2018-01-04T11:30:00Z'}

      5 print(articles)



<ipython-input-84-0f21f2f50907> in <setcomp>(.0)

      2 response_dataframe['articles'][1]['publishedAt']

      3 

----> 4 articles = {article for article in response_dataframe['articles'] if article['publishedAt'] >= '2018-01-04T11:30:00Z'}

      5 print(articles)



TypeError: unhashable type: 'dict'

Therefore how to select a range of articles according by selecting on these keys ?
The expected output is a dataframe sorting the articles by the day and the newspaper.

              The New York Times                                The Washington Post                                The Financial Times  

2007-01-01    . What Sticks from '06. Somalia Orders Islamis... New Ebola Vaccine Gives 100 Percent Protecti...

2007-01-02    . Heart Health: Vitamin Does Not Prevent Death... Flurry of Settlements Over Toxic Mortgages M...

2007-01-03    . Google Answer to Filling Jobs Is an Algorith... Jason Miller Backs Out of White House Commun...

2007-01-04    . Helping Make the Shift From Combat to Commer... Wielding Claims of ‘Fake News,’ Conservative...

2007-01-05    . Rise in Ethanol Raises Concerns About Corn a... When One Party Has the Governor’s Mansion an

...

My Python version is 3.6.6

edited Jan 5 at 16:30

Martijn Pieters♦

728k14525562357

asked Jan 4 at 17:14

ThePassenger

268522

It's not clear what you are trying to do. Why are you putting the JSON response in a dataframe? Why is a set comprehension used, are you trying to avoid duplicates of some kind?

– Martijn Pieters♦
Jan 4 at 17:23

Put differently, can you clearly illustrate what your expected outcome is here?

– Martijn Pieters♦
Jan 4 at 17:23

The code executes without any errors for me. When I print (response), I get <Response [200]>. I am not sure if that's the expected response, but I can't reproduce the error. @MartijnPieters: What version are you using?

– Sheldore
Jan 4 at 17:24

1

Ah, I see. The code posted produces an empty set (no articles with that publication date). The traceback doesn't use the same criteria however, it uses >= to filter with, producing a non-empty match, and then you try to put those dictionaries in a set. I was using the line from the traceback.

– Martijn Pieters♦
Jan 4 at 17:27

1

At any rate, [a for a in response.json()['articles'] if a['publishedAt'] >= '2018-01-04T11:30:00Z'] works and produces a list with 20 dictionaries, all with unique titles and urls, so I don't know why a set would be required here.

– Martijn Pieters♦
Jan 4 at 17:29

|
show 5 more comments

I want to import articles from as many sources around the world as from a certain date.

import requests

url = ('https://newsapi.org/v2/top-headlines?'

       'country=us&'

       'apiKey=de9e19b7547e44c4983ad761c104278f')

response = requests.get(url)



response_dataframe = pd.DataFrame(response.json())



articles = {article for article in response_dataframe['articles'] if article['publishedAt'] >= '2019-01-04T11:30:00Z'}

print(articles)

But I get:

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-84-0f21f2f50907> in <module>

      2 response_dataframe['articles'][1]['publishedAt']

      3 

----> 4 articles = {article for article in response_dataframe['articles'] if article['publishedAt'] >= '2018-01-04T11:30:00Z'}

      5 print(articles)



<ipython-input-84-0f21f2f50907> in <setcomp>(.0)

      2 response_dataframe['articles'][1]['publishedAt']

      3 

----> 4 articles = {article for article in response_dataframe['articles'] if article['publishedAt'] >= '2018-01-04T11:30:00Z'}

      5 print(articles)



TypeError: unhashable type: 'dict'

Therefore how to select a range of articles according by selecting on these keys ?
The expected output is a dataframe sorting the articles by the day and the newspaper.

              The New York Times                                The Washington Post                                The Financial Times  

2007-01-01    . What Sticks from '06. Somalia Orders Islamis... New Ebola Vaccine Gives 100 Percent Protecti...

2007-01-02    . Heart Health: Vitamin Does Not Prevent Death... Flurry of Settlements Over Toxic Mortgages M...

2007-01-03    . Google Answer to Filling Jobs Is an Algorith... Jason Miller Backs Out of White House Commun...

2007-01-04    . Helping Make the Shift From Combat to Commer... Wielding Claims of ‘Fake News,’ Conservative...

2007-01-05    . Rise in Ethanol Raises Concerns About Corn a... When One Party Has the Governor’s Mansion an

...

My Python version is 3.6.6

edited Jan 5 at 16:30

Martijn Pieters♦

728k14525562357

asked Jan 4 at 17:14

ThePassenger

268522

It's not clear what you are trying to do. Why are you putting the JSON response in a dataframe? Why is a set comprehension used, are you trying to avoid duplicates of some kind?

– Martijn Pieters♦
Jan 4 at 17:23

Put differently, can you clearly illustrate what your expected outcome is here?

– Martijn Pieters♦
Jan 4 at 17:23

The code executes without any errors for me. When I print (response), I get <Response [200]>. I am not sure if that's the expected response, but I can't reproduce the error. @MartijnPieters: What version are you using?

– Sheldore
Jan 4 at 17:24

1

Ah, I see. The code posted produces an empty set (no articles with that publication date). The traceback doesn't use the same criteria however, it uses >= to filter with, producing a non-empty match, and then you try to put those dictionaries in a set. I was using the line from the traceback.

– Martijn Pieters♦
Jan 4 at 17:27

1

At any rate, [a for a in response.json()['articles'] if a['publishedAt'] >= '2018-01-04T11:30:00Z'] works and produces a list with 20 dictionaries, all with unique titles and urls, so I don't know why a set would be required here.

– Martijn Pieters♦
Jan 4 at 17:29

|
show 5 more comments

I want to import articles from as many sources around the world as from a certain date.

import requests

url = ('https://newsapi.org/v2/top-headlines?'

       'country=us&'

       'apiKey=de9e19b7547e44c4983ad761c104278f')

response = requests.get(url)



response_dataframe = pd.DataFrame(response.json())



articles = {article for article in response_dataframe['articles'] if article['publishedAt'] >= '2019-01-04T11:30:00Z'}

print(articles)

But I get:

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-84-0f21f2f50907> in <module>

      2 response_dataframe['articles'][1]['publishedAt']

      3 

----> 4 articles = {article for article in response_dataframe['articles'] if article['publishedAt'] >= '2018-01-04T11:30:00Z'}

      5 print(articles)



<ipython-input-84-0f21f2f50907> in <setcomp>(.0)

      2 response_dataframe['articles'][1]['publishedAt']

      3 

----> 4 articles = {article for article in response_dataframe['articles'] if article['publishedAt'] >= '2018-01-04T11:30:00Z'}

      5 print(articles)



TypeError: unhashable type: 'dict'

Therefore how to select a range of articles according by selecting on these keys ?
The expected output is a dataframe sorting the articles by the day and the newspaper.

              The New York Times                                The Washington Post                                The Financial Times  

2007-01-01    . What Sticks from '06. Somalia Orders Islamis... New Ebola Vaccine Gives 100 Percent Protecti...

2007-01-02    . Heart Health: Vitamin Does Not Prevent Death... Flurry of Settlements Over Toxic Mortgages M...

2007-01-03    . Google Answer to Filling Jobs Is an Algorith... Jason Miller Backs Out of White House Commun...

2007-01-04    . Helping Make the Shift From Combat to Commer... Wielding Claims of ‘Fake News,’ Conservative...

2007-01-05    . Rise in Ethanol Raises Concerns About Corn a... When One Party Has the Governor’s Mansion an

...

My Python version is 3.6.6

edited Jan 5 at 16:30

Martijn Pieters♦

728k14525562357

asked Jan 4 at 17:14

ThePassenger

268522

I want to import articles from as many sources around the world as from a certain date.

import requests

url = ('https://newsapi.org/v2/top-headlines?'

       'country=us&'

       'apiKey=de9e19b7547e44c4983ad761c104278f')

response = requests.get(url)



response_dataframe = pd.DataFrame(response.json())



articles = {article for article in response_dataframe['articles'] if article['publishedAt'] >= '2019-01-04T11:30:00Z'}

print(articles)

But I get:

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-84-0f21f2f50907> in <module>

      2 response_dataframe['articles'][1]['publishedAt']

      3 

----> 4 articles = {article for article in response_dataframe['articles'] if article['publishedAt'] >= '2018-01-04T11:30:00Z'}

      5 print(articles)



<ipython-input-84-0f21f2f50907> in <setcomp>(.0)

      2 response_dataframe['articles'][1]['publishedAt']

      3 

----> 4 articles = {article for article in response_dataframe['articles'] if article['publishedAt'] >= '2018-01-04T11:30:00Z'}

      5 print(articles)



TypeError: unhashable type: 'dict'

Therefore how to select a range of articles according by selecting on these keys ?
The expected output is a dataframe sorting the articles by the day and the newspaper.

              The New York Times                                The Washington Post                                The Financial Times  

2007-01-01    . What Sticks from '06. Somalia Orders Islamis... New Ebola Vaccine Gives 100 Percent Protecti...

2007-01-02    . Heart Health: Vitamin Does Not Prevent Death... Flurry of Settlements Over Toxic Mortgages M...

2007-01-03    . Google Answer to Filling Jobs Is an Algorith... Jason Miller Backs Out of White House Commun...

2007-01-04    . Helping Make the Shift From Combat to Commer... Wielding Claims of ‘Fake News,’ Conservative...

2007-01-05    . Rise in Ethanol Raises Concerns About Corn a... When One Party Has the Governor’s Mansion an

...

My Python version is 3.6.6

python python-3.x api date filter

edited Jan 5 at 16:30

Martijn Pieters♦

728k14525562357

asked Jan 4 at 17:14

ThePassenger

268522

edited Jan 5 at 16:30

Martijn Pieters♦

728k14525562357

asked Jan 4 at 17:14

ThePassenger

268522

edited Jan 5 at 16:30

Martijn Pieters♦

728k14525562357

edited Jan 5 at 16:30

Martijn Pieters♦

728k14525562357

edited Jan 5 at 16:30

Martijn Pieters♦

728k14525562357

asked Jan 4 at 17:14

ThePassenger

268522

asked Jan 4 at 17:14

ThePassenger

268522

asked Jan 4 at 17:14

ThePassenger

268522

It's not clear what you are trying to do. Why are you putting the JSON response in a dataframe? Why is a set comprehension used, are you trying to avoid duplicates of some kind?

– Martijn Pieters♦
Jan 4 at 17:23

Put differently, can you clearly illustrate what your expected outcome is here?

– Martijn Pieters♦
Jan 4 at 17:23

The code executes without any errors for me. When I print (response), I get <Response [200]>. I am not sure if that's the expected response, but I can't reproduce the error. @MartijnPieters: What version are you using?

– Sheldore
Jan 4 at 17:24

1

Ah, I see. The code posted produces an empty set (no articles with that publication date). The traceback doesn't use the same criteria however, it uses >= to filter with, producing a non-empty match, and then you try to put those dictionaries in a set. I was using the line from the traceback.

– Martijn Pieters♦
Jan 4 at 17:27

1

At any rate, [a for a in response.json()['articles'] if a['publishedAt'] >= '2018-01-04T11:30:00Z'] works and produces a list with 20 dictionaries, all with unique titles and urls, so I don't know why a set would be required here.

– Martijn Pieters♦
Jan 4 at 17:29

|
show 5 more comments

It's not clear what you are trying to do. Why are you putting the JSON response in a dataframe? Why is a set comprehension used, are you trying to avoid duplicates of some kind?

– Martijn Pieters♦
Jan 4 at 17:23

Put differently, can you clearly illustrate what your expected outcome is here?

– Martijn Pieters♦
Jan 4 at 17:23

The code executes without any errors for me. When I print (response), I get <Response [200]>. I am not sure if that's the expected response, but I can't reproduce the error. @MartijnPieters: What version are you using?

– Sheldore
Jan 4 at 17:24

1

Ah, I see. The code posted produces an empty set (no articles with that publication date). The traceback doesn't use the same criteria however, it uses >= to filter with, producing a non-empty match, and then you try to put those dictionaries in a set. I was using the line from the traceback.

– Martijn Pieters♦
Jan 4 at 17:27

1

At any rate, [a for a in response.json()['articles'] if a['publishedAt'] >= '2018-01-04T11:30:00Z'] works and produces a list with 20 dictionaries, all with unique titles and urls, so I don't know why a set would be required here.

– Martijn Pieters♦
Jan 4 at 17:29

It's not clear what you are trying to do. Why are you putting the JSON response in a dataframe? Why is a set comprehension used, are you trying to avoid duplicates of some kind?

– Martijn Pieters♦
Jan 4 at 17:23

Put differently, can you clearly illustrate what your expected outcome is here?

– Martijn Pieters♦
Jan 4 at 17:23

The code executes without any errors for me. When I print (response), I get <Response [200]>. I am not sure if that's the expected response, but I can't reproduce the error. @MartijnPieters: What version are you using?

– Sheldore
Jan 4 at 17:24

Ah, I see. The code posted produces an empty set (no articles with that publication date). The traceback doesn't use the same criteria however, it uses >= to filter with, producing a non-empty match, and then you try to put those dictionaries in a set. I was using the line from the traceback.

– Martijn Pieters♦
Jan 4 at 17:27

At any rate, [a for a in response.json()['articles'] if a['publishedAt'] >= '2018-01-04T11:30:00Z'] works and produces a list with 20 dictionaries, all with unique titles and urls, so I don't know why a set would be required here.

– Martijn Pieters♦
Jan 4 at 17:29

|
show 5 more comments

1 Answer
1

active

oldest

votes

You are filtering dictionaries then trying to put them in a set. Your expected outcome does not require you to de-duplicate anything, so the easiest path away from the error is to use a list comprehension instead; just swap out {...} curly braces for square braces:

articles = [article for article in response_dataframe['articles'] if article['publishedAt'] >= '2019-01-04T11:30:00Z']

However, if you are going to put the data into a dataframe for processing, you would be much better of with using the pandas.io.json.json_normalize() function; it can produce a dataframe for you from list-and-dictionaries structure typically loaded from a JSON source.

Start with loading just the article data you want into a dataframe and you can filter and re-arrange from there; the following code loads all data into a single dataframe with a new date column, derived from the publishAt information:

import pandas as pd

from pandas.io.json import json_normalize



df = json_normalize(response.json(), 'articles')



# make the datetime column a native type, and add a date-only column

df['publishedAt'] = pd.to_datetime(df['publishedAt'])

df['date'] = df['publishedAt'].dt.date



# move source dictionary into separate columns rather than dictionaries

source_columns = df['source'].apply(pd.Series).add_prefix('source_')

df = pd.concat([df.drop(['source'], axis=1), source_columns], axis=1)

That gives you a dataframe with all the article information, as a complete dataframe with native types, with the columns author, content, description, publishedAt, date, title, url, urlToImage and the source_id and source_name columns from the source mapping.

I note that the API allows you to filter by date already, I'd rely on that instead of filtering locally, as you can save time and bandwidth by having the API give you a smaller dataset. The API also lets you apply sorting, again a good idea.

To group the rows by date and source name, you would have to pivot the dataframe; the dates should be the index, the columns the name of the source, and the titles as the values:

df.pivot(index='date', columns='source_name', values='title')

This fails however, because this format does not have space for more than one title per source per day:

ValueError: Index contains duplicate entries, cannot reshape

In the JSON data served to me, there are multiple CNN and Fox News articles just for today.

You could aggregate multiple titles into lists:

pd.pivot_table(df,

    index='date', columns='source_name', values='title',

    aggfunc=list)

For the default 20 results for 'today' this gives me:

>>> pd.pivot_table(

...     df, index='date', columns='source_name', values='title',

...     aggfunc=list

... )

source_name                                            Bbc.com                        ...                                                                Youtube.com

date                                                                                  ...

2019-01-05   [Paul Whelan: Russia rules out prisoner swap f...                        ...                          [Bears Buzz: Eagles at Bears - Wildcard Round ...



[1 rows x 18 columns]

Personally, I'd just keep the dataframe limited to dates, titles and source names, with a date index:

>>> df[['date', 'source_name', 'title']].set_index('date').sort_values(['date', 'source_name'])

                    source_name                                              title

date

2019-01-05              Bbc.com  Paul Whelan: Russia rules out prisoner swap fo...

2019-01-05            Bloomberg  Russia Says FBI Arrested Russian Citizen on Pa...

2019-01-05                  CNN  Pay raises frozen for Pence, Cabinet members u...

2019-01-05                  CNN  16 big questions on Robert Mueller's Russia in...

2019-01-05            Colts.com  news What They're Saying: Colts/Texans, Wild C...

2019-01-05             Engadget  Pandora iOS update adds offline playback for A...

2019-01-05             Espn.com  Roger Federer wins Hopman Cup with Switzerland...

2019-01-05             Fox News  Japanese 'Tuna King' pays record $3M for prize...

2019-01-05             Fox News  Knicks' Turkish star Enes Kanter to skip Londo...

2019-01-05          Latimes.com  Flu toll mounts in California, with 42 deaths ...

2019-01-05             NBC News  After the fire: Blazes pose hidden threat to t...

2019-01-05           Newser.com  After Backlash, Ellen Not Ditching Support for...

2019-01-05              Npr.org  Three Dead After Fight Escalates Into Shooting...

2019-01-05              Reuters  French 'yellow vests' rail against unrepentant...

2019-01-05             The Hill  Trump: 'I don’t care' that most federal employ...

2019-01-05  The Huffington Post  5 Children Dead After Church Van Crashes On Wa...

2019-01-05            The Verge  Apple seeks to end bent iPad Pro controversy w...

2019-01-05    Thisisinsider.com  Kanye West surprised Kim Kardashian with a $14...

2019-01-05            USA Today  See 'Mean Girls' co-stars Lindsay Lohan and Jo...

2019-01-05          Youtube.com  Bears Buzz: Eagles at Bears - Wildcard Round -...

The above is sorted by date and by source, so multilpe titles from the same source are grouped.

edited Jan 5 at 18:12

answered Jan 5 at 17:14

Martijn Pieters♦

728k14525562357

Many thanks ! And how to filter by date from the API to save time and bandwidth by having the API give you a smaller dataset ? Would I always get 20 results or may I get some more ?

– ThePassenger
Jan 7 at 9:41

@ThePassenger: you already found the documentation, it lists what parameters are acceptable. The top-headlines path accepts a pageSize parameter, for example. The everything path lets you query for a range of dates.

– Martijn Pieters♦
Jan 7 at 16:33

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54043356%2ffilter-a-dictionary-based-on-the-value-of-its-date-keys%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

articles = [article for article in response_dataframe['articles'] if article['publishedAt'] >= '2019-01-04T11:30:00Z']

import pandas as pd

from pandas.io.json import json_normalize



df = json_normalize(response.json(), 'articles')



# make the datetime column a native type, and add a date-only column

df['publishedAt'] = pd.to_datetime(df['publishedAt'])

df['date'] = df['publishedAt'].dt.date



# move source dictionary into separate columns rather than dictionaries

source_columns = df['source'].apply(pd.Series).add_prefix('source_')

df = pd.concat([df.drop(['source'], axis=1), source_columns], axis=1)

To group the rows by date and source name, you would have to pivot the dataframe; the dates should be the index, the columns the name of the source, and the titles as the values:

df.pivot(index='date', columns='source_name', values='title')

This fails however, because this format does not have space for more than one title per source per day:

ValueError: Index contains duplicate entries, cannot reshape

In the JSON data served to me, there are multiple CNN and Fox News articles just for today.

You could aggregate multiple titles into lists:

pd.pivot_table(df,

    index='date', columns='source_name', values='title',

    aggfunc=list)

For the default 20 results for 'today' this gives me:

>>> pd.pivot_table(

...     df, index='date', columns='source_name', values='title',

...     aggfunc=list

... )

source_name                                            Bbc.com                        ...                                                                Youtube.com

date                                                                                  ...

2019-01-05   [Paul Whelan: Russia rules out prisoner swap f...                        ...                          [Bears Buzz: Eagles at Bears - Wildcard Round ...



[1 rows x 18 columns]

Personally, I'd just keep the dataframe limited to dates, titles and source names, with a date index:

>>> df[['date', 'source_name', 'title']].set_index('date').sort_values(['date', 'source_name'])

                    source_name                                              title

date

2019-01-05              Bbc.com  Paul Whelan: Russia rules out prisoner swap fo...

2019-01-05            Bloomberg  Russia Says FBI Arrested Russian Citizen on Pa...

2019-01-05                  CNN  Pay raises frozen for Pence, Cabinet members u...

2019-01-05                  CNN  16 big questions on Robert Mueller's Russia in...

2019-01-05            Colts.com  news What They're Saying: Colts/Texans, Wild C...

2019-01-05             Engadget  Pandora iOS update adds offline playback for A...

2019-01-05             Espn.com  Roger Federer wins Hopman Cup with Switzerland...

2019-01-05             Fox News  Japanese 'Tuna King' pays record $3M for prize...

2019-01-05             Fox News  Knicks' Turkish star Enes Kanter to skip Londo...

2019-01-05          Latimes.com  Flu toll mounts in California, with 42 deaths ...

2019-01-05             NBC News  After the fire: Blazes pose hidden threat to t...

2019-01-05           Newser.com  After Backlash, Ellen Not Ditching Support for...

2019-01-05              Npr.org  Three Dead After Fight Escalates Into Shooting...

2019-01-05              Reuters  French 'yellow vests' rail against unrepentant...

2019-01-05             The Hill  Trump: 'I don’t care' that most federal employ...

2019-01-05  The Huffington Post  5 Children Dead After Church Van Crashes On Wa...

2019-01-05            The Verge  Apple seeks to end bent iPad Pro controversy w...

2019-01-05    Thisisinsider.com  Kanye West surprised Kim Kardashian with a $14...

2019-01-05            USA Today  See 'Mean Girls' co-stars Lindsay Lohan and Jo...

2019-01-05          Youtube.com  Bears Buzz: Eagles at Bears - Wildcard Round -...

The above is sorted by date and by source, so multilpe titles from the same source are grouped.

edited Jan 5 at 18:12

answered Jan 5 at 17:14

Martijn Pieters♦

728k14525562357

Many thanks ! And how to filter by date from the API to save time and bandwidth by having the API give you a smaller dataset ? Would I always get 20 results or may I get some more ?

– ThePassenger
Jan 7 at 9:41

@ThePassenger: you already found the documentation, it lists what parameters are acceptable. The top-headlines path accepts a pageSize parameter, for example. The everything path lets you query for a range of dates.

– Martijn Pieters♦
Jan 7 at 16:33

add a comment |

articles = [article for article in response_dataframe['articles'] if article['publishedAt'] >= '2019-01-04T11:30:00Z']

import pandas as pd

from pandas.io.json import json_normalize



df = json_normalize(response.json(), 'articles')



# make the datetime column a native type, and add a date-only column

df['publishedAt'] = pd.to_datetime(df['publishedAt'])

df['date'] = df['publishedAt'].dt.date



# move source dictionary into separate columns rather than dictionaries

source_columns = df['source'].apply(pd.Series).add_prefix('source_')

df = pd.concat([df.drop(['source'], axis=1), source_columns], axis=1)

To group the rows by date and source name, you would have to pivot the dataframe; the dates should be the index, the columns the name of the source, and the titles as the values:

df.pivot(index='date', columns='source_name', values='title')

This fails however, because this format does not have space for more than one title per source per day:

ValueError: Index contains duplicate entries, cannot reshape

In the JSON data served to me, there are multiple CNN and Fox News articles just for today.

You could aggregate multiple titles into lists:

pd.pivot_table(df,

    index='date', columns='source_name', values='title',

    aggfunc=list)

For the default 20 results for 'today' this gives me:

>>> pd.pivot_table(

...     df, index='date', columns='source_name', values='title',

...     aggfunc=list

... )

source_name                                            Bbc.com                        ...                                                                Youtube.com

date                                                                                  ...

2019-01-05   [Paul Whelan: Russia rules out prisoner swap f...                        ...                          [Bears Buzz: Eagles at Bears - Wildcard Round ...



[1 rows x 18 columns]

Personally, I'd just keep the dataframe limited to dates, titles and source names, with a date index:

>>> df[['date', 'source_name', 'title']].set_index('date').sort_values(['date', 'source_name'])

                    source_name                                              title

date

2019-01-05              Bbc.com  Paul Whelan: Russia rules out prisoner swap fo...

2019-01-05            Bloomberg  Russia Says FBI Arrested Russian Citizen on Pa...

2019-01-05                  CNN  Pay raises frozen for Pence, Cabinet members u...

2019-01-05                  CNN  16 big questions on Robert Mueller's Russia in...

2019-01-05            Colts.com  news What They're Saying: Colts/Texans, Wild C...

2019-01-05             Engadget  Pandora iOS update adds offline playback for A...

2019-01-05             Espn.com  Roger Federer wins Hopman Cup with Switzerland...

2019-01-05             Fox News  Japanese 'Tuna King' pays record $3M for prize...

2019-01-05             Fox News  Knicks' Turkish star Enes Kanter to skip Londo...

2019-01-05          Latimes.com  Flu toll mounts in California, with 42 deaths ...

2019-01-05             NBC News  After the fire: Blazes pose hidden threat to t...

2019-01-05           Newser.com  After Backlash, Ellen Not Ditching Support for...

2019-01-05              Npr.org  Three Dead After Fight Escalates Into Shooting...

2019-01-05              Reuters  French 'yellow vests' rail against unrepentant...

2019-01-05             The Hill  Trump: 'I don’t care' that most federal employ...

2019-01-05  The Huffington Post  5 Children Dead After Church Van Crashes On Wa...

2019-01-05            The Verge  Apple seeks to end bent iPad Pro controversy w...

2019-01-05    Thisisinsider.com  Kanye West surprised Kim Kardashian with a $14...

2019-01-05            USA Today  See 'Mean Girls' co-stars Lindsay Lohan and Jo...

2019-01-05          Youtube.com  Bears Buzz: Eagles at Bears - Wildcard Round -...

The above is sorted by date and by source, so multilpe titles from the same source are grouped.

edited Jan 5 at 18:12

answered Jan 5 at 17:14

Martijn Pieters♦

728k14525562357

Many thanks ! And how to filter by date from the API to save time and bandwidth by having the API give you a smaller dataset ? Would I always get 20 results or may I get some more ?

– ThePassenger
Jan 7 at 9:41

@ThePassenger: you already found the documentation, it lists what parameters are acceptable. The top-headlines path accepts a pageSize parameter, for example. The everything path lets you query for a range of dates.

– Martijn Pieters♦
Jan 7 at 16:33

add a comment |

articles = [article for article in response_dataframe['articles'] if article['publishedAt'] >= '2019-01-04T11:30:00Z']

import pandas as pd

from pandas.io.json import json_normalize



df = json_normalize(response.json(), 'articles')



# make the datetime column a native type, and add a date-only column

df['publishedAt'] = pd.to_datetime(df['publishedAt'])

df['date'] = df['publishedAt'].dt.date



# move source dictionary into separate columns rather than dictionaries

source_columns = df['source'].apply(pd.Series).add_prefix('source_')

df = pd.concat([df.drop(['source'], axis=1), source_columns], axis=1)

To group the rows by date and source name, you would have to pivot the dataframe; the dates should be the index, the columns the name of the source, and the titles as the values:

df.pivot(index='date', columns='source_name', values='title')

This fails however, because this format does not have space for more than one title per source per day:

ValueError: Index contains duplicate entries, cannot reshape

In the JSON data served to me, there are multiple CNN and Fox News articles just for today.

You could aggregate multiple titles into lists:

pd.pivot_table(df,

    index='date', columns='source_name', values='title',

    aggfunc=list)

For the default 20 results for 'today' this gives me:

>>> pd.pivot_table(

...     df, index='date', columns='source_name', values='title',

...     aggfunc=list

... )

source_name                                            Bbc.com                        ...                                                                Youtube.com

date                                                                                  ...

2019-01-05   [Paul Whelan: Russia rules out prisoner swap f...                        ...                          [Bears Buzz: Eagles at Bears - Wildcard Round ...



[1 rows x 18 columns]

Personally, I'd just keep the dataframe limited to dates, titles and source names, with a date index:

>>> df[['date', 'source_name', 'title']].set_index('date').sort_values(['date', 'source_name'])

                    source_name                                              title

date

2019-01-05              Bbc.com  Paul Whelan: Russia rules out prisoner swap fo...

2019-01-05            Bloomberg  Russia Says FBI Arrested Russian Citizen on Pa...

2019-01-05                  CNN  Pay raises frozen for Pence, Cabinet members u...

2019-01-05                  CNN  16 big questions on Robert Mueller's Russia in...

2019-01-05            Colts.com  news What They're Saying: Colts/Texans, Wild C...

2019-01-05             Engadget  Pandora iOS update adds offline playback for A...

2019-01-05             Espn.com  Roger Federer wins Hopman Cup with Switzerland...

2019-01-05             Fox News  Japanese 'Tuna King' pays record $3M for prize...

2019-01-05             Fox News  Knicks' Turkish star Enes Kanter to skip Londo...

2019-01-05          Latimes.com  Flu toll mounts in California, with 42 deaths ...

2019-01-05             NBC News  After the fire: Blazes pose hidden threat to t...

2019-01-05           Newser.com  After Backlash, Ellen Not Ditching Support for...

2019-01-05              Npr.org  Three Dead After Fight Escalates Into Shooting...

2019-01-05              Reuters  French 'yellow vests' rail against unrepentant...

2019-01-05             The Hill  Trump: 'I don’t care' that most federal employ...

2019-01-05  The Huffington Post  5 Children Dead After Church Van Crashes On Wa...

2019-01-05            The Verge  Apple seeks to end bent iPad Pro controversy w...

2019-01-05    Thisisinsider.com  Kanye West surprised Kim Kardashian with a $14...

2019-01-05            USA Today  See 'Mean Girls' co-stars Lindsay Lohan and Jo...

2019-01-05          Youtube.com  Bears Buzz: Eagles at Bears - Wildcard Round -...

The above is sorted by date and by source, so multilpe titles from the same source are grouped.

edited Jan 5 at 18:12

answered Jan 5 at 17:14

Martijn Pieters♦

728k14525562357

articles = [article for article in response_dataframe['articles'] if article['publishedAt'] >= '2019-01-04T11:30:00Z']

import pandas as pd

from pandas.io.json import json_normalize



df = json_normalize(response.json(), 'articles')



# make the datetime column a native type, and add a date-only column

df['publishedAt'] = pd.to_datetime(df['publishedAt'])

df['date'] = df['publishedAt'].dt.date



# move source dictionary into separate columns rather than dictionaries

source_columns = df['source'].apply(pd.Series).add_prefix('source_')

df = pd.concat([df.drop(['source'], axis=1), source_columns], axis=1)

To group the rows by date and source name, you would have to pivot the dataframe; the dates should be the index, the columns the name of the source, and the titles as the values:

df.pivot(index='date', columns='source_name', values='title')

This fails however, because this format does not have space for more than one title per source per day:

ValueError: Index contains duplicate entries, cannot reshape

In the JSON data served to me, there are multiple CNN and Fox News articles just for today.

You could aggregate multiple titles into lists:

pd.pivot_table(df,

    index='date', columns='source_name', values='title',

    aggfunc=list)

For the default 20 results for 'today' this gives me:

>>> pd.pivot_table(

...     df, index='date', columns='source_name', values='title',

...     aggfunc=list

... )

source_name                                            Bbc.com                        ...                                                                Youtube.com

date                                                                                  ...

2019-01-05   [Paul Whelan: Russia rules out prisoner swap f...                        ...                          [Bears Buzz: Eagles at Bears - Wildcard Round ...



[1 rows x 18 columns]

Personally, I'd just keep the dataframe limited to dates, titles and source names, with a date index:

>>> df[['date', 'source_name', 'title']].set_index('date').sort_values(['date', 'source_name'])

                    source_name                                              title

date

2019-01-05              Bbc.com  Paul Whelan: Russia rules out prisoner swap fo...

2019-01-05            Bloomberg  Russia Says FBI Arrested Russian Citizen on Pa...

2019-01-05                  CNN  Pay raises frozen for Pence, Cabinet members u...

2019-01-05                  CNN  16 big questions on Robert Mueller's Russia in...

2019-01-05            Colts.com  news What They're Saying: Colts/Texans, Wild C...

2019-01-05             Engadget  Pandora iOS update adds offline playback for A...

2019-01-05             Espn.com  Roger Federer wins Hopman Cup with Switzerland...

2019-01-05             Fox News  Japanese 'Tuna King' pays record $3M for prize...

2019-01-05             Fox News  Knicks' Turkish star Enes Kanter to skip Londo...

2019-01-05          Latimes.com  Flu toll mounts in California, with 42 deaths ...

2019-01-05             NBC News  After the fire: Blazes pose hidden threat to t...

2019-01-05           Newser.com  After Backlash, Ellen Not Ditching Support for...

2019-01-05              Npr.org  Three Dead After Fight Escalates Into Shooting...

2019-01-05              Reuters  French 'yellow vests' rail against unrepentant...

2019-01-05             The Hill  Trump: 'I don’t care' that most federal employ...

2019-01-05  The Huffington Post  5 Children Dead After Church Van Crashes On Wa...

2019-01-05            The Verge  Apple seeks to end bent iPad Pro controversy w...

2019-01-05    Thisisinsider.com  Kanye West surprised Kim Kardashian with a $14...

2019-01-05            USA Today  See 'Mean Girls' co-stars Lindsay Lohan and Jo...

2019-01-05          Youtube.com  Bears Buzz: Eagles at Bears - Wildcard Round -...

The above is sorted by date and by source, so multilpe titles from the same source are grouped.

edited Jan 5 at 18:12

answered Jan 5 at 17:14

Martijn Pieters♦

728k14525562357

edited Jan 5 at 18:12

answered Jan 5 at 17:14

Martijn Pieters♦

728k14525562357

answered Jan 5 at 17:14

Martijn Pieters♦

728k14525562357

answered Jan 5 at 17:14

Martijn Pieters♦

728k14525562357

Many thanks ! And how to filter by date from the API to save time and bandwidth by having the API give you a smaller dataset ? Would I always get 20 results or may I get some more ?

– ThePassenger
Jan 7 at 9:41

@ThePassenger: you already found the documentation, it lists what parameters are acceptable. The top-headlines path accepts a pageSize parameter, for example. The everything path lets you query for a range of dates.

– Martijn Pieters♦
Jan 7 at 16:33

add a comment |

Many thanks ! And how to filter by date from the API to save time and bandwidth by having the API give you a smaller dataset ? Would I always get 20 results or may I get some more ?

– ThePassenger
Jan 7 at 9:41

@ThePassenger: you already found the documentation, it lists what parameters are acceptable. The top-headlines path accepts a pageSize parameter, for example. The everything path lets you query for a range of dates.

– Martijn Pieters♦
Jan 7 at 16:33

Many thanks ! And how to filter by date from the API to save time and bandwidth by having the API give you a smaller dataset ? Would I always get 20 results or may I get some more ?

– ThePassenger
Jan 7 at 9:41

@ThePassenger: you already found the documentation, it lists what parameters are acceptable. The top-headlines path accepts a pageSize parameter, for example. The everything path lets you query for a range of dates.

– Martijn Pieters♦
Jan 7 at 16:33

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Bdtjtk