mirror of
https://github.com/codelucas/newspaper.git
synced 2025-12-23 05:36:50 +00:00
262 lines
7.3 KiB
ReStructuredText
262 lines
7.3 KiB
ReStructuredText
.. _quickstart:
|
|
|
|
Quickstart
|
|
==========
|
|
|
|
Eager to get started? This page gives a good introduction in how to get started
|
|
with newspaper. This assumes you already have newspaper installed. If you do not,
|
|
head over to the :ref:`Installation <install>` section.
|
|
|
|
Building a news source
|
|
----------------------
|
|
|
|
Source objects are an abstraction of online news media websites like CNN or ESPN.
|
|
You can initialize them in two *different* ways.
|
|
|
|
Building a ``Source`` will extract its categories, feeds, articles, brand, and description for you.
|
|
|
|
You may also provide configuration parameters like ``language``, ``browser_user_agent``, and etc seamlessly. Navigate to the :ref:`advanced <advanced>` section for details.
|
|
|
|
.. code-block:: pycon
|
|
|
|
>>> import newspaper
|
|
>>> cnn_paper = newspaper.build('http://cnn.com')
|
|
|
|
>>> sina_paper = newspaper.build('http://www.lemonde.fr/', language='fr')
|
|
|
|
However, if needed, you may also play with the lower level ``Source`` object as described
|
|
in the :ref:`advanced <advanced>` section.
|
|
|
|
Extracting articles
|
|
-------------------
|
|
|
|
Every news source has a set of *recent* articles.
|
|
|
|
The following examples assume that a news source has been
|
|
initialized and built.
|
|
|
|
.. code-block:: pycon
|
|
|
|
>>> for article in cnn_paper.articles:
|
|
>>> print article.url
|
|
|
|
u'http://www.cnn.com/2013/11/27/justice/tucson-arizona-captive-girls/'
|
|
u'http://www.cnn.com/2013/12/11/us/texas-teen-dwi-wreck/index.html'
|
|
...
|
|
|
|
>>> print cnn_paper.size() # cnn has 3100 articles
|
|
3100
|
|
|
|
Article caching
|
|
---------------
|
|
|
|
By default, newspaper caches all previously extracted articles and **eliminates any
|
|
article which it has already extracted**.
|
|
|
|
This feature exists to prevent duplicate articles and to increase extraction speed.
|
|
|
|
.. code-block:: pycon
|
|
|
|
>>> cbs_paper = newspaper.build('http://cbs.com')
|
|
>>> cbs_paper.size()
|
|
1030
|
|
|
|
>>> cbs_paper = newspaper.build('http://cbs.com')
|
|
>>> cbs_paper.size()
|
|
2
|
|
|
|
The return value of ``cbs_paper.size()`` changes from 1030 to 2 because when we first
|
|
crawled cbs we found 1030 articles. However, on our second crawl, we eliminate all
|
|
articles which have already been crawled.
|
|
|
|
This means **2** new articles have been published since our first extraction.
|
|
|
|
You may opt out of this feature with the ``memoize_articles`` parameter.
|
|
|
|
You may also pass in the lower level``Config`` objects as covered in the :ref:`advanced <advanced>` section.
|
|
|
|
.. code-block:: pycon
|
|
|
|
>>> import newspaper
|
|
|
|
>>> cbs_paper = newspaper.build('http://cbs.com', memoize_articles=False)
|
|
>>> cbs_paper.size()
|
|
1030
|
|
|
|
>>> cbs_paper = newspaper.build('http://cbs.com', memoize_articles=False)
|
|
>>> cbs_paper.size()
|
|
1030
|
|
|
|
|
|
Extracting Source categories
|
|
----------------------------
|
|
|
|
.. code-block:: pycon
|
|
|
|
>>> for category in cnn_paper.category_urls():
|
|
>>> print category
|
|
|
|
u'http://lifestyle.cnn.com'
|
|
u'http://cnn.com/world'
|
|
u'http://tech.cnn.com'
|
|
...
|
|
|
|
Extracting Source feeds
|
|
-----------------------
|
|
|
|
.. code-block:: pycon
|
|
|
|
>>> for feed_url in cnn_paper.feed_urls():
|
|
>>> print feed_url
|
|
|
|
u'http://rss.cnn.com/rss/cnn_crime.rss'
|
|
u'http://rss.cnn.com/rss/cnn_tech.rss'
|
|
...
|
|
|
|
Extracting Source brand & description
|
|
-------------------------------------
|
|
|
|
.. code-block:: pycon
|
|
|
|
>>> print cnn_paper.brand
|
|
u'cnn'
|
|
|
|
>>> print cnn_paper.description
|
|
u'CNN.com delivers the latest breaking news and information on the latest...'
|
|
|
|
News Articles
|
|
-------------
|
|
|
|
Article objects are abstractions of news articles. For example, a news ``Source``
|
|
would be CNN while a news ``Article`` would be a specific CNN article.
|
|
You may reference an ``Article`` from an existing news ``Source`` or initialize
|
|
one by itself.
|
|
|
|
Referencing it from a ``Source``.
|
|
|
|
.. code-block:: pycon
|
|
|
|
>>> first_article = cnn_paper.articles[0]
|
|
|
|
Initializing an ``Article`` by itself.
|
|
|
|
.. code-block:: pycon
|
|
|
|
>>> from newspaper import Article
|
|
>>> first_article = Article(url="http://www.lemonde.fr/...", language='fr')
|
|
|
|
|
|
Note the similar ``language=`` named paramater above. All the config parameters as described for ``Source`` objects also apply for ``Article`` objects! **Source and Article objects have a very similar api**.
|
|
|
|
There are endless possibilities on how we can manipulate and build articles.
|
|
|
|
Downloading an Article
|
|
----------------------
|
|
|
|
We begin by calling ``download()`` on an article. If you are interested in how to
|
|
quickly download articles concurrently with multi-threading check out the
|
|
:ref:`advanced <advanced>` section.
|
|
|
|
.. code-block:: pycon
|
|
|
|
>>> first_article = cnn_paper.articles[0]
|
|
|
|
>>> first_article.download()
|
|
|
|
>>> print first_article.html
|
|
u'<!DOCTYPE HTML><html itemscope itemtype="http://...'
|
|
|
|
>>> print cnn_paper.articles[7].html
|
|
u'' fail, not downloaded yet
|
|
|
|
Parsing an Article
|
|
------------------
|
|
|
|
You may also extract meaningful content from the html, like authors and body-text.
|
|
You **must** have called ``download()`` on an article before calling ``parse()``.
|
|
|
|
.. code-block:: pycon
|
|
|
|
>>> first_article.parse()
|
|
|
|
>>> print first_article.text
|
|
u'Three sisters who were imprisoned for possibly...'
|
|
|
|
>>> print first_article.top_image
|
|
u'http://some.cdn.com/3424hfd4565sdfgdg436/
|
|
|
|
>>> print first_article.authors
|
|
[u'Eliott C. McLaughlin', u'Some CoAuthor']
|
|
|
|
>>> print first_article.title
|
|
u'Police: 3 sisters imprisoned in Tucson home'
|
|
|
|
>>> print first_article.images
|
|
['url_to_img_1', 'url_to_img_2', 'url_to_img_3', ...]
|
|
|
|
>>> print first_article.movies
|
|
['url_to_youtube_link_1', ...] # youtube, vimeo, etc
|
|
|
|
|
|
Performing NLP on an Article
|
|
----------------------------
|
|
|
|
Finally, you may extract out natural language properties from the text.
|
|
You **must** have called both ``download()`` and ``parse()`` on the article
|
|
before calling ``nlp()``.
|
|
|
|
**As of the current build, nlp() features only work on western languages.**
|
|
|
|
.. code-block:: pycon
|
|
|
|
>>> first_article.nlp()
|
|
|
|
>>> print first_article.summary
|
|
u'...imprisoned for possibly a constant barrage...'
|
|
|
|
>>> print first_article.keywords
|
|
[u'music', u'Tucson', ... ]
|
|
|
|
>>> print cnn_paper.articles[100].nlp() # fail, not been downloaded yet
|
|
Traceback (...
|
|
ArticleException: You must parse an article before you try to..
|
|
|
|
|
|
``nlp()`` is expensive, as is ``parse()``, make sure you actually need them before calling them on
|
|
all of your articles! In some cases, if you just need urls, even ``download()`` is not necessary.
|
|
|
|
Easter Eggs
|
|
-----------
|
|
|
|
Here are random but hopefully useful features! ``hot()`` returns a list of the top
|
|
trending terms on Google using a public api. ``popular_urls()`` returns a list
|
|
of popular news source urls.. In case you need help choosing a news source!
|
|
|
|
.. code-block:: pycon
|
|
|
|
>>> import newspaper
|
|
|
|
>>> newspaper.hot()
|
|
['Ned Vizzini', Brian Boitano', Crossword Inventor', 'Alex & Sierra', ... ]
|
|
|
|
>>> newspaper.popular_urls()
|
|
['http://slate.com', 'http://cnn.com', 'http://huffingtonpost.com', ... ]
|
|
|
|
>>> newspaper.languages()
|
|
|
|
Your available languages are:
|
|
input code full name
|
|
|
|
ar Arabic
|
|
de German
|
|
en English
|
|
es Spanish
|
|
fr French
|
|
it Italian
|
|
ko Korean
|
|
no Norwegian
|
|
pt Portuguese
|
|
sv Swedish
|
|
zh Chinese
|
|
|
|
|