newspaper/docs/user_guide/quickstart.rst
2014-06-01 11:06:38 +02:00

262 lines
7.3 KiB
ReStructuredText

.. _quickstart:
Quickstart
==========
Eager to get started? This page gives a good introduction in how to get started
with newspaper. This assumes you already have newspaper installed. If you do not,
head over to the :ref:`Installation <install>` section.
Building a news source
----------------------
Source objects are an abstraction of online news media websites like CNN or ESPN.
You can initialize them in two *different* ways.
Building a ``Source`` will extract its categories, feeds, articles, brand, and description for you.
You may also provide configuration parameters like ``language``, ``browser_user_agent``, and etc seamlessly. Navigate to the :ref:`advanced <advanced>` section for details.
.. code-block:: pycon
>>> import newspaper
>>> cnn_paper = newspaper.build('http://cnn.com')
>>> sina_paper = newspaper.build('http://www.lemonde.fr/', language='fr')
However, if needed, you may also play with the lower level ``Source`` object as described
in the :ref:`advanced <advanced>` section.
Extracting articles
-------------------
Every news source has a set of *recent* articles.
The following examples assume that a news source has been
initialized and built.
.. code-block:: pycon
>>> for article in cnn_paper.articles:
>>> print article.url
u'http://www.cnn.com/2013/11/27/justice/tucson-arizona-captive-girls/'
u'http://www.cnn.com/2013/12/11/us/texas-teen-dwi-wreck/index.html'
...
>>> print cnn_paper.size() # cnn has 3100 articles
3100
Article caching
---------------
By default, newspaper caches all previously extracted articles and **eliminates any
article which it has already extracted**.
This feature exists to prevent duplicate articles and to increase extraction speed.
.. code-block:: pycon
>>> cbs_paper = newspaper.build('http://cbs.com')
>>> cbs_paper.size()
1030
>>> cbs_paper = newspaper.build('http://cbs.com')
>>> cbs_paper.size()
2
The return value of ``cbs_paper.size()`` changes from 1030 to 2 because when we first
crawled cbs we found 1030 articles. However, on our second crawl, we eliminate all
articles which have already been crawled.
This means **2** new articles have been published since our first extraction.
You may opt out of this feature with the ``memoize_articles`` parameter.
You may also pass in the lower level``Config`` objects as covered in the :ref:`advanced <advanced>` section.
.. code-block:: pycon
>>> import newspaper
>>> cbs_paper = newspaper.build('http://cbs.com', memoize_articles=False)
>>> cbs_paper.size()
1030
>>> cbs_paper = newspaper.build('http://cbs.com', memoize_articles=False)
>>> cbs_paper.size()
1030
Extracting Source categories
----------------------------
.. code-block:: pycon
>>> for category in cnn_paper.category_urls():
>>> print category
u'http://lifestyle.cnn.com'
u'http://cnn.com/world'
u'http://tech.cnn.com'
...
Extracting Source feeds
-----------------------
.. code-block:: pycon
>>> for feed_url in cnn_paper.feed_urls():
>>> print feed_url
u'http://rss.cnn.com/rss/cnn_crime.rss'
u'http://rss.cnn.com/rss/cnn_tech.rss'
...
Extracting Source brand & description
-------------------------------------
.. code-block:: pycon
>>> print cnn_paper.brand
u'cnn'
>>> print cnn_paper.description
u'CNN.com delivers the latest breaking news and information on the latest...'
News Articles
-------------
Article objects are abstractions of news articles. For example, a news ``Source``
would be CNN while a news ``Article`` would be a specific CNN article.
You may reference an ``Article`` from an existing news ``Source`` or initialize
one by itself.
Referencing it from a ``Source``.
.. code-block:: pycon
>>> first_article = cnn_paper.articles[0]
Initializing an ``Article`` by itself.
.. code-block:: pycon
>>> from newspaper import Article
>>> first_article = Article(url="http://www.lemonde.fr/...", language='fr')
Note the similar ``language=`` named paramater above. All the config parameters as described for ``Source`` objects also apply for ``Article`` objects! **Source and Article objects have a very similar api**.
There are endless possibilities on how we can manipulate and build articles.
Downloading an Article
----------------------
We begin by calling ``download()`` on an article. If you are interested in how to
quickly download articles concurrently with multi-threading check out the
:ref:`advanced <advanced>` section.
.. code-block:: pycon
>>> first_article = cnn_paper.articles[0]
>>> first_article.download()
>>> print first_article.html
u'<!DOCTYPE HTML><html itemscope itemtype="http://...'
>>> print cnn_paper.articles[7].html
u'' fail, not downloaded yet
Parsing an Article
------------------
You may also extract meaningful content from the html, like authors and body-text.
You **must** have called ``download()`` on an article before calling ``parse()``.
.. code-block:: pycon
>>> first_article.parse()
>>> print first_article.text
u'Three sisters who were imprisoned for possibly...'
>>> print first_article.top_image
u'http://some.cdn.com/3424hfd4565sdfgdg436/
>>> print first_article.authors
[u'Eliott C. McLaughlin', u'Some CoAuthor']
>>> print first_article.title
u'Police: 3 sisters imprisoned in Tucson home'
>>> print first_article.images
['url_to_img_1', 'url_to_img_2', 'url_to_img_3', ...]
>>> print first_article.movies
['url_to_youtube_link_1', ...] # youtube, vimeo, etc
Performing NLP on an Article
----------------------------
Finally, you may extract out natural language properties from the text.
You **must** have called both ``download()`` and ``parse()`` on the article
before calling ``nlp()``.
**As of the current build, nlp() features only work on western languages.**
.. code-block:: pycon
>>> first_article.nlp()
>>> print first_article.summary
u'...imprisoned for possibly a constant barrage...'
>>> print first_article.keywords
[u'music', u'Tucson', ... ]
>>> print cnn_paper.articles[100].nlp() # fail, not been downloaded yet
Traceback (...
ArticleException: You must parse an article before you try to..
``nlp()`` is expensive, as is ``parse()``, make sure you actually need them before calling them on
all of your articles! In some cases, if you just need urls, even ``download()`` is not necessary.
Easter Eggs
-----------
Here are random but hopefully useful features! ``hot()`` returns a list of the top
trending terms on Google using a public api. ``popular_urls()`` returns a list
of popular news source urls.. In case you need help choosing a news source!
.. code-block:: pycon
>>> import newspaper
>>> newspaper.hot()
['Ned Vizzini', Brian Boitano', Crossword Inventor', 'Alex & Sierra', ... ]
>>> newspaper.popular_urls()
['http://slate.com', 'http://cnn.com', 'http://huffingtonpost.com', ... ]
>>> newspaper.languages()
Your available languages are:
input code full name
ar Arabic
de German
en English
es Spanish
fr French
it Italian
ko Korean
no Norwegian
pt Portuguese
sv Swedish
zh Chinese