newspaper/docs/user_guide/quickstart.rst

.. _quickstart:

Quickstart
==========

Eager to get started? This page gives a good introduction in how to get started
with newspaper. This assumes you already have newspaper installed. If you do not,
head over to the :ref:`Installation <install>` section.

Building a news source
----------------------

Source objects are an abstraction of online news media websites like CNN or ESPN.
You can initialize them in two *different* ways.

Building a ``Source`` will extract its categories, feeds, articles, brand, and description for you.

You may also provide configuration parameters like ``language``, ``browser_user_agent``, and etc seamlessly. Navigate to the :ref:`advanced <advanced>` section for details.

.. code-block:: pycon

    >>> import newspaper
    >>> cnn_paper = newspaper.build('http://cnn.com')

    >>> sina_paper = newspaper.build('http://www.lemonde.fr/', language='fr')

However, if needed, you may also play with the lower level ``Source`` object as described
in the :ref:`advanced <advanced>` section.

Extracting articles
-------------------

Every news source has a set of *recent* articles.

The following examples assume that a news source has been
initialized and built.

.. code-block:: pycon

    >>> for article in cnn_paper.articles:
    >>>     print article.url

    u'http://www.cnn.com/2013/11/27/justice/tucson-arizona-captive-girls/'
    u'http://www.cnn.com/2013/12/11/us/texas-teen-dwi-wreck/index.html'
    ...

    >>> print cnn_paper.size() # cnn has 3100 articles
    3100

Article caching
---------------

By default, newspaper caches all previously extracted articles and **eliminates any
article which it has already extracted**.

This feature exists to prevent duplicate articles and to increase extraction speed.

.. code-block:: pycon

    >>> cbs_paper = newspaper.build('http://cbs.com')
    >>> cbs_paper.size()
    1030

    >>> cbs_paper = newspaper.build('http://cbs.com')
    >>> cbs_paper.size()
    2

The return value of ``cbs_paper.size()`` changes from 1030 to 2 because when we first
crawled cbs we found 1030 articles. However, on our second crawl, we eliminate all
articles which have already been crawled.

This means **2** new articles have been published since our first extraction.

You may opt out of this feature with the ``memoize_articles`` parameter.

You may also pass in the lower level``Config`` objects as covered in the :ref:`advanced <advanced>` section.

.. code-block:: pycon

    >>> import newspaper

    >>> cbs_paper = newspaper.build('http://cbs.com', memoize_articles=False)
    >>> cbs_paper.size()
    1030

    >>> cbs_paper = newspaper.build('http://cbs.com', memoize_articles=False)
    >>> cbs_paper.size()
    1030


Extracting Source categories
----------------------------

.. code-block:: pycon

    >>> for category in cnn_paper.category_urls():
    >>>     print category

    u'http://lifestyle.cnn.com'
    u'http://cnn.com/world'
    u'http://tech.cnn.com'
    ...

Extracting Source feeds
-----------------------

.. code-block:: pycon

    >>> for feed_url in cnn_paper.feed_urls():
    >>>     print feed_url

    u'http://rss.cnn.com/rss/cnn_crime.rss'
    u'http://rss.cnn.com/rss/cnn_tech.rss'
    ...

Extracting Source brand & description
-------------------------------------

.. code-block:: pycon

    >>> print cnn_paper.brand
    u'cnn'

    >>> print cnn_paper.description
    u'CNN.com delivers the latest breaking news and information on the latest...'

News Articles
-------------

Article objects are abstractions of news articles. For example, a news ``Source``
would be CNN while a news ``Article`` would be a specific CNN article.
You may reference an ``Article`` from an existing news ``Source`` or initialize
one by itself.

Referencing it from a ``Source``.

.. code-block:: pycon

    >>> first_article = cnn_paper.articles[0]

Initializing an ``Article`` by itself.

.. code-block:: pycon

    >>> from newspaper import Article
    >>> first_article = Article(url="http://www.lemonde.fr/...", language='fr')


Note the similar ``language=`` named paramater above. All the config parameters as described for ``Source`` objects also apply for ``Article`` objects! **Source and Article objects have a very similar api**.

There are endless possibilities on how we can manipulate and build articles.

Downloading an Article
----------------------

We begin by calling ``download()`` on an article. If you are interested in how to
quickly download articles concurrently with multi-threading check out the
:ref:`advanced <advanced>` section.

.. code-block:: pycon

    >>> first_article = cnn_paper.articles[0]

    >>> first_article.download()

    >>> print first_article.html
    u'<!DOCTYPE HTML><html itemscope itemtype="http://...'

    >>> print cnn_paper.articles[7].html
    u'' fail, not downloaded yet

Parsing an Article
------------------

You may also extract meaningful content from the html, like authors and body-text.
You **must** have called ``download()`` on an article before calling ``parse()``.

.. code-block:: pycon

    >>> first_article.parse()

    >>> print first_article.text
    u'Three sisters who were imprisoned for possibly...'

    >>> print first_article.top_image
    u'http://some.cdn.com/3424hfd4565sdfgdg436/

    >>> print first_article.authors
    [u'Eliott C. McLaughlin', u'Some CoAuthor']

    >>> print first_article.title
    u'Police: 3 sisters imprisoned in Tucson home'

    >>> print first_article.images
    ['url_to_img_1', 'url_to_img_2', 'url_to_img_3', ...]

    >>> print first_article.movies
    ['url_to_youtube_link_1', ...] # youtube, vimeo, etc


Performing NLP on an Article
----------------------------

Finally, you may extract out natural language properties from the text.
You **must** have called both ``download()`` and ``parse()`` on the article
before calling ``nlp()``.

**As of the current build, nlp() features only work on western languages.**

.. code-block:: pycon

    >>> first_article.nlp()

    >>> print first_article.summary
    u'...imprisoned for possibly a constant barrage...'

    >>> print first_article.keywords
    [u'music', u'Tucson', ... ]

    >>> print cnn_paper.articles[100].nlp() # fail, not been downloaded yet
    Traceback (...
    ArticleException: You must parse an article before you try to..


``nlp()`` is expensive, as is ``parse()``, make sure you actually need them before calling them on
all of your articles! In some cases, if you just need urls, even ``download()`` is not necessary.

Easter Eggs
-----------

Here are random but hopefully useful features! ``hot()`` returns a list of the top
trending terms on Google using a public api. ``popular_urls()`` returns a list
of popular news source urls.. In case you need help choosing a news source!

.. code-block:: pycon

    >>> import newspaper

    >>> newspaper.hot()
    ['Ned Vizzini', Brian Boitano', Crossword Inventor', 'Alex & Sierra', ... ]

    >>> newspaper.popular_urls()
    ['http://slate.com', 'http://cnn.com', 'http://huffingtonpost.com', ... ]

    >>> newspaper.languages()

    Your available languages are:
    input code      full name

      ar              Arabic
      de              German
      en              English
      es              Spanish
      fr              French
      it              Italian
      ko              Korean
      no              Norwegian
      pt              Portuguese
      sv              Swedish
      zh              Chinese