mirrors/newspaper: newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:

mirror of https://github.com/codelucas/newspaper.git synced 2025-12-23 05:36:50 +00:00

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs: https://goo.gl/VX41yK

crawler crawling news news-aggregator python scraper

Find a file

Lucas Ou-Yang 4e11ca419b fix readme		2013-12-30 21:38:49 -08:00
docs	fixed docs google analytics id	2013-12-22 02:10:04 -08:00
newspaper	added ability to change # of threads allocated per source when using pool framework	2013-12-30 21:36:42 -08:00
tests	added ability to change # of threads allocated per source when using pool framework	2013-12-30 21:36:42 -08:00
.gitignore	finalizing code for upload	2013-12-20 23:33:13 -08:00
CHANGES.txt	added config objects for easy configuration, will mod readme to reflect this tomorrow	2013-12-29 04:12:01 -08:00
CONTRIBUTORS.md	fix bug where memoize articles fails and returns nothing	2013-12-29 17:49:41 -08:00
download_corpora.py	added ability to change # of threads allocated per source when using pool framework	2013-12-30 21:36:42 -08:00
HISTORY.md	added config objects for easy configuration, will mod readme to reflect this tomorrow	2013-12-29 04:12:01 -08:00
LICENSE	Initial commit	2013-11-25 01:50:50 -08:00
MANIFEST.in	cleaned up goose merge, started mthreading code, added contributor	2013-12-29 04:06:44 -08:00
README.rst	fix readme	2013-12-30 21:38:49 -08:00
requirements.txt	cleaned up goose merge, started mthreading code, added contributor	2013-12-29 04:06:44 -08:00
setup.py	changed readme to reflect new 0.0.4 api	2013-12-30 04:11:30 -08:00

README.rst

Newspaper: Article scraping & curation
======================================

.. image:: https://badge.fury.io/py/newspaper.png
    :target: http://badge.fury.io/py/newspaper
        :alt: Latest version

.. .. image:: https://pypip.in/d/newspaper/badge.png
    :target: https://crate.io/packages/newspaper/
        :alt: Number of PyPI downloads


Inspired by ``requests`` for its simplicity and powered by ``lxml`` for its speed; **newspaper**
is a Python 2 library for extracting & curating articles from the web.

Newspaper wants to change the way people handle ``article extraction`` with ``a new, more precise
layer of abstraction``.  Visit our homepage at: `Newspaper Docs`_.

Newspaper utilizes lxml and caching for speed. *Also, everything is in unicode*

.. code-block:: pycon

    >>> import newspaper

    >>> cnn_paper = newspaper.build('http://cnn.com') # ~15 seconds 

    >>> for article in cnn_paper.articles: # filters urls 
    >>>     print article.url 

    u'http://www.cnn.com/2013/11/27/justice/tucson-arizona-captive-girls/'
    u'http://www.cnn.com/2013/12/11/us/texas-teen-dwi-wreck/index.html'
    u'http://www.cnn.com/2013/12/07/us/life-pearl-harbor/'
    ...

    >>> print cnn_paper.size() # number of articles
    3100 

    >>> print cnn_paper.category_urls() 
    [u'http://lifestyle.cnn.com', u'http://cnn.com/world', u'http://tech.cnn.com' ...]

    >>> print cnn_paper.feed_urls() 
    [u'http://rss.cnn.com/rss/cnn_crime.rss', u'http://rss.cnn.com/rss/cnn_tech.rss', ...] 

    Note that the above category urls & feeds are cached for speed.

The first step is to ``download()`` an article.    
    
.. code-block:: pycon

    >>> first_article = cnn_paper.articles[0]

    >>> first_article.download()

    >>> print first_article.html # html fetched from download()
    u'<!DOCTYPE HTML><html itemscope itemtype="http://...'
    
    # we have not downloaded this article yet, it will fail
    >>> print cnn_paper.articles[7].html 
    u'' 

You may also extract meaningful content from the html, like authors, body-text..
You must have called ``download()`` on the article before calling ``parse()``.

.. code-block:: pycon

    >>> first_article.parse()  

    >>> print first_article.text
    u'Three sisters who were imprisoned for possibly...'

    >>> print first_article.top_img  
    u'http://some.cdn.com/3424hfd4565sdfgdg436/

    >>> print first_article.authors
    [u'Eliott C. McLaughlin', u'Some CoAuthor']
    
    >>> print first_article.title
    u'Police: 3 sisters imprisoned in Tucson home'

Finally, you may extract out natural language properties from the text. You must have
called both ``download()`` and ``parse()`` on the article before calling ``nlp()``.

.. code-block:: pycon

    >>> first_article.nlp() # must be on an already parse()'ed article

    >>> print first_article.summary
    u'...imprisoned for possibly a constant barrage...'

    >>> print first_article.keywords
    [u'music', u'Tucson', ... ]

    >>> print cnn_paper.articles[100].nlp() # fail, not been downloaded yet
    Traceback (...
       ...
    ArticleException: You must parse an article before you try to..


**Downloading articles one at a time is slow.** But spamming a single news source
like cnn.com with tons of threads or with ASYNC-IO will cause rate limiting
and also doing that is very mean.

We solve this problem by allocating 1-2 threads per news source to both greatly
speed up the download time while being respectful.

.. code-block:: pycon

    >>> import newspaper
    >>> from newspaper import news_pool

    >>> slate_paper = newspaper.build('http://slate.com')
    >>> tc_paper = newspaper.build('http://techcrunch.com')
    >>> espn_paper = newspaper.build('http://espn.com')

    >>> papers = [slate_paper, tc_paper, espn_paper]
    >>> news_pool.set(papers, threads_per_source=2) # (3*2) = 6 threads total
    >>> news_pool.join()

    At this point, you can safely assume that download() has been
    called on every single article for all 3 sources.
    
    >>> print slate_paper.articles[10].html
    u'<html> ...' 


**IMPORTANT**
Unless told not to in the configs via ``is_memoize_articles`` (default true), 
newspaper auto caches all article urls for speed & duplicates.

.. code-block:: pycon

    >>> cbs_paper = newspaper.build('http://cbs.com')
    >>> cbs_paper.size()
    1030 # cbs has 1030 articles as of now

    >>> cbs_paper = newspaper.build('http://cbs.com')
    >>> cbs_paper.size()
    60 # since we last ran build(), cbs published 60 new articles
       # we ignore old articles

    If you'd like to opt out of memoization, modify the Config object

    >>> from newspaper import Config

    >>> config = Config()
    >>> config.is_memoize_articles = False
    >>> cbs_paper = newspaper.build('http://cbs.com', config)
    >>> cbs_paper.size()
    1030
    >>> cbs_paper = newspaper.build('http://cbs.com', config)
    >>> cbs_paper.size()
    1030

Some other useful news-source level functionality.

.. code-block:: pycon

    >>> cnn_paper = newspaper.build('http://cnn.com')
    >>> print cnn_paper.brand
    u'cnn'

    >>> print cnn_paper.description
    u'CNN.com delivers the latest breaking news and information on the latest...'

    >>> newspaper.hot()[:5] # top google trending terms
    ['Ned Vizzini', Brian Boitano', Crossword Inventor', 'Alex and Sierra', 'Claire Davis']

    >>> newspaper.popular_urls() 
    ['http://slate.com', 'http://cnn.com', 'http://huffingtonpost.com', ...]


You may also customize how newspaper extracts articles at a much deeper level
via config objects. View newspaper/configuration.py for details.

.. code-block:: pycon

    >>> import newspaper
    >>> from newspaper import Config

    >>> config = Config()
    >>> config.verbose = True
    >>> config.MAX_KEYWORDS = 10
    >>> config.MAX_AUTHORS = 2
    >>> config.browser_user_agent = 'new dude'
    >>> config.number_threads = 2
    >>> config.request_timeout = 5

    >>> espn = newspaper.build('http://espn.com', config=config)
    
    However, config objects are still under heavy development!


Config objects are highly flexible, you can pass them into
newspaper.build(..) methods, Source(..) constructors, and also
Article(..) constructors.

A Config object passed into a Source is also respectively passed
into all of that Source's children Article's.

.. code-block:: pycon

    >>> import newspaper
    >>> from newspaper import Config, Article, Source
    >>> config = Config()

    >>> a = Article(url='http://..', config)
    >>> a.download()
    >>> a.parse()

    >>> s = Source('http://latimes.com', config)
    >>> s.build()

TADA :D

Alternatively, you may use newspaper's lower level Article api.

.. code-block:: pycon

    >>> from newspaper import Article

    >>> article = Article('http://cnn.com/2013/11/27/travel/weather-thanksgiving/index.html')
    >>> article.download()

    >>> print article.html 
    u'<!DOCTYPE HTML><html itemscope itemtype="http://...'
    
    >>> article.parse()

    >>> print article.text
    u'The purpose of this article is to introduce...'

    >>> print article.authors
    [u'Martha Stewart', u'Bob Smith']

    >>> print article.top_img
    u'http://some.cdn.com/3424hfd4565sdfgdg436/

    >>> print article.title
    u'Thanksgiving Weather Guide Travel ...'

    >>> article.nlp()
           
    >>> print article.summary
    u'...and so that's how a Thanksgiving meal is cooked...'

    >>> print article.keywords
    [u'Thanksgiving', u'holliday', u'Walmart', ...]


``nlp()`` is expensive, as is ``parse()``, make sure you actually need them before calling them on
all of your articles! In some cases, if you just need urls, even ``download()`` is not necessary.

Newspaper stands on the giant shoulders of `lxml`_, `nltk`_, and `requests`_. Newspaper also uses chunks of
`goose`_'s code internally. 

.. _`lxml`: http://lxml.de/
.. _`nltk`: http://nltk.org/
.. _`requests`: http://docs.python-requests.org/en/latest/
.. _`goose`: https://github.com/grangier/python-goose

Features
--------

- News url identification
- Text extraction from html
- Keyword extraction from text
- Summary extraction from text
- Author extraction from text
- Top Image & All image extraction from html
- Top Google trending terms 
- News article extraction from news domain
- Quick html downloads via multithreading

Get it now
----------
::

    $ pip install newspaper

    # IMPORTANT
    # If you KNOW for sure you will use the natural language features, nlp(), you must
    # download some seperate nltk corpora below, it may take a while!

    $ curl https://raw.github.com/codelucas/newspaper/master/download_corpora.py | python2.7

Todo List
---------
X Means done

- X Fully integrate the goose library into our own article class
- X Add multiple article download (concurrently with gevent or multithread) example
- Add a "follow_robots.txt" option in the config object.
- Bake in the CSSSelect and BeautifulSoup dependencies
- Add in an examples section on README
- Make the documentation much better, still learning how to use sphinx docs!


.. Examples TODO
.. -------------

.. See more examples at the `Quickstart guide`_.


Documentation
-------------

Full documentation is available at `Newspaper Docs`_.

Requirements
------------

- Python >= 2.6 and <= 2.7*

License
-------

MIT licensed. View our LICENSE file in the top level directory.

.. _`Quickstart guide`: https://newspaper.readthedocs.org/en/latest/
.. _`Newspaper Docs`: http://newspaper.readthedocs.org