mirrors/newspaper: newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:

mirror of https://github.com/codelucas/newspaper.git synced 2025-12-23 05:36:50 +00:00

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs: https://goo.gl/VX41yK

crawler crawling news news-aggregator python scraper

Find a file

Lucas Ou-Yang 15bbd8a9db added complete test cases for config setting		2014-01-09 00:28:14 -08:00
docs	fix indenting	2014-01-06 02:31:09 -08:00
newspaper	added complete test cases for config setting	2014-01-09 00:28:14 -08:00
tests	added complete test cases for config setting	2014-01-09 00:28:14 -08:00
.gitignore	removing useless build dir	2013-12-31 03:21:10 -08:00
CHANGES.txt	added config objects for easy configuration, will mod readme to reflect this tomorrow	2013-12-29 04:12:01 -08:00
CONTRIBUTORS.md	fix bug where memoize articles fails and returns nothing	2013-12-29 17:49:41 -08:00
download_corpora.py	added ability to change # of threads allocated per source when using pool framework	2013-12-30 21:36:42 -08:00
HISTORY.md	added config objects for easy configuration, will mod readme to reflect this tomorrow	2013-12-29 04:12:01 -08:00
LICENSE	Initial commit	2013-11-25 01:50:50 -08:00
MANIFEST.in	cleaned up goose merge, started mthreading code, added contributor	2013-12-29 04:06:44 -08:00
README.rst	config api change, added 10 langauges, fixed langauge setting for chinese and arabic, api much smoother for language toggling, fixed how we read html from requests where encoding is wrong	2014-01-08 13:30:04 -08:00
requirements.txt	cleaned up goose merge, started mthreading code, added contributor	2013-12-29 04:06:44 -08:00
setup.py	changed readme to reflect new 0.0.4 api	2013-12-30 04:11:30 -08:00

README.rst

Newspaper: Article scraping & curation
======================================

.. image:: https://badge.fury.io/py/newspaper.png
    :target: http://badge.fury.io/py/newspaper
        :alt: Latest version

Inspired by `requests`_ for its **simplicity** and powered by `lxml`_ for its **speed**; *newspaper*
is a Python 2 library for extracting & curating articles from the web.

Newspaper wants to change the way people handle article extraction with a new, more precise
layer of abstraction. Newspaper caches whatever it can for speed. *Also, everything is in unicode*.

Please refer to `The Documentation`_ for a quickstart tutorial!

A Glance:
---------

.. code-block:: pycon

    >>> import newspaper

    >>> cnn_paper = newspaper.build('http://cnn.com')

    >>> for article in cnn_paper.articles:
    >>>     print article.url
    u'http://www.cnn.com/2013/11/27/justice/tucson-arizona-captive-girls/'
    u'http://www.cnn.com/2013/12/11/us/texas-teen-dwi-wreck/index.html'
    ...

    >>> for category in cnn_paper.category_urls():
    >>>     print category

    u'http://lifestyle.cnn.com'
    u'http://cnn.com/world'
    u'http://tech.cnn.com'
    ...

.. code-block:: pycon

    >>> article = cnn_paper.articles[0]

.. code-block:: pycon

    >>> article.download()

    >>> article.html
    u'<!DOCTYPE HTML><html itemscope itemtype="http://...'

.. code-block:: pycon

    >>> article.parse()

    >>> article.authors
    [u'Leigh Ann Caldwell', 'John Honway']

    >>> article.text
    u'Washington (CNN) -- Not everyone subscribes to a New Year's resolution...'

.. code-block:: pycon

    >>> article.nlp()

    >>> article.keywords
    ['New Years', 'resolution', ...]

    >>> article.summary
    u'The study shows that 93% of people ...'

Documentation
-------------

Check out `The Documentation`_ for full and detailed guides using newspaper.

Features
--------

- Works in 10+ languages (English, Chinese, German, Arabic, ...)
- Multi-threaded article download framework
- News url identification
- Text extraction from html
- Top image extraction from html
- All image extraction from html
- Keyword extraction from text
- Summary extraction from text
- Author extraction from text
- Google trending terms extraction

Get it now
----------

Installing newspaper is simple with `pip <http://www.pip-installer.org/>`_.
However, you will run into fixable issues if you are trying to install on ubuntu.

**If you are not using ubuntu**, install with the following:

::

    $ pip install newspaper

    $ curl https://raw.github.com/codelucas/newspaper/master/download_corpora.py | python2.7


**If you are**, install using the following:

::

    $ apt-get install libxml2-dev libxslt-dev

    $ easy_install lxml  # NOT PIP
    
    $ pip install newspaper 

    $ curl https://raw.github.com/codelucas/newspaper/master/download_corpora.py | python2.7


It is also important to note that the line 

::

    $ curl https://raw.github.com/codelucas/newspaper/master/download_corpora.py | python2.7


is not needed unless you need the natural language, ``nlp()``, features like keywords extraction and summarization.

If you are using ubuntu and are still running into gcc compile errors when installing lxml, try installing
``libxslt1-dev`` instead of ``libxslt-dev``.

Todo List
---------

- Add a "follow_robots.txt" option in the config object.
- Bake in the CSSSelect and BeautifulSoup dependencies

.. _`Quickstart guide`: https://newspaper.readthedocs.org/en/latest/
.. _`The Documentation`: http://newspaper.readthedocs.org
.. _`lxml`: http://lxml.de/
.. _`requests`: http://docs.python-requests.org/en/latest/