Commit graph

350 commits

Author SHA1 Message Date
codelucas
f07c83230f [bugfix] UnicodeDammit errors when given NoneType 2015-01-22 06:17:31 -08:00
codelucas
bbcae32715 Remove /packages, feedparser, tldextract are now installed via pip 2015-01-22 06:13:40 -08:00
codelucas
d7c28ca09c Fix tons of encoding bugs, refactor /utils, adhere py-2 branch to work for all test cases 2015-01-22 06:02:17 -08:00
codelucas
7a6afc27e3 Improve full-text extraction #1
post_cleanup more lenient, `<li>` => newlines, less strict outputformatting, remove trailing media after article
2015-01-16 15:55:26 -08:00
codelucas
7c6ce8deb0 [bugfix] UnicodeDammit in python2 returns None and prints a warning if the input is already Unicode 2015-01-16 15:32:07 -08:00
codelucas
3a7ee562a7 Deprecate parser_class config option
After integrating UnicodeDammit there is no need to let users customize parsers to BeautifulSoup over lxml, lxml is faster and UnicodeDammit gives us the encoding recognition win from BeautifulSoup
2015-01-16 14:34:59 -08:00
codelucas
a50b8aa122 Merge branch 'python-2-head' of https://github.com/codelucas/newspaper into python-2-head
Conflicts:
	newspaper/parsers.py
	newspaper/urls.py
	tests/unit_tests.py
2015-01-16 14:29:18 -08:00
codelucas
97d4e30751 Integrate UnicodeDammit 2015-01-16 14:04:42 -08:00
codelucas
8ab1699095 Modify article.download(..) to take HTML for mocking 2015-01-16 13:48:25 -08:00
codelucas
f759da0545 Don't filter url GET params, it may identify the article
e.g. http://www.hr-online.de/website/rubriken/nachrichten/indexhessen34938.jsp?rubrik=36094&key=standard_document_53891717
2015-01-16 13:30:51 -08:00
codelucas
0fd1a1e73b BeautifulSoup 3 => BeautifulSoup 4, remove responses from requirements.txt 2015-01-16 13:27:14 -08:00
codelucas
e8c006b427 Refactor /tests to prep for mocking all HTML 2015-01-16 13:24:34 -08:00
codelucas
f12adfce7b [bugfix] Issue #103, metatag to dict extraction 2014-12-29 18:09:26 -08:00
codelucas
67abe78c15 Update setup.py 2014-12-29 04:06:47 -08:00
codelucas
8a9515e3ac Handle exception logging using tracebacks 2014-12-29 04:03:08 -08:00
codelucas
7f793db882 Update version to 0.0.9.1 2014-12-29 03:53:38 -08:00
codelucas
2a4912fd49 Don't filter url GET params, it may ID the article
Also includes unit test updates. Example article: http://www.hr-online.de/website/rubriken/nachrichten/indexhessen34938.jsp?rubrik=36094&key=standard_document_53891717
2014-12-29 03:46:20 -08:00
codelucas
d8f836de95 [bugfix] meta/og "top image" extractions should not be filtered by size/dimensions as the server specifies that it's the "top image" 2014-12-29 03:03:04 -08:00
codelucas
3030608014 [bugfix] spelling correction 2014-12-29 02:38:26 -08:00
codelucas
90ce10904a Update to version 0.0.9 2014-12-29 02:35:34 -08:00
codelucas
e78da26408 [bugfix] broken unittest for summary extraction 2014-12-29 02:09:36 -08:00
codelucas
7fb90f3067 [bugfix] nltk strongly expects unicode strings now, fix nlp.py 2014-12-29 02:07:37 -08:00
Lucas Ou-Yang
8477d340ea Merge pull request #97 from mhall1/ticket_78_python_2_head
Fixed #78: Remove encoding tag because lxml won't accept it for unicode
2014-12-27 02:10:28 -08:00
Michael Hall
81c2075b44 Fixed #78: Remove encoding tag because lxml won't accept it for unicode objects
lxml apparently doesn't accept unicode objects with both an encoding specified and an encoding tag in the HTML document, because they're paranoid that the two won't match. Stripped encoding tag before calling lxml.html.fromstring()
2014-12-20 05:04:12 -07:00
Lucas Ou-Yang
70f6866b11 Tidy up setup.py to import requirements.txt 2014-12-17 10:55:46 -08:00
Lucas Ou-Yang
8aee0272a7 Tidy up setup.py before push to pip 2014-12-17 02:22:47 -08:00
Lucas Ou-Yang
a862d2d30d Merge branch 'WingGao-wing' 2014-12-16 09:54:21 -08:00
Wing Gao
20ebe1f466 update jieba to 0.35 2014-12-16 00:22:52 +08:00
Lucas Ou-Yang
94be062981 Merge branch 'igor-shevchenko-slash-splitter' 2014-11-25 01:19:10 -08:00
Lucas Ou-Yang
662255664f Merge branch 'slash-splitter' of https://github.com/igor-shevchenko/newspaper into igor-shevchenko-slash-splitter
Conflicts:
	newspaper/extractors.py
2014-11-25 01:18:15 -08:00
Lucas Ou-Yang
4d964d0351 Fix merge conflicts with adding slash splitting 2014-11-25 01:09:33 -08:00
Igor Shevchenko
6f5a372773 Remove space sandwiching for slash splitter 2014-11-25 12:22:15 +05:00
Lucas Ou-Yang
493a8a7052 Merge pull request #87 from deweydu/master
split title with _
2014-11-18 20:40:33 -08:00
Lucas Ou-Yang
fba75837ab Merge pull request #88 from phoenixwizard/master
Parse was breaking in the method clean_article_html when keep_article_ht...
2014-11-17 23:44:45 -08:00
Lucas Ou-Yang
259caca4c4 Merge pull request #83 from iwasrobbed/master
Added link to basic demo
2014-11-17 09:23:06 -08:00
Aram
d0726cb48b Parse was breaking in the method clean_article_html when keep_article_html=True was used. On adding the import the problem was resolved 2014-11-17 19:26:38 +05:30
duw
f5a6c03744 split title with _ 2014-11-17 16:36:02 +08:00
rob phillips
2ec7f24c6d Added link to basic demo 2014-10-29 18:59:17 -04:00
Lucas Ou-Yang
3573217c74 Bugfix in deployment process 2014-10-12 17:11:46 -07:00
Lucas Ou-Yang
b9e4c877e2 Modify install.rst on web site docs 2014-10-12 17:05:42 -07:00
Lucas Ou-Yang
0ad9bae0d1 Merge branch 'master' of https://github.com/codelucas/newspaper 2014-10-12 17:03:32 -07:00
Lucas Ou-Yang
7fdd861018 Merge branch 'master' of https://github.com/codelucas/newspaper 2014-10-12 17:03:20 -07:00
Lucas Ou-Yang
f5f5b8a227 Merge branch 'master' of https://github.com/codelucas/newspaper 2014-10-12 17:01:32 -07:00
Lucas Ou-Yang
3c164f2301 Make README.rst more readible 2014-10-12 17:01:10 -07:00
Lucas Ou-Yang
949916490f Make README.rst more readible 2014-10-12 16:59:32 -07:00
Lucas Ou-Yang
bf36c6ab36 Add development installation instructions to README.rst 2014-10-12 16:51:47 -07:00
Lucas Ou-Yang
00a481af1b Reflect new installation directions in the site user docs 2014-10-12 16:45:59 -07:00
Lucas Ou-Yang
04a8881c95 Update installation instructions in README.rst 2014-10-12 16:40:04 -07:00
Lucas Ou-Yang
b03912e9a4 Remove CHANGES.txt and reformat CONTRIBUTORS.md 2014-10-12 16:16:58 -07:00
Lucas Ou-Yang
eec112770e Reformat setup.py, remove unused code 2014-10-12 16:14:51 -07:00