codelucas
f07c83230f
[bugfix] UnicodeDammit errors when given NoneType
2015-01-22 06:17:31 -08:00
codelucas
bbcae32715
Remove /packages, feedparser, tldextract are now installed via pip
2015-01-22 06:13:40 -08:00
codelucas
d7c28ca09c
Fix tons of encoding bugs, refactor /utils, adhere py-2 branch to work for all test cases
2015-01-22 06:02:17 -08:00
codelucas
7a6afc27e3
Improve full-text extraction #1
...
post_cleanup more lenient, `<li>` => newlines, less strict outputformatting, remove trailing media after article
2015-01-16 15:55:26 -08:00
codelucas
7c6ce8deb0
[bugfix] UnicodeDammit in python2 returns None and prints a warning if the input is already Unicode
2015-01-16 15:32:07 -08:00
codelucas
3a7ee562a7
Deprecate parser_class config option
...
After integrating UnicodeDammit there is no need to let users customize parsers to BeautifulSoup over lxml, lxml is faster and UnicodeDammit gives us the encoding recognition win from BeautifulSoup
2015-01-16 14:34:59 -08:00
codelucas
a50b8aa122
Merge branch 'python-2-head' of https://github.com/codelucas/newspaper into python-2-head
...
Conflicts:
newspaper/parsers.py
newspaper/urls.py
tests/unit_tests.py
2015-01-16 14:29:18 -08:00
codelucas
97d4e30751
Integrate UnicodeDammit
2015-01-16 14:04:42 -08:00
codelucas
8ab1699095
Modify article.download(..) to take HTML for mocking
2015-01-16 13:48:25 -08:00
codelucas
f759da0545
Don't filter url GET params, it may identify the article
...
e.g. http://www.hr-online.de/website/rubriken/nachrichten/indexhessen34938.jsp?rubrik=36094&key=standard_document_53891717
2015-01-16 13:30:51 -08:00
codelucas
0fd1a1e73b
BeautifulSoup 3 => BeautifulSoup 4, remove responses from requirements.txt
2015-01-16 13:27:14 -08:00
codelucas
e8c006b427
Refactor /tests to prep for mocking all HTML
2015-01-16 13:24:34 -08:00
codelucas
f12adfce7b
[bugfix] Issue #103 , metatag to dict extraction
2014-12-29 18:09:26 -08:00
codelucas
67abe78c15
Update setup.py
2014-12-29 04:06:47 -08:00
codelucas
8a9515e3ac
Handle exception logging using tracebacks
2014-12-29 04:03:08 -08:00
codelucas
7f793db882
Update version to 0.0.9.1
2014-12-29 03:53:38 -08:00
codelucas
2a4912fd49
Don't filter url GET params, it may ID the article
...
Also includes unit test updates. Example article: http://www.hr-online.de/website/rubriken/nachrichten/indexhessen34938.jsp?rubrik=36094&key=standard_document_53891717
2014-12-29 03:46:20 -08:00
codelucas
d8f836de95
[bugfix] meta/og "top image" extractions should not be filtered by size/dimensions as the server specifies that it's the "top image"
2014-12-29 03:03:04 -08:00
codelucas
3030608014
[bugfix] spelling correction
2014-12-29 02:38:26 -08:00
codelucas
90ce10904a
Update to version 0.0.9
2014-12-29 02:35:34 -08:00
codelucas
e78da26408
[bugfix] broken unittest for summary extraction
2014-12-29 02:09:36 -08:00
codelucas
7fb90f3067
[bugfix] nltk strongly expects unicode strings now, fix nlp.py
2014-12-29 02:07:37 -08:00
Lucas Ou-Yang
8477d340ea
Merge pull request #97 from mhall1/ticket_78_python_2_head
...
Fixed #78 : Remove encoding tag because lxml won't accept it for unicode
2014-12-27 02:10:28 -08:00
Michael Hall
81c2075b44
Fixed #78 : Remove encoding tag because lxml won't accept it for unicode objects
...
lxml apparently doesn't accept unicode objects with both an encoding specified and an encoding tag in the HTML document, because they're paranoid that the two won't match. Stripped encoding tag before calling lxml.html.fromstring()
2014-12-20 05:04:12 -07:00
Lucas Ou-Yang
70f6866b11
Tidy up setup.py to import requirements.txt
2014-12-17 10:55:46 -08:00
Lucas Ou-Yang
8aee0272a7
Tidy up setup.py before push to pip
2014-12-17 02:22:47 -08:00
Lucas Ou-Yang
a862d2d30d
Merge branch 'WingGao-wing'
2014-12-16 09:54:21 -08:00
Wing Gao
20ebe1f466
update jieba to 0.35
2014-12-16 00:22:52 +08:00
Lucas Ou-Yang
94be062981
Merge branch 'igor-shevchenko-slash-splitter'
2014-11-25 01:19:10 -08:00
Lucas Ou-Yang
662255664f
Merge branch 'slash-splitter' of https://github.com/igor-shevchenko/newspaper into igor-shevchenko-slash-splitter
...
Conflicts:
newspaper/extractors.py
2014-11-25 01:18:15 -08:00
Lucas Ou-Yang
4d964d0351
Fix merge conflicts with adding slash splitting
2014-11-25 01:09:33 -08:00
Igor Shevchenko
6f5a372773
Remove space sandwiching for slash splitter
2014-11-25 12:22:15 +05:00
Lucas Ou-Yang
493a8a7052
Merge pull request #87 from deweydu/master
...
split title with _
2014-11-18 20:40:33 -08:00
Lucas Ou-Yang
fba75837ab
Merge pull request #88 from phoenixwizard/master
...
Parse was breaking in the method clean_article_html when keep_article_ht...
2014-11-17 23:44:45 -08:00
Lucas Ou-Yang
259caca4c4
Merge pull request #83 from iwasrobbed/master
...
Added link to basic demo
2014-11-17 09:23:06 -08:00
Aram
d0726cb48b
Parse was breaking in the method clean_article_html when keep_article_html=True was used. On adding the import the problem was resolved
2014-11-17 19:26:38 +05:30
duw
f5a6c03744
split title with _
2014-11-17 16:36:02 +08:00
rob phillips
2ec7f24c6d
Added link to basic demo
2014-10-29 18:59:17 -04:00
Lucas Ou-Yang
3573217c74
Bugfix in deployment process
2014-10-12 17:11:46 -07:00
Lucas Ou-Yang
b9e4c877e2
Modify install.rst on web site docs
2014-10-12 17:05:42 -07:00
Lucas Ou-Yang
0ad9bae0d1
Merge branch 'master' of https://github.com/codelucas/newspaper
2014-10-12 17:03:32 -07:00
Lucas Ou-Yang
7fdd861018
Merge branch 'master' of https://github.com/codelucas/newspaper
2014-10-12 17:03:20 -07:00
Lucas Ou-Yang
f5f5b8a227
Merge branch 'master' of https://github.com/codelucas/newspaper
2014-10-12 17:01:32 -07:00
Lucas Ou-Yang
3c164f2301
Make README.rst more readible
2014-10-12 17:01:10 -07:00
Lucas Ou-Yang
949916490f
Make README.rst more readible
2014-10-12 16:59:32 -07:00
Lucas Ou-Yang
bf36c6ab36
Add development installation instructions to README.rst
2014-10-12 16:51:47 -07:00
Lucas Ou-Yang
00a481af1b
Reflect new installation directions in the site user docs
2014-10-12 16:45:59 -07:00
Lucas Ou-Yang
04a8881c95
Update installation instructions in README.rst
2014-10-12 16:40:04 -07:00
Lucas Ou-Yang
b03912e9a4
Remove CHANGES.txt and reformat CONTRIBUTORS.md
2014-10-12 16:16:58 -07:00
Lucas Ou-Yang
eec112770e
Reformat setup.py, remove unused code
2014-10-12 16:14:51 -07:00