newspaper

mirror of https://github.com/codelucas/newspaper.git synced 2025-12-23 05:36:50 +00:00

Author	SHA1	Message	Date
Lucas Ou-Yang	38fdfc0d48	fixed docs, added link to examples on how to add non-latin languages	2014-06-17 03:32:02 -07:00
Lucas Ou-Yang	d5d532cbcd	fix spacing in docs	2014-06-17 03:27:17 -07:00
Lucas Ou-Yang	f961091679	fix docs	2014-06-17 03:18:48 -07:00
Lucas Ou-Yang	d0081bee68	update and finalize docs	2014-06-17 02:58:20 -07:00
Lucas Ou-Yang	2a7488c977	remove history	2014-06-17 01:02:54 -07:00
Lucas Ou-Yang	d67be411c3	commented out attempted hints feature, update version 0.0.7	2014-06-17 01:02:08 -07:00
Lucas Ou-Yang	ae84fa2c21	fix bug in meta img extraction	2014-06-16 03:12:13 -07:00
Lucas Ou-Yang	fe7c9cf8bf	refactor url extraction	2014-06-16 03:01:32 -07:00
Lucas Ou-Yang	69641d0435	refactor feed extraction	2014-06-16 02:43:57 -07:00
Lucas Ou-Yang	7019cd2f9b	refactor img extraction	2014-06-16 02:10:01 -07:00
Lucas Ou-Yang	7c8cf2dcbb	refactor meta img extraction	2014-06-16 01:49:35 -07:00
Lucas Ou-Yang	f1d489e84f	refactor move meta_type extraction	2014-06-16 01:09:04 -07:00
Lucas Ou-Yang	55efa5f4e1	added stopwords and support for indonesian and vietnamese languages	2014-06-15 02:32:00 -07:00
Lucas Ou-Yang	dc6b2f42cb	fix remove all span tags below p tags	2014-06-15 01:55:50 -07:00
Lucas Ou-Yang	fd21e9d7cd	added clean_body_classes to ensure body tag is not removed, this is from goose upstream	2014-06-15 01:52:45 -07:00
Lucas Ou	6d3a486063	Merge pull request #54 from jacquerie/master Fix typo in code and documentation	2014-06-01 02:26:45 -07:00
Jacopo Notarstefano	e92c5319f9	Fix small typo in documentation about newspaper.languages()	2014-06-01 11:06:38 +02:00
Jacopo Notarstefano	b989781ca4	Fix small typo when calling newspaper.languages()	2014-06-01 10:59:09 +02:00
Lucas Ou	1fcb9934c2	Merge pull request #53 from awesomejie/jiepatch removed quotes of 'filename' in utils\__init__.py	2014-05-31 11:09:00 -07:00
awesomejie	ac1c306efe	removed quotes of 'filename' in utils\__init__.py	2014-05-31 08:24:45 -04:00
Lucas Ou	856346cf73	Merge pull request #49 from jeffnappi/patch-1 Fixed long-form article issue w/ calculate_best_node	2014-05-01 10:27:00 -07:00
Jeffrey Nappi	ba7d2726d1	Fixed long-form article issue w/ calculate_best_node See https://github.com/jiminoc/goose/blob/master/src/main/scala/com/gravity/goose/extractors/ContentExtractor.scala#L227	2014-04-29 16:23:46 -07:00
Lucas Ou-Yang	e0718be363	added filetype extractor in urls.py, more modular, fixed bug where our image processor would error out on .ico images (very small images), added more clear ubuntu install instructions	2014-03-28 01:00:03 -07:00
Lucas Ou-Yang	f5b5cebcfe	Merge pull request #35 from otemnov/original Use first image from article top_node	2014-02-06 21:46:31 -08:00
Oleg Temnov	d77c3a3434	Fix for stopwords - wrongly checks bytestring vs unicode set Example: http://top.rbc.ru/economics/04/02/2014/903127.shtml	2014-02-05 04:41:57 +04:00
Oleg Temnov	b0ec588877	Resolve relative image urls	2014-02-03 12:03:10 +04:00
Oleg Temnov	e278edf704	Include meta image url to images collection	2014-02-03 12:03:10 +04:00
Oleg Temnov	1f134ac632	Initialize images and top_mage	2014-02-03 12:03:10 +04:00
Oleg Temnov	0997aeaa81	Check image requirements before set	2014-02-03 09:41:41 +04:00
Oleg Temnov	6aa62d096c	Use first image from article top_node	2014-02-03 09:41:24 +04:00
Lucas Ou-Yang	c7fce705c6	add new languages to readme	2014-02-02 14:10:34 -08:00
Lucas Ou-Yang	1eaa45217c	added languages into api, fixed bug in readme link	2014-02-02 14:07:07 -08:00
Lucas Ou-Yang	b6455ae64b	completly revamped docs, added 6 new languages	2014-02-02 14:04:23 -08:00
Lucas Ou-Yang	3219d39280	added arabic urls to testing suite, fixed broken link to boilerpipe	2014-02-02 13:05:49 -08:00
Lucas Ou-Yang	3eaa8d135e	added many more file types to allowed in urls.py, added more related projects	2014-02-02 13:04:35 -08:00
Lucas Ou-Yang	491a91cbf3	Merge pull request #33 from cantino/master Add a section with links to related projects	2014-02-02 12:37:50 -08:00
Andrew Cantino	4871668cde	Fix formatting	2014-02-01 15:05:31 -08:00
Andrew Cantino	1e2ec8ef9b	Add a section with references to related projects	2014-02-01 15:03:12 -08:00
Lucas Ou-Yang	f9aafc4a0c	updated contributors, added goose licensing info	2014-01-31 12:01:18 -08:00
Lucas Ou-Yang	50530561e2	change type of ration into double from int dividing two integers in python yields an integer, we need a double	2014-01-29 16:17:49 -08:00
Lucas Ou-Yang	b78529f913	Merge pull request #30 from otemnov/original Original	2014-01-29 16:14:16 -08:00
Oleg Temnov	393601dc4a	Configuration for ignore excessively long/wide images ration	2014-01-30 03:40:48 +04:00
Oleg Temnov	b94e1347bc	Remove duplicate image urls	2014-01-30 03:40:48 +04:00
Lucas Ou-Yang	42d113fb28	removed auto test-run	2014-01-27 14:32:32 -08:00
Lucas Ou-Yang	89179e0cb7	remove un-needed tests	2014-01-27 14:31:33 -08:00
Lucas Ou-Yang	9c4b9403a9	added slight ease to testing reddit img extract	2014-01-27 14:30:21 -08:00
Lucas Ou-Yang	adba1d0860	Merge pull request #29 from otemnov/original Fix reddit top image	2014-01-27 14:23:23 -08:00
Oleg Temnov	84f41fbd38	fix reddit top image	2014-01-28 01:46:52 +04:00
Lucas Ou-Yang	f0d5fae5b2	Merge pull request #28 from voidfiles/master Extract Meta Tags in structured way	2014-01-26 16:39:56 -08:00
Alex Kessinger	0c117721fe	Extract Meta Tags in structured way There is a lot of meta data that poeple put in meta tags. Espcially when you realize that Open Graph, and Twitter Cards are very popular. Therefore I suggest that should be a way to extract the tags from an article. Here is an example: ```sh import newspaper article = newspaper.build_article('http://www.buzzfeed.com/tabirakhter/epic-literary-love-tattoos') article.download() article.parse() print pprint.pformat(dict(article.meta_data)) {'article': {'publisher': 'https://www.facebook.com/BuzzFeed', 'section': 'DIY', 'tag': 'Tattoos'}, 'og': {'description': 'All is fair in love and ink.', 'image': {'height': 625, 'url': 'http://s3-ak.buzzfeed.com/static/2014-01/enhanced/webdr06/22/16/enhanced-buzz-8304-1390426280-19.jpg', 'width': 625}, 'site_name': 'BuzzFeed', 'title': '23 Epic Literary Love Tattoos', 'type': 'article', 'url': 'http://www.buzzfeed.com/tabirakhter/epic-literary-love-tattoos'}, 'twitter': {'app': {'id': {'googleplay': 'com.buzzfeed.android', 'ipad': 'id352969997', 'iphone': 'id352969997'}, 'url': {'googleplay': 'http://www.buzzfeed.com/tabirakhter/epic-literary-love-tattoos'}}, 'card': 'summary_large_image', 'creator': '@tabooradley', 'description': 'All is fair in love and ink.', 'image': 'http://s3-ak.buzzfeed.com/static/2014-01/campaign_images/webdr06/26/14/23-epic-literary-love-tattoos-1-1410-1390763440-3_big.jpg', 'site': '@buzzfeed', 'title': '23 Epic Literary Love Tattoos', 'url': 'http://www.buzzfeed.com/tabirakhter/epic-literary-love-tattoos'}} ```	2014-01-26 16:28:34 -08:00

... 2 3 4 5 6 ...

350 commits