Commit graph

350 commits

Author SHA1 Message Date
Lucas Ou-Yang
38fdfc0d48 fixed docs, added link to examples on how to add non-latin languages 2014-06-17 03:32:02 -07:00
Lucas Ou-Yang
d5d532cbcd fix spacing in docs 2014-06-17 03:27:17 -07:00
Lucas Ou-Yang
f961091679 fix docs 2014-06-17 03:18:48 -07:00
Lucas Ou-Yang
d0081bee68 update and finalize docs 2014-06-17 02:58:20 -07:00
Lucas Ou-Yang
2a7488c977 remove history 2014-06-17 01:02:54 -07:00
Lucas Ou-Yang
d67be411c3 commented out attempted hints feature, update version 0.0.7 2014-06-17 01:02:08 -07:00
Lucas Ou-Yang
ae84fa2c21 fix bug in meta img extraction 2014-06-16 03:12:13 -07:00
Lucas Ou-Yang
fe7c9cf8bf refactor url extraction 2014-06-16 03:01:32 -07:00
Lucas Ou-Yang
69641d0435 refactor feed extraction 2014-06-16 02:43:57 -07:00
Lucas Ou-Yang
7019cd2f9b refactor img extraction 2014-06-16 02:10:01 -07:00
Lucas Ou-Yang
7c8cf2dcbb refactor meta img extraction 2014-06-16 01:49:35 -07:00
Lucas Ou-Yang
f1d489e84f refactor move meta_type extraction 2014-06-16 01:09:04 -07:00
Lucas Ou-Yang
55efa5f4e1 added stopwords and support for indonesian and vietnamese languages 2014-06-15 02:32:00 -07:00
Lucas Ou-Yang
dc6b2f42cb fix remove all span tags below p tags 2014-06-15 01:55:50 -07:00
Lucas Ou-Yang
fd21e9d7cd added clean_body_classes to ensure body tag is not removed, this is from goose upstream 2014-06-15 01:52:45 -07:00
Lucas Ou
6d3a486063 Merge pull request #54 from jacquerie/master
Fix typo in code and documentation
2014-06-01 02:26:45 -07:00
Jacopo Notarstefano
e92c5319f9 Fix small typo in documentation about newspaper.languages() 2014-06-01 11:06:38 +02:00
Jacopo Notarstefano
b989781ca4 Fix small typo when calling newspaper.languages() 2014-06-01 10:59:09 +02:00
Lucas Ou
1fcb9934c2 Merge pull request #53 from awesomejie/jiepatch
removed quotes of 'filename' in utils\__init__.py
2014-05-31 11:09:00 -07:00
awesomejie
ac1c306efe removed quotes of 'filename' in utils\__init__.py 2014-05-31 08:24:45 -04:00
Lucas Ou
856346cf73 Merge pull request #49 from jeffnappi/patch-1
Fixed long-form article issue w/ calculate_best_node
2014-05-01 10:27:00 -07:00
Jeffrey Nappi
ba7d2726d1 Fixed long-form article issue w/ calculate_best_node
See https://github.com/jiminoc/goose/blob/master/src/main/scala/com/gravity/goose/extractors/ContentExtractor.scala#L227
2014-04-29 16:23:46 -07:00
Lucas Ou-Yang
e0718be363 added filetype extractor in urls.py, more modular, fixed bug where our image processor would error out on .ico images (very small images), added more clear ubuntu install instructions 2014-03-28 01:00:03 -07:00
Lucas Ou-Yang
f5b5cebcfe Merge pull request #35 from otemnov/original
Use first image from article top_node
2014-02-06 21:46:31 -08:00
Oleg Temnov
d77c3a3434 Fix for stopwords - wrongly checks bytestring vs unicode set
Example: http://top.rbc.ru/economics/04/02/2014/903127.shtml
2014-02-05 04:41:57 +04:00
Oleg Temnov
b0ec588877 Resolve relative image urls 2014-02-03 12:03:10 +04:00
Oleg Temnov
e278edf704 Include meta image url to images collection 2014-02-03 12:03:10 +04:00
Oleg Temnov
1f134ac632 Initialize images and top_mage 2014-02-03 12:03:10 +04:00
Oleg Temnov
0997aeaa81 Check image requirements before set 2014-02-03 09:41:41 +04:00
Oleg Temnov
6aa62d096c Use first image from article top_node 2014-02-03 09:41:24 +04:00
Lucas Ou-Yang
c7fce705c6 add new languages to readme 2014-02-02 14:10:34 -08:00
Lucas Ou-Yang
1eaa45217c added languages into api, fixed bug in readme link 2014-02-02 14:07:07 -08:00
Lucas Ou-Yang
b6455ae64b completly revamped docs, added 6 new languages 2014-02-02 14:04:23 -08:00
Lucas Ou-Yang
3219d39280 added arabic urls to testing suite, fixed broken link to boilerpipe 2014-02-02 13:05:49 -08:00
Lucas Ou-Yang
3eaa8d135e added many more file types to allowed in urls.py, added more related projects 2014-02-02 13:04:35 -08:00
Lucas Ou-Yang
491a91cbf3 Merge pull request #33 from cantino/master
Add a section with links to related projects
2014-02-02 12:37:50 -08:00
Andrew Cantino
4871668cde Fix formatting 2014-02-01 15:05:31 -08:00
Andrew Cantino
1e2ec8ef9b Add a section with references to related projects 2014-02-01 15:03:12 -08:00
Lucas Ou-Yang
f9aafc4a0c updated contributors, added goose licensing info 2014-01-31 12:01:18 -08:00
Lucas Ou-Yang
50530561e2 change type of ration into double from int
dividing two integers in python yields an integer, we need a double
2014-01-29 16:17:49 -08:00
Lucas Ou-Yang
b78529f913 Merge pull request #30 from otemnov/original
Original
2014-01-29 16:14:16 -08:00
Oleg Temnov
393601dc4a Configuration for ignore excessively long/wide images ration 2014-01-30 03:40:48 +04:00
Oleg Temnov
b94e1347bc Remove duplicate image urls 2014-01-30 03:40:48 +04:00
Lucas Ou-Yang
42d113fb28 removed auto test-run 2014-01-27 14:32:32 -08:00
Lucas Ou-Yang
89179e0cb7 remove un-needed tests 2014-01-27 14:31:33 -08:00
Lucas Ou-Yang
9c4b9403a9 added slight ease to testing reddit img extract 2014-01-27 14:30:21 -08:00
Lucas Ou-Yang
adba1d0860 Merge pull request #29 from otemnov/original
Fix reddit top image
2014-01-27 14:23:23 -08:00
Oleg Temnov
84f41fbd38 fix reddit top image 2014-01-28 01:46:52 +04:00
Lucas Ou-Yang
f0d5fae5b2 Merge pull request #28 from voidfiles/master
Extract Meta Tags in structured way
2014-01-26 16:39:56 -08:00
Alex Kessinger
0c117721fe Extract Meta Tags in structured way
There is a lot of meta data that poeple put in meta tags. Espcially
when you realize that Open Graph, and Twitter Cards are very popular.
Therefore I suggest that should be a way to extract the tags from
an article.

Here is an example:

```sh
import newspaper
article = newspaper.build_article('http://www.buzzfeed.com/tabirakhter/epic-literary-love-tattoos')
article.download()
article.parse()
print pprint.pformat(dict(article.meta_data))

{'article': {'publisher': 'https://www.facebook.com/BuzzFeed',
             'section': 'DIY',
             'tag': 'Tattoos'},
 'og': {'description': 'All is fair in love and ink.',
        'image': {'height': 625,
                  'url': 'http://s3-ak.buzzfeed.com/static/2014-01/enhanced/webdr06/22/16/enhanced-buzz-8304-1390426280-19.jpg',
                  'width': 625},
        'site_name': 'BuzzFeed',
        'title': '23 Epic Literary Love Tattoos',
        'type': 'article',
        'url': 'http://www.buzzfeed.com/tabirakhter/epic-literary-love-tattoos'},
 'twitter': {'app': {'id': {'googleplay': 'com.buzzfeed.android',
                            'ipad': 'id352969997',
                            'iphone': 'id352969997'},
                     'url': {'googleplay': 'http://www.buzzfeed.com/tabirakhter/epic-literary-love-tattoos'}},
             'card': 'summary_large_image',
             'creator': '@tabooradley',
             'description': 'All is fair in love and ink.',
             'image': 'http://s3-ak.buzzfeed.com/static/2014-01/campaign_images/webdr06/26/14/23-epic-literary-love-tattoos-1-1410-1390763440-3_big.jpg',
             'site': '@buzzfeed',
             'title': '23 Epic Literary Love Tattoos',
             'url': 'http://www.buzzfeed.com/tabirakhter/epic-literary-love-tattoos'}}
```
2014-01-26 16:28:34 -08:00