Lucas Ou-Yang
|
38fdfc0d48
|
fixed docs, added link to examples on how to add non-latin languages
|
2014-06-17 03:32:02 -07:00 |
|
Lucas Ou-Yang
|
d5d532cbcd
|
fix spacing in docs
|
2014-06-17 03:27:17 -07:00 |
|
Lucas Ou-Yang
|
f961091679
|
fix docs
|
2014-06-17 03:18:48 -07:00 |
|
Lucas Ou-Yang
|
d0081bee68
|
update and finalize docs
|
2014-06-17 02:58:20 -07:00 |
|
Lucas Ou-Yang
|
2a7488c977
|
remove history
|
2014-06-17 01:02:54 -07:00 |
|
Lucas Ou-Yang
|
d67be411c3
|
commented out attempted hints feature, update version 0.0.7
|
2014-06-17 01:02:08 -07:00 |
|
Lucas Ou-Yang
|
ae84fa2c21
|
fix bug in meta img extraction
|
2014-06-16 03:12:13 -07:00 |
|
Lucas Ou-Yang
|
fe7c9cf8bf
|
refactor url extraction
|
2014-06-16 03:01:32 -07:00 |
|
Lucas Ou-Yang
|
69641d0435
|
refactor feed extraction
|
2014-06-16 02:43:57 -07:00 |
|
Lucas Ou-Yang
|
7019cd2f9b
|
refactor img extraction
|
2014-06-16 02:10:01 -07:00 |
|
Lucas Ou-Yang
|
7c8cf2dcbb
|
refactor meta img extraction
|
2014-06-16 01:49:35 -07:00 |
|
Lucas Ou-Yang
|
f1d489e84f
|
refactor move meta_type extraction
|
2014-06-16 01:09:04 -07:00 |
|
Lucas Ou-Yang
|
55efa5f4e1
|
added stopwords and support for indonesian and vietnamese languages
|
2014-06-15 02:32:00 -07:00 |
|
Lucas Ou-Yang
|
dc6b2f42cb
|
fix remove all span tags below p tags
|
2014-06-15 01:55:50 -07:00 |
|
Lucas Ou-Yang
|
fd21e9d7cd
|
added clean_body_classes to ensure body tag is not removed, this is from goose upstream
|
2014-06-15 01:52:45 -07:00 |
|
Lucas Ou
|
6d3a486063
|
Merge pull request #54 from jacquerie/master
Fix typo in code and documentation
|
2014-06-01 02:26:45 -07:00 |
|
Jacopo Notarstefano
|
e92c5319f9
|
Fix small typo in documentation about newspaper.languages()
|
2014-06-01 11:06:38 +02:00 |
|
Jacopo Notarstefano
|
b989781ca4
|
Fix small typo when calling newspaper.languages()
|
2014-06-01 10:59:09 +02:00 |
|
Lucas Ou
|
1fcb9934c2
|
Merge pull request #53 from awesomejie/jiepatch
removed quotes of 'filename' in utils\__init__.py
|
2014-05-31 11:09:00 -07:00 |
|
awesomejie
|
ac1c306efe
|
removed quotes of 'filename' in utils\__init__.py
|
2014-05-31 08:24:45 -04:00 |
|
Lucas Ou
|
856346cf73
|
Merge pull request #49 from jeffnappi/patch-1
Fixed long-form article issue w/ calculate_best_node
|
2014-05-01 10:27:00 -07:00 |
|
Jeffrey Nappi
|
ba7d2726d1
|
Fixed long-form article issue w/ calculate_best_node
See https://github.com/jiminoc/goose/blob/master/src/main/scala/com/gravity/goose/extractors/ContentExtractor.scala#L227
|
2014-04-29 16:23:46 -07:00 |
|
Lucas Ou-Yang
|
e0718be363
|
added filetype extractor in urls.py, more modular, fixed bug where our image processor would error out on .ico images (very small images), added more clear ubuntu install instructions
|
2014-03-28 01:00:03 -07:00 |
|
Lucas Ou-Yang
|
f5b5cebcfe
|
Merge pull request #35 from otemnov/original
Use first image from article top_node
|
2014-02-06 21:46:31 -08:00 |
|
Oleg Temnov
|
d77c3a3434
|
Fix for stopwords - wrongly checks bytestring vs unicode set
Example: http://top.rbc.ru/economics/04/02/2014/903127.shtml
|
2014-02-05 04:41:57 +04:00 |
|
Oleg Temnov
|
b0ec588877
|
Resolve relative image urls
|
2014-02-03 12:03:10 +04:00 |
|
Oleg Temnov
|
e278edf704
|
Include meta image url to images collection
|
2014-02-03 12:03:10 +04:00 |
|
Oleg Temnov
|
1f134ac632
|
Initialize images and top_mage
|
2014-02-03 12:03:10 +04:00 |
|
Oleg Temnov
|
0997aeaa81
|
Check image requirements before set
|
2014-02-03 09:41:41 +04:00 |
|
Oleg Temnov
|
6aa62d096c
|
Use first image from article top_node
|
2014-02-03 09:41:24 +04:00 |
|
Lucas Ou-Yang
|
c7fce705c6
|
add new languages to readme
|
2014-02-02 14:10:34 -08:00 |
|
Lucas Ou-Yang
|
1eaa45217c
|
added languages into api, fixed bug in readme link
|
2014-02-02 14:07:07 -08:00 |
|
Lucas Ou-Yang
|
b6455ae64b
|
completly revamped docs, added 6 new languages
|
2014-02-02 14:04:23 -08:00 |
|
Lucas Ou-Yang
|
3219d39280
|
added arabic urls to testing suite, fixed broken link to boilerpipe
|
2014-02-02 13:05:49 -08:00 |
|
Lucas Ou-Yang
|
3eaa8d135e
|
added many more file types to allowed in urls.py, added more related projects
|
2014-02-02 13:04:35 -08:00 |
|
Lucas Ou-Yang
|
491a91cbf3
|
Merge pull request #33 from cantino/master
Add a section with links to related projects
|
2014-02-02 12:37:50 -08:00 |
|
Andrew Cantino
|
4871668cde
|
Fix formatting
|
2014-02-01 15:05:31 -08:00 |
|
Andrew Cantino
|
1e2ec8ef9b
|
Add a section with references to related projects
|
2014-02-01 15:03:12 -08:00 |
|
Lucas Ou-Yang
|
f9aafc4a0c
|
updated contributors, added goose licensing info
|
2014-01-31 12:01:18 -08:00 |
|
Lucas Ou-Yang
|
50530561e2
|
change type of ration into double from int
dividing two integers in python yields an integer, we need a double
|
2014-01-29 16:17:49 -08:00 |
|
Lucas Ou-Yang
|
b78529f913
|
Merge pull request #30 from otemnov/original
Original
|
2014-01-29 16:14:16 -08:00 |
|
Oleg Temnov
|
393601dc4a
|
Configuration for ignore excessively long/wide images ration
|
2014-01-30 03:40:48 +04:00 |
|
Oleg Temnov
|
b94e1347bc
|
Remove duplicate image urls
|
2014-01-30 03:40:48 +04:00 |
|
Lucas Ou-Yang
|
42d113fb28
|
removed auto test-run
|
2014-01-27 14:32:32 -08:00 |
|
Lucas Ou-Yang
|
89179e0cb7
|
remove un-needed tests
|
2014-01-27 14:31:33 -08:00 |
|
Lucas Ou-Yang
|
9c4b9403a9
|
added slight ease to testing reddit img extract
|
2014-01-27 14:30:21 -08:00 |
|
Lucas Ou-Yang
|
adba1d0860
|
Merge pull request #29 from otemnov/original
Fix reddit top image
|
2014-01-27 14:23:23 -08:00 |
|
Oleg Temnov
|
84f41fbd38
|
fix reddit top image
|
2014-01-28 01:46:52 +04:00 |
|
Lucas Ou-Yang
|
f0d5fae5b2
|
Merge pull request #28 from voidfiles/master
Extract Meta Tags in structured way
|
2014-01-26 16:39:56 -08:00 |
|
Alex Kessinger
|
0c117721fe
|
Extract Meta Tags in structured way
There is a lot of meta data that poeple put in meta tags. Espcially
when you realize that Open Graph, and Twitter Cards are very popular.
Therefore I suggest that should be a way to extract the tags from
an article.
Here is an example:
```sh
import newspaper
article = newspaper.build_article('http://www.buzzfeed.com/tabirakhter/epic-literary-love-tattoos')
article.download()
article.parse()
print pprint.pformat(dict(article.meta_data))
{'article': {'publisher': 'https://www.facebook.com/BuzzFeed',
'section': 'DIY',
'tag': 'Tattoos'},
'og': {'description': 'All is fair in love and ink.',
'image': {'height': 625,
'url': 'http://s3-ak.buzzfeed.com/static/2014-01/enhanced/webdr06/22/16/enhanced-buzz-8304-1390426280-19.jpg',
'width': 625},
'site_name': 'BuzzFeed',
'title': '23 Epic Literary Love Tattoos',
'type': 'article',
'url': 'http://www.buzzfeed.com/tabirakhter/epic-literary-love-tattoos'},
'twitter': {'app': {'id': {'googleplay': 'com.buzzfeed.android',
'ipad': 'id352969997',
'iphone': 'id352969997'},
'url': {'googleplay': 'http://www.buzzfeed.com/tabirakhter/epic-literary-love-tattoos'}},
'card': 'summary_large_image',
'creator': '@tabooradley',
'description': 'All is fair in love and ink.',
'image': 'http://s3-ak.buzzfeed.com/static/2014-01/campaign_images/webdr06/26/14/23-epic-literary-love-tattoos-1-1410-1390763440-3_big.jpg',
'site': '@buzzfeed',
'title': '23 Epic Literary Love Tattoos',
'url': 'http://www.buzzfeed.com/tabirakhter/epic-literary-love-tattoos'}}
```
|
2014-01-26 16:28:34 -08:00 |
|