Commit graph

653 commits

Author SHA1 Message Date
ljluestc
dd61ba794f Add restrict_to_homepage_urls option to limit scraping to homepage links (#134) 2025-06-22 09:47:18 -07:00
Lucas Ou-Yang
ba8d2f41be
Update README.rst 2025-03-06 17:44:47 -08:00
Lucas Ou-Yang
f622011177
Update README.rst 2020-09-01 23:54:25 -07:00
Lucas Ou-Yang
5af1bea20f
Update README.rst 2020-07-12 18:16:14 -07:00
Lucas Ou-Yang
b0cc1278c4
Update README.rst 2020-07-04 19:34:46 -07:00
Lucas Ou-Yang
4b35117e7e
Update README.rst 2020-07-03 12:10:45 -07:00
Lucas Ou-Yang
2f6ca8fa63
Update README.rst 2020-07-03 12:09:51 -07:00
Lucas Ou-Yang
db81b55aab
Update README.rst 2020-07-03 12:04:24 -07:00
Lucas Ou-Yang
1c27e6da19
Update README.rst
Project support
2020-06-27 19:49:08 -07:00
Bachstelze
837bd13e96
changed 404 url (#819) 2020-06-25 22:47:35 -07:00
Lucas Ou-Yang
56de65af9e
Update README.rst 2020-06-22 16:36:27 -07:00
Lucas Ou-Yang
cba0658011
Update README.rst 2020-06-22 16:35:16 -07:00
Kyle Jones
a0f725333a
Dropping python 3.4 support (#768)
* Dropping python 3.4 support

* Fixing build issues

* Changed version number - incremented major version due to breaking change

* Removing pandas dependency
2020-06-22 13:38:44 -07:00
Lucas Ou-Yang
1c7feb1c55
Update README, gitad setup 2020-06-20 11:59:42 -07:00
Lucas Ou-Yang
9b89046d07 Add donations links in readme 2019-04-12 08:23:13 -04:00
Lucas Ou-Yang
cf85a7eadf Modify readme 2019-04-07 16:31:52 -04:00
Lucas Ou-Yang
2788a2fdcd Replace patreon with consulting 2019-04-07 16:00:31 -04:00
Akash Nidhi P S
c258db1e54 Added more stopwords for stopwords-hi.txt (#675) 2019-03-16 20:58:27 -04:00
Guy Rosin
069a437920 Update extractors.py (#688)
Added a date tag
2019-03-16 20:56:20 -04:00
bact
4c9cde0749 Add Thai stopwords (#669)
* Add Thai stopwords from stopwordsiso

* add "th" to language_dict

* add unit test and test data files for Thai language

* - add pythainlp to requirements.txt
- sort requirements.txt

* Update and sort supported language list

* sort the language list

* update language list in docs/index.rst
2019-03-16 20:53:04 -04:00
Lucas Ou-Yang
1cb6a1b143 Another edit to the Patreon change 2019-03-10 21:26:19 -04:00
Lucas Ou-Yang
0deaaa1ec5 Add Patreon support page. 2019-03-10 21:22:26 -04:00
danieleago
1ee2fdbfa5 Update stopwords-it.txt (#660)
fix strip word
2019-01-04 20:30:09 -08:00
ekaterinasmarp
11cbf3a303 Ignoring http pages depending on their content-type (#658)
* Ignoring http pages depending on their content-type, PDFs are ignored by default

* Code review fixes

* Code review fixes

* Code review fixes

* Code review fixes
2018-12-27 07:06:06 -08:00
Agnel Vishal
162c168e8d Updated comments. (#659)
* Updated comments.

The previous comment was difficult to understand.

* Changed as requested.

* Removed space
2018-12-05 00:02:26 +09:00
Torben Brodt
e84b666136 Fix extracting proper author information with nested tags (#651)
tested with 7392243

where author is in nested structure
```
<span itemprop="author" itemscope itemtype="http://schema.org/Person">
                        <span itemprop="name">
                                Klaus Egelund</span>
                    </span>
```
2018-12-04 00:48:31 +09:00
Piotr Grzesik
c8a0455b81 Add extraction of meta_site_name (#630) 2018-10-28 13:25:14 -07:00
Evaldas Kazlauskis
7f388b37a7 Adding lithuanian language support (#639) 2018-10-27 15:09:09 -07:00
Piotr Grzesik
4013b6ad04 Skip removing last diff it it's known that it has non-media class (#633) 2018-10-26 11:37:14 -07:00
Piotr Grzesik
1d095200b1 Fix broadcasting typo in cleaners (#634) 2018-10-11 18:50:14 +07:00
Dan Robertson
4a540cbcd9 Handle file scheme in Article.download (#598)
- Update the Article download function to handle the file scheme.
 - Add test cases for using Article.download with a file url
2018-10-04 12:11:50 +07:00
Lucas Ou-Yang
d1766a8b84 Change package management script to use twine 2018-09-27 22:03:59 -07:00
Lucas Ou-Yang
7dc200fa31 Version bump: 0.2.7 => 0.2.8 2018-09-27 21:53:09 -07:00
Nuno Pinheiro
9af47d1e25 Added div_to_para transformation for section tags (#627) 2018-09-22 14:53:29 -07:00
Lucas Ou-Yang
8fa5ae1546
Add Article constructor param sanitization guards (#623) 2018-09-05 00:34:07 -07:00
Sam Fonseca
146c4fd304 add "byl" attr val to byline parsing (#619) 2018-09-04 23:50:55 -07:00
Lucas Ou-Yang
7a39f9d717
Kill print(..) statements and replace with logging or exceptions (#622) 2018-09-04 23:47:05 -07:00
Lucas Ou-Yang
2f7fc40ac9
Improve mthreading.py code, add override threads option, remove unused (#618) 2018-09-02 23:21:22 -07:00
Shevchenko Vitaliy
beacce0e16 Use try except on creating dirs to avoid FileExistsError because of race condition (#617) 2018-08-31 18:01:35 -07:00
Lev E. Givon
f8b8e52d14 Enable async downloading of individually specified articles. (#548) 2018-08-30 01:04:28 -07:00
Lucas Ou-Yang
92d7f290aa Bump version from 0.2.6 => 0.2.7 enabling new pypi push 2018-08-28 00:44:10 -07:00
sfi-dannybrady
2dea0097c0 Added Japanese language support. (#584)
* Added Japanese language support.

* Investigate tinysegmenter requirement
2018-08-27 00:38:04 -07:00
Itamar Eduardo Gonçalves de Oliveira
9f8fa28f13 Update stopwords-pt.txt (#597)
Update list with pt stopwords from: https://github.com/stopwords-iso/stopwords-pt
2018-08-26 11:55:23 -07:00
viymak
c09da44c9e Add Belarusian support (#607)
* added belarusian stopwords

* added Belarusian to README.rst, index.rst, quickstart.rst, utils.py
2018-08-26 11:39:02 -07:00
Riccardo Padovani
e34c777e60 Allow .shtml pages (#570)
Fix #569
2018-06-21 00:05:15 -07:00
Nazarii Piontko
b0c13a068c Fix get_html_2XX_only exceptions masking (#573)
* Fix get_html_2XX_only exceptions masking

* Fix non-informative exceptions in article.py
2018-06-21 00:04:20 -07:00
Pok Wai Wong
c521057b20 Update README.rst (#545)
* Update README.rst

- "Newspaper has *seamless* language extraction and detection."
to
-"Newspaper can extract and detect languages *seamlessly*."

* Update README.rst

- "Newspaper has *seamless* language extraction and detection."
to
-"Newspaper can extract and detect languages *seamlessly*."
2018-04-05 01:14:36 -07:00
yongja216
5e91da7024 StopWordsKorean class bug (#540)
* bug fix: StopWordsKorean class

fixed bug in StopWordsKorean class that original code did not find stopwords in words.
StopWordsHindi class also seems like having the same problem but did not fix it.

* enhancement: finding Korean stopwords

Korean is composed of 'noun'+'postposition'.
Postposition acts like stopwords in English.
For example, to him' is '그에게'(그(he)+에게(to)) in Korean. 에게(to) is a stopword.
So, we can detect stopwords mostly by checking whether the last characters in Korean words is stopwords.
2018-04-01 21:06:49 -07:00
Daniel van Flymen
61b2bf6c75 Use Temp Dir instead of Home Dir (#354) 2018-03-06 17:53:19 -08:00
froessler
ed09bb7e49 added estonian stopwords (#523) 2018-02-28 01:57:14 -08:00