ljluestc
dd61ba794f
Add restrict_to_homepage_urls option to limit scraping to homepage links ( #134 )
2025-06-22 09:47:18 -07:00
Lucas Ou-Yang
ba8d2f41be
Update README.rst
2025-03-06 17:44:47 -08:00
Lucas Ou-Yang
f622011177
Update README.rst
2020-09-01 23:54:25 -07:00
Lucas Ou-Yang
5af1bea20f
Update README.rst
2020-07-12 18:16:14 -07:00
Lucas Ou-Yang
b0cc1278c4
Update README.rst
2020-07-04 19:34:46 -07:00
Lucas Ou-Yang
4b35117e7e
Update README.rst
2020-07-03 12:10:45 -07:00
Lucas Ou-Yang
2f6ca8fa63
Update README.rst
2020-07-03 12:09:51 -07:00
Lucas Ou-Yang
db81b55aab
Update README.rst
2020-07-03 12:04:24 -07:00
Lucas Ou-Yang
1c27e6da19
Update README.rst
...
Project support
2020-06-27 19:49:08 -07:00
Bachstelze
837bd13e96
changed 404 url ( #819 )
2020-06-25 22:47:35 -07:00
Lucas Ou-Yang
56de65af9e
Update README.rst
2020-06-22 16:36:27 -07:00
Lucas Ou-Yang
cba0658011
Update README.rst
2020-06-22 16:35:16 -07:00
Kyle Jones
a0f725333a
Dropping python 3.4 support ( #768 )
...
* Dropping python 3.4 support
* Fixing build issues
* Changed version number - incremented major version due to breaking change
* Removing pandas dependency
2020-06-22 13:38:44 -07:00
Lucas Ou-Yang
1c7feb1c55
Update README, gitad setup
2020-06-20 11:59:42 -07:00
Lucas Ou-Yang
9b89046d07
Add donations links in readme
2019-04-12 08:23:13 -04:00
Lucas Ou-Yang
cf85a7eadf
Modify readme
2019-04-07 16:31:52 -04:00
Lucas Ou-Yang
2788a2fdcd
Replace patreon with consulting
2019-04-07 16:00:31 -04:00
Akash Nidhi P S
c258db1e54
Added more stopwords for stopwords-hi.txt ( #675 )
2019-03-16 20:58:27 -04:00
Guy Rosin
069a437920
Update extractors.py ( #688 )
...
Added a date tag
2019-03-16 20:56:20 -04:00
bact
4c9cde0749
Add Thai stopwords ( #669 )
...
* Add Thai stopwords from stopwordsiso
* add "th" to language_dict
* add unit test and test data files for Thai language
* - add pythainlp to requirements.txt
- sort requirements.txt
* Update and sort supported language list
* sort the language list
* update language list in docs/index.rst
2019-03-16 20:53:04 -04:00
Lucas Ou-Yang
1cb6a1b143
Another edit to the Patreon change
2019-03-10 21:26:19 -04:00
Lucas Ou-Yang
0deaaa1ec5
Add Patreon support page.
2019-03-10 21:22:26 -04:00
danieleago
1ee2fdbfa5
Update stopwords-it.txt ( #660 )
...
fix strip word
2019-01-04 20:30:09 -08:00
ekaterinasmarp
11cbf3a303
Ignoring http pages depending on their content-type ( #658 )
...
* Ignoring http pages depending on their content-type, PDFs are ignored by default
* Code review fixes
* Code review fixes
* Code review fixes
* Code review fixes
2018-12-27 07:06:06 -08:00
Agnel Vishal
162c168e8d
Updated comments. ( #659 )
...
* Updated comments.
The previous comment was difficult to understand.
* Changed as requested.
* Removed space
2018-12-05 00:02:26 +09:00
Torben Brodt
e84b666136
Fix extracting proper author information with nested tags ( #651 )
...
tested with 7392243
where author is in nested structure
```
<span itemprop="author" itemscope itemtype="http://schema.org/Person ">
<span itemprop="name">
Klaus Egelund</span>
</span>
```
2018-12-04 00:48:31 +09:00
Piotr Grzesik
c8a0455b81
Add extraction of meta_site_name ( #630 )
2018-10-28 13:25:14 -07:00
Evaldas Kazlauskis
7f388b37a7
Adding lithuanian language support ( #639 )
2018-10-27 15:09:09 -07:00
Piotr Grzesik
4013b6ad04
Skip removing last diff it it's known that it has non-media class ( #633 )
2018-10-26 11:37:14 -07:00
Piotr Grzesik
1d095200b1
Fix broadcasting typo in cleaners ( #634 )
2018-10-11 18:50:14 +07:00
Dan Robertson
4a540cbcd9
Handle file scheme in Article.download ( #598 )
...
- Update the Article download function to handle the file scheme.
- Add test cases for using Article.download with a file url
2018-10-04 12:11:50 +07:00
Lucas Ou-Yang
d1766a8b84
Change package management script to use twine
2018-09-27 22:03:59 -07:00
Lucas Ou-Yang
7dc200fa31
Version bump: 0.2.7 => 0.2.8
2018-09-27 21:53:09 -07:00
Nuno Pinheiro
9af47d1e25
Added div_to_para transformation for section tags ( #627 )
2018-09-22 14:53:29 -07:00
Lucas Ou-Yang
8fa5ae1546
Add Article constructor param sanitization guards ( #623 )
2018-09-05 00:34:07 -07:00
Sam Fonseca
146c4fd304
add "byl" attr val to byline parsing ( #619 )
2018-09-04 23:50:55 -07:00
Lucas Ou-Yang
7a39f9d717
Kill print(..) statements and replace with logging or exceptions ( #622 )
2018-09-04 23:47:05 -07:00
Lucas Ou-Yang
2f7fc40ac9
Improve mthreading.py code, add override threads option, remove unused ( #618 )
2018-09-02 23:21:22 -07:00
Shevchenko Vitaliy
beacce0e16
Use try except on creating dirs to avoid FileExistsError because of race condition ( #617 )
2018-08-31 18:01:35 -07:00
Lev E. Givon
f8b8e52d14
Enable async downloading of individually specified articles. ( #548 )
2018-08-30 01:04:28 -07:00
Lucas Ou-Yang
92d7f290aa
Bump version from 0.2.6 => 0.2.7 enabling new pypi push
2018-08-28 00:44:10 -07:00
sfi-dannybrady
2dea0097c0
Added Japanese language support. ( #584 )
...
* Added Japanese language support.
* Investigate tinysegmenter requirement
2018-08-27 00:38:04 -07:00
Itamar Eduardo Gonçalves de Oliveira
9f8fa28f13
Update stopwords-pt.txt ( #597 )
...
Update list with pt stopwords from: https://github.com/stopwords-iso/stopwords-pt
2018-08-26 11:55:23 -07:00
viymak
c09da44c9e
Add Belarusian support ( #607 )
...
* added belarusian stopwords
* added Belarusian to README.rst, index.rst, quickstart.rst, utils.py
2018-08-26 11:39:02 -07:00
Riccardo Padovani
e34c777e60
Allow .shtml pages ( #570 )
...
Fix #569
2018-06-21 00:05:15 -07:00
Nazarii Piontko
b0c13a068c
Fix get_html_2XX_only exceptions masking ( #573 )
...
* Fix get_html_2XX_only exceptions masking
* Fix non-informative exceptions in article.py
2018-06-21 00:04:20 -07:00
Pok Wai Wong
c521057b20
Update README.rst ( #545 )
...
* Update README.rst
- "Newspaper has *seamless* language extraction and detection."
to
-"Newspaper can extract and detect languages *seamlessly*."
* Update README.rst
- "Newspaper has *seamless* language extraction and detection."
to
-"Newspaper can extract and detect languages *seamlessly*."
2018-04-05 01:14:36 -07:00
yongja216
5e91da7024
StopWordsKorean class bug ( #540 )
...
* bug fix: StopWordsKorean class
fixed bug in StopWordsKorean class that original code did not find stopwords in words.
StopWordsHindi class also seems like having the same problem but did not fix it.
* enhancement: finding Korean stopwords
Korean is composed of 'noun'+'postposition'.
Postposition acts like stopwords in English.
For example, to him' is '그에게'(그(he)+에게(to)) in Korean. 에게(to) is a stopword.
So, we can detect stopwords mostly by checking whether the last characters in Korean words is stopwords.
2018-04-01 21:06:49 -07:00
Daniel van Flymen
61b2bf6c75
Use Temp Dir instead of Home Dir ( #354 )
2018-03-06 17:53:19 -08:00
froessler
ed09bb7e49
added estonian stopwords ( #523 )
2018-02-28 01:57:14 -08:00