diff --git a/.github/ISSUE_TEMPLATE.md b/.github/ISSUE_TEMPLATE.md deleted file mode 100644 index 85567507..00000000 --- a/.github/ISSUE_TEMPLATE.md +++ /dev/null @@ -1,39 +0,0 @@ -Please make sure these boxes are checked before submitting your issue – thank you! - -- [ ] You can actually watch the video in your browser or mobile application, but not download them with `you-get`. -- [ ] Your `you-get` is up-to-date. -- [ ] I have read and tried to do so. -- [ ] The issue is not yet reported on or . If so, please add your comments under the existing issue. -- [ ] The issue (or question) is really about `you-get`, not about some other code or project. - -Run the command with the `--debug` option, and paste the full output inside the fences: - -``` -[PASTE IN ME] -``` - -If there's anything else you would like to say (e.g. in case your issue is not about downloading a specific video; it might as well be a general discussion or proposal for a new feature), fill in the box below; otherwise, you may want to post an emoji or meme instead: - -> [WRITE SOMETHING] -> [OR HAVE SOME :icecream:!] - -汉语翻译最终日期:2016年02月26日 - -在提交前,请确保您已经检查了以下内容! - -- [ ] 你可以在浏览器或移动端中观看视频,但不能使用`you-get`下载. -- [ ] 您的`you-get`为最新版. -- [ ] 我已经阅读并按 中的指引进行了操作. -- [ ] 您的问题没有在 , 报告,否则请在原有issue下报告. -- [ ] 本问题确实关于`you-get`, 而不是其他项目. - -请使用`--debug`运行,并将输出粘贴在下面: - -``` -[在这里粘贴完整日志] -``` - -如果您有其他附言,例如问题只在某个视频发生,或者是一般性讨论或者提出新功能,请在下面添加;或者您可以卖个萌: - -> [您的内容] -> [舔 :icecream:!] diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md deleted file mode 100644 index 79a43f6b..00000000 --- a/.github/PULL_REQUEST_TEMPLATE.md +++ /dev/null @@ -1,48 +0,0 @@ -**(PLEASE DELETE ALL THESE AFTER READING)** - -Thank you for the pull request! `you-get` is a growing open source project, which would not have been possible without contributors like you. - -Here are some simple rules to follow, please recheck them before sending the pull request: - -- [ ] If you want to propose two or more unrelated patches, please open separate pull requests for them, instead of one; -- [ ] All pull requests should be based upon the latest `develop` branch; -- [ ] Name your branch (from which you will send the pull request) properly; use a meaningful name like `add-this-shining-feature` rather than just `develop`; -- [ ] All commit messages, as well as comments in code, should be written in understandable English. - -As a contributor, you must be aware that - -- [ ] You agree to contribute your code to this project, under the terms of the MIT license, so that any person may freely use or redistribute them; of course, you will still reserve the copyright for your own authorship. -- [ ] You may not contribute any code not authored by yourself, unless they are licensed under either public domain or the MIT license, literally. - -Not all pull requests can eventually be merged. I consider merged / unmerged patches as equally important for the community: as long as you think a patch would be helpful, someone else might find it helpful, too, therefore they could take your fork and benefit in some way. In any case, I would like to thank you in advance for taking your time to contribute to this project. - -Cheers, -Mort - -**(PLEASE REPLACE ALL ABOVE WITH A DETAILED DESCRIPTION OF YOUR PULL REQUEST)** - - -汉语翻译最后日期:2016年02月26日 - -**(阅读后请删除所有内容)** - -感谢您的pull request! `you-get`是稳健成长的开源项目,感谢您的贡献. - -以下简单检查项目望您复查: - -- [ ] 如果您预计提出两个或更多不相关补丁,请为每个使用不同的pull requests,而不是单一; -- [ ] 所有的pull requests应基于最新的`develop`分支; -- [ ] 您预计提出pull requests的分支应有有意义名称,例如`add-this-shining-feature`而不是`develop`; -- [ ] 所有的提交信息与代码中注释应使用可理解的英语. - -作为贡献者,您需要知悉 - -- [ ] 您同意在MIT协议下贡献代码,以便任何人自由使用或分发;当然,你仍旧保留代码的著作权 -- [ ] 你不得贡献非自己编写的代码,除非其属于公有领域或使用MIT协议. - -不是所有的pull requests都会被合并,然而我认为合并/不合并的补丁一样重要:如果您认为补丁重要,其他人也有可能这么认为,那么他们可以从你的fork中提取工作并获益。无论如何,感谢您费心对本项目贡献. - -祝好, -Mort - -**(请将本内容完整替换为PULL REQUEST的详细内容)** diff --git a/.github/workflows/python-package.yml b/.github/workflows/python-package.yml new file mode 100644 index 00000000..b3d50ff7 --- /dev/null +++ b/.github/workflows/python-package.yml @@ -0,0 +1,39 @@ +# This workflow will install Python dependencies, run tests and lint with a variety of Python versions +# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions + +name: develop + +on: + push: + branches: [ develop ] + pull_request: + branches: [ develop ] + +jobs: + build: + + runs-on: ubuntu-latest + strategy: + matrix: + python-version: [3.5, 3.6, 3.7, 3.8, pypy3] + + steps: + - uses: actions/checkout@v2 + - name: Set up Python ${{ matrix.python-version }} + uses: actions/setup-python@v2 + with: + python-version: ${{ matrix.python-version }} + - name: Install dependencies + run: | + python -m pip install --upgrade pip + pip install flake8 pytest + if [ -f requirements.txt ]; then pip install -r requirements.txt; fi + - name: Lint with flake8 + run: | + # stop the build if there are Python syntax errors or undefined names + flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics + # exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide + flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics + - name: Test with unittest + run: | + make test diff --git a/.gitignore b/.gitignore index d22d3afe..99b18775 100644 --- a/.gitignore +++ b/.gitignore @@ -81,3 +81,11 @@ _* *.xml /.env /.idea +*.m4a +*.DS_Store +*.txt + +*.zip + +.vscode + diff --git a/.travis.yml b/.travis.yml deleted file mode 100644 index 9b73708d..00000000 --- a/.travis.yml +++ /dev/null @@ -1,18 +0,0 @@ -# https://travis-ci.org/soimort/you-get -language: python -python: - - "3.2" - - "3.3" - - "3.4" - - "3.5" - - "nightly" - - "pypy3" -script: make test -sudo: false -notifications: - webhooks: - urls: - - https://webhooks.gitter.im/e/43cd57826e88ed8f2152 - on_success: change # options: [always|never|change] default: always - on_failure: always # options: [always|never|change] default: always - on_start: never # options: [always|never|change] default: always diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md new file mode 100644 index 00000000..36816948 --- /dev/null +++ b/CONTRIBUTING.md @@ -0,0 +1,27 @@ +# How to Report an Issue + +If you would like to report a problem you find when using `you-get`, please open a [Pull Request](https://github.com/soimort/you-get/pulls), which should include: + +1. A detailed description of the encountered problem; +2. At least one commit, addressing the problem through some unit test(s). + * Examples of good commits: [#2675](https://github.com/soimort/you-get/pull/2675/files), [#2680](https://github.com/soimort/you-get/pull/2680/files), [#2685](https://github.com/soimort/you-get/pull/2685/files) + +PRs that fail to meet the above criteria may be closed summarily with no further action. + +A valid PR will remain open until its addressed problem is fixed. + + + +# 如何汇报问题 + +为了防止对 GitHub Issues 的滥用,本项目不接受一般的 Issue。 + +如您在使用 `you-get` 的过程中发现任何问题,请开启一个 [Pull Request](https://github.com/soimort/you-get/pulls)。该 PR 应当包含: + +1. 详细的问题描述; +2. 至少一个 commit,其内容是**与问题相关的**单元测试。**不要通过随意修改无关文件的方式来提交 PR!** + * 有效的 commit 示例:[#2675](https://github.com/soimort/you-get/pull/2675/files), [#2680](https://github.com/soimort/you-get/pull/2680/files), [#2685](https://github.com/soimort/you-get/pull/2685/files) + +不符合以上条件的 PR 可能被直接关闭。 + +有效的 PR 将会被一直保留,直至相应的问题得以修复。 diff --git a/LICENSE.txt b/LICENSE.txt index 54a06fe5..a193d8e2 100644 --- a/LICENSE.txt +++ b/LICENSE.txt @@ -1,15 +1,15 @@ -============================================== -This is a copy of the MIT license. -============================================== -Copyright (C) 2012, 2013, 2014, 2015, 2016 Mort Yao -Copyright (C) 2012 Boyu Guo +MIT License -Permission is hereby granted, free of charge, to any person obtaining a copy of -this software and associated documentation files (the "Software"), to deal in -the Software without restriction, including without limitation the rights to -use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies -of the Software, and to permit persons to whom the Software is furnished to do -so, subject to the following conditions: +Copyright (c) 2012-2020 Mort Yao and other contributors + (https://github.com/soimort/you-get/graphs/contributors) +Copyright (c) 2012 Boyu Guo + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. diff --git a/README.md b/README.md index b994ebd1..ce412afd 100644 --- a/README.md +++ b/README.md @@ -1,22 +1,32 @@ # You-Get +[![Build Status](https://github.com/soimort/you-get/workflows/develop/badge.svg)](https://github.com/soimort/you-get/actions) [![PyPI version](https://img.shields.io/pypi/v/you-get.svg)](https://pypi.python.org/pypi/you-get/) -[![Build Status](https://travis-ci.org/soimort/you-get.svg)](https://travis-ci.org/soimort/you-get) [![Gitter](https://badges.gitter.im/Join%20Chat.svg)](https://gitter.im/soimort/you-get?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge) +**NOTICE: Read [this](https://github.com/soimort/you-get/blob/develop/CONTRIBUTING.md) if you are looking for the conventional "Issues" tab.** + +--- + [You-Get](https://you-get.org/) is a tiny command-line utility to download media contents (videos, audios, images) from the Web, in case there is no other handy way to do it. -Here's how you use `you-get` to download a video from [this web page](http://www.fsf.org/blogs/rms/20140407-geneva-tedx-talk-free-software-free-society): +Here's how you use `you-get` to download a video from [YouTube](https://www.youtube.com/watch?v=jNQXAC9IVRw): ```console -$ you-get http://www.fsf.org/blogs/rms/20140407-geneva-tedx-talk-free-software-free-society -Site: fsf.org -Title: TEDxGE2014_Stallman05_LQ -Type: WebM video (video/webm) -Size: 27.12 MiB (28435804 Bytes) +$ you-get 'https://www.youtube.com/watch?v=jNQXAC9IVRw' +site: YouTube +title: Me at the zoo +stream: + - itag: 43 + container: webm + quality: medium + size: 0.5 MiB (564215 bytes) + # download-with: you-get --itag=43 [URL] -Downloading TEDxGE2014_Stallman05_LQ.webm ... -100.0% ( 27.1/27.1 MB) ├████████████████████████████████████████┤[1/1] 12 MB/s +Downloading Me at the zoo.webm ... + 100% ( 0.5/ 0.5MB) ├██████████████████████████████████┤[1/1] 6 MB/s + +Saving Me at the zoo.en.srt ... Done. ``` And here's why you might want to use it: @@ -43,10 +53,10 @@ Are you a Python programmer? Then check out [the source](https://github.com/soim ### Prerequisites -The following dependencies are required and must be installed separately, unless you are using a pre-built package or chocolatey on Windows: +The following dependencies are necessary: -* **[Python 3](https://www.python.org/downloads/)** -* **[FFmpeg](https://www.ffmpeg.org/)** (strongly recommended) or [Libav](https://libav.org/) +* **[Python](https://www.python.org/downloads/)** 3.2 or above +* **[FFmpeg](https://www.ffmpeg.org/)** 1.0 or above * (Optional) [RTMPDump](https://rtmpdump.mplayerhq.hu/) ### Option 1: Install via pip @@ -55,17 +65,13 @@ The official release of `you-get` is distributed on [PyPI](https://pypi.python.o $ pip3 install you-get -### Option 2: Install via [Antigen](https://github.com/zsh-users/antigen) +### Option 2: Install via [Antigen](https://github.com/zsh-users/antigen) (for Zsh users) Add the following line to your `.zshrc`: antigen bundle soimort/you-get -### Option 3: Use a pre-built package (Windows only) - -Download the `exe` (standalone) or `7z` (all dependencies included) from: . - -### Option 4: Download from GitHub +### Option 3: Download from GitHub You may either download the [stable](https://github.com/soimort/you-get/archive/master.zip) (identical with the latest release on PyPI) or the [develop](https://github.com/soimort/you-get/archive/develop.zip) (more hotfixes, unstable features) branch of `you-get`. Unzip it, and put the directory containing the `you-get` script into your `PATH`. @@ -83,7 +89,7 @@ $ python3 setup.py install --user to install `you-get` to a permanent path. -### Option 5: Git clone +### Option 4: Git clone This is the recommended way for all developers, even if you don't often code in Python. @@ -93,13 +99,7 @@ $ git clone git://github.com/soimort/you-get.git Then put the cloned directory into your `PATH`, or run `./setup.py install` to install `you-get` to a permanent path. -### Option 6: Using [Chocolatey](https://chocolatey.org/) (Windows only) - -``` -> choco install you-get -``` - -### Option 7: Homebrew (Mac only) +### Option 5: Homebrew (Mac only) You can install `you-get` easily via: @@ -107,9 +107,17 @@ You can install `you-get` easily via: $ brew install you-get ``` +### Option 6: pkg (FreeBSD only) + +You can install `you-get` easily via: + +``` +# pkg install you-get +``` + ### Shell completion -Completion definitions for Bash, Fish and Zsh can be found in [`contrib/completion`](contrib/completion). Please consult your shell's manual for how to take advantage of them. +Completion definitions for Bash, Fish and Zsh can be found in [`contrib/completion`](https://github.com/soimort/you-get/tree/develop/contrib/completion). Please consult your shell's manual for how to take advantage of them. ## Upgrading @@ -125,12 +133,6 @@ or download the latest release via: $ you-get https://github.com/soimort/you-get/archive/master.zip ``` -or use [chocolatey package manager](https://chocolatey.org): - -``` -> choco upgrade you-get -``` - In order to get the latest ```develop``` branch without messing up the PIP, you can try: ``` @@ -148,22 +150,54 @@ $ you-get -i 'https://www.youtube.com/watch?v=jNQXAC9IVRw' site: YouTube title: Me at the zoo streams: # Available quality and codecs + [ DASH ] ____________________________________ + - itag: 242 + container: webm + quality: 320x240 + size: 0.6 MiB (618358 bytes) + # download-with: you-get --itag=242 [URL] + + - itag: 395 + container: mp4 + quality: 320x240 + size: 0.5 MiB (550743 bytes) + # download-with: you-get --itag=395 [URL] + + - itag: 133 + container: mp4 + quality: 320x240 + size: 0.5 MiB (498558 bytes) + # download-with: you-get --itag=133 [URL] + + - itag: 278 + container: webm + quality: 192x144 + size: 0.4 MiB (392857 bytes) + # download-with: you-get --itag=278 [URL] + + - itag: 160 + container: mp4 + quality: 192x144 + size: 0.4 MiB (370882 bytes) + # download-with: you-get --itag=160 [URL] + + - itag: 394 + container: mp4 + quality: 192x144 + size: 0.4 MiB (367261 bytes) + # download-with: you-get --itag=394 [URL] + [ DEFAULT ] _________________________________ - itag: 43 container: webm quality: medium - size: 0.5 MiB (564215 bytes) + size: 0.5 MiB (568748 bytes) # download-with: you-get --itag=43 [URL] - itag: 18 container: mp4 - quality: medium - # download-with: you-get --itag=18 [URL] - - - itag: 5 - container: flv quality: small - # download-with: you-get --itag=5 [URL] + # download-with: you-get --itag=18 [URL] - itag: 36 container: 3gp @@ -176,23 +210,24 @@ streams: # Available quality and codecs # download-with: you-get --itag=17 [URL] ``` -The format marked with `DEFAULT` is the one you will get by default. If that looks cool to you, download it: +By default, the one on the top is the one you will get. If that looks cool to you, download it: ``` $ you-get 'https://www.youtube.com/watch?v=jNQXAC9IVRw' site: YouTube title: Me at the zoo stream: - - itag: 43 + - itag: 242 container: webm - quality: medium - size: 0.5 MiB (564215 bytes) - # download-with: you-get --itag=43 [URL] + quality: 320x240 + size: 0.6 MiB (618358 bytes) + # download-with: you-get --itag=242 [URL] -Downloading zoo.webm ... -100.0% ( 0.5/0.5 MB) ├████████████████████████████████████████┤[1/1] 7 MB/s +Downloading Me at the zoo.webm ... + 100% ( 0.6/ 0.6MB) ├██████████████████████████████████████████████████████████████████████████████┤[2/2] 2 MB/s +Merging video parts... Merged into Me at the zoo.webm -Saving Me at the zoo.en.srt ...Done. +Saving Me at the zoo.en.srt ... Done. ``` (If a YouTube video has any closed captions, they will be downloaded together with the video file, in SubRip subtitle format.) @@ -292,7 +327,7 @@ However, the system proxy setting (i.e. the environment variable `http_proxy`) i ### Watch a video -Use the `--player`/`-p` option to feed the video into your media player of choice, e.g. `mplayer` or `vlc`, instead of downloading it: +Use the `--player`/`-p` option to feed the video into your media player of choice, e.g. `mpv` or `vlc`, instead of downloading it: ``` $ you-get -p vlc 'https://www.youtube.com/watch?v=jNQXAC9IVRw' @@ -333,33 +368,29 @@ Use `--url`/`-u` to get a list of downloadable resource URLs extracted from the | VK | |✓|✓| | | Vine | |✓| | | | Vimeo | |✓| | | -| Vidto | |✓| | | -| Videomega | |✓| | | | Veoh | |✓| | | | **Tumblr** | |✓|✓|✓| | TED | |✓| | | | SoundCloud | | | |✓| | SHOWROOM | |✓| | | | Pinterest | | |✓| | -| MusicPlayOn | |✓| | | | MTV81 | |✓| | | | Mixcloud | | | |✓| | Metacafe | |✓| | | | Magisto | |✓| | | | Khan Academy | |✓| | | -| JPopsuki TV | |✓| | | | Internet Archive | |✓| | | | **Instagram** | |✓|✓| | | InfoQ | |✓| | | | Imgur | | |✓| | | Heavy Music Archive | | | |✓| -| **Google+** | |✓|✓| | | Freesound | | | |✓| | Flickr | |✓|✓| | | FC2 Video | |✓| | | | Facebook | |✓| | | | eHow | |✓| | | | Dailymotion | |✓| | | +| Coub | |✓| | | | CBS | |✓| | | | Bandcamp | | | |✓| | AliveThai | |✓| | | @@ -368,14 +399,12 @@ Use `--url`/`-u` to get a list of downloadable resource URLs extracted from the | **niconico
ニコニコ動画** | |✓| | | | **163
网易视频
网易云音乐** |
|✓| |✓| | 56网 | |✓| | | -| **AcFun** | |✓| | | +| **AcFun** | |✓| | | | **Baidu
百度贴吧** | |✓|✓| | | 爆米花网 | |✓| | | -| **bilibili
哔哩哔哩** | |✓| | | -| Dilidili | |✓| | | -| 豆瓣 | | | |✓| +| **bilibili
哔哩哔哩** | |✓|✓|✓| +| 豆瓣 | |✓| |✓| | 斗鱼 | |✓| | | -| Panda
熊猫 | |✓| | | | 凤凰视频 | |✓| | | | 风行网 | |✓| | | | iQIYI
爱奇艺 | |✓| | | @@ -387,26 +416,32 @@ Use `--url`/`-u` to get a list of downloadable resource URLs extracted from the | 荔枝FM | | | |✓| | 秒拍 | |✓| | | | MioMio弹幕网 | |✓| | | +| MissEvan
猫耳FM | | | |✓| | 痞客邦 | |✓| | | | PPTV聚力 | |✓| | | | 齐鲁网 | |✓| | | | QQ
腾讯视频 | |✓| | | | 企鹅直播 | |✓| | | -| 阡陌视频 | |✓| | | -| THVideo | |✓| | | | Sina
新浪视频
微博秒拍视频 |
|✓| | | | Sohu
搜狐视频 | |✓| | | -| 天天动听 | | | |✓| | **Tudou
土豆** | |✓| | | -| 虾米 | | | |✓| +| 虾米 | |✓| |✓| | 阳光卫视 | |✓| | | | **音悦Tai** | |✓| | | | **Youku
优酷** | |✓| | | | 战旗TV | |✓| | | | 央视网 | |✓| | | -| 花瓣 | | |✓| | | Naver
네이버 | |✓| | | | 芒果TV | |✓| | | +| 火猫TV | |✓| | | +| 阳光宽频网 | |✓| | | +| 西瓜视频 | |✓| | | +| 新片场 | |✓| | | +| 快手 | |✓|✓| | +| 抖音 | |✓| | | +| TikTok | |✓| | | +| 中国体育(TV) |
|✓| | | +| 知乎 | |✓| | | For all other sites not on the list, the universal extractor will take care of finding and downloading interesting resources from the page. @@ -414,19 +449,13 @@ For all other sites not on the list, the universal extractor will take care of f If something is broken and `you-get` can't get you things you want, don't panic. (Yes, this happens all the time!) -Check if it's already a known problem on , and search on the [list of open issues](https://github.com/soimort/you-get/issues). If it has not been reported yet, open a new issue, with detailed command-line output attached. +Check if it's already a known problem on . If not, follow the guidelines on [how to report an issue](https://github.com/soimort/you-get/blob/develop/CONTRIBUTING.md). ## Getting Involved You can reach us on the Gitter channel [#soimort/you-get](https://gitter.im/soimort/you-get) (here's how you [set up your IRC client](http://irc.gitter.im) for Gitter). If you have a quick question regarding `you-get`, ask it there. -All kinds of pull requests are welcome. However, there are a few guidelines to follow: - -* The [`develop`](https://github.com/soimort/you-get/tree/develop) branch is where your pull request should go. -* Remember to rebase. -* Document your PR clearly, and if applicable, provide some sample links for reviewers to test with. -* Write well-formatted, easy-to-understand commit messages. If you don't know how, look at existing ones. -* We will not ask you to sign a CLA, but you must assure that your code can be legally redistributed (under the terms of the MIT license). +If you are seeking to report an issue or contribute, please make sure to read [the guidelines](https://github.com/soimort/you-get/blob/develop/CONTRIBUTING.md) first. ## Legal Issues @@ -450,6 +479,6 @@ We only ship the code here, and how you are going to use it is left to your own ## Authors -Made by [@soimort](https://github.com/soimort), who is in turn powered by :coffee:, :pizza: and :ramen:. +Made by [@soimort](https://github.com/soimort), who is in turn powered by :coffee:, :beer: and :ramen:. You can find the [list of all contributors](https://github.com/soimort/you-get/graphs/contributors) here. diff --git a/setup.py b/setup.py index 21246c5f..24dc9fb2 100755 --- a/setup.py +++ b/setup.py @@ -41,5 +41,9 @@ setup( classifiers = proj_info['classifiers'], - entry_points = {'console_scripts': proj_info['console_scripts']} + entry_points = {'console_scripts': proj_info['console_scripts']}, + + extras_require={ + 'socks': ['PySocks'], + } ) diff --git a/src/you_get/cli_wrapper/player/__main__.py b/src/you_get/cli_wrapper/player/__main__.py index 8d4958b9..09f4d42d 100644 --- a/src/you_get/cli_wrapper/player/__main__.py +++ b/src/you_get/cli_wrapper/player/__main__.py @@ -1,7 +1,9 @@ #!/usr/bin/env python +''' WIP def main(): script_main('you-get', any_download, any_download_playlist) if __name__ == "__main__": main() +''' diff --git a/src/you_get/common.py b/src/you_get/common.py index 948b0ca2..79fc74d1 100755 --- a/src/you_get/common.py +++ b/src/you_get/common.py @@ -1,8 +1,31 @@ #!/usr/bin/env python +import io +import os +import re +import sys +import time +import json +import socket +import locale +import logging +import argparse +import ssl +from http import cookiejar +from importlib import import_module +from urllib import request, parse, error + +from .version import __version__ +from .util import log, term +from .util.git import get_version +from .util.strings import get_filename, unescape_html +from . import json_output as json_output_ +sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='utf8') + SITES = { '163' : 'netease', '56' : 'w56', + '365yg' : 'toutiao', 'acfun' : 'acfun', 'archive' : 'archive', 'baidu' : 'baidu', @@ -13,10 +36,10 @@ SITES = { 'cctv' : 'cntv', 'cntv' : 'cntv', 'cbs' : 'cbs', + 'coub' : 'coub', 'dailymotion' : 'dailymotion', - 'dilidili' : 'dilidili', - 'dongting' : 'dongting', 'douban' : 'douban', + 'douyin' : 'douyin', 'douyu' : 'douyutv', 'ehow' : 'ehow', 'facebook' : 'facebook', @@ -25,10 +48,11 @@ SITES = { 'freesound' : 'freesound', 'fun' : 'funshion', 'google' : 'google', + 'giphy' : 'giphy', 'heavy-music' : 'heavymusic', - 'huaban' : 'huaban', 'huomao' : 'huomaotv', 'iask' : 'sina', + 'icourses' : 'icourses', 'ifeng' : 'ifeng', 'imgur' : 'imgur', 'in' : 'alive', @@ -37,32 +61,36 @@ SITES = { 'interest' : 'interest', 'iqilu' : 'iqilu', 'iqiyi' : 'iqiyi', + 'ixigua' : 'ixigua', 'isuntv' : 'suntv', + 'iwara' : 'iwara', 'joy' : 'joy', - 'jpopsuki' : 'jpopsuki', 'kankanews' : 'bilibili', + 'kakao' : 'kakao', 'khanacademy' : 'khan', 'ku6' : 'ku6', + 'kuaishou' : 'kuaishou', 'kugou' : 'kugou', 'kuwo' : 'kuwo', 'le' : 'le', 'letv' : 'le', 'lizhi' : 'lizhi', + 'longzhu' : 'longzhu', 'magisto' : 'magisto', 'metacafe' : 'metacafe', 'mgtv' : 'mgtv', 'miomio' : 'miomio', + 'missevan' : 'missevan', 'mixcloud' : 'mixcloud', 'mtv81' : 'mtv81', - 'musicplayon' : 'musicplayon', + 'miaopai' : 'yixia', 'naver' : 'naver', '7gogo' : 'nanagogo', 'nicovideo' : 'nicovideo', - 'panda' : 'panda', 'pinterest' : 'pinterest', 'pixnet' : 'pixnet', 'pptv' : 'pptv', - 'qianmo' : 'qianmo', + 'qingting' : 'qingting', 'qq' : 'qq', 'showroom-live' : 'showroom', 'sina' : 'sina', @@ -71,14 +99,13 @@ SITES = { 'soundcloud' : 'soundcloud', 'ted' : 'ted', 'theplatform' : 'theplatform', - 'thvideo' : 'thvideo', + 'tiktok' : 'tiktok', 'tucao' : 'tucao', 'tudou' : 'tudou', 'tumblr' : 'tumblr', 'twimg' : 'twitter', 'twitter' : 'twitter', - 'videomega' : 'videomega', - 'vidto' : 'vidto', + 'ucas' : 'ucas', 'vimeo' : 'vimeo', 'wanmen' : 'wanmen', 'weibo' : 'miaopai', @@ -88,48 +115,35 @@ SITES = { 'xiami' : 'xiami', 'xiaokaxiu' : 'yixia', 'xiaojiadianvideo' : 'fc2video', + 'ximalaya' : 'ximalaya', + 'xinpianchang' : 'xinpianchang', 'yinyuetai' : 'yinyuetai', - 'miaopai' : 'yixia', + 'yizhibo' : 'yizhibo', 'youku' : 'youku', 'youtu' : 'youtube', 'youtube' : 'youtube', 'zhanqi' : 'zhanqi', + 'zhibo' : 'zhibo', + 'zhihu' : 'zhihu', } -import getopt -import json -import locale -import logging -import os -import platform -import re -import socket -import sys -import time -from urllib import request, parse, error -from http import cookiejar -from importlib import import_module - -from .version import __version__ -from .util import log, term -from .util.git import get_version -from .util.strings import get_filename, unescape_html -from . import json_output as json_output_ - dry_run = False json_output = False force = False +skip_existing_file_size_check = False player = None extractor_proxy = None cookies = None output_filename = None +auto_rename = False +insecure = False fake_headers = { - 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', + 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # noqa 'Accept-Charset': 'UTF-8,*;q=0.5', 'Accept-Encoding': 'gzip,deflate,sdch', 'Accept-Language': 'en-US,en;q=0.8', - 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:13.0) Gecko/20100101 Firefox/13.0' + 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.74 Safari/537.36 Edg/79.0.309.43', # noqa } if sys.stdout.isatty(): @@ -137,16 +151,60 @@ if sys.stdout.isatty(): else: default_encoding = locale.getpreferredencoding().lower() + +def rc4(key, data): + # all encryption algo should work on bytes + assert type(key) == type(data) and type(key) == type(b'') + state = list(range(256)) + j = 0 + for i in range(256): + j += state[i] + key[i % len(key)] + j &= 0xff + state[i], state[j] = state[j], state[i] + + i = 0 + j = 0 + out_list = [] + for char in data: + i += 1 + i &= 0xff + j += state[i] + j &= 0xff + state[i], state[j] = state[j], state[i] + prn = state[(state[i] + state[j]) & 0xff] + out_list.append(char ^ prn) + + return bytes(out_list) + + +def general_m3u8_extractor(url, headers={}): + m3u8_list = get_content(url, headers=headers).split('\n') + urls = [] + for line in m3u8_list: + line = line.strip() + if line and not line.startswith('#'): + if line.startswith('http'): + urls.append(line) + else: + seg_url = parse.urljoin(url, line) + urls.append(seg_url) + return urls + + def maybe_print(*s): - try: print(*s) - except: pass + try: + print(*s) + except: + pass + def tr(s): if default_encoding == 'utf-8': return s else: return s - #return str(s.encode('utf-8'))[2:-1] + # return str(s.encode('utf-8'))[2:-1] + # DEPRECATED in favor of match1() def r1(pattern, text): @@ -154,6 +212,7 @@ def r1(pattern, text): if m: return m.group(1) + # DEPRECATED in favor of match1() def r1_of(patterns, text): for p in patterns: @@ -161,6 +220,7 @@ def r1_of(patterns, text): if x: return x + def match1(text, *patterns): """Scans through a string for substrings matched some patterns (first-subgroups only). @@ -188,6 +248,7 @@ def match1(text, *patterns): ret.append(match.group(1)) return ret + def matchall(text, patterns): """Scans through a string for substrings matched some patterns. @@ -206,10 +267,26 @@ def matchall(text, patterns): return ret + def launch_player(player, urls): import subprocess import shlex - subprocess.call(shlex.split(player) + list(urls)) + urls = list(urls) + for url in urls.copy(): + if type(url) is list: + urls.extend(url) + urls = [url for url in urls if type(url) is str] + assert urls + if (sys.version_info >= (3, 3)): + import shutil + exefile=shlex.split(player)[0] + if shutil.which(exefile) is not None: + subprocess.call(shlex.split(player) + urls) + else: + log.wtf('[Failed] Cannot find player "%s"' % exefile) + else: + subprocess.call(shlex.split(player) + urls) + def parse_query_param(url, param): """Parses the query string of a URL and returns the value of a parameter. @@ -227,8 +304,14 @@ def parse_query_param(url, param): except: return None + def unicodize(text): - return re.sub(r'\\u([0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f])', lambda x: chr(int(x.group(0)[2:], 16)), text) + return re.sub( + r'\\u([0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f])', + lambda x: chr(int(x.group(0)[2:], 16)), + text + ) + # DEPRECATED in favor of util.legitimize() def escape_file_path(path): @@ -238,6 +321,7 @@ def escape_file_path(path): path = path.replace('?', '-') return path + def ungzip(data): """Decompresses data for Content-Encoding: gzip. """ @@ -247,6 +331,7 @@ def ungzip(data): f = gzip.GzipFile(fileobj=buffer) return f.read() + def undeflate(data): """Decompresses data for Content-Encoding: deflate. (the zlib compression is used.) @@ -255,15 +340,20 @@ def undeflate(data): decompressobj = zlib.decompressobj(-zlib.MAX_WBITS) return decompressobj.decompress(data)+decompressobj.flush() + # DEPRECATED in favor of get_content() -def get_response(url, faker = False): +def get_response(url, faker=False): + logging.debug('get_response: %s' % url) + # install cookies if cookies: opener = request.build_opener(request.HTTPCookieProcessor(cookies)) request.install_opener(opener) if faker: - response = request.urlopen(request.Request(url, headers = fake_headers), None) + response = request.urlopen( + request.Request(url, headers=fake_headers), None + ) else: response = request.urlopen(url) @@ -275,13 +365,15 @@ def get_response(url, faker = False): response.data = data return response + # DEPRECATED in favor of get_content() -def get_html(url, encoding = None, faker = False): +def get_html(url, encoding=None, faker=False): content = get_response(url, faker).data return str(content, 'utf-8', 'ignore') + # DEPRECATED in favor of get_content() -def get_decoded_html(url, faker = False): +def get_decoded_html(url, faker=False): response = get_response(url, faker) data = response.data charset = r1(r'charset=([\w-]+)', response.headers['content-type']) @@ -290,11 +382,41 @@ def get_decoded_html(url, faker = False): else: return data -def get_location(url): - response = request.urlopen(url) - # urllib will follow redirections and it's too much code to tell urllib - # not to do that - return response.geturl() + +def get_location(url, headers=None, get_method='HEAD'): + logging.debug('get_location: %s' % url) + + if headers: + req = request.Request(url, headers=headers) + else: + req = request.Request(url) + req.get_method = lambda: get_method + res = urlopen_with_retry(req) + return res.geturl() + + +def urlopen_with_retry(*args, **kwargs): + retry_time = 3 + for i in range(retry_time): + try: + if insecure: + # ignore ssl errors + ctx = ssl.create_default_context() + ctx.check_hostname = False + ctx.verify_mode = ssl.CERT_NONE + return request.urlopen(*args, context=ctx, **kwargs) + else: + return request.urlopen(*args, **kwargs) + except socket.timeout as e: + logging.debug('request attempt %s timeout' % str(i + 1)) + if i + 1 == retry_time: + raise e + # try to tackle youku CDN fails + except error.HTTPError as http_error: + logging.debug('HTTP Error with code{}'.format(http_error.code)) + if i + 1 == retry_time: + raise http_error + def get_content(url, headers={}, decoded=True): """Gets the content of a URL via sending a HTTP GET request. @@ -315,13 +437,7 @@ def get_content(url, headers={}, decoded=True): cookies.add_cookie_header(req) req.headers.update(req.unredirected_hdrs) - for i in range(10): - try: - response = request.urlopen(req) - break - except socket.timeout: - logging.debug('request attempt %s timeout' % str(i + 1)) - + response = urlopen_with_retry(req) data = response.read() # Handle HTTP compression for gzip and deflate (zlib) @@ -333,15 +449,18 @@ def get_content(url, headers={}, decoded=True): # Decode the response body if decoded: - charset = match1(response.getheader('Content-Type'), r'charset=([\w-]+)') + charset = match1( + response.getheader('Content-Type', ''), r'charset=([\w-]+)' + ) if charset is not None: - data = data.decode(charset) + data = data.decode(charset, 'ignore') else: - data = data.decode('utf-8') + data = data.decode('utf-8', 'ignore') return data -def post_content(url, headers={}, post_data={}, decoded=True): + +def post_content(url, headers={}, post_data={}, decoded=True, **kwargs): """Post the content of a URL via sending a HTTP POST request. Args: @@ -352,15 +471,20 @@ def post_content(url, headers={}, post_data={}, decoded=True): Returns: The content as a string. """ - - logging.debug('post_content: %s \n post_data: %s' % (url, post_data)) + if kwargs.get('post_data_raw'): + logging.debug('post_content: %s\npost_data_raw: %s' % (url, kwargs['post_data_raw'])) + else: + logging.debug('post_content: %s\npost_data: %s' % (url, post_data)) req = request.Request(url, headers=headers) if cookies: cookies.add_cookie_header(req) req.headers.update(req.unredirected_hdrs) - post_data_enc = bytes(parse.urlencode(post_data), 'utf-8') - response = request.urlopen(req, data = post_data_enc) + if kwargs.get('post_data_raw'): + post_data_enc = bytes(kwargs['post_data_raw'], 'utf-8') + else: + post_data_enc = bytes(parse.urlencode(post_data), 'utf-8') + response = urlopen_with_retry(req, data=post_data_enc) data = response.read() # Handle HTTP compression for gzip and deflate (zlib) @@ -372,7 +496,9 @@ def post_content(url, headers={}, post_data={}, decoded=True): # Decode the response body if decoded: - charset = match1(response.getheader('Content-Type'), r'charset=([\w-]+)') + charset = match1( + response.getheader('Content-Type'), r'charset=([\w-]+)' + ) if charset is not None: data = data.decode(charset) else: @@ -380,41 +506,54 @@ def post_content(url, headers={}, post_data={}, decoded=True): return data -def url_size(url, faker = False, headers = {}): + +def url_size(url, faker=False, headers={}): if faker: - response = request.urlopen(request.Request(url, headers = fake_headers), None) + response = urlopen_with_retry( + request.Request(url, headers=fake_headers) + ) elif headers: - response = request.urlopen(request.Request(url, headers = headers), None) + response = urlopen_with_retry(request.Request(url, headers=headers)) else: - response = request.urlopen(url) + response = urlopen_with_retry(url) size = response.headers['content-length'] - return int(size) if size!=None else float('inf') + return int(size) if size is not None else float('inf') -def urls_size(urls, faker = False, headers = {}): + +def urls_size(urls, faker=False, headers={}): return sum([url_size(url, faker=faker, headers=headers) for url in urls]) -def get_head(url, headers = {}): + +def get_head(url, headers=None, get_method='HEAD'): + logging.debug('get_head: %s' % url) + if headers: - req = request.Request(url, headers = headers) + req = request.Request(url, headers=headers) else: req = request.Request(url) - req.get_method = lambda : 'HEAD' - res = request.urlopen(req) - return dict(res.headers) + req.get_method = lambda: get_method + res = urlopen_with_retry(req) + return res.headers + + +def url_info(url, faker=False, headers={}): + logging.debug('url_info: %s' % url) -def url_info(url, faker = False, headers = {}): if faker: - response = request.urlopen(request.Request(url, headers = fake_headers), None) + response = urlopen_with_retry( + request.Request(url, headers=fake_headers) + ) elif headers: - response = request.urlopen(request.Request(url, headers = headers), None) + response = urlopen_with_retry(request.Request(url, headers=headers)) else: - response = request.urlopen(request.Request(url)) + response = urlopen_with_retry(request.Request(url)) headers = response.headers type = headers['content-type'] - if type == 'image/jpg; charset=UTF-8' or type == 'image/jpg' : type = 'audio/mpeg' #fix for netease + if type == 'image/jpg; charset=UTF-8' or type == 'image/jpg': + type = 'audio/mpeg' # fix for netease mapping = { 'video/3gpp': '3gp', 'video/f4v': 'flv', @@ -426,6 +565,9 @@ def url_info(url, faker = False, headers = {}): 'video/x-ms-asf': 'asf', 'audio/mp4': 'mp4', 'audio/mpeg': 'mp3', + 'audio/wav': 'wav', + 'audio/x-wav': 'wav', + 'audio/wave': 'wav', 'image/jpeg': 'jpg', 'image/png': 'png', 'image/gif': 'gif', @@ -437,7 +579,9 @@ def url_info(url, faker = False, headers = {}): type = None if headers['content-disposition']: try: - filename = parse.unquote(r1(r'filename="?([^"]+)"?', headers['content-disposition'])) + filename = parse.unquote( + r1(r'filename="?([^"]+)"?', headers['content-disposition']) + ) if len(filename.split('.')) > 1: ext = filename.split('.')[-1] else: @@ -454,41 +598,95 @@ def url_info(url, faker = False, headers = {}): return type, ext, size -def url_locations(urls, faker = False, headers = {}): + +def url_locations(urls, faker=False, headers={}): locations = [] for url in urls: + logging.debug('url_locations: %s' % url) + if faker: - response = request.urlopen(request.Request(url, headers = fake_headers), None) + response = urlopen_with_retry( + request.Request(url, headers=fake_headers) + ) elif headers: - response = request.urlopen(request.Request(url, headers = headers), None) + response = urlopen_with_retry( + request.Request(url, headers=headers) + ) else: - response = request.urlopen(request.Request(url)) + response = urlopen_with_retry(request.Request(url)) locations.append(response.url) return locations -def url_save(url, filepath, bar, refer = None, is_part = False, faker = False, headers = {}): - file_size = url_size(url, faker = faker, headers = headers) - if os.path.exists(filepath): - if not force and file_size == os.path.getsize(filepath): - if not is_part: - if bar: - bar.done() - print('Skipping %s: file already exists' % tr(os.path.basename(filepath))) +def url_save( + url, filepath, bar, refer=None, is_part=False, faker=False, + headers=None, timeout=None, **kwargs +): + tmp_headers = headers.copy() if headers is not None else {} + # When a referer specified with param refer, + # the key must be 'Referer' for the hack here + if refer is not None: + tmp_headers['Referer'] = refer + if type(url) is list: + chunk_sizes = [url_size(url, faker=faker, headers=tmp_headers) for url in url] + file_size = sum(chunk_sizes) + is_chunked, urls = True, url + else: + file_size = url_size(url, faker=faker, headers=tmp_headers) + chunk_sizes = [file_size] + is_chunked, urls = False, [url] + + continue_renameing = True + while continue_renameing: + continue_renameing = False + if os.path.exists(filepath): + if not force and (file_size == os.path.getsize(filepath) or skip_existing_file_size_check): + if not is_part: + if bar: + bar.done() + if skip_existing_file_size_check: + log.w( + 'Skipping {} without checking size: file already exists'.format( + tr(os.path.basename(filepath)) + ) + ) + else: + log.w( + 'Skipping {}: file already exists'.format( + tr(os.path.basename(filepath)) + ) + ) + else: + if bar: + bar.update_received(file_size) + return else: - if bar: - bar.update_received(file_size) - return - else: - if not is_part: - if bar: - bar.done() - print('Overwriting %s' % tr(os.path.basename(filepath)), '...') - elif not os.path.exists(os.path.dirname(filepath)): - os.mkdir(os.path.dirname(filepath)) + if not is_part: + if bar: + bar.done() + if not force and auto_rename: + path, ext = os.path.basename(filepath).rsplit('.', 1) + finder = re.compile(' \([1-9]\d*?\)$') + if (finder.search(path) is None): + thisfile = path + ' (1).' + ext + else: + def numreturn(a): + return ' (' + str(int(a.group()[2:-1]) + 1) + ').' + thisfile = finder.sub(numreturn, path) + ext + filepath = os.path.join(os.path.dirname(filepath), thisfile) + print('Changing name to %s' % tr(os.path.basename(filepath)), '...') + continue_renameing = True + continue + if log.yes_or_no('File with this name already exists. Overwrite?'): + log.w('Overwriting %s ...' % tr(os.path.basename(filepath))) + else: + return + elif not os.path.exists(os.path.dirname(filepath)): + os.mkdir(os.path.dirname(filepath)) - temp_filepath = filepath + '.download' if file_size!=float('inf') else filepath + temp_filepath = filepath + '.download' if file_size != float('inf') \ + else filepath received = 0 if not force: open_mode = 'ab' @@ -500,117 +698,97 @@ def url_save(url, filepath, bar, refer = None, is_part = False, faker = False, h else: open_mode = 'wb' - if received < file_size: - if faker: - headers = fake_headers - elif headers: - headers = headers - else: - headers = {} - if received: - headers['Range'] = 'bytes=' + str(received) + '-' - if refer: - headers['Referer'] = refer - - response = request.urlopen(request.Request(url, headers = headers), None) - try: - range_start = int(response.headers['content-range'][6:].split('/')[0].split('-')[0]) - end_length = end = int(response.headers['content-range'][6:].split('/')[1]) - range_length = end_length - range_start - except: - content_length = response.headers['content-length'] - range_length = int(content_length) if content_length!=None else float('inf') - - if file_size != received + range_length: - received = 0 - if bar: - bar.received = 0 - open_mode = 'wb' - - with open(temp_filepath, open_mode) as output: - while True: - buffer = response.read(1024 * 256) - if not buffer: - if received == file_size: # Download finished - break - else: # Unexpected termination. Retry request - headers['Range'] = 'bytes=' + str(received) + '-' - response = request.urlopen(request.Request(url, headers = headers), None) - output.write(buffer) - received += len(buffer) - if bar: - bar.update_received(len(buffer)) - - assert received == os.path.getsize(temp_filepath), '%s == %s == %s' % (received, os.path.getsize(temp_filepath), temp_filepath) - - if os.access(filepath, os.W_OK): - os.remove(filepath) # on Windows rename could fail if destination filepath exists - os.rename(temp_filepath, filepath) - -def url_save_chunked(url, filepath, bar, refer = None, is_part = False, faker = False, headers = {}): - if os.path.exists(filepath): - if not force: - if not is_part: - if bar: - bar.done() - print('Skipping %s: file already exists' % tr(os.path.basename(filepath))) + chunk_start = 0 + chunk_end = 0 + for i, url in enumerate(urls): + received_chunk = 0 + chunk_start += 0 if i == 0 else chunk_sizes[i - 1] + chunk_end += chunk_sizes[i] + if received < file_size and received < chunk_end: + if faker: + tmp_headers = fake_headers + ''' + if parameter headers passed in, we have it copied as tmp_header + elif headers: + headers = headers else: + headers = {} + ''' + if received: + # chunk_start will always be 0 if not chunked + tmp_headers['Range'] = 'bytes=' + str(received - chunk_start) + '-' + if refer: + tmp_headers['Referer'] = refer + + if timeout: + response = urlopen_with_retry( + request.Request(url, headers=tmp_headers), timeout=timeout + ) + else: + response = urlopen_with_retry( + request.Request(url, headers=tmp_headers) + ) + try: + range_start = int( + response.headers[ + 'content-range' + ][6:].split('/')[0].split('-')[0] + ) + end_length = int( + response.headers['content-range'][6:].split('/')[1] + ) + range_length = end_length - range_start + except: + content_length = response.headers['content-length'] + range_length = int(content_length) if content_length is not None \ + else float('inf') + + if is_chunked: # always append if chunked + open_mode = 'ab' + elif file_size != received + range_length: # is it ever necessary? + received = 0 if bar: - bar.update_received(os.path.getsize(filepath)) - return - else: - if not is_part: - if bar: - bar.done() - print('Overwriting %s' % tr(os.path.basename(filepath)), '...') - elif not os.path.exists(os.path.dirname(filepath)): - os.mkdir(os.path.dirname(filepath)) + bar.received = 0 + open_mode = 'wb' - temp_filepath = filepath + '.download' - received = 0 - if not force: - open_mode = 'ab' + with open(temp_filepath, open_mode) as output: + while True: + buffer = None + try: + buffer = response.read(1024 * 256) + except socket.timeout: + pass + if not buffer: + if is_chunked and received_chunk == range_length: + break + elif not is_chunked and received == file_size: # Download finished + break + # Unexpected termination. Retry request + tmp_headers['Range'] = 'bytes=' + str(received - chunk_start) + '-' + response = urlopen_with_retry( + request.Request(url, headers=tmp_headers) + ) + continue + output.write(buffer) + received += len(buffer) + received_chunk += len(buffer) + if bar: + bar.update_received(len(buffer)) - if os.path.exists(temp_filepath): - received += os.path.getsize(temp_filepath) - if bar: - bar.update_received(os.path.getsize(temp_filepath)) - else: - open_mode = 'wb' - - if faker: - headers = fake_headers - elif headers: - headers = headers - else: - headers = {} - if received: - headers['Range'] = 'bytes=' + str(received) + '-' - if refer: - headers['Referer'] = refer - - response = request.urlopen(request.Request(url, headers = headers), None) - - with open(temp_filepath, open_mode) as output: - while True: - buffer = response.read(1024 * 256) - if not buffer: - break - output.write(buffer) - received += len(buffer) - if bar: - bar.update_received(len(buffer)) - - assert received == os.path.getsize(temp_filepath), '%s == %s == %s' % (received, os.path.getsize(temp_filepath)) + assert received == os.path.getsize(temp_filepath), '%s == %s == %s' % ( + received, os.path.getsize(temp_filepath), temp_filepath + ) if os.access(filepath, os.W_OK): - os.remove(filepath) # on Windows rename could fail if destination filepath exists + # on Windows rename could fail if destination filepath exists + os.remove(filepath) os.rename(temp_filepath, filepath) + class SimpleProgressBar: term_size = term.get_terminal_size()[1] - def __init__(self, total_size, total_pieces = 1): + def __init__(self, total_size, total_pieces=1): self.displayed = False self.total_size = total_size self.total_pieces = total_pieces @@ -623,9 +801,12 @@ class SimpleProgressBar: # 38 is the size of all statically known size in self.bar total_str = '%5s' % round(self.total_size / 1048576, 1) total_str_width = max(len(total_str), 5) - self.bar_size = self.term_size - 27 - 2*total_pieces_len - 2*total_str_width + self.bar_size = self.term_size - 28 - 2 * total_pieces_len \ + - 2 * total_str_width self.bar = '{:>4}%% ({:>%s}/%sMB) ├{:─<%s}┤[{:>%s}/{:>%s}] {}' % ( - total_str_width, total_str, self.bar_size, total_pieces_len, total_pieces_len) + total_str_width, total_str, self.bar_size, total_pieces_len, + total_pieces_len + ) def update(self): self.displayed = True @@ -642,7 +823,10 @@ class SimpleProgressBar: else: plus = '' bar = '█' * dots + plus - bar = self.bar.format(percent, round(self.received / 1048576, 1), bar, self.current_piece, self.total_pieces, self.speed) + bar = self.bar.format( + percent, round(self.received / 1048576, 1), bar, + self.current_piece, self.total_pieces, self.speed + ) sys.stdout.write('\r' + bar) sys.stdout.flush() @@ -669,8 +853,9 @@ class SimpleProgressBar: print() self.displayed = False + class PiecesProgressBar: - def __init__(self, total_size, total_pieces = 1): + def __init__(self, total_size, total_pieces=1): self.displayed = False self.total_size = total_size self.total_pieces = total_pieces @@ -679,7 +864,9 @@ class PiecesProgressBar: def update(self): self.displayed = True - bar = '{0:>5}%[{1:<40}] {2}/{3}'.format('', '=' * 40, self.current_piece, self.total_pieces) + bar = '{0:>5}%[{1:<40}] {2}/{3}'.format( + '', '=' * 40, self.current_piece, self.total_pieces + ) sys.stdout.write('\r' + bar) sys.stdout.flush() @@ -695,20 +882,31 @@ class PiecesProgressBar: print() self.displayed = False + class DummyProgressBar: def __init__(self, *args): pass + def update_received(self, n): pass + def update_piece(self, n): pass + def done(self): pass -def get_output_filename(urls, title, ext, output_dir, merge): + +def get_output_filename(urls, title, ext, output_dir, merge, **kwargs): # lame hack for the --output-filename option global output_filename - if output_filename: return output_filename + if output_filename: + result = output_filename + if kwargs.get('part', -1) >= 0: + result = '%s[%02d]' % (result, kwargs.get('part')) + if ext: + result = '%s.%s' % (result, ext) + return result merged_ext = ext if (len(urls) > 1) and merge: @@ -725,15 +923,34 @@ def get_output_filename(urls, title, ext, output_dir, merge): merged_ext = 'mkv' else: merged_ext = 'ts' - return '%s.%s' % (title, merged_ext) + result = title + if kwargs.get('part', -1) >= 0: + result = '%s[%02d]' % (result, kwargs.get('part')) + result = '%s.%s' % (result, merged_ext) + return result.replace("'", "_") -def download_urls(urls, title, ext, total_size, output_dir='.', refer=None, merge=True, faker=False, headers = {}, **kwargs): +def print_user_agent(faker=False): + urllib_default_user_agent = 'Python-urllib/%d.%d' % sys.version_info[:2] + user_agent = fake_headers['User-Agent'] if faker else urllib_default_user_agent + print('User Agent: %s' % user_agent) + +def download_urls( + urls, title, ext, total_size, output_dir='.', refer=None, merge=True, + faker=False, headers={}, **kwargs +): assert urls if json_output: - json_output_.download_urls(urls=urls, title=title, ext=ext, total_size=total_size, refer=refer) + json_output_.download_urls( + urls=urls, title=title, ext=ext, total_size=total_size, + refer=refer + ) return if dry_run: - print('Real URLs:\n%s' % '\n'.join(urls)) + print_user_agent(faker=faker) + try: + print('Real URLs:\n%s' % '\n'.join(urls)) + except: + print('Real URLs:\n%s' % '\n'.join([j for i in urls for j in i])) return if player: @@ -753,8 +970,13 @@ def download_urls(urls, title, ext, total_size, output_dir='.', refer=None, merg output_filepath = os.path.join(output_dir, output_filename) if total_size: - if not force and os.path.exists(output_filepath) and os.path.getsize(output_filepath) >= total_size * 0.9: - print('Skipping %s: file already exists' % output_filepath) + if not force and os.path.exists(output_filepath) and not auto_rename\ + and (os.path.getsize(output_filepath) >= total_size * 0.9\ + or skip_existing_file_size_check): + if skip_existing_file_size_check: + log.w('Skipping %s without checking size: file already exists' % output_filepath) + else: + log.w('Skipping %s: file already exists' % output_filepath) print() return bar = SimpleProgressBar(total_size, len(urls)) @@ -765,19 +987,25 @@ def download_urls(urls, title, ext, total_size, output_dir='.', refer=None, merg url = urls[0] print('Downloading %s ...' % tr(output_filename)) bar.update() - url_save(url, output_filepath, bar, refer = refer, faker = faker, headers = headers) + url_save( + url, output_filepath, bar, refer=refer, faker=faker, + headers=headers, **kwargs + ) bar.done() else: parts = [] - print('Downloading %s.%s ...' % (tr(title), ext)) + print('Downloading %s ...' % tr(output_filename)) bar.update() for i, url in enumerate(urls): - filename = '%s[%02d].%s' % (title, i, ext) - filepath = os.path.join(output_dir, filename) - parts.append(filepath) - #print 'Downloading %s [%s/%s]...' % (tr(filename), i + 1, len(urls)) + output_filename_i = get_output_filename(urls, title, ext, output_dir, merge, part=i) + output_filepath_i = os.path.join(output_dir, output_filename_i) + parts.append(output_filepath_i) + # print 'Downloading %s [%s/%s]...' % (tr(filename), i + 1, len(urls)) bar.update_piece(i + 1) - url_save(url, filepath, bar, refer = refer, is_part = True, faker = faker, headers = headers) + url_save( + url, output_filepath_i, bar, refer=refer, is_part=True, faker=faker, + headers=headers, **kwargs + ) bar.done() if not merge: @@ -791,7 +1019,8 @@ def download_urls(urls, title, ext, total_size, output_dir='.', refer=None, merg ret = ffmpeg_concat_av(parts, output_filepath, ext) print('Merged into %s' % output_filename) if ret == 0: - for part in parts: os.remove(part) + for part in parts: + os.remove(part) elif ext in ['flv', 'f4v']: try: @@ -825,7 +1054,7 @@ def download_urls(urls, title, ext, total_size, output_dir='.', refer=None, merg for part in parts: os.remove(part) - elif ext == "ts": + elif ext == 'ts': try: from .processor.ffmpeg import has_ffmpeg_installed if has_ffmpeg_installed(): @@ -841,96 +1070,36 @@ def download_urls(urls, title, ext, total_size, output_dir='.', refer=None, merg for part in parts: os.remove(part) + elif ext == 'mp3': + try: + from .processor.ffmpeg import has_ffmpeg_installed + + assert has_ffmpeg_installed() + from .processor.ffmpeg import ffmpeg_concat_mp3_to_mp3 + ffmpeg_concat_mp3_to_mp3(parts, output_filepath) + print('Merged into %s' % output_filename) + except: + raise + else: + for part in parts: + os.remove(part) + else: print("Can't merge %s files" % ext) print() -def download_urls_chunked(urls, title, ext, total_size, output_dir='.', refer=None, merge=True, faker=False, headers = {}): - assert urls - if dry_run: - print('Real URLs:\n%s\n' % urls) - return - if player: - launch_player(player, urls) - return - - title = tr(get_filename(title)) - - filename = '%s.%s' % (title, ext) - filepath = os.path.join(output_dir, filename) - if total_size and ext in ('ts'): - if not force and os.path.exists(filepath[:-3] + '.mkv'): - print('Skipping %s: file already exists' % filepath[:-3] + '.mkv') - print() - return - bar = SimpleProgressBar(total_size, len(urls)) - else: - bar = PiecesProgressBar(total_size, len(urls)) - - if len(urls) == 1: - parts = [] - url = urls[0] - print('Downloading %s ...' % tr(filename)) - filepath = os.path.join(output_dir, filename) - parts.append(filepath) - url_save_chunked(url, filepath, bar, refer = refer, faker = faker, headers = headers) - bar.done() - - if not merge: - print() - return - if ext == 'ts': - from .processor.ffmpeg import has_ffmpeg_installed - if has_ffmpeg_installed(): - from .processor.ffmpeg import ffmpeg_convert_ts_to_mkv - if ffmpeg_convert_ts_to_mkv(parts, os.path.join(output_dir, title + '.mkv')): - for part in parts: - os.remove(part) - else: - os.remove(os.path.join(output_dir, title + '.mkv')) - else: - print('No ffmpeg is found. Conversion aborted.') - else: - print("Can't convert %s files" % ext) - else: - parts = [] - print('Downloading %s.%s ...' % (tr(title), ext)) - for i, url in enumerate(urls): - filename = '%s[%02d].%s' % (title, i, ext) - filepath = os.path.join(output_dir, filename) - parts.append(filepath) - #print 'Downloading %s [%s/%s]...' % (tr(filename), i + 1, len(urls)) - bar.update_piece(i + 1) - url_save_chunked(url, filepath, bar, refer = refer, is_part = True, faker = faker, headers = headers) - bar.done() - - if not merge: - print() - return - if ext == 'ts': - from .processor.ffmpeg import has_ffmpeg_installed - if has_ffmpeg_installed(): - from .processor.ffmpeg import ffmpeg_concat_ts_to_mkv - if ffmpeg_concat_ts_to_mkv(parts, os.path.join(output_dir, title + '.mkv')): - for part in parts: - os.remove(part) - else: - os.remove(os.path.join(output_dir, title + '.mkv')) - else: - print('No ffmpeg is found. Merging aborted.') - else: - print("Can't merge %s files" % ext) - - print() - -def download_rtmp_url(url,title, ext,params={}, total_size=0, output_dir='.', refer=None, merge=True, faker=False): +def download_rtmp_url( + url, title, ext, params={}, total_size=0, output_dir='.', refer=None, + merge=True, faker=False +): assert url if dry_run: + print_user_agent(faker=faker) print('Real URL:\n%s\n' % [url]) - if params.get("-y",False): #None or unset ->False - print('Real Playpath:\n%s\n' % [params.get("-y")]) + if params.get('-y', False): # None or unset -> False + print('Real Playpath:\n%s\n' % [params.get('-y')]) return if player: @@ -938,16 +1107,23 @@ def download_rtmp_url(url,title, ext,params={}, total_size=0, output_dir='.', re play_rtmpdump_stream(player, url, params) return - from .processor.rtmpdump import has_rtmpdump_installed, download_rtmpdump_stream - assert has_rtmpdump_installed(), "RTMPDump not installed." - download_rtmpdump_stream(url, title, ext,params, output_dir) + from .processor.rtmpdump import ( + has_rtmpdump_installed, download_rtmpdump_stream + ) + assert has_rtmpdump_installed(), 'RTMPDump not installed.' + download_rtmpdump_stream(url, title, ext, params, output_dir) -def download_url_ffmpeg(url,title, ext,params={}, total_size=0, output_dir='.', refer=None, merge=True, faker=False): + +def download_url_ffmpeg( + url, title, ext, params={}, total_size=0, output_dir='.', refer=None, + merge=True, faker=False, stream=True +): assert url if dry_run: + print_user_agent(faker=faker) print('Real URL:\n%s\n' % [url]) - if params.get("-y",False): #None or unset ->False - print('Real Playpath:\n%s\n' % [params.get("-y")]) + if params.get('-y', False): # None or unset ->False + print('Real Playpath:\n%s\n' % [params.get('-y')]) return if player: @@ -955,17 +1131,33 @@ def download_url_ffmpeg(url,title, ext,params={}, total_size=0, output_dir='.', return from .processor.ffmpeg import has_ffmpeg_installed, ffmpeg_download_stream - assert has_ffmpeg_installed(), "FFmpeg not installed." - ffmpeg_download_stream(url, title, ext, params, output_dir) + assert has_ffmpeg_installed(), 'FFmpeg not installed.' + + global output_filename + if output_filename: + dotPos = output_filename.rfind('.') + if dotPos > 0: + title = output_filename[:dotPos] + ext = output_filename[dotPos+1:] + else: + title = output_filename + + title = tr(get_filename(title)) + + ffmpeg_download_stream(url, title, ext, params, output_dir, stream=stream) + def playlist_not_supported(name): def f(*args, **kwargs): raise NotImplementedError('Playlist is not supported for ' + name) return f -def print_info(site_info, title, type, size): + +def print_info(site_info, title, type, size, **kwargs): if json_output: - json_output_.print_info(site_info=site_info, title=title, type=type, size=size) + json_output_.print_info( + site_info=site_info, title=title, type=type, size=size + ) return if type: type = type.lower() @@ -996,48 +1188,62 @@ def print_info(site_info, title, type, size): type = 'image/gif' if type in ['video/3gpp']: - type_info = "3GPP multimedia file (%s)" % type + type_info = '3GPP multimedia file (%s)' % type elif type in ['video/x-flv', 'video/f4v']: - type_info = "Flash video (%s)" % type + type_info = 'Flash video (%s)' % type elif type in ['video/mp4', 'video/x-m4v']: - type_info = "MPEG-4 video (%s)" % type + type_info = 'MPEG-4 video (%s)' % type elif type in ['video/MP2T']: - type_info = "MPEG-2 transport stream (%s)" % type + type_info = 'MPEG-2 transport stream (%s)' % type elif type in ['video/webm']: - type_info = "WebM video (%s)" % type - #elif type in ['video/ogg']: - # type_info = "Ogg video (%s)" % type + type_info = 'WebM video (%s)' % type + # elif type in ['video/ogg']: + # type_info = 'Ogg video (%s)' % type elif type in ['video/quicktime']: - type_info = "QuickTime video (%s)" % type + type_info = 'QuickTime video (%s)' % type elif type in ['video/x-matroska']: - type_info = "Matroska video (%s)" % type - #elif type in ['video/x-ms-wmv']: - # type_info = "Windows Media video (%s)" % type + type_info = 'Matroska video (%s)' % type + # elif type in ['video/x-ms-wmv']: + # type_info = 'Windows Media video (%s)' % type elif type in ['video/x-ms-asf']: - type_info = "Advanced Systems Format (%s)" % type - #elif type in ['video/mpeg']: - # type_info = "MPEG video (%s)" % type - elif type in ['audio/mp4']: - type_info = "MPEG-4 audio (%s)" % type + type_info = 'Advanced Systems Format (%s)' % type + # elif type in ['video/mpeg']: + # type_info = 'MPEG video (%s)' % type + elif type in ['audio/mp4', 'audio/m4a']: + type_info = 'MPEG-4 audio (%s)' % type elif type in ['audio/mpeg']: - type_info = "MP3 (%s)" % type + type_info = 'MP3 (%s)' % type + elif type in ['audio/wav', 'audio/wave', 'audio/x-wav']: + type_info = 'Waveform Audio File Format ({})'.format(type) elif type in ['image/jpeg']: - type_info = "JPEG Image (%s)" % type + type_info = 'JPEG Image (%s)' % type elif type in ['image/png']: - type_info = "Portable Network Graphics (%s)" % type + type_info = 'Portable Network Graphics (%s)' % type elif type in ['image/gif']: - type_info = "Graphics Interchange Format (%s)" % type - + type_info = 'Graphics Interchange Format (%s)' % type + elif type in ['m3u8']: + if 'm3u8_type' in kwargs: + if kwargs['m3u8_type'] == 'master': + type_info = 'M3U8 Master {}'.format(type) + else: + type_info = 'M3U8 Playlist {}'.format(type) else: - type_info = "Unknown type (%s)" % type + type_info = 'Unknown type (%s)' % type - maybe_print("Site: ", site_info) - maybe_print("Title: ", unescape_html(tr(title))) - print("Type: ", type_info) - print("Size: ", round(size / 1048576, 2), "MiB (" + str(size) + " Bytes)") + maybe_print('Site: ', site_info) + maybe_print('Title: ', unescape_html(tr(title))) + print('Type: ', type_info) + if type != 'm3u8': + print( + 'Size: ', round(size / 1048576, 2), + 'MiB (' + str(size) + ' Bytes)' + ) + if type == 'm3u8' and 'm3u8_url' in kwargs: + print('M3U8 Url: {}'.format(kwargs['m3u8_url'])) print() + def mime_to_container(mime): mapping = { 'video/3gpp': '3gp', @@ -1050,6 +1256,7 @@ def mime_to_container(mime): else: return mime.split('/')[1] + def parse_host(host): """Parses host name and port number from a string. """ @@ -1062,6 +1269,7 @@ def parse_host(host): port = o.port or 0 return (hostname, port) + def set_proxy(proxy): proxy_handler = request.ProxyHandler({ 'http': '%s:%s' % proxy, @@ -1070,27 +1278,33 @@ def set_proxy(proxy): opener = request.build_opener(proxy_handler) request.install_opener(opener) + def unset_proxy(): proxy_handler = request.ProxyHandler({}) opener = request.build_opener(proxy_handler) request.install_opener(opener) + # DEPRECATED in favor of set_proxy() and unset_proxy() def set_http_proxy(proxy): - if proxy == None: # Use system default setting + if proxy is None: # Use system default setting proxy_support = request.ProxyHandler() - elif proxy == '': # Don't use any proxy + elif proxy == '': # Don't use any proxy proxy_support = request.ProxyHandler({}) - else: # Use proxy - proxy_support = request.ProxyHandler({'http': '%s' % proxy, 'https': '%s' % proxy}) + else: # Use proxy + proxy_support = request.ProxyHandler( + {'http': '%s' % proxy, 'https': '%s' % proxy} + ) opener = request.build_opener(proxy_support) request.install_opener(opener) + def print_more_compatible(*args, **kwargs): import builtins as __builtin__ """Overload default print function as py (<3.3) does not support 'flush' keyword. Although the function name can be same as print to get itself overloaded automatically, - I'd rather leave it with a different name and only overload it when importing to make less confusion. """ + I'd rather leave it with a different name and only overload it when importing to make less confusion. + """ # nothing happens on py3.3 and later if sys.version_info[:2] >= (3, 3): return __builtin__.print(*args, **kwargs) @@ -1103,12 +1317,9 @@ def print_more_compatible(*args, **kwargs): return ret - def download_main(download, download_playlist, urls, playlist, **kwargs): for url in urls: - if url.startswith('https://'): - url = url[8:] - if not url.startswith('http://'): + if re.match(r'https?://', url) is None: url = 'http://' + url if playlist: @@ -1116,208 +1327,391 @@ def download_main(download, download_playlist, urls, playlist, **kwargs): else: download(url, **kwargs) -def script_main(script_name, download, download_playlist, **kwargs): - def version(): - log.i('version %s, a tiny downloader that scrapes the web.' - % get_version(kwargs['repo_path'] - if 'repo_path' in kwargs else __version__)) +def load_cookies(cookiefile): + global cookies + if cookiefile.endswith('.txt'): + # MozillaCookieJar treats prefix '#HttpOnly_' as comments incorrectly! + # do not use its load() + # see also: + # - https://docs.python.org/3/library/http.cookiejar.html#http.cookiejar.MozillaCookieJar + # - https://github.com/python/cpython/blob/4b219ce/Lib/http/cookiejar.py#L2014 + # - https://curl.haxx.se/libcurl/c/CURLOPT_COOKIELIST.html#EXAMPLE + #cookies = cookiejar.MozillaCookieJar(cookiefile) + #cookies.load() + from http.cookiejar import Cookie + cookies = cookiejar.MozillaCookieJar() + now = time.time() + ignore_discard, ignore_expires = False, False + with open(cookiefile, 'r', encoding='utf-8') as f: + for line in f: + # last field may be absent, so keep any trailing tab + if line.endswith("\n"): line = line[:-1] + + # skip comments and blank lines XXX what is $ for? + if (line.strip().startswith(("#", "$")) or + line.strip() == ""): + if not line.strip().startswith('#HttpOnly_'): # skip for #HttpOnly_ + continue + + domain, domain_specified, path, secure, expires, name, value = \ + line.split("\t") + secure = (secure == "TRUE") + domain_specified = (domain_specified == "TRUE") + if name == "": + # cookies.txt regards 'Set-Cookie: foo' as a cookie + # with no name, whereas http.cookiejar regards it as a + # cookie with no value. + name = value + value = None + + initial_dot = domain.startswith(".") + if not line.strip().startswith('#HttpOnly_'): # skip for #HttpOnly_ + assert domain_specified == initial_dot + + discard = False + if expires == "": + expires = None + discard = True + + # assume path_specified is false + c = Cookie(0, name, value, + None, False, + domain, domain_specified, initial_dot, + path, False, + secure, + expires, + discard, + None, + None, + {}) + if not ignore_discard and c.discard: + continue + if not ignore_expires and c.is_expired(now): + continue + cookies.set_cookie(c) + + elif cookiefile.endswith(('.sqlite', '.sqlite3')): + import sqlite3, shutil, tempfile + temp_dir = tempfile.gettempdir() + temp_cookiefile = os.path.join(temp_dir, 'temp_cookiefile.sqlite') + shutil.copy2(cookiefile, temp_cookiefile) + + cookies = cookiejar.MozillaCookieJar() + con = sqlite3.connect(temp_cookiefile) + cur = con.cursor() + cur.execute("""SELECT host, path, isSecure, expiry, name, value + FROM moz_cookies""") + for item in cur.fetchall(): + c = cookiejar.Cookie( + 0, item[4], item[5], None, False, item[0], + item[0].startswith('.'), item[0].startswith('.'), + item[1], False, item[2], item[3], item[3] == '', None, + None, {}, + ) + cookies.set_cookie(c) + + else: + log.e('[error] unsupported cookies format') + # TODO: Chromium Cookies + # SELECT host_key, path, secure, expires_utc, name, encrypted_value + # FROM cookies + # http://n8henrie.com/2013/11/use-chromes-cookies-for-easier-downloading-with-python-requests/ + + +def set_socks_proxy(proxy): + try: + import socks + if '@' in proxy: + proxy_info = proxy.split("@") + socks_proxy_addrs = proxy_info[1].split(':') + socks_proxy_auth = proxy_info[0].split(":") + print(socks_proxy_auth[0]+" "+socks_proxy_auth[1]+" "+socks_proxy_addrs[0]+" "+socks_proxy_addrs[1]) + socks.set_default_proxy( + socks.SOCKS5, + socks_proxy_addrs[0], + int(socks_proxy_addrs[1]), + True, + socks_proxy_auth[0], + socks_proxy_auth[1] + ) + else: + socks_proxy_addrs = proxy.split(':') + print(socks_proxy_addrs[0]+" "+socks_proxy_addrs[1]) + socks.set_default_proxy( + socks.SOCKS5, + socks_proxy_addrs[0], + int(socks_proxy_addrs[1]), + ) + socket.socket = socks.socksocket + + def getaddrinfo(*args): + return [ + (socket.AF_INET, socket.SOCK_STREAM, 6, '', (args[0], args[1])) + ] + socket.getaddrinfo = getaddrinfo + except ImportError: + log.w( + 'Error importing PySocks library, socks proxy ignored.' + 'In order to use use socks proxy, please install PySocks.' + ) + + +def script_main(download, download_playlist, **kwargs): logging.basicConfig(format='[%(levelname)s] %(message)s') - help = 'Usage: %s [OPTION]... [URL]...\n\n' % script_name - help += '''Startup options: - -V | --version Print version and exit. - -h | --help Print help and exit. - \n''' - help += '''Dry-run options: (no actual downloading) - -i | --info Print extracted information. - -u | --url Print extracted information with URLs. - --json Print extracted URLs in JSON format. - \n''' - help += '''Download options: - -n | --no-merge Do not merge video parts. - --no-caption Do not download captions. - (subtitles, lyrics, danmaku, ...) - -f | --force Force overwriting existed files. - -F | --format Set video format to STREAM_ID. - -O | --output-filename Set output filename. - -o | --output-dir Set output directory. - -p | --player Stream extracted URL to a PLAYER. - -c | --cookies Load cookies.txt or cookies.sqlite. - -x | --http-proxy Use an HTTP proxy for downloading. - -y | --extractor-proxy Use an HTTP proxy for extracting only. - --no-proxy Never use a proxy. - -s | --socks-proxy Use an SOCKS5 proxy for downloading. - -t | --timeout Set socket timeout. - -d | --debug Show traceback and other debug info. - ''' + def print_version(): + version = get_version( + kwargs['repo_path'] if 'repo_path' in kwargs else __version__ + ) + log.i( + 'version {}, a tiny downloader that scrapes the web.'.format( + version + ) + ) - short_opts = 'Vhfiuc:ndF:O:o:p:x:y:s:t:' - opts = ['version', 'help', 'force', 'info', 'url', 'cookies', 'no-caption', 'no-merge', 'no-proxy', 'debug', 'json', 'format=', 'stream=', 'itag=', 'output-filename=', 'output-dir=', 'player=', 'http-proxy=', 'socks-proxy=', 'extractor-proxy=', 'lang=', 'timeout='] - if download_playlist: - short_opts = 'l' + short_opts - opts = ['playlist'] + opts + parser = argparse.ArgumentParser( + prog='you-get', + usage='you-get [OPTION]... URL...', + description='A tiny downloader that scrapes the web', + add_help=False, + ) + parser.add_argument( + '-V', '--version', action='store_true', + help='Print version and exit' + ) + parser.add_argument( + '-h', '--help', action='store_true', + help='Print this help message and exit' + ) - try: - opts, args = getopt.getopt(sys.argv[1:], short_opts, opts) - except getopt.GetoptError as err: - log.e(err) - log.e("try 'you-get --help' for more options") - sys.exit(2) + dry_run_grp = parser.add_argument_group( + 'Dry-run options', '(no actual downloading)' + ) + dry_run_grp = dry_run_grp.add_mutually_exclusive_group() + dry_run_grp.add_argument( + '-i', '--info', action='store_true', help='Print extracted information' + ) + dry_run_grp.add_argument( + '-u', '--url', action='store_true', + help='Print extracted information with URLs' + ) + dry_run_grp.add_argument( + '--json', action='store_true', + help='Print extracted URLs in JSON format' + ) + + download_grp = parser.add_argument_group('Download options') + download_grp.add_argument( + '-n', '--no-merge', action='store_true', default=False, + help='Do not merge video parts' + ) + download_grp.add_argument( + '--no-caption', action='store_true', + help='Do not download captions (subtitles, lyrics, danmaku, ...)' + ) + download_grp.add_argument( + '-f', '--force', action='store_true', default=False, + help='Force overwriting existing files' + ) + download_grp.add_argument( + '--skip-existing-file-size-check', action='store_true', default=False, + help='Skip existing file without checking file size' + ) + download_grp.add_argument( + '-F', '--format', metavar='STREAM_ID', + help='Set video format to STREAM_ID' + ) + download_grp.add_argument( + '-O', '--output-filename', metavar='FILE', help='Set output filename' + ) + download_grp.add_argument( + '-o', '--output-dir', metavar='DIR', default='.', + help='Set output directory' + ) + download_grp.add_argument( + '-p', '--player', metavar='PLAYER', + help='Stream extracted URL to a PLAYER' + ) + download_grp.add_argument( + '-c', '--cookies', metavar='COOKIES_FILE', + help='Load cookies.txt or cookies.sqlite' + ) + download_grp.add_argument( + '-t', '--timeout', metavar='SECONDS', type=int, default=600, + help='Set socket timeout' + ) + download_grp.add_argument( + '-d', '--debug', action='store_true', + help='Show traceback and other debug info' + ) + download_grp.add_argument( + '-I', '--input-file', metavar='FILE', type=argparse.FileType('r'), + help='Read non-playlist URLs from FILE' + ) + download_grp.add_argument( + '-P', '--password', help='Set video visit password to PASSWORD' + ) + download_grp.add_argument( + '-l', '--playlist', action='store_true', + help='Prefer to download a playlist' + ) + download_grp.add_argument( + '-a', '--auto-rename', action='store_true', default=False, + help='Auto rename same name different files' + ) + + download_grp.add_argument( + '-k', '--insecure', action='store_true', default=False, + help='ignore ssl errors' + ) + + proxy_grp = parser.add_argument_group('Proxy options') + proxy_grp = proxy_grp.add_mutually_exclusive_group() + proxy_grp.add_argument( + '-x', '--http-proxy', metavar='HOST:PORT', + help='Use an HTTP proxy for downloading' + ) + proxy_grp.add_argument( + '-y', '--extractor-proxy', metavar='HOST:PORT', + help='Use an HTTP proxy for extracting only' + ) + proxy_grp.add_argument( + '--no-proxy', action='store_true', help='Never use a proxy' + ) + proxy_grp.add_argument( + '-s', '--socks-proxy', metavar='HOST:PORT or USERNAME:PASSWORD@HOST:PORT', + help='Use an SOCKS5 proxy for downloading' + ) + + download_grp.add_argument('--stream', help=argparse.SUPPRESS) + download_grp.add_argument('--itag', help=argparse.SUPPRESS) + + parser.add_argument('URL', nargs='*', help=argparse.SUPPRESS) + + args = parser.parse_args() + + if args.help: + print_version() + parser.print_help() + sys.exit() + if args.version: + print_version() + sys.exit() + + if args.debug: + # Set level of root logger to DEBUG + logging.getLogger().setLevel(logging.DEBUG) global force + global skip_existing_file_size_check global dry_run global json_output global player global extractor_proxy - global cookies global output_filename + global auto_rename + global insecure + output_filename = args.output_filename + extractor_proxy = args.extractor_proxy + + info_only = args.info + if args.force: + force = True + if args.skip_existing_file_size_check: + skip_existing_file_size_check = True + if args.auto_rename: + auto_rename = True + if args.url: + dry_run = True + if args.json: + json_output = True + # to fix extractors not use VideoExtractor + dry_run = True + info_only = False + + if args.cookies: + load_cookies(args.cookies) - info_only = False - playlist = False caption = True - merge = True - stream_id = None - lang = None - output_dir = '.' - proxy = None - socks_proxy = None - extractor_proxy = None - traceback = False - timeout = 600 - for o, a in opts: - if o in ('-V', '--version'): - version() - sys.exit() - elif o in ('-h', '--help'): - version() - print(help) - sys.exit() - elif o in ('-f', '--force'): - force = True - elif o in ('-i', '--info'): - info_only = True - elif o in ('-u', '--url'): - dry_run = True - elif o in ('--json', ): - json_output = True - # to fix extractors not use VideoExtractor - dry_run = True - info_only = False - elif o in ('-c', '--cookies'): - try: - cookies = cookiejar.MozillaCookieJar(a) - cookies.load() - except: - import sqlite3 - cookies = cookiejar.MozillaCookieJar() - con = sqlite3.connect(a) - cur = con.cursor() - try: - cur.execute("SELECT host, path, isSecure, expiry, name, value FROM moz_cookies") - for item in cur.fetchall(): - c = cookiejar.Cookie(0, item[4], item[5], - None, False, - item[0], - item[0].startswith('.'), - item[0].startswith('.'), - item[1], False, - item[2], - item[3], item[3]=="", - None, None, {}) - cookies.set_cookie(c) - except: pass - # TODO: Chromium Cookies - # SELECT host_key, path, secure, expires_utc, name, encrypted_value FROM cookies - # http://n8henrie.com/2013/11/use-chromes-cookies-for-easier-downloading-with-python-requests/ + stream_id = args.format or args.stream or args.itag + if args.no_caption: + caption = False + if args.player: + player = args.player + caption = False - elif o in ('-l', '--playlist'): - playlist = True - elif o in ('--no-caption',): - caption = False - elif o in ('-n', '--no-merge'): - merge = False - elif o in ('--no-proxy',): - proxy = '' - elif o in ('-d', '--debug'): - traceback = True - # Set level of root logger to DEBUG - logging.getLogger().setLevel(logging.DEBUG) - elif o in ('-F', '--format', '--stream', '--itag'): - stream_id = a - elif o in ('-O', '--output-filename'): - output_filename = a - elif o in ('-o', '--output-dir'): - output_dir = a - elif o in ('-p', '--player'): - player = a - caption = False - elif o in ('-x', '--http-proxy'): - proxy = a - elif o in ('-s', '--socks-proxy'): - socks_proxy = a - elif o in ('-y', '--extractor-proxy'): - extractor_proxy = a - elif o in ('--lang',): - lang = a - elif o in ('-t', '--timeout'): - timeout = int(a) - else: - log.e("try 'you-get --help' for more options") + if args.insecure: + # ignore ssl + insecure = True + + + if args.no_proxy: + set_http_proxy('') + else: + set_http_proxy(args.http_proxy) + if args.socks_proxy: + set_socks_proxy(args.socks_proxy) + + URLs = [] + if args.input_file: + logging.debug('you are trying to load urls from %s', args.input_file) + if args.playlist: + log.e( + "reading playlist from a file is unsupported " + "and won't make your life easier" + ) sys.exit(2) - if not args: - print(help) + URLs.extend(args.input_file.read().splitlines()) + args.input_file.close() + URLs.extend(args.URL) + + if not URLs: + parser.print_help() sys.exit() - if (socks_proxy): - try: - import socket - import socks - socks_proxy_addrs = socks_proxy.split(':') - socks.set_default_proxy(socks.SOCKS5, - socks_proxy_addrs[0], - int(socks_proxy_addrs[1])) - socket.socket = socks.socksocket - def getaddrinfo(*args): - return [(socket.AF_INET, socket.SOCK_STREAM, 6, '', (args[0], args[1]))] - socket.getaddrinfo = getaddrinfo - except ImportError: - log.w('Error importing PySocks library, socks proxy ignored.' - 'In order to use use socks proxy, please install PySocks.') - else: - import socket - set_http_proxy(proxy) - - socket.setdefaulttimeout(timeout) + socket.setdefaulttimeout(args.timeout) try: + extra = {} + if extractor_proxy: + extra['extractor_proxy'] = extractor_proxy if stream_id: - if not extractor_proxy: - download_main(download, download_playlist, args, playlist, stream_id=stream_id, output_dir=output_dir, merge=merge, info_only=info_only, json_output=json_output, caption=caption) - else: - download_main(download, download_playlist, args, playlist, stream_id=stream_id, extractor_proxy=extractor_proxy, output_dir=output_dir, merge=merge, info_only=info_only, json_output=json_output, caption=caption) - else: - if not extractor_proxy: - download_main(download, download_playlist, args, playlist, output_dir=output_dir, merge=merge, info_only=info_only, json_output=json_output, caption=caption) - else: - download_main(download, download_playlist, args, playlist, extractor_proxy=extractor_proxy, output_dir=output_dir, merge=merge, info_only=info_only, json_output=json_output, caption=caption) + extra['stream_id'] = stream_id + download_main( + download, download_playlist, + URLs, args.playlist, + output_dir=args.output_dir, merge=not args.no_merge, + info_only=info_only, json_output=json_output, caption=caption, + password=args.password, + **extra + ) except KeyboardInterrupt: - if traceback: + if args.debug: raise else: sys.exit(1) except UnicodeEncodeError: - log.e('[error] oops, the current environment does not seem to support Unicode.') + if args.debug: + raise + log.e( + '[error] oops, the current environment does not seem to support ' + 'Unicode.' + ) log.e('please set it to a UTF-8-aware locale first,') - log.e('so as to save the video (with some Unicode characters) correctly.') + log.e( + 'so as to save the video (with some Unicode characters) correctly.' + ) log.e('you can do it like this:') log.e(' (Windows) % chcp 65001 ') log.e(' (Linux) $ LC_CTYPE=en_US.UTF-8') sys.exit(1) except Exception: - if not traceback: + if not args.debug: log.e('[error] oops, something went wrong.') - log.e('don\'t panic, c\'est la vie. please try the following steps:') + log.e( + 'don\'t panic, c\'est la vie. please try the following steps:' + ) log.e(' (1) Rule out any network problem.') log.e(' (2) Make sure you-get is up-to-date.') log.e(' (3) Check if the issue is already known, on') @@ -1326,63 +1720,80 @@ def script_main(script_name, download, download_playlist, **kwargs): log.e(' (4) Run the command with \'--debug\' option,') log.e(' and report this issue with the full output.') else: - version() + print_version() log.i(args) raise sys.exit(1) + def google_search(url): keywords = r1(r'https?://(.*)', url) url = 'https://www.google.com/search?tbm=vid&q=%s' % parse.quote(keywords) page = get_content(url, headers=fake_headers) - videos = re.findall(r'([^<]+)<', page) - vdurs = re.findall(r'([^<]+)<', page) + videos = re.findall( + r'

([^<]+)<', page + ) + vdurs = re.findall(r'([^<]+)<', page) durs = [r1(r'(\d+:\d+)', unescape_html(dur)) for dur in vdurs] - print("Google Videos search:") + print('Google Videos search:') for v in zip(videos, durs): - print("- video: %s [%s]" % (unescape_html(v[0][1]), - v[1] if v[1] else '?')) - print("# you-get %s" % log.sprint(v[0][0], log.UNDERLINE)) + print('- video: {} [{}]'.format( + unescape_html(v[0][1]), + v[1] if v[1] else '?' + )) + print('# you-get %s' % log.sprint(v[0][0], log.UNDERLINE)) print() - print("Best matched result:") + print('Best matched result:') return(videos[0][0]) + def url_to_module(url): try: video_host = r1(r'https?://([^/]+)/', url) video_url = r1(r'https?://[^/]+(.*)', url) assert video_host and video_url - except: + except AssertionError: url = google_search(url) video_host = r1(r'https?://([^/]+)/', url) video_url = r1(r'https?://[^/]+(.*)', url) - if video_host.endswith('.com.cn'): + if video_host.endswith('.com.cn') or video_host.endswith('.ac.cn'): video_host = video_host[:-3] domain = r1(r'(\.[^.]+\.[^.]+)$', video_host) or video_host assert domain, 'unsupported url: ' + url + # all non-ASCII code points must be quoted (percent-encoded UTF-8) + url = ''.join([ch if ord(ch) in range(128) else parse.quote(ch) for ch in url]) + video_host = r1(r'https?://([^/]+)/', url) + video_url = r1(r'https?://[^/]+(.*)', url) + k = r1(r'([^.]+)', domain) if k in SITES: - return import_module('.'.join(['you_get', 'extractors', SITES[k]])), url + return ( + import_module('.'.join(['you_get', 'extractors', SITES[k]])), + url + ) else: - import http.client - conn = http.client.HTTPConnection(video_host) - conn.request("HEAD", video_url, headers=fake_headers) - res = conn.getresponse() - location = res.getheader('location') + try: + location = get_location(url) # t.co isn't happy with fake_headers + except: + location = get_location(url, headers=fake_headers) + if location and location != url and not location.startswith('/'): return url_to_module(location) else: return import_module('you_get.extractors.universal'), url + def any_download(url, **kwargs): m, url = url_to_module(url) m.download(url, **kwargs) + def any_download_playlist(url, **kwargs): m, url = url_to_module(url) m.download_playlist(url, **kwargs) + def main(**kwargs): - script_main('you-get', any_download, any_download_playlist, **kwargs) + script_main(any_download, any_download_playlist, **kwargs) diff --git a/src/you_get/extractor.py b/src/you_get/extractor.py index 594b908e..c4315935 100644 --- a/src/you_get/extractor.py +++ b/src/you_get/extractor.py @@ -1,10 +1,11 @@ #!/usr/bin/env python -from .common import match1, maybe_print, download_urls, get_filename, parse_host, set_proxy, unset_proxy +from .common import match1, maybe_print, download_urls, get_filename, parse_host, set_proxy, unset_proxy, get_content, dry_run, player from .common import print_more_compatible as print from .util import log from . import json_output import os +import sys class Extractor(): def __init__(self, *args): @@ -22,12 +23,18 @@ class VideoExtractor(): self.url = None self.title = None self.vid = None + self.m3u8_url = None self.streams = {} self.streams_sorted = [] self.audiolang = None self.password_protected = False self.dash_streams = {} self.caption_tracks = {} + self.out = False + self.ua = None + self.referer = None + self.danmaku = None + self.lyrics = None if args: self.url = args[0] @@ -39,6 +46,8 @@ class VideoExtractor(): if 'extractor_proxy' in kwargs and kwargs['extractor_proxy']: set_proxy(parse_host(kwargs['extractor_proxy'])) self.prepare(**kwargs) + if self.out: + return if 'extractor_proxy' in kwargs and kwargs['extractor_proxy']: unset_proxy() @@ -98,8 +107,12 @@ class VideoExtractor(): if 'quality' in stream: print(" quality: %s" % stream['quality']) - if 'size' in stream: - print(" size: %s MiB (%s bytes)" % (round(stream['size'] / 1048576, 1), stream['size'])) + if 'size' in stream and 'container' in stream and stream['container'].lower() != 'm3u8': + if stream['size'] != float('inf') and stream['size'] != 0: + print(" size: %s MiB (%s bytes)" % (round(stream['size'] / 1048576, 1), stream['size'])) + + if 'm3u8_url' in stream: + print(" m3u8_url: {}".format(stream['m3u8_url'])) if 'itag' in stream: print(" # download-with: %s" % log.sprint("you-get --itag=%s [URL]" % stream_id, log.UNDERLINE)) @@ -119,6 +132,8 @@ class VideoExtractor(): print(" url: %s" % self.url) print() + sys.stdout.flush() + def p(self, stream_id=None): maybe_print("site: %s" % self.__class__.name) maybe_print("title: %s" % self.title) @@ -143,9 +158,10 @@ class VideoExtractor(): for stream in itags: self.p_stream(stream) # Print all other available streams - print(" [ DEFAULT ] %s" % ('_' * 33)) - for stream in self.streams_sorted: - self.p_stream(stream['id'] if 'id' in stream else stream['itag']) + if self.streams_sorted: + print(" [ DEFAULT ] %s" % ('_' * 33)) + for stream in self.streams_sorted: + self.p_stream(stream['id'] if 'id' in stream else stream['itag']) if self.audiolang: print("audio-languages:") @@ -153,6 +169,8 @@ class VideoExtractor(): print(" - lang: {}".format(i['lang'])) print(" download-url: {}\n".format(i['url'])) + sys.stdout.flush() + def p_playlist(self, stream_id=None): maybe_print("site: %s" % self.__class__.name) print("playlist: %s" % self.title) @@ -183,7 +201,14 @@ class VideoExtractor(): stream_id = kwargs['stream_id'] else: # Download stream with the best quality - stream_id = self.streams_sorted[0]['id'] if 'id' in self.streams_sorted[0] else self.streams_sorted[0]['itag'] + from .processor.ffmpeg import has_ffmpeg_installed + if has_ffmpeg_installed() and player is None and self.dash_streams or not self.streams_sorted: + #stream_id = list(self.dash_streams)[-1] + itags = sorted(self.dash_streams, + key=lambda i: -self.dash_streams[i]['size']) + stream_id = itags[0] + else: + stream_id = self.streams_sorted[0]['id'] if 'id' in self.streams_sorted[0] else self.streams_sorted[0]['itag'] if 'index' not in kwargs: self.p(stream_id) @@ -199,16 +224,26 @@ class VideoExtractor(): ext = self.dash_streams[stream_id]['container'] total_size = self.dash_streams[stream_id]['size'] + if ext == 'm3u8' or ext == 'm4a': + ext = 'mp4' + if not urls: log.wtf('[Failed] Cannot extract video source.') # For legacy main() - download_urls(urls, self.title, ext, total_size, + headers = {} + if self.ua is not None: + headers['User-Agent'] = self.ua + if self.referer is not None: + headers['Referer'] = self.referer + download_urls(urls, self.title, ext, total_size, headers=headers, output_dir=kwargs['output_dir'], merge=kwargs['merge'], av=stream_id in self.dash_streams) - if not kwargs['caption']: - print('Skipping captions.') + + if 'caption' not in kwargs or not kwargs['caption']: + print('Skipping captions or danmaku.') return + for lang in self.caption_tracks: filename = '%s.%s.srt' % (get_filename(self.title), lang) print('Saving %s ... ' % filename, end="", flush=True) @@ -218,7 +253,20 @@ class VideoExtractor(): x.write(srt) print('Done.') + if self.danmaku is not None and not dry_run: + filename = '{}.cmt.xml'.format(get_filename(self.title)) + print('Downloading {} ...\n'.format(filename)) + with open(os.path.join(kwargs['output_dir'], filename), 'w', encoding='utf8') as fp: + fp.write(self.danmaku) + + if self.lyrics is not None and not dry_run: + filename = '{}.lrc'.format(get_filename(self.title)) + print('Downloading {} ...\n'.format(filename)) + with open(os.path.join(kwargs['output_dir'], filename), 'w', encoding='utf8') as fp: + fp.write(self.lyrics) + # For main_dev() #download_urls(urls, self.title, self.streams[stream_id]['container'], self.streams[stream_id]['size']) - - self.__init__() + keep_obj = kwargs.get('keep_obj', False) + if not keep_obj: + self.__init__() diff --git a/src/you_get/extractors/__init__.py b/src/you_get/extractors/__init__.py index e69bc2fd..4280d236 100755 --- a/src/you_get/extractors/__init__.py +++ b/src/you_get/extractors/__init__.py @@ -11,9 +11,10 @@ from .bokecc import * from .cbs import * from .ckplayer import * from .cntv import * +from .coub import * from .dailymotion import * -from .dilidili import * from .douban import * +from .douyin import * from .douyutv import * from .ehow import * from .facebook import * @@ -23,7 +24,7 @@ from .freesound import * from .funshion import * from .google import * from .heavymusic import * -from .huaban import * +from .icourses import * from .ifeng import * from .imgur import * from .infoq import * @@ -32,12 +33,15 @@ from .interest import * from .iqilu import * from .iqiyi import * from .joy import * -from .jpopsuki import * +from .khan import * from .ku6 import * +from .kakao import * +from .kuaishou import * from .kugou import * from .kuwo import * from .le import * from .lizhi import * +from .longzhu import * from .magisto import * from .metacafe import * from .mgtv import * @@ -45,41 +49,41 @@ from .miaopai import * from .miomio import * from .mixcloud import * from .mtv81 import * -from .musicplayon import * from .nanagogo import * from .naver import * from .netease import * from .nicovideo import * -from .panda import * from .pinterest import * from .pixnet import * from .pptv import * -from .qianmo import * from .qie import * +from .qingting import * from .qq import * from .showroom import * from .sina import * from .sohu import * from .soundcloud import * from .suntv import * +from .ted import * from .theplatform import * -from .thvideo import * +from .tiktok import * from .tucao import * from .tudou import * from .tumblr import * from .twitter import * +from .ucas import * from .veoh import * -from .videomega import * from .vimeo import * from .vine import * from .vk import * from .w56 import * from .wanmen import * from .xiami import * +from .xinpianchang import * from .yinyuetai import * from .yixia import * from .youku import * from .youtube import * -from .ted import * -from .khan import * from .zhanqi import * +from .zhibo import * +from .zhihu import * \ No newline at end of file diff --git a/src/you_get/extractors/acfun.py b/src/you_get/extractors/acfun.py index 87e005fb..cd275927 100644 --- a/src/you_get/extractors/acfun.py +++ b/src/you_get/extractors/acfun.py @@ -1,92 +1,213 @@ #!/usr/bin/env python -__all__ = ['acfun_download'] - from ..common import * +from ..extractor import VideoExtractor -from .le import letvcloud_download_by_vu -from .qq import qq_download_by_vid -from .sina import sina_download_by_vid -from .tudou import tudou_download_by_iid -from .youku import youku_download_by_vid, youku_open_download_by_vid +class AcFun(VideoExtractor): + name = "AcFun" -import json, re + stream_types = [ + {'id': '2160P', 'qualityType': '2160p'}, + {'id': '1080P60', 'qualityType': '1080p60'}, + {'id': '720P60', 'qualityType': '720p60'}, + {'id': '1080P+', 'qualityType': '1080p+'}, + {'id': '1080P', 'qualityType': '1080p'}, + {'id': '720P', 'qualityType': '720p'}, + {'id': '540P', 'qualityType': '540p'}, + {'id': '360P', 'qualityType': '360p'} + ] -def get_srt_json(id): - url = 'http://danmu.aixifan.com/V2/%s' % id - return get_html(url) + def prepare(self, **kwargs): + assert re.match(r'https?://[^\.]*\.*acfun\.[^\.]+/(\D|bangumi)/\D\D(\d+)', self.url) -def acfun_download_by_vid(vid, title, output_dir='.', merge=True, info_only=False, **kwargs): - """str, str, str, bool, bool ->None + if re.match(r'https?://[^\.]*\.*acfun\.[^\.]+/\D/\D\D(\d+)', self.url): + html = get_content(self.url, headers=fake_headers) + json_text = match1(html, r"(?s)videoInfo\s*=\s*(\{.*?\});") + json_data = json.loads(json_text) + vid = json_data.get('currentVideoInfo').get('id') + up = json_data.get('user').get('name') + self.title = json_data.get('title') + video_list = json_data.get('videoList') + if len(video_list) > 1: + self.title += " - " + [p.get('title') for p in video_list if p.get('id') == vid][0] + currentVideoInfo = json_data.get('currentVideoInfo') - Download Acfun video by vid. + elif re.match("https?://[^\.]*\.*acfun\.[^\.]+/bangumi/aa(\d+)", self.url): + html = get_content(self.url, headers=fake_headers) + tag_script = match1(html, r'') + json_text = tag_script[tag_script.find('{') : tag_script.find('};') + 1] + json_data = json.loads(json_text) + self.title = json_data['bangumiTitle'] + " " + json_data['episodeName'] + " " + json_data['title'] + vid = str(json_data['videoId']) + up = "acfun" + currentVideoInfo = json_data.get('currentVideoInfo') - Call Acfun API, decide which site to use, and pass the job to its - extractor. - """ + else: + raise NotImplemented - #first call the main parasing API - info = json.loads(get_html('http://www.acfun.tv/video/getVideo.aspx?id=' + vid)) + if 'ksPlayJson' in currentVideoInfo: + durationMillis = currentVideoInfo['durationMillis'] + ksPlayJson = ksPlayJson = json.loads( currentVideoInfo['ksPlayJson'] ) + representation = ksPlayJson.get('adaptationSet')[0].get('representation') + stream_list = representation - sourceType = info['sourceType'] + for stream in stream_list: + m3u8_url = stream["url"] + size = durationMillis * stream["avgBitrate"] / 8 + # size = float('inf') + container = 'mp4' + stream_id = stream["qualityLabel"] + quality = stream["qualityType"] + + stream_data = dict(src=m3u8_url, size=size, container=container, quality=quality) + self.streams[stream_id] = stream_data - #decide sourceId to know which extractor to use - if 'sourceId' in info: sourceId = info['sourceId'] - # danmakuId = info['danmakuId'] + assert self.title and m3u8_url + self.title = unescape_html(self.title) + self.title = escape_file_path(self.title) + p_title = r1('active">([^<]+)', html) + self.title = '%s (%s)' % (self.title, up) + if p_title: + self.title = '%s - %s' % (self.title, p_title) - #call extractor decided by sourceId - if sourceType == 'sina': - sina_download_by_vid(sourceId, title, output_dir=output_dir, merge=merge, info_only=info_only) - elif sourceType == 'youku': - youku_download_by_vid(sourceId, title=title, output_dir=output_dir, merge=merge, info_only=info_only, **kwargs) - elif sourceType == 'tudou': - tudou_download_by_iid(sourceId, title, output_dir=output_dir, merge=merge, info_only=info_only) - elif sourceType == 'qq': - qq_download_by_vid(sourceId, title, output_dir=output_dir, merge=merge, info_only=info_only) - elif sourceType == 'letv': - letvcloud_download_by_vu(sourceId, '2d8c027396', title, output_dir=output_dir, merge=merge, info_only=info_only) - elif sourceType == 'zhuzhan': - #As in Jul.28.2016, Acfun is using embsig to anti hotlink so we need to pass this - embsig = info['encode'] - a = 'http://api.aixifan.com/plays/%s' % vid - s = json.loads(get_content(a, headers={'deviceType': '2'})) - if s['data']['source'] == "zhuzhan-youku": - sourceId = s['data']['sourceId'] - youku_open_download_by_vid(client_id='908a519d032263f8', vid=sourceId, title=title, output_dir=output_dir,merge=merge, info_only=info_only, embsig = embsig, **kwargs) - else: - raise NotImplementedError(sourceType) - if not info_only and not dry_run: - if not kwargs['caption']: - print('Skipping danmaku.') - return - try: - title = get_filename(title) - print('Downloading %s ...\n' % (title + '.cmt.json')) - cmt = get_srt_json(vid) - with open(os.path.join(output_dir, title + '.cmt.json'), 'w', encoding='utf-8') as x: - x.write(cmt) - except: - pass + def download(self, **kwargs): + if 'json_output' in kwargs and kwargs['json_output']: + json_output.output(self) + elif 'info_only' in kwargs and kwargs['info_only']: + if 'stream_id' in kwargs and kwargs['stream_id']: + # Display the stream + stream_id = kwargs['stream_id'] + if 'index' not in kwargs: + self.p(stream_id) + else: + self.p_i(stream_id) + else: + # Display all available streams + if 'index' not in kwargs: + self.p([]) + else: + stream_id = self.streams_sorted[0]['id'] if 'id' in self.streams_sorted[0] else self.streams_sorted[0]['itag'] + self.p_i(stream_id) -def acfun_download(url, output_dir='.', merge=True, info_only=False, **kwargs): - assert re.match(r'http://[^\.]+.acfun.[^\.]+/\D/\D\D(\d+)', url) - html = get_html(url) + else: + if 'stream_id' in kwargs and kwargs['stream_id']: + # Download the stream + stream_id = kwargs['stream_id'] + else: + stream_id = self.streams_sorted[0]['id'] if 'id' in self.streams_sorted[0] else self.streams_sorted[0]['itag'] - title = r1(r'data-title="([^"]+)"', html) - title = unescape_html(title) - title = escape_file_path(title) - assert title + if 'index' not in kwargs: + self.p(stream_id) + else: + self.p_i(stream_id) + if stream_id in self.streams: + url = self.streams[stream_id]['src'] + ext = self.streams[stream_id]['container'] + total_size = self.streams[stream_id]['size'] - vid = r1('data-vid="(\d+)"', html) - up = r1('data-name="([^"]+)"', html) - title = title + ' - ' + up - acfun_download_by_vid(vid, title, - output_dir=output_dir, - merge=merge, - info_only=info_only, - **kwargs) -site_info = "AcFun.tv" -download = acfun_download + if ext == 'm3u8' or ext == 'm4a': + ext = 'mp4' + + if not url: + log.wtf('[Failed] Cannot extract video source.') + # For legacy main() + headers = {} + if self.ua is not None: + headers['User-Agent'] = self.ua + if self.referer is not None: + headers['Referer'] = self.referer + + download_url_ffmpeg(url, self.title, ext, output_dir=kwargs['output_dir'], merge=kwargs['merge']) + + if 'caption' not in kwargs or not kwargs['caption']: + print('Skipping captions or danmaku.') + return + + for lang in self.caption_tracks: + filename = '%s.%s.srt' % (get_filename(self.title), lang) + print('Saving %s ... ' % filename, end="", flush=True) + srt = self.caption_tracks[lang] + with open(os.path.join(kwargs['output_dir'], filename), + 'w', encoding='utf-8') as x: + x.write(srt) + print('Done.') + + if self.danmaku is not None and not dry_run: + filename = '{}.cmt.xml'.format(get_filename(self.title)) + print('Downloading {} ...\n'.format(filename)) + with open(os.path.join(kwargs['output_dir'], filename), 'w', encoding='utf8') as fp: + fp.write(self.danmaku) + + if self.lyrics is not None and not dry_run: + filename = '{}.lrc'.format(get_filename(self.title)) + print('Downloading {} ...\n'.format(filename)) + with open(os.path.join(kwargs['output_dir'], filename), 'w', encoding='utf8') as fp: + fp.write(self.lyrics) + + # For main_dev() + #download_urls(urls, self.title, self.streams[stream_id]['container'], self.streams[stream_id]['size']) + keep_obj = kwargs.get('keep_obj', False) + if not keep_obj: + self.__init__() + + + def acfun_download(self, url, output_dir='.', merge=True, info_only=False, **kwargs): + assert re.match(r'https?://[^\.]*\.*acfun\.[^\.]+/(\D|bangumi)/\D\D(\d+)', url) + + def getM3u8UrlFromCurrentVideoInfo(currentVideoInfo): + if 'playInfos' in currentVideoInfo: + return currentVideoInfo['playInfos'][0]['playUrls'][0] + elif 'ksPlayJson' in currentVideoInfo: + ksPlayJson = json.loads( currentVideoInfo['ksPlayJson'] ) + representation = ksPlayJson.get('adaptationSet')[0].get('representation') + reps = [] + for one in representation: + reps.append( (one['width']* one['height'], one['url'], one['backupUrl']) ) + return max(reps)[1] + + + if re.match(r'https?://[^\.]*\.*acfun\.[^\.]+/\D/\D\D(\d+)', url): + html = get_content(url, headers=fake_headers) + json_text = match1(html, r"(?s)videoInfo\s*=\s*(\{.*?\});") + json_data = json.loads(json_text) + vid = json_data.get('currentVideoInfo').get('id') + up = json_data.get('user').get('name') + title = json_data.get('title') + video_list = json_data.get('videoList') + if len(video_list) > 1: + title += " - " + [p.get('title') for p in video_list if p.get('id') == vid][0] + currentVideoInfo = json_data.get('currentVideoInfo') + m3u8_url = getM3u8UrlFromCurrentVideoInfo(currentVideoInfo) + elif re.match("https?://[^\.]*\.*acfun\.[^\.]+/bangumi/aa(\d+)", url): + html = get_content(url, headers=fake_headers) + tag_script = match1(html, r'') + json_text = tag_script[tag_script.find('{') : tag_script.find('};') + 1] + json_data = json.loads(json_text) + title = json_data['bangumiTitle'] + " " + json_data['episodeName'] + " " + json_data['title'] + vid = str(json_data['videoId']) + up = "acfun" + + currentVideoInfo = json_data.get('currentVideoInfo') + m3u8_url = getM3u8UrlFromCurrentVideoInfo(currentVideoInfo) + + else: + raise NotImplemented + + assert title and m3u8_url + title = unescape_html(title) + title = escape_file_path(title) + p_title = r1('active">([^<]+)', html) + title = '%s (%s)' % (title, up) + if p_title: + title = '%s - %s' % (title, p_title) + + print_info(site_info, title, 'm3u8', float('inf')) + if not info_only: + download_url_ffmpeg(m3u8_url, title, 'mp4', output_dir=output_dir, merge=merge) + +site = AcFun() +site_info = "AcFun.cn" +download = site.download_by_url download_playlist = playlist_not_supported('acfun') diff --git a/src/you_get/extractors/baidu.py b/src/you_get/extractors/baidu.py index d5efaf0b..521d5e99 100644 --- a/src/you_get/extractors/baidu.py +++ b/src/you_get/extractors/baidu.py @@ -38,7 +38,7 @@ def baidu_get_song_title(data): def baidu_get_song_lyric(data): lrc = data['lrcLink'] - return None if lrc is '' else "http://music.baidu.com%s" % lrc + return "http://music.baidu.com%s" % lrc if lrc else None def baidu_download_song(sid, output_dir='.', merge=True, info_only=False): @@ -104,42 +104,54 @@ def baidu_download_album(aid, output_dir='.', merge=True, info_only=False): def baidu_download(url, output_dir='.', stream_type=None, merge=True, info_only=False, **kwargs): - if re.match(r'http://pan.baidu.com', url): + if re.match(r'https?://pan.baidu.com', url): real_url, title, ext, size = baidu_pan_download(url) + print_info('BaiduPan', title, ext, size) if not info_only: + print('Hold on...') + time.sleep(5) download_urls([real_url], title, ext, size, output_dir, url, merge=merge, faker=True) - elif re.match(r'http://music.baidu.com/album/\d+', url): - id = r1(r'http://music.baidu.com/album/(\d+)', url) + elif re.match(r'https?://music.baidu.com/album/\d+', url): + id = r1(r'https?://music.baidu.com/album/(\d+)', url) baidu_download_album(id, output_dir, merge, info_only) - elif re.match('http://music.baidu.com/song/\d+', url): - id = r1(r'http://music.baidu.com/song/(\d+)', url) + elif re.match('https?://music.baidu.com/song/\d+', url): + id = r1(r'https?://music.baidu.com/song/(\d+)', url) baidu_download_song(id, output_dir, merge, info_only) - elif re.match('http://tieba.baidu.com/', url): + elif re.match('https?://tieba.baidu.com/', url): try: # embedded videos - embed_download(url, output_dir, merge=merge, info_only=info_only) + embed_download(url, output_dir, merge=merge, info_only=info_only, **kwargs) except: # images html = get_html(url) title = r1(r'title:"([^"]+)"', html) + vhsrc = re.findall(r'"BDE_Image"[^>]+src="([^"]+\.mp4)"', html) or \ + re.findall(r'vhsrc="([^"]+)"', html) + if len(vhsrc) > 0: + ext = 'mp4' + size = url_size(vhsrc[0]) + print_info(site_info, title, ext, size) + if not info_only: + download_urls(vhsrc, title, ext, size, + output_dir=output_dir, merge=False) + items = re.findall( - r'//imgsrc.baidu.com/forum/w[^"]+/([^/"]+)', html) - urls = ['http://imgsrc.baidu.com/forum/pic/item/' + i + r'//tiebapic.baidu.com/forum/w[^"]+/([^/"]+)', html) + urls = ['http://tiebapic.baidu.com/forum/pic/item/' + i for i in set(items)] # handle albums kw = r1(r'kw=([^&]+)', html) or r1(r"kw:'([^']+)'", html) tid = r1(r'tid=(\d+)', html) or r1(r"tid:'([^']+)'", html) - album_url = 'http://tieba.baidu.com/photo/g/bw/picture/list?kw=%s&tid=%s' % ( - kw, tid) + album_url = 'http://tieba.baidu.com/photo/g/bw/picture/list?kw=%s&tid=%s&pe=%s' % (kw, tid, 1000) album_info = json.loads(get_content(album_url)) for i in album_info['data']['pic_list']: urls.append( - 'http://imgsrc.baidu.com/forum/pic/item/' + i['pic_id'] + '.jpg') + 'http://tiebapic.baidu.com/forum/pic/item/' + i['pic_id'] + '.jpg') ext = 'jpg' size = float('Inf') @@ -210,9 +222,6 @@ def baidu_pan_download(url): title_wrapped = json.loads('{"wrapper":"%s"}' % title) title = title_wrapped['wrapper'] logging.debug(real_url) - print_info(site_info, title, ext, size) - print('Hold on...') - time.sleep(5) return real_url, title, ext, size diff --git a/src/you_get/extractors/baomihua.py b/src/you_get/extractors/baomihua.py index 4c4febb7..9e97879a 100644 --- a/src/you_get/extractors/baomihua.py +++ b/src/you_get/extractors/baomihua.py @@ -6,6 +6,16 @@ from ..common import * import urllib +def baomihua_headers(referer=None, cookie=None): + # a reasonable UA + ua = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36' + headers = {'Accept': '*/*', 'Accept-Language': 'en-US,en;q=0.5', 'User-Agent': ua} + if referer is not None: + headers.update({'Referer': referer}) + if cookie is not None: + headers.update({'Cookie': cookie}) + return headers + def baomihua_download_by_id(id, title=None, output_dir='.', merge=True, info_only=False, **kwargs): html = get_html('http://play.baomihua.com/getvideourl.aspx?flvid=%s&devicetype=phone_app' % id) host = r1(r'host=([^&]*)', html) @@ -14,11 +24,12 @@ def baomihua_download_by_id(id, title=None, output_dir='.', merge=True, info_onl assert type vid = r1(r'&stream_name=([^&]*)', html) assert vid - url = "http://%s/pomoho_video/%s.%s" % (host, vid, type) - _, ext, size = url_info(url) + dir_str = r1(r'&dir=([^&]*)', html).strip() + url = "http://%s/%s/%s.%s" % (host, dir_str, vid, type) + _, ext, size = url_info(url, headers=baomihua_headers()) print_info(site_info, title, type, size) if not info_only: - download_urls([url], title, ext, size, output_dir, merge = merge) + download_urls([url], title, ext, size, output_dir, merge = merge, headers=baomihua_headers()) def baomihua_download(url, output_dir='.', merge=True, info_only=False, **kwargs): html = get_html(url) diff --git a/src/you_get/extractors/bilibili.py b/src/you_get/extractors/bilibili.py index c18290b8..cdcccf20 100644 --- a/src/you_get/extractors/bilibili.py +++ b/src/you_get/extractors/bilibili.py @@ -1,196 +1,770 @@ #!/usr/bin/env python -__all__ = ['bilibili_download'] - from ..common import * - -from .sina import sina_download_by_vid -from .tudou import tudou_download_by_id -from .youku import youku_download_by_vid +from ..extractor import VideoExtractor import hashlib -import re -appkey = 'f3bb208b3d081dc8' -SECRETKEY_MINILOADER = '1c15888dc316e05a15fdd0a02ed6584f' +class Bilibili(VideoExtractor): + name = "Bilibili" -def get_srt_xml(id): - url = 'http://comment.bilibili.com/%s.xml' % id - return get_html(url) + # Bilibili media encoding options, in descending quality order. + stream_types = [ + {'id': 'hdflv2_4k', 'quality': 120, 'audio_quality': 30280, + 'container': 'FLV', 'video_resolution': '2160p', 'desc': '超清 4K'}, + {'id': 'flv_p60', 'quality': 116, 'audio_quality': 30280, + 'container': 'FLV', 'video_resolution': '1080p', 'desc': '高清 1080P60'}, + {'id': 'hdflv2', 'quality': 112, 'audio_quality': 30280, + 'container': 'FLV', 'video_resolution': '1080p', 'desc': '高清 1080P+'}, + {'id': 'flv', 'quality': 80, 'audio_quality': 30280, + 'container': 'FLV', 'video_resolution': '1080p', 'desc': '高清 1080P'}, + {'id': 'flv720_p60', 'quality': 74, 'audio_quality': 30280, + 'container': 'FLV', 'video_resolution': '720p', 'desc': '高清 720P60'}, + {'id': 'flv720', 'quality': 64, 'audio_quality': 30280, + 'container': 'FLV', 'video_resolution': '720p', 'desc': '高清 720P'}, + {'id': 'hdmp4', 'quality': 48, 'audio_quality': 30280, + 'container': 'MP4', 'video_resolution': '720p', 'desc': '高清 720P (MP4)'}, + {'id': 'flv480', 'quality': 32, 'audio_quality': 30280, + 'container': 'FLV', 'video_resolution': '480p', 'desc': '清晰 480P'}, + {'id': 'flv360', 'quality': 16, 'audio_quality': 30216, + 'container': 'FLV', 'video_resolution': '360p', 'desc': '流畅 360P'}, + # 'quality': 15? + {'id': 'mp4', 'quality': 0}, + {'id': 'jpg', 'quality': 0}, + ] -def parse_srt_p(p): - fields = p.split(',') - assert len(fields) == 8, fields - time, mode, font_size, font_color, pub_time, pool, user_id, history = fields - time = float(time) - - mode = int(mode) - assert 1 <= mode <= 8 - # mode 1~3: scrolling - # mode 4: bottom - # mode 5: top - # mode 6: reverse? - # mode 7: position - # mode 8: advanced - - pool = int(pool) - assert 0 <= pool <= 2 - # pool 0: normal - # pool 1: srt - # pool 2: special? - - font_size = int(font_size) - - font_color = '#%06x' % int(font_color) - - return pool, mode, font_size, font_color - - -def parse_srt_xml(xml): - d = re.findall(r'(.*)', xml) - for x, y in d: - p = parse_srt_p(x) - raise NotImplementedError() - - -def parse_cid_playurl(xml): - from xml.dom.minidom import parseString - try: - doc = parseString(xml.encode('utf-8')) - urls = [durl.getElementsByTagName('url')[0].firstChild.nodeValue for durl in doc.getElementsByTagName('durl')] - return urls - except: - return [] - - -def bilibili_download_by_cids(cids, title, output_dir='.', merge=True, info_only=False): - urls = [] - for cid in cids: - sign_this = hashlib.md5(bytes('cid={cid}&from=miniplay&player=1{SECRETKEY_MINILOADER}'.format(cid = cid, SECRETKEY_MINILOADER = SECRETKEY_MINILOADER), 'utf-8')).hexdigest() - url = 'http://interface.bilibili.com/playurl?&cid=' + cid + '&from=miniplay&player=1' + '&sign=' + sign_this - urls += [i - if not re.match(r'.*\.qqvideo\.tc\.qq\.com', i) - else re.sub(r'.*\.qqvideo\.tc\.qq\.com', 'http://vsrc.store.qq.com', i) - for i in parse_cid_playurl(get_content(url))] - - type_ = '' - size = 0 - for url in urls: - _, type_, temp = url_info(url) - size += temp - - print_info(site_info, title, type_, size) - if not info_only: - download_urls(urls, title, type_, total_size=None, output_dir=output_dir, merge=merge) - - -def bilibili_download_by_cid(cid, title, output_dir='.', merge=True, info_only=False): - sign_this = hashlib.md5(bytes('cid={cid}&from=miniplay&player=1{SECRETKEY_MINILOADER}'.format(cid = cid, SECRETKEY_MINILOADER = SECRETKEY_MINILOADER), 'utf-8')).hexdigest() - url = 'http://interface.bilibili.com/playurl?&cid=' + cid + '&from=miniplay&player=1' + '&sign=' + sign_this - urls = [i - if not re.match(r'.*\.qqvideo\.tc\.qq\.com', i) - else re.sub(r'.*\.qqvideo\.tc\.qq\.com', 'http://vsrc.store.qq.com', i) - for i in parse_cid_playurl(get_content(url))] - - type_ = '' - size = 0 - for url in urls: - _, type_, temp = url_info(url) - size += temp or 0 - - print_info(site_info, title, type_, size) - if not info_only: - download_urls(urls, title, type_, total_size=None, output_dir=output_dir, merge=merge) - - -def bilibili_live_download_by_cid(cid, title, output_dir='.', merge=True, info_only=False): - api_url = 'http://live.bilibili.com/api/playurl?cid=' + cid - urls = parse_cid_playurl(get_content(api_url)) - - for url in urls: - _, type_, _ = url_info(url) - size = 0 - print_info(site_info, title, type_, size) - if not info_only: - download_urls([url], title, type_, total_size=None, output_dir=output_dir, merge=merge) - - -def bilibili_download(url, output_dir='.', merge=True, info_only=False, **kwargs): - html = get_content(url) - - if re.match(r'https?://bangumi\.bilibili\.com/', url): - # quick hack for bangumi URLs - url = r1(r'"([^"]+)" class="v-av-link"', html) - html = get_content(url) - - title = r1_of([r'', - r']*>\s*([^<>]+)\s*

'], html) - if title: - title = unescape_html(title) - title = escape_file_path(title) - - flashvars = r1_of([r'(cid=\d+)', r'(cid: \d+)', r'flashvars="([^"]+)"', - r'"https://[a-z]+\.bilibili\.com/secure,(cid=\d+)(?:&aid=\d+)?"'], html) - assert flashvars - flashvars = flashvars.replace(': ', '=') - t, cid = flashvars.split('=', 1) - cid = cid.split('&')[0] - if t == 'cid': - if re.match(r'https?://live\.bilibili\.com/', url): - title = r1(r'\s*([^<>]+)\s*', html) - bilibili_live_download_by_cid(cid, title, output_dir=output_dir, merge=merge, info_only=info_only) - + @staticmethod + def height_to_quality(height, qn): + if height <= 360 and qn <= 16: + return 16 + elif height <= 480 and qn <= 32: + return 32 + elif height <= 720 and qn <= 64: + return 64 + elif height <= 1080 and qn <= 80: + return 80 + elif height <= 1080 and qn <= 112: + return 112 else: - # multi-P - cids = [] - pages = re.findall('', html) - for i, page in enumerate(pages): - html = get_html("http://www.bilibili.com%s" % page) - flashvars = r1_of([r'(cid=\d+)', - r'flashvars="([^"]+)"', - r'"https://[a-z]+\.bilibili\.com/secure,(cid=\d+)(?:&aid=\d+)?"'], html) - if flashvars: - t, cid = flashvars.split('=', 1) - cids.append(cid.split('&')[0]) - if url.endswith(page): - cids = [cid.split('&')[0]] - titles = [titles[i]] - break + return 120 - # no multi-P - if not pages: - cids = [cid] - titles = [r1(r'', html) or title] + @staticmethod + def bilibili_headers(referer=None, cookie=None): + # a reasonable UA + ua = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36' + headers = {'Accept': '*/*', 'Accept-Language': 'en-US,en;q=0.5', 'User-Agent': ua} + if referer is not None: + headers.update({'Referer': referer}) + if cookie is not None: + headers.update({'Cookie': cookie}) + return headers - for i in range(len(cids)): - bilibili_download_by_cid(cids[i], - titles[i], - output_dir=output_dir, - merge=merge, - info_only=info_only) + @staticmethod + def bilibili_api(avid, cid, qn=0): + return 'https://api.bilibili.com/x/player/playurl?avid=%s&cid=%s&qn=%s&type=&otype=json&fnver=0&fnval=16&fourk=1' % (avid, cid, qn) - elif t == 'vid': - sina_download_by_vid(cid, title=title, output_dir=output_dir, merge=merge, info_only=info_only) - elif t == 'ykid': - youku_download_by_vid(cid, title=title, output_dir=output_dir, merge=merge, info_only=info_only) - elif t == 'uid': - tudou_download_by_id(cid, title, output_dir=output_dir, merge=merge, info_only=info_only) - else: - raise NotImplementedError(flashvars) + @staticmethod + def bilibili_audio_api(sid): + return 'https://www.bilibili.com/audio/music-service-c/web/url?sid=%s' % sid - if not info_only and not dry_run: - if not kwargs['caption']: - print('Skipping danmaku.') + @staticmethod + def bilibili_audio_info_api(sid): + return 'https://www.bilibili.com/audio/music-service-c/web/song/info?sid=%s' % sid + + @staticmethod + def bilibili_audio_menu_info_api(sid): + return 'https://www.bilibili.com/audio/music-service-c/web/menu/info?sid=%s' % sid + + @staticmethod + def bilibili_audio_menu_song_api(sid, ps=100): + return 'https://www.bilibili.com/audio/music-service-c/web/song/of-menu?sid=%s&pn=1&ps=%s' % (sid, ps) + + @staticmethod + def bilibili_bangumi_api(avid, cid, ep_id, qn=0, fnval=16): + return 'https://api.bilibili.com/pgc/player/web/playurl?avid=%s&cid=%s&qn=%s&type=&otype=json&ep_id=%s&fnver=0&fnval=%s' % (avid, cid, qn, ep_id, fnval) + + @staticmethod + def bilibili_interface_api(cid, qn=0): + entropy = 'rbMCKn@KuamXWlPMoJGsKcbiJKUfkPF_8dABscJntvqhRSETg' + appkey, sec = ''.join([chr(ord(i) + 2) for i in entropy[::-1]]).split(':') + params = 'appkey=%s&cid=%s&otype=json&qn=%s&quality=%s&type=' % (appkey, cid, qn, qn) + chksum = hashlib.md5(bytes(params + sec, 'utf8')).hexdigest() + return 'https://interface.bilibili.com/v2/playurl?%s&sign=%s' % (params, chksum) + + @staticmethod + def bilibili_live_api(cid): + return 'https://api.live.bilibili.com/room/v1/Room/playUrl?cid=%s&quality=0&platform=web' % cid + + @staticmethod + def bilibili_live_room_info_api(room_id): + return 'https://api.live.bilibili.com/room/v1/Room/get_info?room_id=%s' % room_id + + @staticmethod + def bilibili_live_room_init_api(room_id): + return 'https://api.live.bilibili.com/room/v1/Room/room_init?id=%s' % room_id + + @staticmethod + def bilibili_space_channel_api(mid, cid, pn=1, ps=100): + return 'https://api.bilibili.com/x/space/channel/video?mid=%s&cid=%s&pn=%s&ps=%s&order=0&jsonp=jsonp' % (mid, cid, pn, ps) + + @staticmethod + def bilibili_space_favlist_api(fid, pn=1, ps=20): + return 'https://api.bilibili.com/x/v3/fav/resource/list?media_id=%s&pn=%s&ps=%s&order=mtime&type=0&tid=0&jsonp=jsonp' % (fid, pn, ps) + + @staticmethod + def bilibili_space_video_api(mid, pn=1, ps=100): + return "https://api.bilibili.com/x/space/arc/search?mid=%s&pn=%s&ps=%s&tid=0&keyword=&order=pubdate&jsonp=jsonp" % (mid, pn, ps) + + @staticmethod + def bilibili_vc_api(video_id): + return 'https://api.vc.bilibili.com/clip/v1/video/detail?video_id=%s' % video_id + + @staticmethod + def bilibili_h_api(doc_id): + return 'https://api.vc.bilibili.com/link_draw/v1/doc/detail?doc_id=%s' % doc_id + + @staticmethod + def url_size(url, faker=False, headers={},err_value=0): + try: + return url_size(url,faker,headers) + except: + return err_value + + def prepare(self, **kwargs): + self.stream_qualities = {s['quality']: s for s in self.stream_types} + + try: + html_content = get_content(self.url, headers=self.bilibili_headers(referer=self.url)) + except: + html_content = '' # live always returns 400 (why?) + #self.title = match1(html_content, + # r'

bangumi/play/ep + # redirect: bangumi.bilibili.com/anime -> bangumi/play/ep + elif re.match(r'https?://(www\.)?bilibili\.com/bangumi/play/ss(\d+)', self.url) or \ + re.match(r'https?://bangumi\.bilibili\.com/anime/(\d+)/play', self.url): + initial_state_text = match1(html_content, r'__INITIAL_STATE__=(.*?);\(function\(\)') # FIXME + initial_state = json.loads(initial_state_text) + ep_id = initial_state['epList'][0]['id'] + self.url = 'https://www.bilibili.com/bangumi/play/ep%s' % ep_id + html_content = get_content(self.url, headers=self.bilibili_headers(referer=self.url)) + + # sort it out + if re.match(r'https?://(www\.)?bilibili\.com/audio/au(\d+)', self.url): + sort = 'audio' + elif re.match(r'https?://(www\.)?bilibili\.com/bangumi/play/ep(\d+)', self.url): + sort = 'bangumi' + elif match1(html_content, r'))', html) + json_data = json.loads(coub_data) + return json_data + + +def get_file_path(merge, output_dir, title, url): + mime, ext, size = url_info(url) + file_name = get_output_filename([], title, ext, output_dir, merge) + file_path = os.path.join(output_dir, file_name) + return file_name, file_path + + +def get_loop_file_path(title, output_dir): + return os.path.join(output_dir, get_output_filename([], title, "txt", None, False)) + + +def cleanup_files(files): + for file in files: + os.remove(file) + + +site_info = "coub.com" +download = coub_download +download_playlist = playlist_not_supported('coub') diff --git a/src/you_get/extractors/dailymotion.py b/src/you_get/extractors/dailymotion.py index 8b701cd1..789dff45 100644 --- a/src/you_get/extractors/dailymotion.py +++ b/src/you_get/extractors/dailymotion.py @@ -3,29 +3,36 @@ __all__ = ['dailymotion_download'] from ..common import * +import urllib.parse -def dailymotion_download(url, output_dir = '.', merge = True, info_only = False, **kwargs): +def rebuilt_url(url): + path = urllib.parse.urlparse(url).path + aid = path.split('/')[-1].split('_')[0] + return 'http://www.dailymotion.com/embed/video/{}?autoplay=1'.format(aid) + +def dailymotion_download(url, output_dir='.', merge=True, info_only=False, **kwargs): """Downloads Dailymotion videos by URL. """ - html = get_content(url) + html = get_content(rebuilt_url(url)) info = json.loads(match1(html, r'qualities":({.+?}),"')) title = match1(html, r'"video_title"\s*:\s*"([^"]+)"') or \ match1(html, r'"title"\s*:\s*"([^"]+)"') + title = unicodize(title) - for quality in ['720','480','380','240','auto']: + for quality in ['1080','720','480','380','240','144','auto']: try: - real_url = info[quality][0]["url"] + real_url = info[quality][1]["url"] if real_url: break except KeyError: pass - type, ext, size = url_info(real_url) + mime, ext, size = url_info(real_url) - print_info(site_info, title, type, size) + print_info(site_info, title, mime, size) if not info_only: - download_urls([real_url], title, ext, size, output_dir, merge = merge) + download_urls([real_url], title, ext, size, output_dir=output_dir, merge=merge) site_info = "Dailymotion.com" download = dailymotion_download diff --git a/src/you_get/extractors/dilidili.py b/src/you_get/extractors/dilidili.py deleted file mode 100644 index 082f84e1..00000000 --- a/src/you_get/extractors/dilidili.py +++ /dev/null @@ -1,77 +0,0 @@ -#!/usr/bin/env python - -__all__ = ['dilidili_download'] - -from ..common import * -from .ckplayer import ckplayer_download - -headers = { - 'DNT': '1', - 'Accept-Encoding': 'gzip, deflate, sdch, br', - 'Accept-Language': 'en-CA,en;q=0.8,en-US;q=0.6,zh-CN;q=0.4,zh;q=0.2', - 'Upgrade-Insecure-Requests': '1', - 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/537.36', - 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', - 'Cache-Control': 'max-age=0', - 'Referer': 'http://www.dilidili.com/', - 'Connection': 'keep-alive', - 'Save-Data': 'on', -} - -#---------------------------------------------------------------------- -def dilidili_parser_data_to_stream_types(typ ,vid ,hd2 ,sign, tmsign, ulk): - """->list""" - parse_url = 'http://player.005.tv/parse.php?xmlurl=null&type={typ}&vid={vid}&hd={hd2}&sign={sign}&tmsign={tmsign}&userlink={ulk}'.format(typ = typ, vid = vid, hd2 = hd2, sign = sign, tmsign = tmsign, ulk = ulk) - html = get_content(parse_url, headers=headers) - - info = re.search(r'(\{[^{]+\})(\{[^{]+\})(\{[^{]+\})(\{[^{]+\})(\{[^{]+\})', html).groups() - info = [i.strip('{}').split('->') for i in info] - info = {i[0]: i [1] for i in info} - - stream_types = [] - for i in zip(info['deft'].split('|'), info['defa'].split('|')): - stream_types.append({'id': str(i[1][-1]), 'container': 'mp4', 'video_profile': i[0]}) - return stream_types - -#---------------------------------------------------------------------- -def dilidili_download(url, output_dir = '.', merge = False, info_only = False, **kwargs): - if re.match(r'http://www.dilidili.com/watch\S+', url): - html = get_content(url) - title = match1(html, r'(.+)丨(.+)') #title - - # player loaded via internal iframe - frame_url = re.search(r'