Tuesday 2 September 2008

wget 入门 (zz)

wget,看起来比用ftp感觉舒服一点,厚厚。

http://redhat.ecenter.idv.tw/bbs/showthread.php?threadid=39222

現在就先說明一下, wget 是 GNU 組織發展的軟體之一, 有 Linux 與 Windows 版, 而且用法相同, 不需要註冊或破解, 一般人的使用上是沒有限制的, 有點心動了嗎?

那接著就來說明一下, wget 的用法, 因為是提供"砍站"的功能, 所以 wget 並沒有互動式的介面, 完全是用文字模式處理, 那怎麼知道要抓什麼呢? 就必須先用瀏覽器, 或 ftp 軟體確定, 例如要抓高醫的網頁, 只要在文字模式, 下這樣的命令:

wget http://www.kmu.edu.tw/
看起來夠簡單吧, 一個小小的動作, 整個高醫網頁就抓下來了. 不過這樣會抓到很多不相關的東西, 就可以用一些參數來設定, 常用的參數如下:

-np 只抓該站內的資料, 因為網頁有超鏈結到其他網站的功能, 加上 -np 參數, 就可以限制只抓該站內資料.

-m 是 mirror 的縮寫, 也就是將整個網站, 連同目錄結構都抓下來.

-A 只抓某些副檔名, 例如 -A html,htm 表示只抓網頁而不抓圖.

-b 丟到背景執行, 在 Windows 下可以讓 wget 不佔用 DOS 模式.

-c 續傳, 如果之前有抓到一半中斷的網站, 可以用這功能續傳,
而且不需要網站支援續傳功能, wget 會自動從中斷的地方續傳.

例如我可以下這樣的指令:
wget -A jpeg,jpg -b -c -m -np http://www..idv.tw/
將該網站的 jpeg 圖片全部抓下來.
如何? 夠方便吧? 而且 wget 不只可以抓網頁, 還可以抓 ftp
wget ftp://ftp.nsysu.edu.tw/
需要帳號密碼的站台:wget ftp://user:password@ftp.individual.com.tw/
需要特殊埠號(Port)的站台:
wget ftp://user:password@ftp.individual.com.tw:6667/

只要一個命令, 就可以將整個站的資料全部抓下來,
wget 的功能就是這麼強大, 相對的, 對於頻寬也就有很大的要求.
所以盡可能的, 設定正確的代理伺服器(proxy)來降低網路負載:

像高醫內可以用 set http_proxy=http://proxy.kmu.edu.tw:3128/
使用 hinet 可以用 set http_proxy=http://www.hinet.net:80/
使用 seed.net 可以用 set http://proxy=http://ksproxy.seed.net.tw:8080/

至於軟體的取得, 可以從這裏找到 Windows 版的 wget :
ftp://ftp.ntust.edu.tw/WinNT/Winsoc...win-1_5_3_1.zip
解開之後, 直接執行 wget 就可以了, 如果您想把網路頻寬塞滿,
相信 wget 不會讓你失望的.

-------------------------------------------------------------------------------------------------
wget --help 手册
Mandatory arguments to long options are mandatory for short options too.

Startup:
-V, --version display the version of Wget and exit.
-h, --help print this help.
-b, --background go to background after startup.
-e, --execute=COMMAND execute a `.wgetrc'-style command.

Logging and input file:
-o, --output-file=FILE log messages to FILE.
-a, --append-output=FILE append messages to FILE.
-d, --debug print lots of debugging information.
-q, --quiet quiet (no output).
-v, --verbose be verbose (this is the default).
-nv, --no-verbose turn off verboseness, without being quiet.
-i, --input-file=FILE download URLs found in FILE.
-F, --force-html treat input file as HTML.
-B, --base=URL prepends URL to relative links in -F -i file.

Download:
-t, --tries=NUMBER set number of retries to NUMBER (0 unlimits).
--retry-connrefused retry even if connection is refused.
-O, --output-document=FILE write documents to FILE.
-nc, --no-clobber skip downloads that would download to
existing files.
-c, --continue resume getting a partially-downloaded file.
--progress=TYPE select progress gauge type.
-N, --timestamping don't re-retrieve files unless newer than
local.
-S, --server-response print server response.
--spider don't download anything.
-T, --timeout=SECONDS set all timeout values to SECONDS.
--dns-timeout=SECS set the DNS lookup timeout to SECS.
--connect-timeout=SECS set the connect timeout to SECS.
--read-timeout=SECS set the read timeout to SECS.
-w, --wait=SECONDS wait SECONDS between retrievals.
--waitretry=SECONDS wait 1..SECONDS between retries of a retrieval.

--random-wait wait from 0...2*WAIT secs between retrievals.
-Y, --proxy explicitly turn on proxy.
--no-proxy explicitly turn off proxy.
-Q, --quota=NUMBER set retrieval quota to NUMBER.
--bind-address=ADDRESS bind to ADDRESS (hostname or IP) on local host.

--limit-rate=RATE limit download rate to RATE.
--no-dns-cache disable caching DNS lookups.
--restrict-file-names=OS restrict chars in file names to ones OS allows.

--user=USER set both ftp and http user to USER.
--password=PASS set both ftp and http password to PASS.

Directories:
-nd, --no-directories don't create directories.
-x, --force-directories force creation of directories.
-nH, --no-host-directories don't create host directories.
--protocol-directories use protocol name in directories.
-P, --directory-prefix=PREFIX save files to PREFIX/...
--cut-dirs=NUMBER ignore NUMBER remote directory components.

HTTP options:
--http-user=USER set http user to USER.
--http-password=PASS set http password to PASS.
--no-cache disallow server-cached data.
-E, --html-extension save HTML documents with `.html' extension.
--ignore-length ignore `Content-Length' header field.
--header=STRING insert STRING among the headers.
--proxy-user=USER set USER as proxy username.
--proxy-password=PASS set PASS as proxy password.
--referer=URL include `Referer: URL' header in HTTP request.
--save-headers save the HTTP headers to file.
-U, --user-agent=AGENT identify as AGENT instead of Wget/VERSION.
--no-http-keep-alive disable HTTP keep-alive (persistent connections).

--no-cookies don't use cookies.
--load-cookies=FILE load cookies from FILE before session.
--save-cookies=FILE save cookies to FILE after session.
--keep-session-cookies load and save session (non-permanent) cookies.
--post-data=STRING use the POST method; send STRING as the data.
--post-file=FILE use the POST method; send contents of FILE.

HTTPS (SSL/TLS) options:
--secure-protocol=PR choose secure protocol, one of auto, SSLv2,
SSLv3, and TLSv1.
--no-check-certificate don't validate the server's certificate.
--certificate=FILE client certificate file.
--certificate-type=TYPE client certificate type, PEM or DER.
--private-key=FILE private key file.
--private-key-type=TYPE private key type, PEM or DER.
--ca-certificate=FILE file with the bundle of CA's.
--ca-directory=DIR directory where hash list of CA's is stored.
--random-file=FILE file with random data for seeding the SSL PRNG.
--egd-file=FILE file naming the EGD socket with random data.

FTP options:
--ftp-user=USER set ftp user to USER.
--ftp-password=PASS set ftp password to PASS.
--no-remove-listing don't remove `.listing' files.
--no-glob turn off FTP file name globbing.
--no-passive-ftp disable the "passive" transfer mode.
--retr-symlinks when recursing, get linked-to files (not dir).
--preserve-permissions preserve remote file permissions.

Recursive download:
-r, --recursive specify recursive download.
-l, --level=NUMBER maximum recursion depth (inf or 0 for infinite).
--delete-after delete files locally after downloading them.
-k, --convert-links make links in downloaded HTML point to local files.
-K, --backup-converted before converting file X, back up as X.orig.
-m, --mirror shortcut for -N -r -l inf --no-remove-listing.
-p, --page-requisites get all images, etc. needed to display HTML page.
--strict-comments turn on strict (SGML) handling of HTML comments.

Recursive accept/reject:
-A, --accept=LIST comma-separated list of accepted extensions.
-R, --reject=LIST comma-separated list of rejected extensions.
-D, --domains=LIST comma-separated list of accepted domains.
--exclude-domains=LIST comma-separated list of rejected domains.
--follow-ftp follow FTP links from HTML documents.
--follow-tags=LIST comma-separated list of followed HTML tags.
--ignore-tags=LIST comma-separated list of ignored HTML tags.
-H, --span-hosts go to foreign hosts when recursive.
-L, --relative follow relative links only.
-I, --include-directories=LIST list of allowed directories.
-X, --exclude-directories=LIST list of excluded directories.
-np, --no-parent don't ascend to the parent directory.

ETL Process Description Template

在杰出Lead的培养下,俺终于意识到Wiki和知识积累的重要性。在摸索使用wiki的过程中,也开始意识到俺葱白的一个女人在评价wiki的一句话:"Template在wiki使用中是相当的重要的。" 当年,俺对这句话相当的嗤之以鼻,而今天才意识到这是多年工作经验的精华和积累。Template不仅仅是对知识的精确表述,也能体现工作的流程和职责。更重要的是,可以能够像模式一样,提供team member之间相互交流和理解的工具。So,从今天开始,开始注意并搜集总结template,少做无用功夫,多留时间愉悦自己。嘿嘿。:)

===================华丽分割线=======================================
ETL Jobs Process Description Template
  • Job Description
    • Overview (Table format)
      • DI job name
      • Data Provider (Where it comes from)
      • Description (Generic work flow)
    • Complex work flow chart(直观的将复杂的workflow显示)
    • Detailed work flow(描述根据具体的workflow以及具体的细节)
  • Job Schedule
  • Deploy