暗月猫的旅行: 2 Sept 2008

wget,看起来比用ftp感觉舒服一点，厚厚。

http://redhat.ecenter.idv.tw/bbs/showthread.php?threadid=39222

現在就先說明一下, wget 是 GNU 組織發展的軟體之一, 有 Linux 與 Windows 版, 而且用法相同, 不需要註冊或破解, 一般人的使用上是沒有限制的, 有點心動了嗎?

那接著就來說明一下, wget 的用法, 因為是提供"砍站"的功能, 所以 wget 並沒有互動式的介面, 完全是用文字模式處理, 那怎麼知道要抓什麼呢? 就必須先用瀏覽器, 或 ftp 軟體確定, 例如要抓高醫的網頁, 只要在文字模式, 下這樣的命令:

wget http://www.kmu.edu.tw/
看起來夠簡單吧, 一個小小的動作, 整個高醫網頁就抓下來了. 不過這樣會抓到很多不相關的東西, 就可以用一些參數來設定, 常用的參數如下:

-np 只抓該站內的資料, 因為網頁有超鏈結到其他網站的功能, 加上 -np 參數, 就可以限制只抓該站內資料.

-m 是 mirror 的縮寫, 也就是將整個網站, 連同目錄結構都抓下來.

-A 只抓某些副檔名, 例如 -A html,htm 表示只抓網頁而不抓圖.

-b 丟到背景執行, 在 Windows 下可以讓 wget 不佔用 DOS 模式.

-c 續傳, 如果之前有抓到一半中斷的網站, 可以用這功能續傳,
而且不需要網站支援續傳功能, wget 會自動從中斷的地方續傳.

例如我可以下這樣的指令:
wget -A jpeg,jpg -b -c -m -np http://www..idv.tw/
將該網站的 jpeg 圖片全部抓下來.
如何? 夠方便吧? 而且 wget 不只可以抓網頁, 還可以抓 ftp
wget ftp://ftp.nsysu.edu.tw/
需要帳號密碼的站台:wget ftp://user:password@ftp.individual.com.tw/
需要特殊埠號(Port)的站台:
wget ftp://user:password@ftp.individual.com.tw:6667/

只要一個命令, 就可以將整個站的資料全部抓下來,
wget 的功能就是這麼強大, 相對的, 對於頻寬也就有很大的要求.
所以盡可能的, 設定正確的代理伺服器(proxy)來降低網路負載:

像高醫內可以用 set http_proxy=http://proxy.kmu.edu.tw:3128/
使用 hinet 可以用 set http_proxy=http://www.hinet.net:80/
使用 seed.net 可以用 set http://proxy=http://ksproxy.seed.net.tw:8080/

至於軟體的取得, 可以從這裏找到 Windows 版的 wget :
ftp://ftp.ntust.edu.tw/WinNT/Winsoc...win-1_5_3_1.zip
解開之後, 直接執行 wget 就可以了, 如果您想把網路頻寬塞滿,
相信 wget 不會讓你失望的.

-------------------------------------------------------------------------------------------------
wget --help 手册
Mandatory arguments to long options are mandatory for short options too.

Startup:
-V, --version display the version of Wget and exit.
-h, --help print this help.
-b, --background go to background after startup.
-e, --execute=COMMAND execute a `.wgetrc'-style command.

Logging and input file:
-o, --output-file=FILE log messages to FILE.
-a, --append-output=FILE append messages to FILE.
-d, --debug print lots of debugging information.
-q, --quiet quiet (no output).
-v, --verbose be verbose (this is the default).
-nv, --no-verbose turn off verboseness, without being quiet.
-i, --input-file=FILE download URLs found in FILE.
-F, --force-html treat input file as HTML.
-B, --base=URL prepends URL to relative links in -F -i file.

Download:
-t, --tries=NUMBER set number of retries to NUMBER (0 unlimits).
--retry-connrefused retry even if connection is refused.
-O, --output-document=FILE write documents to FILE.
-nc, --no-clobber skip downloads that would download to
existing files.
-c, --continue resume getting a partially-downloaded file.
--progress=TYPE select progress gauge type.
-N, --timestamping don't re-retrieve files unless newer than
local.
-S, --server-response print server response.
--spider don't download anything.
-T, --timeout=SECONDS set all timeout values to SECONDS.
--dns-timeout=SECS set the DNS lookup timeout to SECS.
--connect-timeout=SECS set the connect timeout to SECS.
--read-timeout=SECS set the read timeout to SECS.
-w, --wait=SECONDS wait SECONDS between retrievals.
--waitretry=SECONDS wait 1..SECONDS between retries of a retrieval.

--random-wait wait from 0...2*WAIT secs between retrievals.
-Y, --proxy explicitly turn on proxy.
--no-proxy explicitly turn off proxy.
-Q, --quota=NUMBER set retrieval quota to NUMBER.
--bind-address=ADDRESS bind to ADDRESS (hostname or IP) on local host.

--limit-rate=RATE limit download rate to RATE.
--no-dns-cache disable caching DNS lookups.
--restrict-file-names=OS restrict chars in file names to ones OS allows.

--user=USER set both ftp and http user to USER.
--password=PASS set both ftp and http password to PASS.

Directories:
-nd, --no-directories don't create directories.
-x, --force-directories force creation of directories.
-nH, --no-host-directories don't create host directories.
--protocol-directories use protocol name in directories.
-P, --directory-prefix=PREFIX save files to PREFIX/...
--cut-dirs=NUMBER ignore NUMBER remote directory components.

HTTP options:
--http-user=USER set http user to USER.
--http-password=PASS set http password to PASS.
--no-cache disallow server-cached data.
-E, --html-extension save HTML documents with `.html' extension.
--ignore-length ignore `Content-Length' header field.
--header=STRING insert STRING among the headers.
--proxy-user=USER set USER as proxy username.
--proxy-password=PASS set PASS as proxy password.
--referer=URL include `Referer: URL' header in HTTP request.
--save-headers save the HTTP headers to file.
-U, --user-agent=AGENT identify as AGENT instead of Wget/VERSION.
--no-http-keep-alive disable HTTP keep-alive (persistent connections).

--no-cookies don't use cookies.
--load-cookies=FILE load cookies from FILE before session.
--save-cookies=FILE save cookies to FILE after session.
--keep-session-cookies load and save session (non-permanent) cookies.
--post-data=STRING use the POST method; send STRING as the data.
--post-file=FILE use the POST method; send contents of FILE.

HTTPS (SSL/TLS) options:
--secure-protocol=PR choose secure protocol, one of auto, SSLv2,
SSLv3, and TLSv1.
--no-check-certificate don't validate the server's certificate.
--certificate=FILE client certificate file.
--certificate-type=TYPE client certificate type, PEM or DER.
--private-key=FILE private key file.
--private-key-type=TYPE private key type, PEM or DER.
--ca-certificate=FILE file with the bundle of CA's.
--ca-directory=DIR directory where hash list of CA's is stored.
--random-file=FILE file with random data for seeding the SSL PRNG.
--egd-file=FILE file naming the EGD socket with random data.

FTP options:
--ftp-user=USER set ftp user to USER.
--ftp-password=PASS set ftp password to PASS.
--no-remove-listing don't remove `.listing' files.
--no-glob turn off FTP file name globbing.
--no-passive-ftp disable the "passive" transfer mode.
--retr-symlinks when recursing, get linked-to files (not dir).
--preserve-permissions preserve remote file permissions.

Recursive download:
-r, --recursive specify recursive download.
-l, --level=NUMBER maximum recursion depth (inf or 0 for infinite).
--delete-after delete files locally after downloading them.
-k, --convert-links make links in downloaded HTML point to local files.
-K, --backup-converted before converting file X, back up as X.orig.
-m, --mirror shortcut for -N -r -l inf --no-remove-listing.
-p, --page-requisites get all images, etc. needed to display HTML page.
--strict-comments turn on strict (SGML) handling of HTML comments.

Recursive accept/reject:
-A, --accept=LIST comma-separated list of accepted extensions.
-R, --reject=LIST comma-separated list of rejected extensions.
-D, --domains=LIST comma-separated list of accepted domains.
--exclude-domains=LIST comma-separated list of rejected domains.
--follow-ftp follow FTP links from HTML documents.
--follow-tags=LIST comma-separated list of followed HTML tags.
--ignore-tags=LIST comma-separated list of ignored HTML tags.
-H, --span-hosts go to foreign hosts when recursive.
-L, --relative follow relative links only.
-I, --include-directories=LIST list of allowed directories.
-X, --exclude-directories=LIST list of excluded directories.
-np, --no-parent don't ascend to the parent directory.

在杰出Lead的培养下，俺终于意识到Wiki和知识积累的重要性。在摸索使用wiki的过程中，也开始意识到俺葱白的一个女人在评价wiki的一句话:"Template在wiki使用中是相当的重要的。" 当年，俺对这句话相当的嗤之以鼻，而今天才意识到这是多年工作经验的精华和积累。Template不仅仅是对知识的精确表述，也能体现工作的流程和职责。更重要的是，可以能够像模式一样，提供team member之间相互交流和理解的工具。So,从今天开始，开始注意并搜集总结template,少做无用功夫，多留时间愉悦自己。嘿嘿。:)

===================华丽分割线=======================================
ETL Jobs Process Description Template

Job Description

Overview (Table format)

DI job name
Data Provider (Where it comes from)
Description (Generic work flow)

Complex work flow chart(直观的将复杂的workflow显示)

Detailed work flow(描述根据具体的workflow以及具体的细节)

Job Schedule
Deploy

暗月猫的旅行

Tuesday, 2 September 2008

wget 入门 (zz)

ETL Process Description Template