almost 2 years ago

參考資料:
Understanding Scope in Ruby

參考資料:
純.rb檔放置地方
參考:RailsFun.tw 新手教學 day3 HD:大概在53分鐘的地方
a. XX.rb檔案做在project_name/lib
b. 剛做好的時候,可以用ruby XX.rb
c. 然後到config/application.rbrequirerequire "XX" (不要有.rb)
d. require之後,可以用rails c試看看,先require "XX",然後再XX.run (我是設定self.run)
e. 還可以到view去使用 <%= XX.run %>,不過server記得要重開

terminal : rails c,大小寫有差 ; 有時候出現false是因為已經loaded ; 有時候不知道為什麼不用require
require 'mechanize' #如果是在cloud9

require 'Mechanize' #如果是在Terminal
terminal: google的頁面太複雜了,很難找到我要的連結,而且又有data-href這種東西,所以改用yahoo來搜尋。
agent = Mechanize.new
page = agent.get('https://tw.yahoo.com/')

google 跟 yahoo的表單有點不太一樣:
Mechanize examples
Getting Started With Mechanize

google_form = page.form('f')
google_form.q = 'ruby mechanize'
terminal: 如果是google首頁,會看到首頁的表單-forms: name是 f ; fields裡:name:q,value是空的
=>
{forms
  #<Mechanize::Form

   {name "f"}
   {method "GET"}
   {action "/search"}
   {fields
    [hidden:0x3fc7f160d9f4 type: hidden name: ie value: Big5]
    [hidden:0x3fc7f160d1fc type: hidden name: hl value: zh-TW]
    [hidden:0x3fc7f160cd4c type: hidden name: source value: hp]
    [hidden:0x3fc7f160c810 type: hidden name: biw value: ]
    [hidden:0x3fc7f160c2fc type: hidden name: bih value: ]
    [text:0x3fc7f19edad8 type:  name: q value: ]
    [hidden:0x3fc7f1995fcc type: hidden name: gbv value: 1]}
   {radiobuttons}
   {checkboxes}
   {file_uploads}
   {buttons
    [submit:0x3fc7f19ed6dc type: submit name: btnG value: Google 搜尋]
    [submit:0x3fc7f19ecca0 type: submit name: btnI value: 好手氣]}>}
terminal: 如果是yahoo首頁,會看到首頁的表單-forms: name是 nil ; fields裡:name:p,value是空的
=>
{forms
  #<Mechanize::Form

   {name nil}
   {method "GET"}
   {action "https://tw.search.yahoo.com/search"}
   {fields
    [hidden:0x3fc7f1884408 type: hidden name: fr value: yfp-search-sb]
    [text:0x3fc7f18810f0 type: text name: p value: ]}
   {radiobuttons}
   {checkboxes}
   {file_uploads}
   {buttons [submit:0x3fc7f187dedc type: submit name:  value: 搜尋]}>}>
terminal: 用mechanize的.form method 去找到要填的表單。
google_form = page.form('f') #找到這個叫做'f'的form

yahoo_form = page.form #找到這個沒有名字的form
terminal: 把q跟p設定成要搜尋的字串,console裡面會看到,value會變成我們送出的字串
#把q設定成要搜尋的字串  - 林正宗婦產科診所。

google_form.q = '林正宗婦產科診所'
yahoo_form.p = '林正宗婦產科診所'

=> #<Mechanize::Form

 {name nil}
 {method "GET"}
 {action "https://tw.search.yahoo.com/search"}
 {fields
  [hidden:0x3fc7f1884408 type: hidden name: fr value: yfp-search-sb]
  [text:0x3fc7f18810f0 type: text name: p value: 林正宗婦產科診所]}
 {radiobuttons}
 {checkboxes}
 {file_uploads}
 {buttons [submit:0x3fc7f187dedc type: submit name:  value: 搜尋]}>
terminal: 按下送出,送出表單
  page = agent.submit(yahoo_form, yahoo_form.buttons.first)
terminal: url_list = []放錯地方,就一輩子跑不出來了
url_list = []
page.links.each do |link|
  url = link if  link.text.start_with?("林正宗") && link.href.exclude?("yahoo")
  url_list << url
  url_list.delete_if {|url| url.nil?}
end; true
錯誤的步驟:url_list = []放錯地方,就一輩子跑不出來了
rails c

require 'Mechanize'
agent = Mechanize.new

page = agent.get('https://tw.yahoo.com/')
yahoo_form = page.form
url_list = []

@hospitals = Hospital.all
@hospitals[0,500].each do |h|
  
  name = h.name
  yahoo_form.p = name
  page = agent.submit(yahoo_form, yahoo_form.buttons.first)
  
  page.links.each do |link|
    url = link if link.text.start_with?(name[0,4]) && link.href !~ /(nhi|tylife|makesop|tel038|medicallab|deefeed|doctorknow|youtube|pchome|bamahome|319papago|doctor.tw|finddoc|yahoo|doctor01|ezlife|5151|facebook|ipeen|104hc|verywed|bizpo|1111|twypage|tw16|google|businessweekly|goo|wakema|pixnet|iyp|xuite|sina|wikia|nownews|commonhealth|healthnews|socailiao|web393|wikipedia|mywoo|104|mohw|lifego|ypgo|yes123|blogspot|16won|4a0b|yelp|twspecial|fuly|healthmedia|shsh|qmap|zhupiter|ilist|dmbom|yam|lecoin|iask|shopcool|datagovtw|518|pixstu|iguang|urmap|yeahday|cmoremap|lookup|hicare|itwyp|tut|medicaltravel|dadupo|tophealthclinics|medicaltravel|roodo|itel|salary|web66|hotfrog|ibeta|abiz|olc|freeweb|ck101|1673|121|freelist|taiwanschoolnet|tw66|wordpress|babyhome|fashionguide|gothejob|chcg|lienchiangcounty4|likeboy|fescomail|dradvice|hcl)/
    url_list << url.href if !url.nil?
    url_list.uniq!
  end
  @hospital = Hospital.find_by(:name => name)
  if url_list.empty?
    @hospital.website = "N/A"
    @hospital.save
  else  
    @hospital.website = url_list[0]
    @hospital.save
  end
end; true

Understanding Scope in Ruby

完整的code 正確的步驟:url_list = []要放到迴圈裡面; 把nhi留下來,taichung-clinic
rails c

require 'Mechanize'
agent = Mechanize.new

page = agent.get('https://tw.yahoo.com/')
yahoo_form = page.form


@hospitals = Hospital.all
@hospitals[1000..1015].each do |h|
  url_list = []
  name = h.name
  yahoo_form.p = name
  page = agent.submit(yahoo_form, yahoo_form.buttons.first)
  
  page.links.each do |link|
    url = link if link.text.start_with?(name[0,4]) && link.href !~ /(thehealthdb|ihao|tradelist|lifeshow|1637.tw|2012tt|aiwiki|fcu|tylife|makesop|tel038|medicallab|deefeed|doctorknow|youtube|pchome|bamahome|319papago|doctor.tw|finddoc|yahoo|doctor01|ezlife|5151|facebook|ipeen|104hc|verywed|bizpo|1111|twypage|tw16|google|businessweekly|goo|wakema|pixnet|iyp|xuite|sina|wikia|nownews|commonhealth|healthnews|socailiao|web393|wikipedia|mywoo|104|mohw|lifego|ypgo|yes123|blogspot|16won|4a0b|yelp|twspecial|fuly|healthmedia|shsh|qmap|zhupiter|ilist|dmbom|yam|lecoin|iask|shopcool|datagovtw|518|pixstu|iguang|urmap|yeahday|cmoremap|lookup|hicare|itwyp|tut|medicaltravel|dadupo|tophealthclinics|medicaltravel|roodo|itel|salary|web66|hotfrog|ibeta|abiz|olc|freeweb|ck101|1673|121|freelist|taiwanschoolnet|tw66|wordpress|babyhome|fashionguide|gothejob|chcg|lienchiangcounty4|likeboy|fescomail|dradvice|hcl)/
    url_list << url.href if !url.nil?
    url_list.uniq!
  end
  @hospital = Hospital.find_by(:name => name)
  if url_list.empty?
    @hospital.website = "N/A"
    @hospital.save
  else  
    @hospital.website = url_list[0]
    @hospital.save
  end
end; true
比較醜的寫法
url = link if link.text.start_with?(name[0,4]) && link.href.exclude?("yahoo") && link.href.exclude?("doctor01") && link.href.exclude?("ezlife") && link.href.exclude?("5151") && link.href.exclude?("facebook") && link.href.exclude?("ipeen") && link.href.exclude?("104hc") && link.href.exclude?("verywed") && link.href.exclude?("bizpo") && link.href.exclude?("1111") && link.href.exclude?("twypage") && link.href.exclude?("tw16") && link.href.exclude?("google") && link.href.exclude?("businessweekly") && link.href.exclude?("goo") && link.href.exclude?("wakema") && link.href.exclude?("pixnet") && link.href.exclude?("iyp") && link.href.exclude?("xuite")&& link.href.exclude?("sina") && link.href.exclude?("wikia") && link.href.exclude?("nownews") && link.href.exclude?("commonhealth") && link.href.exclude?("nhi") && link.href.exclude?("healthnews") && link.href.exclude?("socailiao") && link.href.exclude?("web393") && link.href.exclude?("wikipedia") && link.href.exclude?("mywoo") && link.href.exclude?("104") && link.href.exclude?("mohw") && link.href.exclude?("lifego") && link.href.exclude?("ypgo") && link.href.exclude?("yes123") && link.href.exclude?("blogspot") && link.href.exclude?("16won") && link.href.exclude?("4a0b") && link.href.exclude?("yelp") && link.href.exclude?("twspecial") && link.href.exclude?("fuly") && link.href.exclude?("healthmedia") && link.href.exclude?("shsh") && link.href.exclude?("qmap")
比較漂亮的寫法:篩掉不想要的網址
url = link if link.text.start_with?(name[0,4]) && link.href !~ /(nhi|tylife|makesop|tel038|medicallab|deefeed|doctorknow|youtube|pchome|bamahome|319papago|doctor.tw|finddoc|yahoo|doctor01|ezlife|5151|facebook|ipeen|104hc|verywed|bizpo|1111|twypage|tw16|google|businessweekly|goo|wakema|pixnet|iyp|xuite|sina|wikia|nownews|commonhealth|healthnews|socailiao|web393|wikipedia|mywoo|104|mohw|lifego|ypgo|yes123|blogspot|16won|4a0b|yelp|twspecial|fuly|healthmedia|shsh|qmap|zhupiter|ilist|dmbom|yam|lecoin|iask|shopcool|datagovtw|518|pixstu|iguang|urmap|yeahday|cmoremap|lookup|hicare|itwyp|tut|medicaltravel|dadupo|tophealthclinics|medicaltravel|roodo|itel|salary|web66|hotfrog|ibeta|abiz|olc|freeweb|ck101|1673|121|freelist|taiwanschoolnet|tw66|wordpress|babyhome|fashionguide|gothejob|chcg|lienchiangcounty4|likeboy|fescomail|dradvice|hcl)/    

一次跑太多比就會出現這種情況:Mechanize::ResponseCodeError: 999
Yahoo好像隔一個小時,就可以再爬一次,應該一次可以爬五佰筆資料。

← 用 Spreadsheet gem 去讀取 Excel 檔(.xls) [Rails] 計數快取 counter_cache 用法,包含範例 ( log 的變化 ) →