Google hack-简单页面URL采集
采集Google的URL方式多重多样,最简单的方式莫过于js直接获取节点了。比如:
var h3 = document.getElementsByTagName('h3'); for(var i=0;i<h3.length;i++){ var a = h3[i]. getElementsByTagName('a'); console.log(a[0].href); }
在Chrome浏览器中,按下F12打开其中的“Console”,然后将上面的代码贴入,按下Enter键执行即可看到效果。
在java里面用jsoup也可以非常简单的获取到搜索结果的URL:
public static void main(String[] args) throws IOException { Document doc = Jsoup.connect("https://www.google.ws/search?num=100&site=&source=hp&q=filetype%3Ajsp&oq=filetype%3Ajsp&gs_l=hp.3...8115.14780.0.15194.22.21.1.0.0.0.523.5187.3j3j3j5j4j1.19.0....0...1c.1.36.hp..14.8.1440.P_2EQhc7Pz0").userAgent("Googlebot/2.1 (+http://www.googlebot.com/bot.html)").timeout(5000).get(); Elements element = doc.getElementsByTag("h3"); for (Element e : element) { Matcher m= Pattern.compile("/url\?q=(.*)&sa").matcher(e.getElementsByTag("a").get(0).attr("href")); if(m.find()){ System.out.println(URLDecoder.decode(m.group(1),"UTF-8")); } } }
正则的方式:
package org.javaweb.test; import java.util.regex.Matcher; import java.util.regex.Pattern; public class TestReg { public static void main(String[] args) { String source="<h3 class="r"><a href="http://baidu.com">百度</a></h3><h3 class="r"><a href="http://google.com">谷歌</a></h3> "; StringBuilder resultComment=new StringBuilder(); StringBuilder resultName=new StringBuilder(); System.out.println("=======开始匹配========"); String patternStrs="(<h3 class="r"><a.+?)href="(.+?)">(.+?)(</a></h3>)"; Pattern pattern=Pattern.compile(patternStrs); Matcher matcher=pattern.matcher(source); while(matcher.find()){ resultName.append(matcher.group(2)+"n"); resultComment.append(matcher.group(3)+"n"); } System.out.println("=======标签内内容======="); System.out.println(resultComment.toString()); System.out.println("=======name属性值======="); System.out.println(resultName.toString()); } }