你的位置:首页 > Java教程

[Java教程]Httpclient 和jsoup结和提取网页内容(某客学院视频链接)


     最近在极客学院获得体验会员3个月,然后就去上面看了看,感觉课程讲的还不错。整好最近学习Android,然后去上面找点视频看看。发现只有使用RMB买的会员才能在上面下载视频。抱着试一试的态度,去看他的网页源码,不巧发现有视频地址链接。然后想起来jsoup提取网页元素挺方便的,没事干就写了一个demo。

    jsoup 是一款Java 的HTML解析器,可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API,可通过DOM,CSS以及类似于jQuery的操作方法来取出和操作数据。

    jsoup的主要功能如下:

   1. 从一个URL,文件或字符串中解析HTML;
   2. 使用DOM或CSS选择器来查找、取出数据;
   3. 可操作HTML元素、属性、文本;
   jsoup的用法中文文档地址:http://www.open-open.com/jsoup/
 
     使用jsoup提取网页中指定的内容需要提前做好网页分析工作。我找到在极客学院一个课程的页面源码,很快找到了视频链接部分;如下图:<scource/> 标签中就是视频链接,通过这个链接我们可以通过迅雷下载。
 <source src="http://cv3.jikexueyuan.com/201508081934/f8f3f9f8088f1ba0a6c75594448d96ab/course/1501-1600/1557/video/4278_b_h264_sd_960_540.mp4" type="video/mp4"></source>

View Code

 


     我们获取整个html源码,然后根据<scource/>对源码进行提取,很容易获取下载链接。

    接着通过分析网页,我们可以得到一门课程所有视频信息。网页源码如下:

   

 1 <dl class="lessonvideo-list">  2  <dd class="playing">  3   <h2> <span class="sm-icon "></span> <a href="http://www.jikexueyuan.com/course/1748_1.html?ss=1" jktag="&amp;posGP=103001&amp;posArea=0002&amp;posOper=8005&amp;posColumn=1.1">1.编写自己的自定义 View(上)</a> <span class="lesson-time">00:10:24</span> </h2>  4   <blockquote> 5    本课时主要讲解最简单的自定义 View,然后加入绘制元素(文字、图形等),并且可以像使用系统控件一样在布局中使用。 6   </blockquote>  7  </dd>  8  <dd>  9   <h2> <span class="sm-icon "></span> <a href="http://www.jikexueyuan.com/course/1748_2.html?ss=1" jktag="&amp;posGP=103001&amp;posArea=0002&amp;posOper=8005&amp;posColumn=2.2">2.编写自己的自定义 View(下)</a> <span class="lesson-time">00:12:05</span> </h2> 10   <blockquote>11    本课时主要讲解最简单的自定义 View,然后加入绘制元素(文字、图形等),并且可以像使用系统控件一样在布局中使用。12   </blockquote> 13  </dd> 14  <dd> 15   <h2> <span class="sm-icon "></span> <a href="http://www.jikexueyuan.com/course/1748_3.html?ss=1" jktag="&amp;posGP=103001&amp;posArea=0002&amp;posOper=8005&amp;posColumn=3.3">3.加入逻辑线程</a> <span class="lesson-time">00:20:34</span> </h2> 16   <blockquote>17    本课时需要让绘制的元素动起来,但是又不阻塞主线程,所以引入逻辑线程。在子线程更新 UI 是不被允许的,但是 View 提供了方法。让我们来看看吧。18   </blockquote> 19  </dd> 20  <dd> 21   <h2> <span class="sm-icon "></span> <a href="http://www.jikexueyuan.com/course/1748_4.html?ss=1" jktag="&amp;posGP=103001&amp;posArea=0002&amp;posOper=8005&amp;posColumn=4.4">4.提取和封装自定义 View</a> <span class="lesson-time">00:15:41</span> </h2> 22   <blockquote>23    本课时主要讲解在上个课程的基础上,进行提取代码来构造自定义 View 的基类,主要目的是:创建新的自定义 View 时,只需继承此类并只关心绘制和逻辑,其他工作由父类完成。这样既减少重复编码,也简化了逻辑。24   </blockquote> 25  </dd> 26  <dd> 27   <h2> <span class="sm-icon "></span> <a href="http://www.jikexueyuan.com/course/1748_5.html?ss=1" jktag="&amp;posGP=103001&amp;posArea=0002&amp;posOper=8005&amp;posColumn=5.5">5.在 </a> <span class="lesson-time">00:14:05</span> </h2> 28   <blockquote>29    本课时主要讲解的是在 30   </blockquote> 31  </dd> 32  </dl>

View Code

  通过 Elements results1 = doc.getElementsByClass("lessonvideo-list"); 我们可以获得视频列表。然后我们接着对从视频列表获取课程每节课视频地址使用jsoup遍历获取视频链接。

以上是主要思路,另外使用jsoup get方法获取网页Docment是是没有cooike状态的,有些视频需要VIP会员登录才能获取到视频播放地址。因此我们需要用httpclient来模拟用户登录状态。

 一下是整个工程源码。

1 、 课程course类,用于存储课程每一节课的课程名和课程url地址。

 1 public class Course { 2  3   /** 4    * 链接的地址 5    */ 6   private String linkHref; 7   /** 8    * 链接的标题 9    */10   private String linkText;11 12   public String getLinkHref() {13     return linkHref;14   }15 16   public void setLinkHref(String linkHref) {17     this.linkHref = linkHref;18   }19 20   public String getLinkText() {21     return linkText;22   }23 24   public void setLinkText(String linkText) {25     this.linkText = linkText;26   }27 28   @Override29   public String toString() {30     return "Video [linkHref=" + linkHref + ", linkText=" + linkText + "]";31   }32 33 }

View Code

2、HttpUtils类,用于模拟用户登录状态。

 1 import java.io.IOException; 2 import java.io.InputStream; 3 import java.io.UnsupportedEncodingException; 4  5 import org.apache.http.Header; 6 import org.apache.http.HttpEntity; 7 import org.apache.http.HttpHeaders; 8 import org.apache.http.HttpResponse; 9 import org.apache.http.HttpStatus; 10 import org.apache.http.client.ClientProtocolException; 11 import org.apache.http.client.HttpClient; 12 import org.apache.http.client.methods.CloseableHttpResponse; 13 import org.apache.http.client.methods.HttpGet; 14 import org.apache.http.client.methods.HttpPost; 15 import org.apache.http.entity.StringEntity; 16 import org.apache.http.impl.client.CloseableHttpClient; 17 import org.apache.http.impl.client.DefaultHttpClient; 18 import org.apache.http.impl.client.HttpClients; 19 import org.apache.http.util.EntityUtils; 20  21 @SuppressWarnings("deprecation") 22 public class HttpUtils { 23   String cookieStr = ""; 24  25   public String getCookieStr() { 26     return cookieStr; 27   } 28  29   CloseableHttpResponse response = null; 30  31   public CloseableHttpResponse getResponse() { 32     return response; 33   } 34  35   public HttpUtils(String cookieStr) { 36     this.cookieStr = cookieStr; 37   } 38  39   public HttpUtils() { 40  41   } 42  43   public String Get(String url) { 44     CloseableHttpClient httpclient = HttpClients.createDefault(); 45     HttpGet httpget = new HttpGet(url); 46     httpget.setHeader("cookie", cookieStr); 47     httpget.setHeader( 48         HttpHeaders.USER_AGENT, 49         "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36"); 50  51     try { 52       response = httpclient.execute(httpget); 53       HttpEntity entity = response.getEntity(); 54       String res = EntityUtils.toString(entity, "UTF-8"); 55  56       return res; 57     } catch (Exception e) { 58       System.err.println(String.format("HTTP GET error %s", 59           e.getMessage())); 60     } finally { 61       try { 62         httpclient.close(); 63       } catch (IOException e) { 64         // e.printStackTrace(); 65       } 66     } 67     return null; 68   } 69  70   public String Post(String url) { 71     CloseableHttpClient httpclient = HttpClients.createDefault(); 72     HttpPost httppost = new HttpPost(url.split("\\?")[0]); 73     StringEntity reqEntity = null; 74     try { 75       reqEntity = new StringEntity(url.split("\\?")[1], "UTF-8"); 76     } catch (UnsupportedEncodingException e1) { 77       // TODO Auto-generated catch block 78       e1.printStackTrace(); 79     } 80     httppost.setHeader("cookie", cookieStr); 81     reqEntity 82         .setContentType("application/x-www-form-urlencoded;charset=UTF-8"); 83     httppost.setEntity(reqEntity); 84     httppost.setHeader( 85         HttpHeaders.USER_AGENT, 86         "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36"); 87     try { 88       response = httpclient.execute(httppost); 89       Header[] headers = response.getAllHeaders(); 90       for (Header h : headers) { 91         String name = h.getName(); 92         String value = h.getValue(); 93         if ("Set-Cookie".equalsIgnoreCase(name)) { 94           cookieStr += subCookie(value); 95           //System.out.println(cookieStr); 96           // break; 97         } 98       } 99       HttpEntity entity = response.getEntity();100 101       return EntityUtils.toString(entity, "UTF-8");102     } catch (Exception e) {103       System.err.println(String.format("HTTP POST error %s",104           e.getMessage()));105     } finally {106       try {107         httpclient.close();108       } catch (IOException e) {109         // e.printStackTrace();110       }111     }112     return null;113   }114 115   public String GetLoginCookie(String url) {116     CloseableHttpClient httpclient = HttpClients.createDefault();117     HttpGet httpget = new HttpGet(url);118     httpget.setHeader("Cookie", cookieStr);119     httpget.setHeader(120         HttpHeaders.USER_AGENT,121         "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36");122     try {123       response = httpclient.execute(httpget);124       Header[] headers = response.getAllHeaders();125       for (Header h : headers) {126         String name = h.getName();127         String value = h.getValue();128         if ("Set-Cookie".equalsIgnoreCase(name)) {129           cookieStr = subCookie(value);130           return cookieStr;131         }132 133       }134     } catch (Exception e) {135       System.err.println(String.format("HTTP GET error %s",136           e.getMessage()));137     } finally {138       try {139         httpclient.close();140       } catch (IOException e) {141         // e.printStackTrace();142       }143     }144     return "4";// 错误码145   }146 147   public String subCookie(String value) {148     int end = value.indexOf(";");149     return value.substring(0, end + 1);150   }151 152   public InputStream GetImage(String url) {153     InputStream is = null;154     HttpClient httpclient = new DefaultHttpClient();155     HttpGet httpGet = new HttpGet(url);156     if (cookieStr != null)157       httpGet.setHeader("Cookie", cookieStr);158     HttpResponse response;159     try {160       response = httpclient.execute(httpGet);161       if (HttpStatus.SC_OK == response.getStatusLine().getStatusCode()) {162         HttpEntity entity = response.getEntity();163         if (entity != null) {164           //System.out.println(entity.getContentType());165           // 可以判断是否是文件数据流166           //System.out.println(entity.isStreaming());167           // File storeFile = new File("F:\\code.jpg");168           // FileOutputStream output = new169           // FileOutputStream(storeFile);170           // 得到网络资源并写入文件171           InputStream input = entity.getContent();172           is = input;173           // byte b[] = new byte[1024];174           // int j = 0;175           // while ((j = input.read(b)) != -1) {176           // output.write(b, 0, j);177           // }178           // output.flush();179           // output.close();180         }181       }182     } catch (ClientProtocolException e) {183       // TODO Auto-generated catch block184       e.printStackTrace();185     } catch (IOException e) {186       // TODO Auto-generated catch block187       e.printStackTrace();188     }189     return is;190   }

View Code

3、简单的测试Test类。

 1 package com.debughao.down; 2  3 import java.util.ArrayList; 4 import java.util.List; 5 import java.util.Scanner; 6  7 import org.jsoup.Jsoup; 8 import org.jsoup.nodes.Document; 9 import org.jsoup.nodes.Element;10 import org.jsoup.select.Elements;11 12 import com.debughao.bean.Course;13 14 public class Test {15 16   public static void main(String[] args) {17     HttpUtils http = new HttpUtils("stat_uuid=1436867409341663197461; uname=qq_rwe4zg5t; uid=3812752; code=LZ8XF1; "18         + "authcode=b809MIxLGp8syQcnuAAdIT9PuCEH2%2FuiyvRuuLALSxb6z6iGoM3xcihNJKzHK%2BAZWzVIGFAW0QrBYiSLmHN1qnhi0YQLmBeWeqkJHXh5xsoylWuRCFmRDJZyUtAGr3U; "19         + "level_id=3; is_expire=0; domain=debughao; stat_fromWebUrl=; stat_ssid=1439813138264;"20         + " connect.sid=s%3A5xux57xcLyCBheevR40DUa0beJD_ok-S.0aTnwfjSvm7A49zydLGbtXy7vdCGfH7lB7MwmZURppQ; "21         + "QINGCLOUDELB=37e16e60f0cd051b754b0acf9bdfd4b5d562b81daa2a899c46d3a1e304c7eb2b|VcWiq|VcWiq; "22         + "_ga=GA1.2.889563867.1436867384; _gat=1; Hm_lvt_f3c68d41bda15331608595c98e9c3915=1438945833,1438947627,1438995076,1438995133;"23         + " Hm_lpvt_f3c68d41bda15331608595c98e9c3915=1439015591; MECHAT_LVTime=1439015591174; MECHAT_CKID=cookieVal=006600143686858016573509; "24         + "undefined=; stat_isNew=0");25     Scanner sc=new Scanner(System.in);26     String url= sc.nextLine();27     sc.close();28     String res = http.Get(url);29     Document doc = getDocByRes(res);30     List<Course> videos = getVideoList(doc);31     for (Course video : videos) {32       System.out.println(video.getLinkText());33     }34     for (Course video : videos) {35       String urls = video.getLinkHref();36       String res2 = http.Get(urls);37       Document doc1 = getDocByRes(res2);38       getVideoLink(doc1);39 40     }41   }42 43   private static Document getDocByRes(String res) {44     // TODO Auto-generated method stub45     Document doc = null;46     doc = Jsoup.parse(res);47     return doc;48   }49 50   public static List<Course> getVideoList(Element doc) {51     Elements links;52     List<Course> courses = new ArrayList<Course>();53     Course course = null;54     Elements results1 = doc.getElementsByClass("lessonvideo-list");55     String title = doc.getElementsByTag("title").text();56     System.out.println(title);57     for (Element element : results1) {58       links = element.getElementsByTag("a");59       for (Element link : links) {60         String linkList = link.attr("href");61         String linkText = link.text();62         // System.out.println(linkText);63         course = new Course();64         course.setLinkHref(linkList);65         course.setLinkText(linkText);66         courses.add(course);67       }68     }69     return courses;70   }71 72   public static void getVideoLink(Document doc) {73     Elements results2 = doc.select("source");74     String mp4Links = results2.attr("src");75     System.out.println(mp4Links);76   }77 }

View Code

4、以下是运行结果:

 1 http://www.jikexueyuan.com/course/1748.html 2 自定义 View 基础和原理-极客学院 3 1.编写自己的自定义 View(上) 4 2.编写自己的自定义 View(下) 5 3.加入逻辑线程 6 4.提取和封装自定义 View 7 5.在  8 http://cv3.jikexueyuan.com/201508082007/99549fa37069a39a2e128278ee60768c/course/1501-1600/1557/video/4278_b_h264_sd_960_540.mp4 9 http://cv3.jikexueyuan.com/201508082007/a068be74f7f31900e128f109523b0925/course/1501-1600/1557/video/4279_b_h264_sd_960_540.mp410 http://cv3.jikexueyuan.com/201508082008/bf216e06770e9a9b0adda34ea4d01dfc/course/1501-1600/1557/video/4280_b_h264_sd_960_540.mp411 http://cv3.jikexueyuan.com/201508082008/75b51573a75458848136e61e848d1ae7/course/1501-1600/1557/video/4281_b_h264_sd_960_540.mp412 http://cv3.jikexueyuan.com/201508082008/ca20fad3e1bc622aa64bbfa7d2b768dd/course/1501-1600/1557/video/5159_b_h264_sd_960_540.mp4

 打开迅雷新建任务就可以下载。