数据抓取之反爬虫规则：使用代理和http头信息

之前说个数据抓取遇到的一个坎就是验证码，这次来说另外两个。我们知道web系统可以拿到客户请求信息，那么针对客户请求的频率，客户信息都会做限制。如果一个ip上的客户访问过于频繁，或者明显是用程序抓取，肯定是要禁止的。本文针对这两个问题说下解决方法。

其实针对上述两个问题，解决方法已经很成熟了，无非就是买代理和在http请求中加入头信息伪装为浏览器请求。本文说下具体操作

使用代理

首先购买代理，这个网上卖代理的很多，自己搜索，而且价格也不贵。

其次就是在程序中使用代理：

  HttpClient httpclient = new DefaultHttpClient();
  httpclient.getCredentialsProvider().setCredentials(
          new AuthScope("代理ip", "代理端口"),
          new UsernamePasswordCredentials("代理用户名","代理密码"));

http请求加入头信息

同样在http请求中加入头信息也是很少代码搞定：

  HttpGet httpget = new HttpGet(url);
  // 加入头信息
  httpget.addHeader("Accept", "text/html");
  httpget.addHeader("Accept-Charset", "utf-8");
  httpget.addHeader("Accept-Encoding", "gzip");
  httpget.addHeader("Accept-Language", "zh-CN,zh");
  httpget.addHeader("User-Agent", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.160 Safari/537.22");
    
  HttpResponse response = httpclient.execute(httpget);

post请求同样的方式：

  HttpPost httppost = new HttpPost(url);        
  List<NameValuePair> formparams = param; 
  UrlEncodedFormEntity uefEntity = new UrlEncodedFormEntity(formparams, reqEncoding);
  httppost.setEntity(uefEntity);
  // 加入头信息
  httppost.addHeader("Accept", "text/html");
  httppost.addHeader("Accept-Charset", "utf-8");
  httppost.addHeader("Accept-Encoding", "gzip");
  httppost.addHeader("Accept-Language", "en-US,en");
  httppost.addHeader("User-Agent", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.160 Safari/537.22");

  HttpResponse response = httpclient.execute(httppost);