Java - 通过socket开发爬虫

    科技2025-11-07  20

     

    学习爬虫开发的前提是对http协议拥有足够的认识!

    这里有一本《HTTP武功秘籍》,免费送给你!

    前言:

    有些网站的反扒策略非常狠,建议刚入手时选择一些没有反扒的网站!

     

     

    上菜吧(这里的URL用例为的用户首页)

    import java.io.ByteArrayOutputStream; import java.io.IOException; import java.io.InputStream; import java.io.OutputStream; import java.net.Socket; import javax.net.ssl.SSLSocket; import javax.net.ssl.SSLSocketFactory; public class TestHttpClient { Socket socket = null; String host = "blog.csdn.net"; // host地址 Integer port = 443; // 这里用的是https协议,所以是443端口;如果是http协议要将它改成80端口 public void createSocket() { try { // 如果是http协议,直接new 一个Socket对象就行 // socket = new Socket(host, port); // https协议需要通过如下的方法来建立连接 socket = (SSLSocket)((SSLSocketFactory)SSLSocketFactory.getDefault()).createSocket(host, port); } catch (IOException e) { e.printStackTrace(); } } public void communcate() { // 注意这里必须制定请求方式 地址 注意空格 StringBuffer sb = new StringBuffer("GET https://blog.csdn.net/v7911 HTTP/1.1\r\n"); // 以下为请求头 sb.append("Request URL: https://blog.csdn.net/v7911\r\n"); sb.append("Request Method: GET\r\n"); sb.append("Status Code: 200 \r\n"); sb.append("Remote Address: 112.126.96.31:443\r\n"); sb.append("Referrer Policy: strict-origin-when-cross-origin\r\n"); sb.append("Host: blog.csdn.net\r\n"); sb.append(":authority: blog.csdn.net\r\n"); sb.append(":method: GET\r\n"); sb.append(":path: /v7911\r\n"); sb.append(":scheme: https\r\n"); sb.append("accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9\r\n"); sb.append("accept-encoding: \r\n"); sb.append("accept-language: zh-CN,zh;q=0.9\r\n"); sb.append("cache-control: max-age=0\r\n"); sb.append("sec-fetch-dest: document\r\n"); sb.append("sec-fetch-mode: navigate\r\n"); sb.append("sec-fetch-site: none\r\n"); sb.append("sec-fetch-user: ?1\r\n"); sb.append("upgrade-insecure-requests: 1\r\n"); sb.append("user-agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36\r\n"); // 注意这里要换行结束请求头 sb.append("\r\n"); String content = sb.toString(); System.out.println("发出的报文内容:"); System.out.println(content); System.out.println("#######################################################"); System.out.println("接收到的报文内容:"); try { // 获取写出流 OutputStream os = socket.getOutputStream(); os.write(content.getBytes()); // 获取输入流 InputStream is = socket.getInputStream(); ByteArrayOutputStream baos = new ByteArrayOutputStream(); byte[] bytes = new byte[1024]; int len = -1; while ((len = is.read(bytes)) != -1) { baos.write(bytes, 0, len); } // 这里将二进制数据转换为了utf-8格式的字符串。切记可能需要根据实际网站返回的内容来确定该编码格式!!! System.out.println(new String(baos.toByteArray(), "utf-8")); socket.close(); } catch (IOException e) { e.printStackTrace(); } } public static void main(String[] args) { TestHttpClient client = new TestHttpClient(); client.createSocket(); client.communcate(); } }

    推送一下个人在github上关于爬虫的项目(github:https://github.com/baldyoung/Spider

     

     

    Processed: 0.014, SQL: 8