Python编程:requests 库中的 重定向

HTTP 重定向介绍

HTTP 重定向是Web开发中一个基础但重要的概念,它允许服务器客户端请求导向另一个位置。以下是关于HTTP重定向的全面解析:

重定向基本概念

1. 什么是重定向

      重定向是指当客户端请求某个URL时,服务器返回一个特殊响应,指示客户端应该去访问另一个URL。

2. 工作原理

  1. 客户端发送请求到原始URL

  2. 服务器返回3xx状态码和Location头(包含新URL)

  3. 客户端自动向新URL发起请求

重定向状态码

状态码名称说明
300Multiple Choices多选项,手动选择重定向目标(很少使用)
301Moved Permanently永久重定向,资源已永久移动到新位置
302Found临时重定向,资源临时从不同URL响应
303See Other临时重定向,且客户端应使用GET方法访问新URL(常用于POST后的重定向)
307Temporary Redirect临时重定向,且客户端必须保持原始请求方法(POST/PUT等)
308Permanent Redirect永久重定向,且客户端必须保持原始请求方法

重定向类型

1. 永久重定向(301/308)

  • 特点

    • 浏览器会缓存重定向关系

    • 搜索引擎会将权重转移到新URL

  • 使用场景

    • 网站改版URL结构调整

    • 域名更换

    • HTTP升级到HTTPS

2. 临时重定向(302/303/307)

  • 特点

    • 浏览器不会缓存重定向

    • 搜索引擎保留原始URL的权重

  • 使用场景

    • 临时维护页面

    • A/B测试

    • 表单提交后跳转

3. 特殊重定向

  • 303 See Other

    • 强制将POST请求转为GET请求的重定向

    • 常用于表单提交后显示结果页

  • 307/308

    • 保持原始请求方法(POST/PUT/DELETE等)

    • 307是临时,308是永久

重定向实现方式

1. 服务器端实现

  • HTTP头方式

    HTTP/1.1 301 Moved Permanently
    Location: https://example.com/new-url
  • 常见服务器配置

    • Apache:

      Redirect 301 /old /new
    • Nginx:

      location /old {
        return 301 /new;
      }

2. HTML元标签方式(不推荐)

<meta http-equiv="refresh" content="0; url=https://example.com/new">

3. JavaScript方式(不推荐)

window.location.href = "https://example.com/new";

重定向实践

  1. 合理选择状态码

    • 永久移动用301/308

    • 临时移动用302/307

    • POST后显示结果用303

  2. 避免重定向链

    • 多个连续重定向会降低性能

    • 理想情况下不超过1-2次重定向

  3. HTTPS重定向

    nginx

    server {
      listen 80;
      server_name example.com;
      return 301 https://$host$request_uri;
    }
  4. WWW与非WWW统一

    nginx

    server {
      server_name www.example.com;
      return 301 https://example.com$request_uri;
    }
  5. 保留查询字符串

    return 301 https://example.com/new$request_uri;

重定向与SEO

  1. 301重定向

    • 搜索引擎会将页面权重(PageRank)传递给新URL

    • 需要3-6个月时间完成权重转移

  2. 302重定向

    • 搜索引擎会继续抓取原始URL

    • 不传递页面权重

  3. 常见问题

    • 避免将首页302重定向到内部页面

    • 检查并修复损坏的重定向链

    • 使用Google Search Console监控重定向

编程中的重定向处理

1. Python Requests库

import requests

# 默认自动处理重定向(allow_redirects=True)
r = requests.get('http://example.com')

# 禁用自动重定向
r = requests.get('http://example.com', allow_redirects=False)
if r.status_code == 302:
    print("重定向到:", r.headers['Location'])

2. Node.js

const http = require('http');

http.createServer((req, res) => {
  res.writeHead(302, {
    'Location': 'https://example.com/new'
  });
  res.end();
});

3. PHP

header('HTTP/1.1 301 Moved Permanently');
header('Location: https://example.com/new');
exit();

重定向问题排查

  1. 工具

    • cURL: curl -v http://example.com

    • Chrome开发者工具(Network标签)

    • Redirect Checker在线工具

  2. 常见问题

    • 重定向循环(检查规则逻辑)

    • 丢失HTTP方法(使用307/308替代302)

    • 丢失查询参数(检查$request_uri是否包含)

  3. 性能影响

    • 每个重定向增加1个RTT(往返时间)

    • 移动端网络下影响更明显

安全考虑

  1. 开放重定向漏洞

    • 避免将用户提供的URL直接用于重定向

    • 示例漏洞:

      // 危险代码
      header('Location: '.$_GET['url']);
  2. 防御措施

    • 白名单验证目标URL

    • 使用相对路径或固定域名前缀

  3. HTTPS重定向

    • 避免在HTTP页面上处理敏感信息

    • 启用HSTS(HTTP Strict Transport Security)

现代Web开发中的重定向

  1. 单页应用(SPA)路由

    • 使用前端路由处理"虚拟"重定向

    • 仍需服务器配置基础重定向

  2. HTTP/2服务器推送

    • 可减少重定向带来的性能损耗

    • 但无法完全替代正确的重定向策略

  3. CDN边缘重定向

    • 在CDN节点处理重定向逻辑

    • 减少源服务器压力

正确使用HTTP重定向对于网站性能、用户体验和SEO都至关重要。

SessionRedirectMixin 

SessionRedirectMixin 是 Python Requests 库中处理 HTTP 重定向逻辑的核心组件,它为 Session 类提供了完整的重定向处理能力。

类定义与作用

class SessionRedirectMixin:
    def get_redirect_target(self, resp):
        """Receives a Response. Returns a redirect URI or ``None``"""
        # Due to the nature of how requests processes redirects this method will
        # be called at least once upon the original response and at least twice
        # on each subsequent redirect response (if any).
        # If a custom mixin is used to handle this logic, it may be advantageous
        # to cache the redirect location onto the response object as a private
        # attribute.
        if resp.is_redirect:
            location = resp.headers["location"]
            # Currently the underlying http module on py3 decode headers
            # in latin1, but empirical evidence suggests that latin1 is very
            # rarely used with non-ASCII characters in HTTP headers.
            # It is more likely to get UTF8 header rather than latin1.
            # This causes incorrect handling of UTF8 encoded location headers.
            # To solve this, we re-encode the location in latin1.
            location = location.encode("latin1")
            return to_native_string(location, "utf8")
        return None

    def should_strip_auth(self, old_url, new_url):
        """Decide whether Authorization header should be removed when redirecting"""
        old_parsed = urlparse(old_url)
        new_parsed = urlparse(new_url)
        if old_parsed.hostname != new_parsed.hostname:
            return True
        # Special case: allow http -> https redirect when using the standard
        # ports. This isn't specified by RFC 7235, but is kept to avoid
        # breaking backwards compatibility with older versions of requests
        # that allowed any redirects on the same host.
        if (
            old_parsed.scheme == "http"
            and old_parsed.port in (80, None)
            and new_parsed.scheme == "https"
            and new_parsed.port in (443, None)
        ):
            return False

        # Handle default port usage corresponding to scheme.
        changed_port = old_parsed.port != new_parsed.port
        changed_scheme = old_parsed.scheme != new_parsed.scheme
        default_port = (DEFAULT_PORTS.get(old_parsed.scheme, None), None)
        if (
            not changed_scheme
            and old_parsed.port in default_port
            and new_parsed.port in default_port
        ):
            return False

        # Standard case: root URI must match
        return changed_port or changed_scheme

    def resolve_redirects(
        self,
        resp,
        req,
        stream=False,
        timeout=None,
        verify=True,
        cert=None,
        proxies=None,
        yield_requests=False,
        **adapter_kwargs,
    ):
        """Receives a Response. Returns a generator of Responses or Requests."""

        hist = []  # keep track of history

        url = self.get_redirect_target(resp)
        previous_fragment = urlparse(req.url).fragment
        while url:
            prepared_request = req.copy()

            # Update history and keep track of redirects.
            # resp.history must ignore the original request in this loop
            hist.append(resp)
            resp.history = hist[1:]

            try:
                resp.content  # Consume socket so it can be released
            except (ChunkedEncodingError, ContentDecodingError, RuntimeError):
                resp.raw.read(decode_content=False)

            if len(resp.history) >= self.max_redirects:
                raise TooManyRedirects(
                    f"Exceeded {self.max_redirects} redirects.", response=resp
                )

            # Release the connection back into the pool.
            resp.close()

            # Handle redirection without scheme (see: RFC 1808 Section 4)
            if url.startswith("//"):
                parsed_rurl = urlparse(resp.url)
                url = ":".join([to_native_string(parsed_rurl.scheme), url])

            # Normalize url case and attach previous fragment if needed (RFC 7231 7.1.2)
            parsed = urlparse(url)
            if parsed.fragment == "" and previous_fragment:
                parsed = parsed._replace(fragment=previous_fragment)
            elif parsed.fragment:
                previous_fragment = parsed.fragment
            url = parsed.geturl()

            # Facilitate relative 'location' headers, as allowed by RFC 7231.
            # (e.g. '/path/to/resource' instead of 'http://domain.tld/path/to/resource')
            # Compliant with RFC3986, we percent encode the url.
            if not parsed.netloc:
                url = urljoin(resp.url, requote_uri(url))
            else:
                url = requote_uri(url)

            prepared_request.url = to_native_string(url)

            self.rebuild_method(prepared_request, resp)

            # https://github.com/psf/requests/issues/1084
            if resp.status_code not in (
                codes.temporary_redirect,
                codes.permanent_redirect,
            ):
                # https://github.com/psf/requests/issues/3490
                purged_headers = ("Content-Length", "Content-Type", "Transfer-Encoding")
                for header in purged_headers:
                    prepared_request.headers.pop(header, None)
                prepared_request.body = None

            headers = prepared_request.headers
            headers.pop("Cookie", None)

            # Extract any cookies sent on the response to the cookiejar
            # in the new request. Because we've mutated our copied prepared
            # request, use the old one that we haven't yet touched.
            extract_cookies_to_jar(prepared_request._cookies, req, resp.raw)
            merge_cookies(prepared_request._cookies, self.cookies)
            prepared_request.prepare_cookies(prepared_request._cookies)

            # Rebuild auth and proxy information.
            proxies = self.rebuild_proxies(prepared_request, proxies)
            self.rebuild_auth(prepared_request, resp)

            # A failed tell() sets `_body_position` to `object()`. This non-None
            # value ensures `rewindable` will be True, allowing us to raise an
            # UnrewindableBodyError, instead of hanging the connection.
            rewindable = prepared_request._body_position is not None and (
                "Content-Length" in headers or "Transfer-Encoding" in headers
            )

            # Attempt to rewind consumed file-like object.
            if rewindable:
                rewind_body(prepared_request)

            # Override the original request.
            req = prepared_request

            if yield_requests:
                yield req
            else:
                resp = self.send(
                    req,
                    stream=stream,
                    timeout=timeout,
                    verify=verify,
                    cert=cert,
                    proxies=proxies,
                    allow_redirects=False,
                    **adapter_kwargs,
                )

                extract_cookies_to_jar(self.cookies, prepared_request, resp.raw)

                # extract redirect url, if any, for the next loop
                url = self.get_redirect_target(resp)
                yield resp

    def rebuild_auth(self, prepared_request, response):
        """When being redirected we may want to strip authentication from the
        request to avoid leaking credentials. This method intelligently removes
        and reapplies authentication where possible to avoid credential loss.
        """
        headers = prepared_request.headers
        url = prepared_request.url

        if "Authorization" in headers and self.should_strip_auth(
            response.request.url, url
        ):
            # If we get redirected to a new host, we should strip out any
            # authentication headers.
            del headers["Authorization"]

        # .netrc might have more auth for us on our new host.
        new_auth = get_netrc_auth(url) if self.trust_env else None
        if new_auth is not None:
            prepared_request.prepare_auth(new_auth)

    def rebuild_proxies(self, prepared_request, proxies):
        """This method re-evaluates the proxy configuration by considering the
        environment variables. If we are redirected to a URL covered by
        NO_PROXY, we strip the proxy configuration. Otherwise, we set missing
        proxy keys for this URL (in case they were stripped by a previous
        redirect).

        This method also replaces the Proxy-Authorization header where
        necessary.

        :rtype: dict
        """
        headers = prepared_request.headers
        scheme = urlparse(prepared_request.url).scheme
        new_proxies = resolve_proxies(prepared_request, proxies, self.trust_env)

        if "Proxy-Authorization" in headers:
            del headers["Proxy-Authorization"]

        try:
            username, password = get_auth_from_url(new_proxies[scheme])
        except KeyError:
            username, password = None, None

        # urllib3 handles proxy authorization for us in the standard adapter.
        # Avoid appending this to TLS tunneled requests where it may be leaked.
        if not scheme.startswith("https") and username and password:
            headers["Proxy-Authorization"] = _basic_auth_str(username, password)

        return new_proxies

    def rebuild_method(self, prepared_request, response):
        """When being redirected we may want to change the method of the request
        based on certain specs or browser behavior.
        """
        method = prepared_request.method

        # https://tools.ietf.org/html/rfc7231#section-6.4.4
        if response.status_code == codes.see_other and method != "HEAD":
            method = "GET"

        # Do what the browsers do, despite standards...
        # First, turn 302s into GETs.
        if response.status_code == codes.found and method != "HEAD":
            method = "GET"

        # Second, if a POST is responded to with a 301, turn it into a GET.
        # This bizarre behaviour is explained in Issue 1704.
        if response.status_code == codes.moved and method == "POST":
            method = "GET"

        prepared_request.method = method

核心职责

  1. 重定向目标提取:从响应头中识别重定向目标

  2. 请求重建:根据重定向调整原始请求

  3. 安全处理:在重定向时保护敏感信息(如认证头)

  4. 流程控制:管理重定向链和最大重定向限制

核心方法解析

1. get_redirect_target(resp)

def get_redirect_target(self, resp):
    """Receives a Response. Returns a redirect URI or ``None``"""

功能:从响应对象中提取重定向目标URL

实现细节

  • 检查 resp.is_redirect 判断是否是重定向响应

  • 从 Location 头获取目标URL

  • 处理编码问题:将 latin1 编码的Location头转为UTF-8字符串

  • 返回None表示不是重定向响应

特殊处理

location = location.encode('latin1')  # 解决Py3的http模块latin1解码问题
return to_native_string(location, "utf8")

2. should_strip_auth(old_url, new_url)

def should_strip_auth(self, old_url, new_url):
    """Decide whether Authorization header should be removed when redirecting"""

功能:判断在重定向时是否应该移除认证头

安全规则

  1. 主机名变更时移除认证

  2. 特殊例外:HTTP→HTTPS且使用标准端口(80→443)时保留认证

  3. 协议变更或端口变更时移除认证

实现逻辑

if old_parsed.hostname != new_parsed.hostname:
    return True
# 处理http→https的特殊情况
if (old_parsed.scheme == "http" 
    and new_parsed.scheme == "https" 
    and old_parsed.port in (80, None) 
    and new_parsed.port in (443, None)):
    return False

3. resolve_redirects(...)

def resolve_redirects(self, resp, req, stream=False, timeout=None, 
                     verify=True, cert=None, proxies=None, 
                     yield_requests=False, **adapter_kwargs):

功能:处理重定向链的核心方法

参数

  • resp:初始响应对象

  • req:原始PreparedRequest

  • yield_requests:控制返回请求对象还是响应对象

工作流程

  1. 初始化历史记录列表

  2. 获取第一个重定向目标

  3. 进入重定向循环:

    • 检查重定向次数限制

    • 复制并修改原始请求

    • 处理相对路径重定向

    • 重建方法、头部和认证信息

    • 发送新请求或生成中间请求对象

关键逻辑

while url:
    # 1. 更新历史记录
    hist.append(resp)
    resp.history = hist[1:]
    
    # 2. 检查重定向限制
    if len(resp.history) >= self.max_redirects:
        raise TooManyRedirects(...)
    
    # 3. 准备新请求
    prepared_request = req.copy()
    
    # 4. 处理URL
    if not parsed.netloc:
        url = urljoin(resp.url, requote_uri(url))
    
    # 5. 重建方法和头部
    self.rebuild_method(prepared_request, resp)
    
    # 6. 处理cookies
    extract_cookies_to_jar(prepared_request._cookies, req, resp.raw)
    
    # 7. 发送请求或生成对象
    if yield_requests:
        yield prepared_request
    else:
        resp = self.send(prepared_request, ...)
        yield resp

4. rebuild_method(prepared_request, response)

def rebuild_method(self, prepared_request, response):

功能:根据重定向响应调整HTTP方法

规则

  • 303 See Other:强制转为GET方法

  • 302 Found:转为GET方法(浏览器兼容行为)

  • 301 Moved:POST转为GET(历史原因)

  • 307/308:保持原始方法

实现

# RFC 7231 6.4.4: 303必须转为GET
if response.status_code == codes.see_other and method != "HEAD":
    method = "GET"

# 浏览器兼容行为:302转为GET
if response.status_code == codes.found and method != "HEAD":
    method = "GET"

# 历史原因:301 POST转为GET
if response.status_code == codes.moved and method == "POST":
    method = "GET"

5. rebuild_proxies(prepared_request, proxies)

def rebuild_proxies(self, prepared_request, proxies):

功能:根据重定向目标重建代理配置

处理逻辑

  1. 解析新URL的协议(scheme)

  2. 解析环境变量获取新代理配置

  3. 处理代理认证信息

关键代码

# 解析新代理配置
new_proxies = resolve_proxies(prepared_request, proxies, self.trust_env)

# 处理代理认证
if not scheme.startswith('https') and username and password:
    headers['Proxy-Authorization'] = _basic_auth_str(username, password)

设计亮点

1. 安全考虑

  • 认证保护should_strip_auth 防止认证信息泄漏到不同域

  • 代理安全:只在非HTTPS连接中添加Proxy-Authorization头

2. RFC合规性

  • 严格遵循RFC 7231对重定向的处理规范

  • 特殊处理303/307/308等新状态码

3. 灵活的流程控制

  • yield_requests 参数允许获取中间请求对象

  • 生成器设计减少内存消耗

4. 浏览器兼容

  • 实现了浏览器常见的302→GET转换行为

  • 处理相对路径重定向

使用示例

1. 基本使用

s = requests.Session()
response = s.get('http://example.com', allow_redirects=True)  # 自动处理重定向

2. 手动处理重定向

s = requests.Session()
resp = s.get('http://example.com', allow_redirects=False)

if resp.is_redirect:
    redirect_url = s.get_redirect_target(resp)
    print(f"Redirecting to: {redirect_url}")
    new_resp = s.get(redirect_url)

3. 检查重定向链

resp = s.get('http://example.com')
for hist_resp in resp.history:
    print(f"{hist_resp.status_code}: {hist_resp.url}")

扩展与定制

1. 自定义重定向策略

class CustomRedirectMixin(SessionRedirectMixin):
    def should_strip_auth(self, old_url, new_url):
        # 实现自定义的认证保护逻辑
        return super().should_strip_auth(old_url, new_url)

class CustomSession(CustomRedirectMixin, requests.Session):
    pass

2. 监控重定向

class TrackingSession(requests.Session):
    def resolve_redirects(self, *args, **kwargs):
        for resp in super().resolve_redirects(*args, **kwargs):
            print(f"Redirect: {resp.url}")
            yield resp

性能考虑

  1. 及时释放连接

    resp.content  # 消费内容以释放连接
    resp.close()  # 显式关闭连接
  2. 避免长重定向链

    • 默认限制30次重定向(DEFAULT_REDIRECT_LIMIT)

    • 可自定义:session.max_redirects = 10

  3. body重放处理

    rewindable = prepared_request._body_position is not None
    if rewindable:
        rewind_body(prepared_request)

SessionRedirectMixin 的设计展示了Requests库对HTTP协议细节的深度把握,既遵循标准规范又兼顾实际浏览器行为,为开发者提供了强大而安全的重定向处理能力。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值