Requests in python return error, while opening link manually works perfect

stackoverflow.com/questions/4…

import requests

a = 'http://tmsearch.uspto.gov/bin/showfield?f=toc&state=4809%3Ak1aweo.1.1&p_search=searchstr&BackReference=&p_L=100&p_plural=no&p_s_PARA1={}&p_tagrepl%7E%3A=PARA1%24MI&expr=PARA1+or+PARA2&p_s_PARA2=&p_tagrepl%7E%3A=PARA2%24ALL&a_default=search&f=toc&state=4809%3Ak1aweo.1.1&a_search=Submit+Query'
a = a.format('coca-cola')

b = requests.get(a)

print(b.text)
print(b.url)

If you copy the printed url and paste it in browser, site will open with no problem, but if you do requests.get, i get some token? errors. Is there anything I can do?

VIA requests.get I url back, but no data if doing manually. It says: <html><head><TITLE>TESS -- Error</TITLE></head><body>

Improve this question

edited Apr 20, 2017 at 12:16

asked Apr 20, 2017 at 12:11

Testing man

66511 gold badge1111 silver badges2828 bronze badges

Thanks, fixed it

– Testing man

Apr 20, 2017 at 12:16
1

For me it returns "This search session has expired." in the browser, as well as from the python code. Most likely the search engine uses tokens, specified in the headers. You could see headers in the network tab (chrome).

– awesoon

Apr 20, 2017 at 12:18
Soon, even if you change coca-cola to pepsi or something?

– Testing man

Apr 20, 2017 at 12:25

Add a comment

1 Answer

Sorted by:

Highest score (default) Trending (recent votes count more) Date modified (newest first) Date created (oldest first)

First of all, make sure you follow the website's Terms of Use and usage policies.

This is a little bit more complicated that it may seem. You need to maintain a certain state throughout the [web-scraping session][1]. And, you'll need an HTML parser, like BeautifulSoup along the way:

from urllib.parse import parse_qs, urljoin

import requests
from bs4 import BeautifulSoup


SEARCH_TERM = 'coca-cola'

with requests.Session() as session:
    session.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36'}

    # get the current search state
    response = session.get("https://tmsearch.uspto.gov/")
    soup = BeautifulSoup(response.content, "html.parser")
    link = soup.find("a", text="Basic Word Mark Search (New User)")["href"]

    session.get(urljoin(response.url, link))

    state = parse_qs(link)['state'][0]

    # perform a search
    response = session.post("https://tmsearch.uspto.gov/bin/showfield", data={
        'f': 'toc',
        'state': state,
        'p_search': 'search',
        'p_s_All': '',
        'p_s_ALL': SEARCH_TERM + '[COMB]',
        'a_default': 'search',
        'a_search': 'Submit'
    })

    # print search results
    soup = BeautifulSoup(response.content, "html.parser")

    print(soup.find("font", color="blue").get_text())

    table = soup.find("th", text="Serial Number").find_parent("table")
    for row in table('tr')[1:]:
        print(row('td')[1].get_text())

It prints all the serial number values from the first search results page, for demonstration