I want to get the content from this website.

If I use a browser like Firefox or Chrome I could get the real website page I want, but if I use the Python requests package (or wget command) to get it, it returns a totally different HTML page.

I thought the developer of the website had made some blocks for this.

Question

How do I fake a browser visit by using python requests or command wget?

Provide a User-Agent header:

import requests

url="http://www.ichangtou.com/#company:data_000008.html"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

response = requests.get(url, headers=headers)
print(response.content)

FYI, here is a list of User-Agent strings for different browsers:


As a side note, there is a pretty useful third-party package called fake-useragent that provides a nice abstraction layer over user agents:

fake-useragent

Up to date simple useragent faker with real world database

Demo:

>>> from fake_useragent import UserAgent
>>> ua = UserAgent()
>>> ua.chrome
u'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1667.0 Safari/537.36'
>>> ua.random
u'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.67 Safari/537.36'

I used fake UserAgent.

How to use:

from fake_useragent import UserAgent
import requests
   

ua = UserAgent()
print(ua.chrome)
header = {'User-Agent':str(ua.chrome)}
print(header)
url = "https://www.hybrid-analysis.com/recent-submissions?filter=file&sort=^timestamp"
htmlContent = requests.get(url, headers=header)
print(htmlContent)

Output:

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1309.0 Safari/537.17
{'User-Agent': 'Mozilla/5.0 (X11; OpenBSD i386) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36'}
<Response [200]>

Try doing this, using firefox as fake user agent (moreover, it’s a good startup script for web scraping with the use of cookies):

#!/usr/bin/env python2
# -*- coding: utf8 -*-
# vim:ts=4:sw=4


import cookielib, urllib2, sys

def doIt(uri):
    cj = cookielib.CookieJar()
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
    page = opener.open(uri)
    page.addheaders = [('User-agent', 'Mozilla/5.0')]
    print page.read()

for i in sys.argv[1:]:
    doIt(i)

USAGE:

python script.py "http://www.ichangtou.com/#company:data_000008.html"

The root of the answer is that the person asking the question needs to have a JavaScript interpreter to get what they are after. What I have found is I am able to get all of the information I wanted on a website in json before it was interpreted by JavaScript. This has saved me a ton of time in what would be parsing html hoping each webpage is in the same format.

So when you get a response from a website using requests really look at the html/text because you might find the javascripts JSON in the footer ready to be parsed.

Answer

You need to create a header with a proper formatted User agent String, it server to communicate client-server.

You can check your own user agent Here.

Example

Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0
Mozilla/5.0 (Macintosh; Intel Mac OS X x.y; rv:42.0) Gecko/20100101 Firefox/42.0

Third party Package user_agent 0.1.9

I found this module very simple to use, in one line of code it randomly generates a User agent string.

from user_agent import generate_user_agent, generate_navigator
from pprint import pprint

print(generate_user_agent())
# 'Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.3; Win64; x64)'

print(generate_user_agent(os=('mac', 'linux')))
# 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:36.0) Gecko/20100101 Firefox/36.0'

pprint(generate_navigator())

# {'app_code_name': 'Mozilla',
#  'app_name': 'Netscape',
#  'appversion': '5.0',
#  'name': 'firefox',
#  'os': 'linux',
#  'oscpu': 'Linux i686 on x86_64',
#  'platform': 'Linux i686 on x86_64',
#  'user_agent': 'Mozilla/5.0 (X11; Ubuntu; Linux i686 on x86_64; rv:41.0) Gecko/20100101 Firefox/41.0',
#  'version': '41.0'}

pprint(generate_navigator_js())

# {'appCodeName': 'Mozilla',
#  'appName': 'Netscape',
#  'appVersion': '38.0',
#  'platform': 'MacIntel',
#  'userAgent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:38.0) Gecko/20100101 Firefox/38.0'}

I use pyuser_agent. this package use get user agnet

import pyuser_agent
import requests

ua = pyuser_agent.UA()

headers = {
      "User-Agent" : ua.random
}
print(headers)

uri = "https://github.com/THAVASIGTI/"
res = requests.request("GET",uri,headers=headers)
print(res)

console out

{'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 6.1; zh-CN) AppleWebKit/533+ (KHTML, like Gecko)'}
<Response [200]>

I had a similar issue but I was unable to use the UserAgent class inside the fake_useragent module. I was running the code inside a docker container

import requests
import ujson
import random

response = requests.get('https://fake-useragent.herokuapp.com/browsers/0.1.11')
agents_dictionary = ujson.loads(response.text)
random_browser_number = str(random.randint(0, len(agents_dictionary['randomize'])))
random_browser = agents_dictionary['randomize'][random_browser_number]
user_agents_list = agents_dictionary['browsers'][random_browser]
user_agent = user_agents_list[random.randint(0, len(user_agents_list)-1)]

I targeted the endpoint used in the module. This solution still gave me a random user agent however there is the possibility that the data structure at the endpoint could change.

This is how, I have been using a random user agent from a list of nearlly 1000 fake user agents

from random_user_agent.user_agent import UserAgent
from random_user_agent.params import SoftwareName, OperatingSystem
software_names = [SoftwareName.ANDROID.value]
operating_systems = [OperatingSystem.WINDOWS.value, OperatingSystem.LINUX.value, OperatingSystem.MAC.value]   

user_agent_rotator = UserAgent(software_names=software_names, operating_systems=operating_systems, limit=1000)

# Get list of user agents.
user_agents = user_agent_rotator.get_user_agents()

user_agent_random = user_agent_rotator.get_random_user_agent()

Example

print(user_agent_random)

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36

For more details visit this link