I am using python and attempting to scrape the HTTP(s) traffic between my computer and a site, which would include all incoming and outgoing requests,responses, such as images and external calls, etc.
I have attempted to find the network traffic within my hit_site
function, but I'm not finding the information.
hit_site("http://www.google.com")
def hit_site(url):
print url
r = requests.get(url,stream = True)
print r.headers
print r.encoding
print r.status_code
print r.json()
print requests.get(url,stream=True)
print r.request.headers
print r.response.headers
for line in r.iter_lines():
print line
data = r.text
soup = BeautifulSoup(data)
return soup
An example of the type of information that I would like to capture is the following (I used fiddler2 to get this information. All of this and more came from visiting groupon.com):
# Result Protocol Host URL Body Caching Content-Type Process Comments Custom
6 200 HTTP www.groupon.com / 23,236 private, max-age=0, no-cache, no-store, must-revalidate text/html; charset=utf-8 chrome:6080
7 200 HTTP www.groupon.com /homepage-assets/styles-6fca4e9f48.css 6,766 public, max-age=31369910 text/css; charset=UTF-8 chrome:6080
8 200 HTTP Tunnel to img.grouponcdn.com:443 0 chrome:6080
9 200 HTTP img.grouponcdn.com /deal/gsPCLbbqioFVfvjT3qbBZo/The-Omni-Mount-Washington-Resort_01-960x582/v1/c550x332.jpg 94,555 public, max-age=315279127; Expires: Fri, 18 Oct 2024 22:20:20 GMT image/jpeg chrome:6080
10 200 HTTP img.grouponcdn.com /deal/d5YmjhxUBi2mgfCMoriV/pE-700x420/v1/c220x134.jpg 17,832 public, max-age=298601213; Expires: Mon, 08 Apr 2024 21:35:06 GMT image/jpeg chrome:6080
11 200 HTTP www.groupon.com /homepage-assets/main-fcfaf867e3.js 9,604 public, max-age=31369913 application/javascript chrome:6080
12 200 HTTP www.groupon.com /homepage-assets/locale.js?locale=en_US&country=US 1,507 public, max-age=994 application/javascript chrome:6080
13 200 HTTP www.groupon.com /tracky 3 application/octet-stream chrome:6080
14 200 HTTP www.groupon.com /cart/widget?consumerId=b577c9c2-4f07-11e4-8305-0025906127fe 17 private, max-age=0, no-cache, no-store, must-revalidate application/json; charset=utf-8 chrome:6080
15 200 HTTP www.googletagmanager.com /gtm.js?id=GTM-B76Z 39,061 private, max-age=911; Expires: Wed, 22 Oct 2014 20:48:14 GMT text/javascript; charset=UTF-8 chrome:6080
I would greatly appreciate any ideas as to how capture the network traffic using python.
dpkt is an extensive tool (written in Python) for parsing TCP traffic, which includes support for decoding packets involved in the SSL handshake. Another tool for running and decoding captures from Python is pypcapfile.
Note that for decoding SSL traffic including data, private keys need to be known. This is somewhat problematic for a third-party server you don't control such as Google, and significant effort is required to work around it. One such approach is to set up a proxy with a known private key to play man-in-the-middle (and install a self-signed CA into your local store to force the browser to accept it).