Skip to content
This repository was archived by the owner on Nov 6, 2023. It is now read-only.

Fix how 'Location' is parsed from header text#6878

Merged
Hainish merged 4 commits intoEFForg:masterfrom
jeremyn:fix-location-header-parsing
Sep 16, 2016
Merged

Fix how 'Location' is parsed from header text#6878
Hainish merged 4 commits intoEFForg:masterfrom
jeremyn:fix-location-header-parsing

Conversation

@jeremyn
Copy link
Copy Markdown
Contributor

@jeremyn jeremyn commented Sep 14, 2016

This fixes the test problem identified in pull request #6871 , where this code apparently (?) never finds the Location header, for example:

(Pdb) self._headerRe
regex.Regex('(?P<name>\\S+?): (?P<value>.*?)\\r\\n', flags=regex.V0)
(Pdb) headerStr
'HTTP/1.1 301 Moved Permanently\nServer: Apache\nDate: Tue Jun  1 12:48:03 PDT 1999 PDT\nReferer: http://itunes.com/\nLocation: http://www.apple.com/itunes/?cid=OAS-US-DOMAINS-itunes.com\nContent-type: text/html\nContent-length: 311\n\n'
(Pdb) headers = self._headerRe.findall(headerStr)
(Pdb) headers
[]
(Pdb)

Note that if in fact location has been incorrectly False-y for a while, then it's possible the code immediately after the raise HTTPFetcherError line has not been run recently or ever, and this fix may reveal new bugs in it.

Pinging @fuglede @Hainish @J0WI for review.

if not location:
location = None
for piece in headerStr.split('\n'):
if piece.startswith('Location:'):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be careful here. Although the Location header is usually capitalized in this way, it is not always. See https://www.shodan.io/search?query=location%3A

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, thanks. Also RFC 7230 says header field names are case-insensitive. This should be easy to fix.

Copy link
Copy Markdown
Member

@Hainish Hainish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, LGTM

location = None
for piece in headerStr.lower().split('\n'):
if piece.startswith('location:'):
location = ''.join(piece.split(': ')[1:])
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency, you might want to make the argument for the startswith call the same as the argument for the split call. This way, if the web server returns an improperly formatted header (lacking the space) it won't cause an error or unintended behavior.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point again, I'll change it. Actually though it would be properly formatted. From the same RFC (bold mine):

Each header field consists of a case-insensitive field name followed by a colon (":"), optional leading whitespace, the field value, and optional trailing whitespace.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, interesting - I've never seen headers lacking the leading whitespace. In that case you might want to change just line 383 to be:

location = ''.join(piece.split(':')[1:]).strip()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

''.join instead of ':'.join destroys the :. This wasn't noticeable with the much less likely :<space>, but very noticeable with : as in http://. (I made that mistake too.)

raise HTTPFetcherError("Pycurl fetch failed for '%s'" % newUrl)
elif httpCode in (301, 302, 303, 307):
# Parse out the headers and extract location, case-insensitively.
# If there are multiple location headers, pick the last one.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious, why remove the above comments?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I distrust comments in general. The code here is six lines long, it doesn't need two lines of comments.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. I differ in opinion on comments, but I don't think it matters in this case.

location = None
for piece in headerStr.lower().split('\n'):
if piece.startswith('location:'):
location = ':'.join(piece.split(':')[1:]).strip()
Copy link
Copy Markdown
Member

@Hainish Hainish Sep 16, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just spotted the source of another possible problem. Here, the location will always be a lowercased version of the URL, due to line 381. URLs should be treated as case-sensitive, though.

URLs in general are case-sensitive (with the exception of machine names). There may be URLs, or parts of URLs, where case doesn't matter, but identifying these may not be easy. Users should always consider that URLs are case-sensitive.

Ref: https://www.w3.org/TR/WD-html40-970708/htmlweb.html

Copy link
Copy Markdown
Contributor Author

@jeremyn jeremyn Sep 16, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right again, in fact that comes up in the https://itunes.com example because the Location there is http://www.apple.com/itunes/?cid=OAS-US-DOMAINS-itunes.com . I'll fix it tomorrow before tomorrow.

@Hainish
Copy link
Copy Markdown
Member

Hainish commented Sep 16, 2016

Great, lgtm

@Hainish Hainish merged commit 3d78fee into EFForg:master Sep 16, 2016
@fuglede
Copy link
Copy Markdown
Contributor

fuglede commented Sep 17, 2016

Does this mean that this is actually a bug in curl in general?

@jeremyn jeremyn deleted the fix-location-header-parsing branch September 17, 2016 14:44
@jeremyn
Copy link
Copy Markdown
Contributor Author

jeremyn commented Sep 17, 2016

No, I think the bug was entirely on our side because we failed to parse out the Location header. What would the curl bug be?

@fuglede
Copy link
Copy Markdown
Contributor

fuglede commented Sep 17, 2016 via email

@jeremyn
Copy link
Copy Markdown
Contributor Author

jeremyn commented Sep 17, 2016

Oh. Of course @J0WI has to confirm but I assume they meant that they saw Location with curl and so confirmed that the bug was in our stuff.

@fuglede
Copy link
Copy Markdown
Contributor

fuglede commented Sep 17, 2016

Ah, could be. I'm just getting errors altogether;

$ curl -I https://www.itunes.com
curl: (51) SSL: no alternative certificate subject name matches target host name 'www.itunes.com'
$ curl -I https://itunes.com
curl: (56) SSL read: error:00000000:lib(0):func(0):reason(0), errno 104

@jeremyn
Copy link
Copy Markdown
Contributor Author

jeremyn commented Sep 17, 2016

https://www.itunes.com has an invalid certificate, that's what #6871 is trying to fix.

I get something similar with curl -I https://itunes.com, no idea what that means. curl -v https://itunes.com shows the response headers though, including Location.

@fuglede
Copy link
Copy Markdown
Contributor

fuglede commented Sep 17, 2016

Looks like it's an iTunes thing: curl/curl#1020 (comment).

@jay
Copy link
Copy Markdown

jay commented Sep 17, 2016

I don't know if this helps but when I did a GET on it through Fiddler debugging proxy the server failed a compliance check:

X-HTTPPROTOCOL-VIOLATION: [ProtocolViolation] The Server did not return properly formatted HTTP Headers. HTTP headers should be terminated with CRLFCRLF. These were terminated with LFLF.

@Hainish
Copy link
Copy Markdown
Member

Hainish commented Sep 19, 2016

@jeremyn I get no Location header with curl -v

(.venv)user@personal:~$ curl -I https://www.itunes.com
curl: (51) SSL: no alternative certificate subject name matches target host name 'www.itunes.com'
(.venv)user@personal:~$ curl -v https://www.itunes.com
* Rebuilt URL to: https://www.itunes.com/
* Hostname was NOT found in DNS cache
*   Trying 17.172.224.35...
* Connected to www.itunes.com (17.172.224.35) port 443 (#0)
* successfully set certificate verify locations:
*   CAfile: none
  CApath: /etc/ssl/certs
* SSLv3, TLS handshake, Client hello (1):
* SSLv3, TLS handshake, Server hello (2):
* SSLv3, TLS handshake, CERT (11):
* SSLv3, TLS handshake, Server finished (14):
* SSLv3, TLS handshake, Client key exchange (16):
* SSLv3, TLS change cipher, Client hello (1):
* SSLv3, TLS handshake, Finished (20):
* SSLv3, TLS change cipher, Client hello (1):
* SSLv3, TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / AES128-GCM-SHA256
* Server certificate:
*    subject: 1.3.6.1.4.1.311.60.2.1.3=US; 1.3.6.1.4.1.311.60.2.1.2=California; businessCategory=Private Organization; serialNumber=C0806592; C=US; postalCode=95014; ST=California; L=Cupertino; street=1 Infinite Loop; O=Apple Inc.; OU=Internet Services; CN=itunes.com
*    start date: 2015-08-19 00:00:00 GMT
*    expire date: 2017-08-19 23:59:59 GMT
*    subjectAltName does not match www.itunes.com
* SSL: no alternative certificate subject name matches target host name 'www.itunes.com'
* Closing connection 0
* SSLv3, TLS alert, Client hello (1):
curl: (51) SSL: no alternative certificate subject name matches target host name 'www.itunes.com'

FYI, the error indicates that this host is using SAN to indicate which FQDNs the certificate is valid for, but the names listed in the x509 (itunes.com) didn't match the host name you specified (www.itunes.com). You can verify that with this command:

echo 1 | openssl s_client -connect www.itunes.com:443 | openssl x509 -noout -text

Looking at the x509v3 Subject Alternative Name field, we only get itunes.com and not www.itunes.com

@jeremyn
Copy link
Copy Markdown
Contributor Author

jeremyn commented Sep 19, 2016

@Hainish https://itunes.com is what redirects and has the Location header, not https://www.itunes.com .

@Hainish
Copy link
Copy Markdown
Member

Hainish commented Sep 20, 2016

Ah. Yeah in that case the SSL error see is strange. For other https hosts that redirect, curl -I doesn't give me that error:

(.venv)user@personal:~$ curl -I https://eff.org
HTTP/1.1 301 Moved Permanently
Server: nginx
Date: Tue, 20 Sep 2016 00:25:37 GMT
Content-Type: text/html
Content-Length: 178
Connection: keep-alive
Location: https://www.eff.org/
Strict-Transport-Security: max-age=31536000; includeSubdomains
X-Content-Type-Options: nosniff

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants