If a URL contains a quote how do you specify the rel=canonical value?

joedevon picture joedevon · Oct 10, 2009 · Viewed 13.8k times · Source

Say the path of your URL is:

/thisisa"quote/helloworld/

Then how do you create the rel=canonical URL?

Is this kosher?

<link rel="canonical" href="/thisisa&amp;quot;/helloworld/" />

UPDATE

To clarify, I'm getting a form submission, I need to convert part of the query string into the URL. So the steps are:

  1. .htaccess does the redirect
  2. PHP processes a directory as a query string.
  3. The query string will be dynamically inserted into the:
    • Title,
    • Description,
    • Keywords
    • Canonical URL.
    • Spit back into the form's input box

So I need to know which processing has to be done each step of the way...On the first cut, this is my take:

  • Title: htmlspecialchars($rawQuery)
  • Description: htmlspecialchars($rawQery)
  • Keywords: htmlspecialchars($rawQuery)
  • Canonical URL: This is the tricky part. It must match the same URL .htaccess redirects to but even so, I think the raw query is unsafe because quotes can cause JavaScript injection. Worried about urlencode($rawquery) since it's coming from the URL, wouldn't it already be URL-encoded?
  • Spit back into form: htmlspecialchars($rawQuery)

Answer

Gumbo picture Gumbo · Oct 12, 2009

You have to split your question into two:

Do I need to encode the double quotation mark character in the URL path?

Yes, the quotation mark character (U+0022) is not allowed in plain and must be encoded with %22.

Do I need to encode the double quotation mark character in a HTML attribute value?

It depends on how you declare the attribute value:

By default, SGML requires that all attribute values be delimited using either double quotation marks (ASCII decimal 34) or single quotation marks (ASCII decimal 39). Single quote marks can be included within the attribute value when the value is delimited by double quote marks, and vice versa. Authors may also use numeric character references to represent double quotes (&#34;) and single quotes (&#39;). For double quotes authors can also use the character entity reference &quot;.

  • If you’re using double quotation mark character to declare the attribute value (attr="value"), then you must encode the douvke quoteation mark character inside the attribute value declaration with a character reference (&quot;, &#34; or &#x22;).
  • If you’re using the single quotation mark character (U+0027) for your attribute value declaration (attr='value'), then you don’t need to encode the quotation mark character. But it’s recommended to do so.

And since you have slash and a double quotation mark in your attribute value, the third case (using no quotes at all) is not applicable:

In certain cases, authors may specify the value of an attribute without any quotation marks. The attribute value may only contain letters (a-z and A-Z), digits (0-9), hyphens (ASCII decimal 45), periods (ASCII decimal 46), underscores (ASCII decimal 95), and colons (ASCII decimal 58). We recommend using quotation marks even when it is possible to eliminate them.

Now bringing both answers together

Since a double quotation mark must be encoded in a URL (but the single quotation mark is!), you can use the following to do so with the path segments or you URL path:

$path = '/thisisa"quote/helloworld/';
$path = implode('/', array_map('rawurlencode', explode('/', $path)));

And if you want to put that URL path in a HTML attribute, use the htmlspecialchars function to encode remaining special HTML characters:

echo '<link rel="canonical" href="' . htmlspecialchars($path) . '" />';