how to use a regular expression to extract json fields?

James Cooper picture James Cooper · Jan 16, 2013 · Viewed 111.6k times · Source

Beginner RegExp question. I have lines of JSON in a textfile, each with slightly different Fields, but there are 3 fields I want to extract for each line if it has it, ignoring everything else. How would I use a regex (in editpad or anywhere else) to do this?

Example:

"url":"http://www.netcharles.com/orwell/essays.htm",
"domain":"netcharles.com",
"title":"Orwell Essays & Journalism Section - Charles' George Orwell Links",
"tags":["orwell","writing","literature","journalism","essays","politics","essay","reference","language","toread"],
"index":2931,
"time_created":1345419323,
"num_saves":24

I want to extract URL,TITLE,TAGS,

Answer

FrankieTheKneeMan picture FrankieTheKneeMan · Jan 16, 2013
/"(url|title|tags)":"((\\"|[^"])*)"/i

I think this is what you're asking for. I'll provide an explanation momentarily. This regular expression (delimited by / - you probably won't have to put those in editpad) matches:

"

A literal ".

(url|title|tags)

Any of the three literal strings "url", "title" or "tags" - in Regular Expressions, by default Parentheses are used to create groups, and the pipe character is used to alternate - like a logical 'or'. To match these literal characters, you'd have to escape them.

":"

Another literal string.

(

The beginning of another group. (Group 2)

    (

Another group (3)

        \\"

The literal string \" - you have to escape the backslash because otherwise it will be interpreted as escaping the next character, and you never know what that'll do.

        |

or...

        [^"]

Any single character except a double quote The brackets denote a Character Class/Set, or a list of characters to match. Any given class matches exactly one character in the string. Using a carat (^) at the beginning of a class negates it, causing the matcher to match anything that's not contained in the class.

    )

End of group 3...

    *

The asterisk causes the previous regular expression (in this case, group 3), to be repeated zero or more times, In this case causing the matcher to match anything that could be inside the double quotes of a JSON string.

)"

The end of group 2, and a literal ".

I've done a few non-obvious things here, that may come in handy:

  1. Group 2 - when dereferenced using Backreferences - will be the actual string assigned to the field. This is useful when getting the actual value.
  2. The i at the end of the expression makes it case insensitive.
  3. Group 1 contains the name of the captured field.

EDIT: So I see that the tags are an array. I'll update the regular expression here in a second when I've had a chance to think about it.

Your new Regex is:

/"(url|title|tags)":("(\\"|[^"])*"|\[("(\\"|[^"])*"(,"(\\"|[^"])*")*)?\])/i

All I've done here is alternate the string regular expression I had been using ("((\\"|[^"])*)"), with a regular expression for finding arrays (\[("(\\"|[^"])*"(,"(\\"|[^"])*")*)?\]). No so easy to Read, is it? Well, substituting our String Regex out for the letter S, we can rewrite it as:

\[(S(,S)*)?\]

Which matches a literal opening bracket (hence the backslashes), optionally followed by a comma separated list of strings, and a closing bracket. The only new concept I've introduced here is the question mark (?), which is itself a type of repetition. Commonly referred to as 'making the previous expression optional', it can also be thought of as exactly 0 or 1 matches.

With our same S Notation, here's the whole dirty Regular Expression:

/"(url|title|tags)":(S|\[(S(,S)*)?\])/i

If it helps to see it in action, here's a view of it in action.