How to extract text within a string of text

Brandon picture Brandon · Aug 17, 2011 · Viewed 57.7k times · Source

I have a simple problem that I'm hoping to resolve without using VBA but if that's the only way it can be solved, so be it.

I have a file with multiple rows (all one column). Each row has data that looks something like this:

1 7.82E-13 >gi|297848936|ref|XP_00| 4-hydroxide gi|297338191|gb|23343|randomrandom

2 5.09E-09 >gi|168010496|ref|xp_00| 2-pyruvate

etc...

What I want is some way to extract the string of numbers that begin with "gi|" and end with a "|". For some rows this might mean as many as 5 gi numbers, for others it'll just be one.

What I would hope the output would look like would be something like:

297848936,297338191

168010496

etc...

Answer

aevanko picture aevanko · Aug 17, 2011

Here is a very flexible VBA answer using the regex object. What the function does is extract every single sub-group match it finds (stuff inside the parenthesis), separated by whatever string you want (default is ", "). You can find info on regular expressions here: http://www.regular-expressions.info/

You would call it like this, assuming that first string is in A1:

=RegexExtract(A1,"gi[|](\d+)[|]")

Since this looks for all occurance of "gi|" followed by a series of numbers and then another "|", for the first line in your question, this would give you this result:

297848936, 297338191

Just run this down the column and you're all done!

Function RegexExtract(ByVal text As String, _
                      ByVal extract_what As String, _
                      Optional separator As String = ", ") As String

Dim allMatches As Object
Dim RE As Object
Set RE = CreateObject("vbscript.regexp")
Dim i As Long, j As Long
Dim result As String

RE.pattern = extract_what
RE.Global = True
Set allMatches = RE.Execute(text)

For i = 0 To allMatches.count - 1
    For j = 0 To allMatches.Item(i).submatches.count - 1
        result = result & (separator & allMatches.Item(i).submatches.Item(j))
    Next
Next

If Len(result) <> 0 Then
    result = Right$(result, Len(result) - Len(separator))
End If

RegexExtract = result

End Function