I'm a regexp noob and trying to split paragraphs into sentences. In my language we use quite a bit of abbreviations (like: bl.a.) in the middle of sentences, so I have come to the conclusion, that what I need to do is to look for punctuations, that are followed by a single space and then a word that starts with a capital letter like:
[sentence1]...anymore. However...[sentence2]
So a paragraph like:
Der er en lang og bevæget forhistorie bag lov om varsling m.v. i forbindelse med afskedigelser af større omfang. Det er ikke en bureaukratisk lovtekst blandt så mange andre.
Should end in this output:
[0] => Der er en lang og bevæget forhistorie bag lov om varsling m.v. i forbindelse med afskedigelser af større omfang.
[1] => Det er ikke en bureaukratisk lovtekst blandt så mange andre.
and NOT this:
[0] => Der er en lang og bevæget forhistorie bag lov om varsling m.v.
[1] => i forbindelse med afskedigelser af større omfang.
[2] => Det er ikke en bureaukratisk lovtekst blandt så mange andre.
I have found a solution that does the first part of this with the positive lookbehind feature:
$regexp = (?<=[.!?] | [.!?][\'"]);
and then
$sentences = preg_split($regexp, $paragraph, -1, PREG_SPLIT_NO_EMPTY);
which is a great starting point, but splits way too many times because of the many abbreviations.
I have tried to do this:
(?<=[.!?]\s[A-Z] | [.!?][\'"])
to target every occurance of either
. or ! or ?
followed by a space and a capital letter, but that did not work.
Does anyone know, if there is a way to accomplish what I am trying to do?
Unicode RegExp for splitting sentences: (?<=[.?!;])\s+(?=\p{Lu})
Explained demo here: http://regex101.com/r/iR7cC8