Determining chapter number in different types of text

  • A+
Category:Languages

I'm pulling titles from novel related posts. The aim is, via use of regex, to determine which chapter(s) the post is about. Each site uses different ways of identifying the chapters. Here are the most common cases:

$title = 'text chapter 25.6 text'; // c25.6 $title = 'text chapters 23, 24, 25 text'; // c23-25 $title = 'text chapters 23+24+25 text'; // c23-25 $title = 'text chapter 23, 25 text'; // c23 & 25 $title = 'text chapter 23 & 24 & 25 text'; // c23-25 $title = 'text c25.5-30 text'; // c25.5-30 $title = 'text c99-c102 text'; // c99-102 $title = 'text chapter 99 - chapter 102 text'; // c99-102 $title = 'text chapter 1 - 3 text'; // c1-3 $title = '33 text chapter 1, 2 text 3'; // c1-2 $title = 'text v2c5-10 text'; // c5-10 $title = 'text chapters 23, 24, 25, 29, 31, 32 text'; // c23-25 & 29 & 31-32 

The chapter numbers are always listed in the title, just in different variations as displayed above.

What I have so far

So far, I have a regex to determine single cases of chapters, like:

$title = '9 text chapter 25.6 text'; // c25.6 

Using this code (try ideone):

function get_chapter($text, $terms) {      if (empty($text)) return;     if (empty($terms) || !is_array($terms)) return;      $values = false;      $terms_quoted = array();     foreach ($terms as $term)         $terms_quoted[] = preg_quote($term, '/');      // search for matches in $text     // matches with lowercase, and ignores white spaces...     if (preg_match('/('.implode('|', $terms_quoted).')/s*(/d+(/./d+)?)/i', $text, $matches)) {         if (!empty($matches[2]) && is_numeric($matches[2])) {             $values = array(                 'term' => $matches[1],                 'value' => $matches[2]             );         }     }      return $values; }  $text = '9 text chapter 25.6 text'; // c25.6 $terms = array('chapter', 'chapters'); $chapter = get_chapter($text, $terms);  print_r($chapter);  if ($chapter) {     echo 'Chapter is: c'. $chapter['value']; } 

How do I make this work with the other examples listed above? Given the complexity of this question, I will bounty it 200 points when eligible.

 


You may use

$strs = ['hello 23 & 24', 'episode 1 - e 2', 'chapter 1 - chapter 2', 'text chapter 25.6 text', 'text chapters 23, 24, 25 text','text chapter 23, 25 text', 'text chapter 23 & 24 & 25 text','text c25.5-30 text', 'text c99-c102 text', 'text chapter 1 - 3 text', '33 text chapter 1, 2 text 3','text chapters 23, 24, 25, 29, 31, 32 text', 'c19 & c20', 'chapter 25.6 & chapter 29', 'chapter 25+c26', 'chapter 25 + 26 + 27']; $terms = ['chapter', 'ch', 'episode', '']; $chapter_main_rx = "(?|" . implode("|", array_map(function ($term) {     return strlen($term) > 0 ? "(" . substr($term, 0, 1) . ")(?:" . substr($term, 1) . "s?)?": "()" ;},   $terms)) . ")/s*";     // Generate chapter pattern with capturing $chapter_aux_rx = "(?:" . implode("|", array_map(function ($term) {     return strlen($term) > 0 ? substr($term, 0, 1) . "(?:" . substr($term, 1) . "s?)?": "" ;},   $terms)) . ")/s*";     // Generate chapter pattern without capturing  $reg = "~$chapter_main_rx((/d+(?:/./d+)?)(?:/s*[,&+-]/s*(?:$chapter_aux_rx)?(?3))*)~ui";                   // Define the main regex foreach ($strs as $s) {    // Testing strings     if (preg_match($reg, $s, $m)) {    // If there is a match       echo $m[1] .                     // Take Group 1 value and add...           preg_replace_callback(             "~/s*-/s*(?:$chapter_aux_rx)?|(/d+(?:/./d+)?)(?:/s*[&,+]/s*(?:$chapter_aux_rx)?(?1))+~ui", function($x) use ($chapter_aux_rx) {                 return !empty($x[1]) ? buildNumChain(preg_split("~/s*[&,+]/s*(?:$chapter_aux_rx)?~ui", $x[0])) : "-";             }, $m[2])                // ... post-processed Group 2 value         . "/n";     } }  function buildNumChain($arr) {     $ret = "";     $rngnum = "";     for ($i=0; $i < count($arr); $i++) {         if (($i < count($arr) - 1) && $arr[$i] == $arr[$i+1]-1) {             if (empty($rngnum))  {                 $ret .= ($i == 0 ? "" : " & ") . $arr[$i];             }             $rngnum = $arr[$i];             continue;         } else if (!empty($rngnum) || $i == count($arr)) {             $ret .= '-' . $arr[$i];             $rngnum = "";         } else {             $ret .= ($i == 0 ? "" : " & ") . $arr[$i];         }     }     return $ret; } 

See the PHP demo.

Main points

  • Match c or chapter/chapters with numbers that follow them, capture just c and the numbers
  • After matches are found, process Group 2 that contains the number sequences
  • All <number>-c?<number> substrings should be stripped of whitespaces and c before/in between numbers and
  • All ,/&-separated numbers should be post-processed with buildNumChain that generates ranges out of consecutive numbers (whole numbers are assumed).

The main regex is

'~(c)(?:hapters?)?/s*((/d+(?:/./d+)?)(?:/s*[,&+-]/s*(?:c(?:hapters?)?/s*)?(?3))*)~u' 

See the regex demo.

Pattern details

  • (c) - Group 1: c
  • (?:hapters?)? - an optional hapter or hapters
  • /s* - 0+ whitespaces
  • ((/d+(?:/./d+)?)(?:/s*[,&-]/s*c?(?3))*) - Group 2:
    • (/d+(?:/./d+)?) - Group 3: 1+ digits, followed with an optional sequence of .and 1+ digits
    • (?:/s*[,&+-]/s*(?:c(?:hapters?)?/s*)?(?3))* - zero or more sequences of
      • /s*[,&+-]/s* - a ,, &, + or - enclosed with optional 0+ whitespaces
      • (?:c(?:hapters?)?/s*)? - an optional sequence: c, chapter or chapters followed with 0+ whitespaces
      • (?3) - Group 3 pattern recursed / repeated

When the regex matches, the Group 1 value is c, so it will be the first part of the result. Then, /s*-/s*c?|(/d+(?:/./d+)?)(?:/s*[&,+]/s*(?:c(?:hapters?)?/s*)?(?1))+ is used inside preg_replace_callback to remove whitespaces in between - (if any) and c, chapter or chapters (if any) followed with 0+ whitespace chars, and if Group 1 matches, the match is split with /s*[&,+]/s*(?:c(?:hapters?)?/s*)? regex (it matches &, ,, + in between optional 0+ whitespaces followed with an optional string, c, chapter or chapters followed with 0+ whitespaces) and the array is passed to the buildNumChain function that builds the resulting string.

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: