{"id":4586,"date":"2016-06-22T02:31:49","date_gmt":"2016-06-22T02:31:49","guid":{"rendered":"http:\/\/www.garysieling.com\/blog\/?p=4586"},"modified":"2016-06-22T02:31:49","modified_gmt":"2016-06-22T02:31:49","slug":"fixing-overlapping-subtitles-srt-files-node","status":"publish","type":"post","link":"https:\/\/www.garysieling.com\/blog\/fixing-overlapping-subtitles-srt-files-node\/","title":{"rendered":"Fixing overlapping subtitles from SRT files in Node"},"content":{"rendered":"<p>If you get access to subtitles, you may find that they repeat overlapping text, which is a real pain if you just want a transcript. The repeated text is pretty valuable in the newer subtitle formats like WebVTT, as you can highlight words as they are being spoken, but for transcript processing it&#8217;s not helpful.<\/p>\n<pre>\n3309\n00:51:04,309 --> 00:51:09,150\nyou know much of the talk\nso thank you very much\n\n3310\n00:51:09,150 --> 00:51:09,550\nso thank you very much\n\n\n3311\n00:51:09,550 --> 00:51:12,570\nso thank you very much\nfor more information please\n\n3312\n00:51:12,570 --> 00:51:12,970\nfor more information please\n\n\n3313\n00:51:12,970 --> 00:51:16,290\nfor more information please\nvisit www.freddyandeddy.com see\n\n3314\n00:51:16,290 --> 00:51:16,690\nvisit www.freddyandeddy.com see\n\n\n3315\n00:51:16,690 --> 00:51:22,690\nvisit www.freddyandeddy.com see\ndots UK\n<\/pre>\n<p>The first step to handling this is to write code to parse the file &#8211; since it starts with a number, then a time window, and then text, it lends itself to a state machine (in case we run across a file that has a line of transcript that is numerical etc)<\/p>\n<pre lang=\"javascript\">\n\nfunction process(lines) {\n  let line0 = \/^\\d+$\/;\n  let line1 = \/^\\d+:\\d+:\\d+,\\d+ --> \\d+:\\d+:\\d+,\\d+$\/;\n\n  let states = [\"line0\", \"line1\", \"text\"]\n  let processors = [x => null, x => null, x => x]\n  let nexts = [x => !!x.match(line0), x => x.match(line1), x => x === '']\n  let transitions = [1, 2, 0]\n  \n  let idx = 0;\n  let stateIdx = 0;\n\n  let result = [];\n\n  while (idx < lines.length) {\n    let line = lines[idx];\n   \n    let thisLineResult = processors[stateIdx](line);\n    if (thisLineResult !== null &#038;&#038; thisLineResult !== \"\") {\n      result.push(thisLineResult);\n    }\n\n    if (nexts[stateIdx](line)) {\n      stateIdx = transitions[stateIdx];\n    }\n\n    idx++;\n  } \n   \n  let thisLineResult = processors[stateIdx](lines[idx-1]);\n  if (thisLineResult != null) {\n     result.push(thisLineResult);\n  }\n\n  return result;\n}\n<\/pre>\n<p>Once you run this, you'll get a list of just the appropriate lines of text from the subtitles. <\/p>\n<p>Clearly we need to get rid of the duplicate text - the next step is to define a fucntion that can detect the bits of text that are replicated from section to section, shown here:<\/p>\n<pre lang=\"javascript\">\nfunction findOverlap(a, b) {\n  if (b.length === 0) {\n    return \"\";\n  }\n\n  if (a.endsWith(b)) {\n    return b;\n  }\n\n  if (a.indexOf(b) >= 0) {\n    return b;\n  }\n\n  return findOverlap(a, b.substring(0, b.length - 1));\n}\n<\/pre>\n<p>Once we've done this, we can re-constitute the subtitle string, but without anything that overlaps. This function should join everything together (with spaces). The tricky thing here is to not consider overlapping whitespace as a problem, or a single character that just happens to overlap (the file word of a sentence and the first of the next start with \"s\", for example).<\/p>\n<p>To address these issue I've arbitrarily chosen to make sure the overlap is five characters or more. When adding the space between the segments, it must also go at the beginning, since it is a new character and wouldn't show up in the overlap detection.<\/p>\n<pre lang=\"javascript\">\nlet textLines = process(subtitles.split(\"\\n\"));\n\nfunction filterDuplicateText(lines) {\n  let idx = 0;\n  let text = lines[0];\n  while (idx < lines.length - 1) {\n    let overlap = \n      findOverlap(lines[idx], lines[idx + 1]);\n\n    if (overlap.length >= 5) {\n      let nonOverlap = textLines[idx + 1].substring(overlap.length);\n      if (nonOverlap.length > 0) {\n             text += ' ' + nonOverlap;\n      }\n    } else {\n      text += ' ' + textLines[idx + 1];\n    }\n\n    idx++;\n  }\n\n  return text;\n}\n<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>A script to fix subtitle files with duplicated text<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[4],"tags":[302,388],"aioseo_notices":[],"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/4586"}],"collection":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/comments?post=4586"}],"version-history":[{"count":0,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/4586\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/media?parent=4586"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/categories?post=4586"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/tags?post=4586"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}