{"id":5400,"date":"2017-09-19T13:06:56","date_gmt":"2017-09-19T13:06:56","guid":{"rendered":"http:\/\/www.garysieling.com\/blog\/?p=5400"},"modified":"2017-09-19T13:06:56","modified_gmt":"2017-09-19T13:06:56","slug":"full-text-search-within-closed-captions","status":"publish","type":"post","link":"https:\/\/www.garysieling.com\/blog\/full-text-search-within-closed-captions\/","title":{"rendered":"Full-Text Search within Closed Captions"},"content":{"rendered":"<p>Youtube automatically generates closed captions for videos. <a href=\"https:\/\/www.findlectures.com\">FindLectures.com<\/a> crawls these, and\u00a0allows you to search for a phrase within a video and start\u00a0playback where the phrase occurs.<\/div>\n<p>Machine-generated transcriptions include timestamps, but also many transcription errors. If we can\u00a0obtain captions and a corrected transcript for a speech, these can be aligned using the words that do match. In the spots that differ, we can update the language with\u00a0the corrected wording from the transcript.<\/p><\/div>\n<p>In the below example, George W. Bush introduces the phrase &#8220;axis of evil&#8221; in a State of the Union address and the search engine recognizes that this is about 13 minutes in:<\/p>\n<p>&nbsp;<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"669\" height=\"540\" class=\"alignnone size-full wp-image-5405\" src=\"http:\/\/172.104.26.128\/wp-content\/uploads\/2017\/09\/img_59bfbb98a1274.png\" alt=\"\" srcset=\"https:\/\/www.garysieling.com\/blog\/wp-content\/uploads\/2017\/09\/img_59bfbb98a1274.png 669w, https:\/\/www.garysieling.com\/blog\/wp-content\/uploads\/2017\/09\/img_59bfbb98a1274-300x242.png 300w\" sizes=\"(max-width: 669px) 100vw, 669px\" \/><\/p>\n<p>Captions are stored in the search index in a simplified version of the SRT closed caption format:<\/p>\n<pre>00:13:00 word states like these and their\n00:13:23 terrorist allies constitute an axis of evil\n00:13:27 arming to threaten the peace of the world<\/pre>\n<p>For many famous speeches, there is crowd noise at the time of the famous parts of the speeches which causes errors in machine transcriptions. For very well-known\u00a0speeches transcripts are typically available, but the timings are rarely\u00a0included.<\/p>\n<p>This is the case for a famous speech by George H.W. Bush, which uses the phrase &#8220;Read my lips &#8211; no new taxes&#8221; &#8211; at this point in the video, the crowd cheers, rendering the last word or two unintelligible to a machine.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"702\" height=\"573\" class=\"alignnone size-full wp-image-5407\" src=\"http:\/\/172.104.26.128\/wp-content\/uploads\/2017\/09\/img_59bfbce665848.png\" alt=\"\" srcset=\"https:\/\/www.garysieling.com\/blog\/wp-content\/uploads\/2017\/09\/img_59bfbce665848.png 702w, https:\/\/www.garysieling.com\/blog\/wp-content\/uploads\/2017\/09\/img_59bfbce665848-300x245.png 300w\" sizes=\"(max-width: 702px) 100vw, 702px\" \/><\/p>\n<p>Machine transcriptions also commonly misspell homonyms (words that sound alike), such as &#8220;code&#8221; and &#8220;coat&#8221;, or &#8220;word&#8221; and &#8220;world&#8221;.<\/p>\n<p>Closed captions often bold\u00a0words as they are spoken, so the captioning may also include the same phrase repeatedly.<\/p>\n<div>00:13:00 <b>word<\/b> states like these and their<br \/>\n00:13:02 word <b>states<\/b> like these and their<br \/>\n00:13:03 word states <b>like<\/b> these and their<br \/>\n00:13:04 <b>these<\/b> and their terrorist allies<br \/>\n00:13:06 these <b>and<\/b> their terrorist allies<br \/>\n00:13:07 these and <b>their<\/b> terrorist allies<\/div>\n<p>There are robust algorithms to do text alignment. These were invented to aid in DNA sequencing. Sequencing a genome is like re-assembling a puzzle: it takes\u00a0many small strands\u00a0of DNA, and then recombines them by matching where they\u00a0overlap.<\/p>\n<p>For this essay\u00a0I&#8217;m using is the Smith-Waterman algorithm, as there is a good implementation available on NPM:<\/p>\n<pre>npm install igenius-smith-waterman --save\n<\/pre>\n<p>&nbsp;<\/p>\n<p>A DNA sequence is represented using\u00a0the letters A, C, T, G. Since alignment algorithms use letters,\u00a0rather than words, we need to build a mapping to the words in the text.<\/p>\n<p>&nbsp;<\/p>\n<pre>A -&gt; the\nB -&gt; axis\nC -&gt; of\nD -&gt; evil\n<\/pre>\n<p>To increase the number of matches in the alignment, punctuation and accent marks are removed from the transcript, and all words are lower case.<\/p>\n<p>It is also important to include the caption time-stamps in this dictionary. In DNA terms, these are like &#8220;mutations&#8221; we want to apply to the\u00a0transcript.<\/p>\n<pre>E -&gt; 00:13:00\nF -&gt; 00:13:23\nG -&gt; 00:13:27\n<\/pre>\n<p>Our list of base pairs will be much larger than DNA. A typical speech might include 500-2,000 unique terms, so it&#8217;s important to use an implementation\u00a0that supports\u00a0Unicode characters.<\/p>\n<p>When we run the alignment algorithm, we give it two series of letters, and tells us how to turn the first string into the second, and vice versa. Where there is no match, it \u00a0inserts dashes.<\/p>\n<pre>align('ABCDEFG', 'ABCDEFFG'):\nleft: ABCDE-FG\nright: ABCDEFFG\n<\/pre>\n<p>To re-construct the data, we iterate letter by letter, choosing timestamp tokens from the captions, and everything else from the transcript side, which gives us a result like this:<\/p>\n<p>&nbsp;<\/p>\n<pre>00:00:04 Thank you. Thank\n00:06:15 you very much. I\n00:07:27 have many friends to\n00:07:33 thank tonight. I thank the voters who\n00:07:37 supported me. I thank the gallant men who\n<\/pre>\n<p>&nbsp;<\/p>\n<p>There is one final improvement we can make &#8211; if a famous phrase spans two lines, it is difficult for a full-text search engine to find, e.g.:<\/p>\n<p>&nbsp;<\/p>\n<pre>00:00:04 Read my lips -\n00:00:40 No new taxes.\n00:01:05 Let me tell\n00:01:30 you more about the mission.\n<\/pre>\n<p>&nbsp;<\/p>\n<p>When re-constructing the text, each line can look ahead, pulling a few words from the next. This ensures that entire phrases will generally be included on each line, and yields much better quality search results.<\/p>\n<p>&nbsp;<\/p>\n<pre>00:00:04 Read my lips - no new taxes\n00:00:40 no new taxes. Let me tell\n00:01:05 Let me tell you more about the mission.\n<\/pre>\n<p>&nbsp;<\/p>\n<p>The full code for this demonstration is available on github:<\/p>\n<p><a href=\"https:\/\/github.com\/garysieling\/transcript-alignment.git\">https:\/\/github.com\/garysieling\/transcript-alignment.git<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Youtube automatically generates closed captions for videos. FindLectures.com crawls these, and allows you to search for a phrase within a video and start playback where the phrase occurs.<\/p>\n<p>Machine-generated transcriptions include timestamps, but also many transcription errors. If we can obtain captions and a corrected transcript for a speech, these can be aligned using the words that do match. In the spots that differ, we can update the language with the corrected wording from the transcript.<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[11],"tags":[98,498,517,569],"aioseo_notices":[],"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/5400"}],"collection":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/comments?post=5400"}],"version-history":[{"count":0,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/5400\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/media?parent=5400"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/categories?post=5400"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/tags?post=5400"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}