{"id":3882,"date":"2016-04-24T02:15:44","date_gmt":"2016-04-24T02:15:44","guid":{"rendered":"http:\/\/www.garysieling.com\/blog\/?p=3882"},"modified":"2016-04-24T02:15:44","modified_gmt":"2016-04-24T02:15:44","slug":"get-nested-text-within-tag-beautiful-soup","status":"publish","type":"post","link":"https:\/\/www.garysieling.com\/blog\/get-nested-text-within-tag-beautiful-soup\/","title":{"rendered":"Get all nested text within a tag with Beautiful Soup"},"content":{"rendered":"<p>The &#8220;beautiful soup&#8221; library in Python lets you parse HTML pages.<\/p>\n<p>It does some things a little weirdly if you&#8217;re used to Javascript. To filter the document, you can use &#8220;find&#8221;, which gives you a list of tags matching some condition. However, these values are text elements, not DOM nodes, so you have to do &#8220;parent&#8221; to get something that is actually useful.<\/p>\n<p>You can then do a &#8220;find&#8221; on the element you found, to filter to it&#8217;s child elements that are bits of text.<\/p>\n<pre lang=\"python\">\nfile = 'pages\/talk' + str(i) + '.html'\nsoup = BeautifulSoup(open(file), 'html.parser')\n\ndef getTexts():  \n  for hit in soup.find(attrs={\"class\": \"transcript-text-content\"}):\n    yield \"\".join(hit.parent.findAll(text=True))\n\nprint \"\".join(getTexts())\n<\/pre>\n<p>In my tests, the join gives you newlines between the elements, but this may be a coincidence based on my data.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>How to get the text nested within a DOM element in Python<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[4],"tags":[82,447,495],"aioseo_notices":[],"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/3882"}],"collection":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/comments?post=3882"}],"version-history":[{"count":0,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/3882\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/media?parent=3882"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/categories?post=3882"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/tags?post=3882"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}