An even smarter html_truncate tag
As you know I am using Jekyll to generate this blog. When setting mine up I read how Jack Moffitt set up his Jekyll installation and thought the idea of an html_truncate filter was pretty cool.
What does a truncate filter do? Just make a string shorter. Here is an example from the truncate documentation included in Ruby on Rails:
Now, this is all well and good unless you have some HTML in there:
Which is no good, since it has split our <b> tag in two. Worse would be if it got the whole <b> opening tag but missed the closing tag; the whole rest of the page would be bolded. You’ll also notice that this could cut words completely in half.
What is needed is a truncate that won’t truncate tags or words, and won’t leave tags unclosed.
Here is Jack Moffit’s html_truncate filter (from the GitHub commit):
What his does is send the string to the Hpricot HTML parser which strips out all the HTML tags. It then splits the string up into just the words, and returns the first however many words requested. To continue on with our previous example:
So, we solved all of our complaints! No more broken up HTML tags and no more split words. But I wasn’t sure if I liked this result. Where did the HTML tags go? I wanted an HTML truncate that returned the first however many words while maintaining the HTML tags.
Here is my algortithm:
- Load in the HTML
- Traverse the loaded HTML looking for text nodes
- When a text node is found count the number of words it has
- Once the limit is reached, remove all nodes that come after it.
Here is my code:
Update: I have learned that there are some errors in this code. GitHub user Eleo has posted a working version. Thanks Eleo!
I used the Nokogiri HTML parser because I read that it was faster. (Now I am reading that is no longer the case! Which one will I choose?)
And finally, here is what my html_truncate function will does:
No split words, no broken HTML. Perfect!
Though, I ultimately decided not to use it, and went with Jack’s function. I liked how concise it made the resulting text with no <p>’s or <ul>’s to string it out. I still think my version is useful though.