How to correctly extract text from nodes is one of the most popular questions we see, and almost invariably is made more difficult by misusing Nokogiri's "searching" methods.
Nokogiri supports using CSS and XPath selectors. These are equivalent:
doc.at('p').text # => "foo"
doc.at('//p').text # => "foo"
doc.search('p').size # => 2
doc.search('//p').size # => 2
The CSS selectors are extended with many of jQuery's CSS extensions for convenience.
at
and search
are generic versions of at_css
and at_xpath
along with css
and xpath
. Nokogiri makes an attempt to determine whether a CSS or XPath selector is being passed in. It's possible to create a selector that fools at
or search
so occasionally it will misunderstand, which is why we have the more specific versions of the methods. In general I use the generic versions almost always, and only use the specific version if I think Nokogiri will misunderstand. This practice falls under the first entry in "Three Virtues".
If you are searching for one specific node and want its text, then use at
or one of its at_css
or at_xpath
variants:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<p>foo</p>
<p>bar</p>
</body>
</html>
EOT
doc.at('p').text # => "foo"
at
is equivalent to search(...).first
, so you could use the longer-to-type version, but why?
If the text being extracted is concatenated after using search
, css
or xpath
then add map(&:text)
instead of simply using text
:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<p>foo</p>
<p>bar</p>
</body>
</html>
EOT
doc.search('p').text # => "foobar"
doc.search('p').map(&:text) # => ["foo", "bar"]
See the text
documentation for NodeSet and Node for additional information.