There isn't much to add to Nokogiri's "Parsing an HTML/XML Document" tutorial, which is an easy introduction to the subject, so start there, then return to this page to help fill in some gaps.
Nokogiri's basic parsing attempts to clean up a malformed document, sometimes adding missing closing tags, and will add some additional tags to make it correct.
This is an example of telling Nokogiri that the document being parsed is a complete HTML file, and Nokogiri discovering it isn't:
require 'nokogiri'
doc = Nokogiri::HTML('<body></body>')
puts doc.to_html
Which outputs:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body></body></html>
Notice that the DTD statement was added, along with a wrapping <html>
tag.
If we want to avoid this we can parse the document as a DocumentFragment:
require 'nokogiri'
doc = Nokogiri::HTML.fragment('<body></body>')
puts doc.to_html
which now outputs only what was actually passed in:
<body></body>
There is an XML variant also:
require 'nokogiri'
doc = Nokogiri::XML('<node />')
puts doc.to_xml
Which outputs:
<?xml version="1.0"?>
<node/>
and:
doc = Nokogiri::XML.fragment('<node />')
puts doc.to_xml
resulting in:
<node/>
A more verbose variation of fragment
is to use DocumentFragment.parse
, so sometimes you'll see it written that way.
Occasionally, Nokogiri will have to do some fix-ups to try to make sense of the document:
doc = Nokogiri::XML::DocumentFragment.parse('<node ><foo/>')
puts doc.to_xml
With the modified code now being:
<node>
<foo/>
</node>
The same can happen with HTML.
Sometimes the document is mangled beyond Nokogiri's ability to fix it, but it will try anyway, resulting in a document that has a changed hierarchy. Nokogiri won't raise an exception, but it does provide a way to check for errors and the actions it took. See "How to check for parsing errors" for more information.
See the Nokogiri::XML::ParseOptions documentation for various options used when parsing.