Nokogiri is somewhat like a browser, in that it will attempt to provide something useful even if the incoming HTML or XML is malformed. Unfortunately it usually does it silently, but we can ask for a list of the errors using errors
:
require 'nokogiri'
doc = Nokogiri::XML('<node><foo/>')
doc.errors
# => [#<Nokogiri::XML::SyntaxError: 1:13: FATAL: Premature end of data in tag node line 1>]
whereas, the correct XML would result in no errors:
doc = Nokogiri::XML('<node><foo/></node>')
doc.errors
# => []
This also applies to parsing HTML, but, because HTML is a relaxed form of XML, Nokogiri will pass over missing end-nodes often and will only report malformed nodes and more pathological errors:
doc = Nokogiri::HTML('<html><body>')
doc.errors
# => []
doc = Nokogiri::HTML('<html><body><p')
doc.errors
# => [#<Nokogiri::XML::SyntaxError: 1:15: ERROR: Couldn't find end of Start Tag p>]
If, after parsing, you can't find an node that you can see in your editor, this could be the cause of the problem. Sometimes it helps to pass the HTML through a formatter, and see if the nesting helps reveal the problem.
And, because Nokogiri tries to fix the problem but sometimes can't do it correctly, because it can be a very difficult thing for software to do, we'll have to pre-process the file and touch up lines prior to handing it off to Nokogiri. How to do that depends on the file and the problem. It can vary from simply finding a node and adding a trailing >
, to removing embedded malformed markup that was injected by a bad scraping routine, so it's up to the programmer how best to intercede.