Post Viewer - 20140812.HTM

Scraping Raw CSS and HTML with YQL

For a recent project I needed a way in JavaScript to pull the sources of a remote stylesheet and a remote webpage, each as plain strings. The immediate problems were with cross-origin resource sharing policies (i.e. you can’t simply load an html resource that’s not on your domain or that is not on a server expecting requests from your domain) and the fact that I didn’t want to apply the requested stylesheet to the page in order to access its values.

To solve this issue, I wrapped hail2u‘s query-yql jQuery plugin at https://github.com/hail2u/jquery.query-yql in a promise, used the query "select * from htmlstring where url='" + url + "'", requested json data, set the env variable (table definition) to “all,” and resolved the returned object’s .query.results.result. It sounds a bit complicated in a one-sentence description, but with a quick primer on promises and a read through hail2u‘s documentation, it all comes together pretty quickly.

The HTML pulls through as a string without issue; however, scraping CSS files with this method does require one extra step since YQL runs the data through HTML Tidy, which assumes the CSS file is simply malformed HTML, wraps it in the appropriate tags, and converts some characters. Since the contents of the file are consistently wrapped in paragraph tags and the only obvious character conversion is &gt;, a simple series of .replace(/[\s\S]*<p>/, ""), .replace(/<\/p>[\s\S]*/, ""), and .replace(/\&gt\;/g, ">") is enough to convert the scraped data back into its original form.

In practice, the YQL scrape looks a little like this:

var statement = "select * from htmlstring where url='" + url + "'";
$.queryYQL(statement, "json", "all", function (data) {
  var result = data.query.results.result;
  // Clean up malformed CSS resulting from forced application of HTML Tidy and no option to just curl a file
  if (language === "CSS" && result.match(/<p>/)) {
    result = result.replace(/[\s\S]*<p>/, "");
    result = result.replace(/<\/p>[\s\S]*/, "");
    result = result.replace(/\&gt\;/g, ">");
  }
  resolve(result);
});

 

Last Updated: 2015-05-01 - Questions? Feedback? Send me a note!