20
May

Python Script to Grab All CSS for Given URL(s)

While at work, I needed a script that would grab all CSS elements that a webpage was using, both internal and external, given a URL and concatenate these elements into a single file. I came up with the following Python script. There’s a very minimal amount of pre-setup work in order to run it though. First, you must have python installed. I’m going to assume that 1) you do, and 2) you know how to use it. The next step is to download and install the beautifulsoup package. This is how I accomplished this on my Ubuntu box (could vary depending on your distribution):

sudo apt-get install curl
curl -O http://python-distribute.org/distribute_setup.py
sudo python distribute_setup.py
sudo easy_install beautifulsoup

After that, you should be good to go. Copy / paste the following into a python script (I called mine fetch_css.py) and then run it with python fetch_css.py.

# -*- coding: utf-8 -*-
import urllib2
from urlparse import urlparse
from BeautifulSoup import BeautifulSoup


def fetch_css( url ):

    try:
       response = urllib2.urlopen(url)
       html_data = response.read()
       response.close()

       soup = BeautifulSoup(''.join(html_data))

       # Find all external style sheet references
       ext_styles = soup.findAll('link', rel="stylesheet")

       # Find all internal styles
       int_styles = soup.findAll('style', type="text/css")

       # TODO: Find styles defined inline?
       # Might not be useful... which <p style> is which?

       # Loop through all the found int styles, extract style text, store in text
       # first, check to see if there are any results within int_styles.
       int_css_data = ''
       int_found = 1
       if len(int_styles) != 0:
          for i in int_styles:
              print "Found an internal stylesheet"
              int_css_data += i.find(text=True)
       else:
           int_found = 0
           print "No internal stylesheets found"

       # Loop through all the found ext stylesheet, extract the relative URL,
       # append the base URL, and fetch all content in that URL
       # first, check to see if there are any results within ext_styles.
       ext_css_data = ''
       ext_found = 1
       if len(ext_styles) != 0:
          for i in ext_styles:
              # Check to see if the href to css style is absolute or relative
              o = urlparse(i['href'])
              if o.scheme == "":
                 css_url = url + '/' + i['href']  # added "/" just in case
                 print "Found external stylesheet: " + css_url
              else:
                 css_url = i['href']
                 print "Found external stylesheet: " + css_url

              response = urllib2.urlopen(css_url)
              ext_css_data += response.read()
              response.close()
       else:
           ext_found = 0
           print "No external stylesheets found"

       # Combine all internal and external styles into one stylesheet (must convert
       # string to unicode and ignore errors!
       # FIXME: Having problems picking up JP characters:
       #    html[lang="ja-JP"] select{font-family:"Hiragino Kaku Gothic Pro", "ããè´ Pro W3"
       # I already tried ext_css_data.encode('utf-8'), but this didn't work
       all_css_data = int_css_data + unicode(ext_css_data, errors='ignore')

       return all_css_data, int_found, ext_found
    except:
        return "",0,0

################################################################################
# Specify URL(s) here
################################################################################
urls = {
    'jaresfencing': "http://jaresfencing.com",
    'derekhildreth': "http://derekhildreth.com",
    'thelinuxdaily': "http://thelinuxdaily.com",
    'myurl1': "http://myurl1.com"
}


for k, v in urls.items():
   print "nFetching: " + v
   print "--------------------------------------------------------------------------------"
   out, int_found, ext_found = fetch_css(v)
   if ext_found == 1 or int_found == 1:
      filename = k + '_css.out'
      f = open( filename, 'w')
      f.write(out)
      print "Styles successfully written to: " + filename + "n"
      f.close()
   elif out == "":
      print "Error: URL not found!"
   else:
      print "No styles found for " + v + "n"



## OPTIONAL CSS PARSING STEP ##
# MUST INSTALL CSSUTILS with 'sudo easy_install cssutils'
#include cssutils
#sheet = cssutils.parseString(all_css_data)
#f2 = open('temp2', 'w')
#f2.write(sheet.cssText)
#f2.close()

Seems to work for me! I could see potential for many improvements, but it’s pretty robust as it is. Enjoy.

Feel free to donate if this post prevented any headaches! Another way to show your appreciation is to take a gander at these relative ads that you may be interested in:


There's 0 Comment So Far

Share your thoughts, leave a comment!