[Python] Get specific ressources of a web page

Hello ! In this post I just will past my little python tool which permits to list all links of a specific webpage and offers the possibility to download specifics ressources based on the file type. Please try it and tell my back your results and observation to help me to update it :-). Each update will be noticed here.

# @Author Cyrill Gremaud
# Based on the code snippet of Kumar Shubham
# http://hackaholic.info/simple-python-program-extract-links-web-page/
#
# This program will list all the links from a webpage (Kumar Shubham) and save them
# into a file and offer the possibility to download specific kind of files based on 
# their extensions

import sys
import urllib2
import re
import os

outputFile = './tmp.txt'

def updateRelativeLinks(urlname):
    global outputFile
    updatedOutputFile = './' + urlname[7:] + '.txt'
    count = 0
    try:
        of = open(outputFile, 'r')
        off = open(updatedOutputFile, 'w')

        for line in of:
            if not 'http' in line:
                #sometime slash is missing as first char in relative URL
                if line[0] != '/':          
                    line = '/' + line
                count += 1
                newline = urlname + line
                off.write(newline)
                print 'Find [',count,']: ', newline.replace('\n','')
            else:
                off.write(line)

        of.close()
        off.close()
        os.remove(outputFile)
        outputFile = updatedOutputFile

    except:
        print 'updateRelativeLinks error'
        of.close()
        off.close()

def findLinks(url):
    global outputFile
    
    try:
        if url[0:7] != 'http://':
            url = "http://" + url

        of = open(outputFile, 'w')

        f = (urllib2.urlopen(url)).read()         #Open the URL
        k = re.findall('(src|href)="(\S+)"',f)    #Find links in the source
        k = set(k)                                #Store all elements in a dictionnary

        for x in k:
            if len(x[1]) > 2:
                of.write(x[1]+'\n')               #Store each links into the file

        of.close()

        updateRelativeLinks(url)

    except:
        print 'URL not found'

def download(extensions):
    global outputFile

    extList = extensions.split(',')
    of = open(outputFile, 'r')

    if not os.path.exists('./'+outputFile[2:-4]):
        os.makedirs('./'+outputFile[2:-4])

    for line in of:
        for ext in extList:
            if '.'+ext in line:
                try:
                    rawData = (urllib2.urlopen(line).read())
                    filename = line.split('/')
                    dlData = open(('./'+outputFile[2:-4]+'/'+filename[len(filename)-1]).replace('\n',''), 'w')
                    dlData.write(rawData)
                    print 'Downloaded : ' + line.replace('\n','')
                    dlData.close()
                except:
                    print 'Download error with ', line
                    continue

    of.close()    

def usage():
    print """
-------------------------------------------------------------------
|usage:                                                           |
|python webLinks.py full_url [ext1,ext2,..,extn]                  |
|example:-                                                        |
|python webLinks.py http://www.example.com/hello-world png,jpg,wmv|
-------------------------------------------------------------------
"""

if __name__ == '__main__':
    print """
 -------------------------------------------------------------
| By Cyrill Gremaud (gremaudc@gmail.com)                     |
| http://www.cyrill-gremaud.ch                               |
|                                                            |
| All finded links are stored in a file named with the URL   |
| and if specified, all downloaded files are in a dedicated  |
| folder                                                     |
--------------------------------------------------------------
"""
    argCount = len(sys.argv)

    if argCount != 2 | argCount != 3:
        usage()
    else:
        findLinks(sys.argv[1])

        if(argCount == 3):
            download(sys.argv[2])

Updates

3th november 2014 : First release (full ressources download capability in progress)

Bookmark the permalink.

5 Comments

  1. E-mail me for press inquiries and coverage requests at
    celebtreehouse [at] gmail [dot] com. The skit aired during the
    show last night amid much Jon & Kate Plus 8 drama (err.
    Freedman, a former agent for Ford Models, has been studying nutrition for fifteen years.

  2. I listen hot trance music releases. Awesome hot music on the site!

  3. Please let me know if you’re looking for a article writer for your site.
    You have some really good articles and I think I would be a
    good asset. If you ever want to take some of the load
    off, I’d really like to write some material for your
    blog in exchange for a link back to mine. Please blast me an e-mail if interested.
    Kudos!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.