Hello ! In this post I just will past my little python tool which permits to list all links of a specific webpage and offers the possibility to download specifics ressources based on the file type. Please try it and tell my back your results and observation to help me to update it :-). Each update will be noticed here.
# @Author Cyrill Gremaud # Based on the code snippet of Kumar Shubham # http://hackaholic.info/simple-python-program-extract-links-web-page/ # # This program will list all the links from a webpage (Kumar Shubham) and save them # into a file and offer the possibility to download specific kind of files based on # their extensions import sys import urllib2 import re import os outputFile = './tmp.txt' def updateRelativeLinks(urlname): global outputFile updatedOutputFile = './' + urlname[7:] + '.txt' count = 0 try: of = open(outputFile, 'r') off = open(updatedOutputFile, 'w') for line in of: if not 'http' in line: #sometime slash is missing as first char in relative URL if line[0] != '/': line = '/' + line count += 1 newline = urlname + line off.write(newline) print 'Find [',count,']: ', newline.replace('\n','') else: off.write(line) of.close() off.close() os.remove(outputFile) outputFile = updatedOutputFile except: print 'updateRelativeLinks error' of.close() off.close() def findLinks(url): global outputFile try: if url[0:7] != 'http://': url = "http://" + url of = open(outputFile, 'w') f = (urllib2.urlopen(url)).read() #Open the URL k = re.findall('(src|href)="(\S+)"',f) #Find links in the source k = set(k) #Store all elements in a dictionnary for x in k: if len(x[1]) > 2: of.write(x[1]+'\n') #Store each links into the file of.close() updateRelativeLinks(url) except: print 'URL not found' def download(extensions): global outputFile extList = extensions.split(',') of = open(outputFile, 'r') if not os.path.exists('./'+outputFile[2:-4]): os.makedirs('./'+outputFile[2:-4]) for line in of: for ext in extList: if '.'+ext in line: try: rawData = (urllib2.urlopen(line).read()) filename = line.split('/') dlData = open(('./'+outputFile[2:-4]+'/'+filename[len(filename)-1]).replace('\n',''), 'w') dlData.write(rawData) print 'Downloaded : ' + line.replace('\n','') dlData.close() except: print 'Download error with ', line continue of.close() def usage(): print """ ------------------------------------------------------------------- |usage: | |python webLinks.py full_url [ext1,ext2,..,extn] | |example:- | |python webLinks.py http://www.example.com/hello-world png,jpg,wmv| ------------------------------------------------------------------- """ if __name__ == '__main__': print """ ------------------------------------------------------------- | By Cyrill Gremaud (gremaudc@gmail.com) | | http://www.cyrill-gremaud.ch | | | | All finded links are stored in a file named with the URL | | and if specified, all downloaded files are in a dedicated | | folder | -------------------------------------------------------------- """ argCount = len(sys.argv) if argCount != 2 | argCount != 3: usage() else: findLinks(sys.argv[1]) if(argCount == 3): download(sys.argv[2])
Updates
3th november 2014 : First release (full ressources download capability in progress)
Awesome post.
E-mail me for press inquiries and coverage requests at
celebtreehouse [at] gmail [dot] com. The skit aired during the
show last night amid much Jon & Kate Plus 8 drama (err.
Freedman, a former agent for Ford Models, has been studying nutrition for fifteen years.
I listen hot trance music releases. Awesome hot music on the site!
Please let me know if you’re looking for a article writer for your site.
You have some really good articles and I think I would be a
good asset. If you ever want to take some of the load
off, I’d really like to write some material for your
blog in exchange for a link back to mine. Please blast me an e-mail if interested.
Kudos!