Dowloading csv files from a webpage using Python

Dowloading csv files from a webpage using Python

There is a site named Stockpup that gives to anyone the opportunity to download from its webpage csv files containing fundamentals of companies listed in NYSE. The site is non commercial and does not provide an API as other sites do. This means that one have to download manually the csv files one by one which is very time consuming especially since this should be repeated every quarter.

So I wonder if there is a way to automate this process through Python.

I provide below an image of the website I am referring to which can be accessed at: http://www.stockpup.com/data/

enter image description here

I used the following code:

from bs4 import BeautifulSoup import requests from urllib.parse import urljoin from os.path import basename base = "http://www.stockpup.com/data/" url = requests.get('http://www.stockpup.com/data/').text soup = BeautifulSoup(url) for link in (urljoin(base, a["href"]) for a in soup.select("a[href$=.csv]")): with open(basename(link), "w") as f: f.writelines(requests.get(link))

Which returned the following exception:

TypeError Traceback (most recent call last) <ipython-input-12-59ef271e8696> in <module>() 9 for link in (urljoin(base, a["href"]) for a in soup.select("a[href$=.csv]")): 10 with open(basename(link), "w") as f: ---> 11 f.writelines(requests.get(link)) TypeError: write() argument must be str, not bytes

I also tried this code:

from bs4 import BeautifulSoup from time import sleep import requests if __name__ == '__main__': url = requests.get('http://www.stockpup.com/data/').text soup = BeautifulSoup(url) for link in soup.findAll("a"): current_link = link.get("href") if current_link.endswith('csv'): print('Found CSV: ' + current_link) print('Downloading %s' % current_link) sleep(10) response = requests.get('http://www.stockpup.com/data//%s' % current_link, stream=True) fn = current_link.split('/')[0] + '_' + current_link.split('/')[1] + '_' + current_link.split('/')[2] with open(fn, "wb") as handle: for data in response.iter_content(): handle.write(data)

Which returned this error message:

--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-13-fc758e1763cb> in <module>() 9 for link in soup.findAll("a"): 10 current_link = link.get("href") ---> 11 if current_link.endswith('csv'): 12 print('Found CSV: ' + current_link) 13 print('Downloading %s' % current_link) AttributeError: 'NoneType' object has no attribute 'endswith'

I think what this tells me is that it does not find any objects that meet the criteria I gave (csv file extension).

I looked also at the website using the Chrome's Developer's view and this is what I saw:

enter image description here

In fact I can not see the hyperlinks to the csv files.

I tried:

from selenium import webdriver ins = webdriver.Chrome('C:\Program Files (x86)\Google\Chrome\Application') source = BeautifulSoup(ins.page_source) div = source.find_all('div', 'class':'col-md-4 col-md-offset-1') all_as = div[0].find_all('a') href = '' for i in range(len(all_as)): if 'CSV' in all_as[i].text: href = all_as[i]['href'] ins.get('http://www.stockpup.com/'.format(href))

Which returned an exception:

--------------------------------------------------------------------------- PermissionError Traceback (most recent call last) C:ProgramDataAnaconda3libsite-packagesseleniumwebdrivercommonservice.py in start(self) 75 stderr=self.log_file, ---> 76 stdin=PIPE) 77 except TypeError: C:ProgramDataAnaconda3libsubprocess.py in __init__(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags, restore_signals, start_new_session, pass_fds, encoding, errors) 706 errread, errwrite, --> 707 restore_signals, start_new_session) 708 except: C:ProgramDataAnaconda3libsubprocess.py in _execute_child(self, args, executable, preexec_fn, close_fds, pass_fds, cwd, env, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite, unused_restore_signals, unused_start_new_session) 991 os.fspath(cwd) if cwd is not None else None, --> 992 startupinfo) 993 finally: PermissionError: [WinError 5] Access is denied During handling of the above exception, another exception occurred: WebDriverException Traceback (most recent call last) <ipython-input-13-ebd684e97f30> in <module>() 1 from selenium import webdriver ----> 2 ins = webdriver.Chrome('C:\Program Files (x86)\Google\Chrome\Application') 3 source = BeautifulSoup(ins.page_source) 4 div = source.find_all('div', 'class':'col-md-4 col-md-offset-1') 5 all_as = div[0].find_all('a') C:ProgramDataAnaconda3libsite-packagesseleniumwebdriverchromewebdriver.py in __init__(self, executable_path, port, options, service_args, desired_capabilities, service_log_path, chrome_options) 66 service_args=service_args, 67 log_path=service_log_path) ---> 68 self.service.start() 69 70 try: C:ProgramDataAnaconda3libsite-packagesseleniumwebdrivercommonservice.py in start(self) 86 raise WebDriverException( 87 "'%s' executable may have wrong permissions. %s" % ( ---> 88 os.path.basename(self.path), self.start_error_message) 89 ) 90 else: WebDriverException: Message: 'Application' executable may have wrong permissions. Please see https://sites.google.com/a/chromium.org/chromedriver/home

Finally I tried the following code which run without an exception but on the other hand nothing happened.

from bs4 import BeautifulSoup import requests from urllib.parse import urljoin from os.path import basename base = "http://www.stockpup.com/data/" url = requests.get('http://www.stockpup.com/').text soup = BeautifulSoup(url) for link in (urljoin(base, a) for a in soup.select("a[href$=.csv]")): with open(basename(link), "w") as f: f.writelines(requests.get(link))

Your advice will be appreciated.

You have to use webscraping using scrapy or any other scraping libraries for python.
– serbia99
Aug 8 at 10:48

There are lots of libraries you can use in Python to do this, I suggest you take a look at beautifulsoup or scrapy
– gogaz
Aug 8 at 10:48

beautifulsoup

scrapy

The code suggested by Padraic Cunningham from stackoverflow.com/a/39056833/1228815 should do the job.
– Ivar van Wooning
Aug 8 at 11:00

You can use beautifulsoup to get a list of the href of all the a tag in the webpage ending in csv. You can then iterate over the list to download each one.
– Ankit S
Aug 8 at 11:04

beautifulsoup

csv

@Ivar Would you like to have a look at the code and comment if possible?
– user8270077
Aug 8 at 11:20

2 Answers
2

i think you should check out selenium, its cool

from selenium import webdriver ins = webdriver.Chrome(path to the chrome driver) source = BeautifulSoup(ins.page_source) div = source.find_all('div', 'class':'col-md-4 col-md-offset-1') all_as = div[0].find_all('a') href = '' for i in range(len(all_as)): if 'CSV' in all_as[i].text: href = all_as[i]['href'] ins.get('http://www.stockpup.com/'.format(href)) break

Note: Please remove the break statement if you want to download all the attachments, or just give a number where you want to stop

If you still want to do with requests, then i suggest take the href out from the a tag and just append to the stockpup url and then execute, it will download the csv files for you.Hope this helps!!

Another way to do this, much simpler, using requests and beautifulsoup

import pandas as pd import requests source = requests.get('http://www.stockpup.com/data/') soup = BeautifulSoup(source.content) div = soup.find_all('div', 'class':'col-md-4 col-md-offset-1') all_as = div[0].find_all('a') href = '' for i in range(len(all_as)): if 'CSV' in all_as[i].text: href = all_as[i]['href'] data = pd.read_csv('http://www.stockpup.com/'.format(href)) data.to_csv(give the path where you want to save) // path e.g r'C:/Users/sarthak_negi_/Downloads/file.csv'.. file.csv being the name which you will give for your csv file // keep changing the name for every csv otherwise it will overwrite break

Now remove the break for all the csv's. AS far as the error for above selenium attempt, i think path to chrome driver was wrong. you need to give the exe path of the driver. make sure to

Thank you for your ideas! I tried to implement them but I run into some problems as you can see in my updated post. Would it be possible to tell me exactly how to amend my code? It seems I can not follow your directions.
– user8270077
Aug 8 at 11:56

@Sarthak you can still do this using only selenium
– Tushortz
Aug 8 at 12:04

@user8270077 please try the updated one it will work
– Sarthak Negi
Aug 8 at 12:26

@Tushortz ya but with requests also its possible
– Sarthak Negi
Aug 8 at 12:26

@Sarthak: I am grateful! It worked like a charm!
– user8270077
Aug 8 at 17:18

Here is a simple solution:

import re import requests url='http://www.stockpup.com/data/' resp = requests.get(url) for ln in resp.text.splitlines(): if 'quarterly_financial_data.csv' in ln: csv = re.split('/|"', ln) print(url + csv[3]) r = requests.get(url + csv[3]) fcsv = open(csv[3], 'w') fcsv.write(r.text) fcsv.close()

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

va3a9ghxAo,KONEQX4fPykR8rU1V 8 6icgEUrA,qAGkybRXg8,g a0l

搜尋此網誌

Sfyjdyy