Dowloading csv files from a webpage using Python

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP



Dowloading csv files from a webpage using Python



There is a site named Stockpup that gives to anyone the opportunity to download from its webpage csv files containing fundamentals of companies listed in NYSE. The site is non commercial and does not provide an API as other sites do. This means that one have to download manually the csv files one by one which is very time consuming especially since this should be repeated every quarter.



So I wonder if there is a way to automate this process through Python.



I provide below an image of the website I am referring to which can be accessed at: http://www.stockpup.com/data/



enter image description here



I used the following code:


from bs4 import BeautifulSoup
import requests
from urllib.parse import urljoin
from os.path import basename

base = "http://www.stockpup.com/data/"
url = requests.get('http://www.stockpup.com/data/').text
soup = BeautifulSoup(url)
for link in (urljoin(base, a["href"]) for a in soup.select("a[href$=.csv]")):
with open(basename(link), "w") as f:
f.writelines(requests.get(link))



Which returned the following exception:


TypeError Traceback (most recent call last)
<ipython-input-12-59ef271e8696> in <module>()
9 for link in (urljoin(base, a["href"]) for a in soup.select("a[href$=.csv]")):
10 with open(basename(link), "w") as f:
---> 11 f.writelines(requests.get(link))

TypeError: write() argument must be str, not bytes



I also tried this code:


from bs4 import BeautifulSoup
from time import sleep
import requests

if __name__ == '__main__':
url = requests.get('http://www.stockpup.com/data/').text
soup = BeautifulSoup(url)
for link in soup.findAll("a"):
current_link = link.get("href")
if current_link.endswith('csv'):
print('Found CSV: ' + current_link)
print('Downloading %s' % current_link)
sleep(10)
response = requests.get('http://www.stockpup.com/data//%s' % current_link, stream=True)
fn = current_link.split('/')[0] + '_' + current_link.split('/')[1] + '_' + current_link.split('/')[2]
with open(fn, "wb") as handle:
for data in response.iter_content():
handle.write(data)



Which returned this error message:


---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-13-fc758e1763cb> in <module>()
9 for link in soup.findAll("a"):
10 current_link = link.get("href")
---> 11 if current_link.endswith('csv'):
12 print('Found CSV: ' + current_link)
13 print('Downloading %s' % current_link)

AttributeError: 'NoneType' object has no attribute 'endswith'



I think what this tells me is that it does not find any objects that meet the criteria I gave (csv file extension).



I looked also at the website using the Chrome's Developer's view and this is what I saw:



enter image description here



In fact I can not see the hyperlinks to the csv files.



I tried:


from selenium import webdriver
ins = webdriver.Chrome('C:\Program Files (x86)\Google\Chrome\Application')
source = BeautifulSoup(ins.page_source)
div = source.find_all('div', 'class':'col-md-4 col-md-offset-1')
all_as = div[0].find_all('a')

href = ''
for i in range(len(all_as)):
if 'CSV' in all_as[i].text:
href = all_as[i]['href']
ins.get('http://www.stockpup.com/'.format(href))



Which returned an exception:


---------------------------------------------------------------------------
PermissionError Traceback (most recent call last)
C:ProgramDataAnaconda3libsite-packagesseleniumwebdrivercommonservice.py in start(self)
75 stderr=self.log_file,
---> 76 stdin=PIPE)
77 except TypeError:

C:ProgramDataAnaconda3libsubprocess.py in __init__(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags, restore_signals, start_new_session, pass_fds, encoding, errors)
706 errread, errwrite,
--> 707 restore_signals, start_new_session)
708 except:

C:ProgramDataAnaconda3libsubprocess.py in _execute_child(self, args, executable, preexec_fn, close_fds, pass_fds, cwd, env, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite, unused_restore_signals, unused_start_new_session)
991 os.fspath(cwd) if cwd is not None else None,
--> 992 startupinfo)
993 finally:

PermissionError: [WinError 5] Access is denied

During handling of the above exception, another exception occurred:

WebDriverException Traceback (most recent call last)
<ipython-input-13-ebd684e97f30> in <module>()
1 from selenium import webdriver
----> 2 ins = webdriver.Chrome('C:\Program Files (x86)\Google\Chrome\Application')
3 source = BeautifulSoup(ins.page_source)
4 div = source.find_all('div', 'class':'col-md-4 col-md-offset-1')
5 all_as = div[0].find_all('a')

C:ProgramDataAnaconda3libsite-packagesseleniumwebdriverchromewebdriver.py in __init__(self, executable_path, port, options, service_args, desired_capabilities, service_log_path, chrome_options)
66 service_args=service_args,
67 log_path=service_log_path)
---> 68 self.service.start()
69
70 try:

C:ProgramDataAnaconda3libsite-packagesseleniumwebdrivercommonservice.py in start(self)
86 raise WebDriverException(
87 "'%s' executable may have wrong permissions. %s" % (
---> 88 os.path.basename(self.path), self.start_error_message)
89 )
90 else:

WebDriverException: Message: 'Application' executable may have wrong permissions. Please see https://sites.google.com/a/chromium.org/chromedriver/home



Finally I tried the following code which run without an exception but on the other hand nothing happened.


from bs4 import BeautifulSoup
import requests
from urllib.parse import urljoin
from os.path import basename

base = "http://www.stockpup.com/data/"
url = requests.get('http://www.stockpup.com/').text
soup = BeautifulSoup(url)
for link in (urljoin(base, a) for a in soup.select("a[href$=.csv]")):
with open(basename(link), "w") as f:
f.writelines(requests.get(link))



Your advice will be appreciated.





You have to use webscraping using scrapy or any other scraping libraries for python.
– serbia99
Aug 8 at 10:48





There are lots of libraries you can use in Python to do this, I suggest you take a look at beautifulsoup or scrapy
– gogaz
Aug 8 at 10:48


beautifulsoup


scrapy





The code suggested by Padraic Cunningham from stackoverflow.com/a/39056833/1228815 should do the job.
– Ivar van Wooning
Aug 8 at 11:00





You can use beautifulsoup to get a list of the href of all the a tag in the webpage ending in csv. You can then iterate over the list to download each one.
– Ankit S
Aug 8 at 11:04


beautifulsoup


csv





@Ivar Would you like to have a look at the code and comment if possible?
– user8270077
Aug 8 at 11:20




2 Answers
2



i think you should check out selenium, its cool


from selenium import webdriver
ins = webdriver.Chrome(path to the chrome driver)
source = BeautifulSoup(ins.page_source)
div = source.find_all('div', 'class':'col-md-4 col-md-offset-1')
all_as = div[0].find_all('a')

href = ''
for i in range(len(all_as)):
if 'CSV' in all_as[i].text:
href = all_as[i]['href']
ins.get('http://www.stockpup.com/'.format(href))
break



Note: Please remove the break statement if you want to download all the attachments, or just give a number where you want to stop



If you still want to do with requests, then i suggest take the href out from the a tag and just append to the stockpup url and then execute, it will download the csv files for you.Hope this helps!!



Another way to do this, much simpler, using requests and beautifulsoup


import pandas as pd
import requests

source = requests.get('http://www.stockpup.com/data/')
soup = BeautifulSoup(source.content)
div = soup.find_all('div', 'class':'col-md-4 col-md-offset-1')
all_as = div[0].find_all('a')

href = ''
for i in range(len(all_as)):
if 'CSV' in all_as[i].text:
href = all_as[i]['href']
data = pd.read_csv('http://www.stockpup.com/'.format(href))
data.to_csv(give the path where you want to save)
// path e.g r'C:/Users/sarthak_negi_/Downloads/file.csv'.. file.csv being the name which you will give for your csv file
// keep changing the name for every csv otherwise it will overwrite
break



Now remove the break for all the csv's. AS far as the error for above selenium attempt, i think path to chrome driver was wrong. you need to give the exe path of the driver. make sure to





Thank you for your ideas! I tried to implement them but I run into some problems as you can see in my updated post. Would it be possible to tell me exactly how to amend my code? It seems I can not follow your directions.
– user8270077
Aug 8 at 11:56





@Sarthak you can still do this using only selenium
– Tushortz
Aug 8 at 12:04





@user8270077 please try the updated one it will work
– Sarthak Negi
Aug 8 at 12:26





@Tushortz ya but with requests also its possible
– Sarthak Negi
Aug 8 at 12:26





@Sarthak: I am grateful! It worked like a charm!
– user8270077
Aug 8 at 17:18



Here is a simple solution:


import re
import requests

url='http://www.stockpup.com/data/'

resp = requests.get(url)
for ln in resp.text.splitlines():
if 'quarterly_financial_data.csv' in ln:
csv = re.split('/|"', ln)
print(url + csv[3])
r = requests.get(url + csv[3])
fcsv = open(csv[3], 'w')
fcsv.write(r.text)
fcsv.close()






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

Firebase Auth - with Email and Password - Check user already registered

Dynamically update html content plain JS

How to determine optimal route across keyboard