Dowloading csv files from a webpage using Python
Clash Royale CLAN TAG#URR8PPP
Dowloading csv files from a webpage using Python
There is a site named Stockpup that gives to anyone the opportunity to download from its webpage csv files containing fundamentals of companies listed in NYSE. The site is non commercial and does not provide an API as other sites do. This means that one have to download manually the csv files one by one which is very time consuming especially since this should be repeated every quarter.
So I wonder if there is a way to automate this process through Python.
I provide below an image of the website I am referring to which can be accessed at: http://www.stockpup.com/data/
I used the following code:
from bs4 import BeautifulSoup
import requests
from urllib.parse import urljoin
from os.path import basename
base = "http://www.stockpup.com/data/"
url = requests.get('http://www.stockpup.com/data/').text
soup = BeautifulSoup(url)
for link in (urljoin(base, a["href"]) for a in soup.select("a[href$=.csv]")):
with open(basename(link), "w") as f:
f.writelines(requests.get(link))
Which returned the following exception:
TypeError Traceback (most recent call last)
<ipython-input-12-59ef271e8696> in <module>()
9 for link in (urljoin(base, a["href"]) for a in soup.select("a[href$=.csv]")):
10 with open(basename(link), "w") as f:
---> 11 f.writelines(requests.get(link))
TypeError: write() argument must be str, not bytes
I also tried this code:
from bs4 import BeautifulSoup
from time import sleep
import requests
if __name__ == '__main__':
url = requests.get('http://www.stockpup.com/data/').text
soup = BeautifulSoup(url)
for link in soup.findAll("a"):
current_link = link.get("href")
if current_link.endswith('csv'):
print('Found CSV: ' + current_link)
print('Downloading %s' % current_link)
sleep(10)
response = requests.get('http://www.stockpup.com/data//%s' % current_link, stream=True)
fn = current_link.split('/')[0] + '_' + current_link.split('/')[1] + '_' + current_link.split('/')[2]
with open(fn, "wb") as handle:
for data in response.iter_content():
handle.write(data)
Which returned this error message:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-13-fc758e1763cb> in <module>()
9 for link in soup.findAll("a"):
10 current_link = link.get("href")
---> 11 if current_link.endswith('csv'):
12 print('Found CSV: ' + current_link)
13 print('Downloading %s' % current_link)
AttributeError: 'NoneType' object has no attribute 'endswith'
I think what this tells me is that it does not find any objects that meet the criteria I gave (csv file extension).
I looked also at the website using the Chrome's Developer's view and this is what I saw:
In fact I can not see the hyperlinks to the csv files.
I tried:
from selenium import webdriver
ins = webdriver.Chrome('C:\Program Files (x86)\Google\Chrome\Application')
source = BeautifulSoup(ins.page_source)
div = source.find_all('div', 'class':'col-md-4 col-md-offset-1')
all_as = div[0].find_all('a')
href = ''
for i in range(len(all_as)):
if 'CSV' in all_as[i].text:
href = all_as[i]['href']
ins.get('http://www.stockpup.com/'.format(href))
Which returned an exception:
---------------------------------------------------------------------------
PermissionError Traceback (most recent call last)
C:ProgramDataAnaconda3libsite-packagesseleniumwebdrivercommonservice.py in start(self)
75 stderr=self.log_file,
---> 76 stdin=PIPE)
77 except TypeError:
C:ProgramDataAnaconda3libsubprocess.py in __init__(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags, restore_signals, start_new_session, pass_fds, encoding, errors)
706 errread, errwrite,
--> 707 restore_signals, start_new_session)
708 except:
C:ProgramDataAnaconda3libsubprocess.py in _execute_child(self, args, executable, preexec_fn, close_fds, pass_fds, cwd, env, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite, unused_restore_signals, unused_start_new_session)
991 os.fspath(cwd) if cwd is not None else None,
--> 992 startupinfo)
993 finally:
PermissionError: [WinError 5] Access is denied
During handling of the above exception, another exception occurred:
WebDriverException Traceback (most recent call last)
<ipython-input-13-ebd684e97f30> in <module>()
1 from selenium import webdriver
----> 2 ins = webdriver.Chrome('C:\Program Files (x86)\Google\Chrome\Application')
3 source = BeautifulSoup(ins.page_source)
4 div = source.find_all('div', 'class':'col-md-4 col-md-offset-1')
5 all_as = div[0].find_all('a')
C:ProgramDataAnaconda3libsite-packagesseleniumwebdriverchromewebdriver.py in __init__(self, executable_path, port, options, service_args, desired_capabilities, service_log_path, chrome_options)
66 service_args=service_args,
67 log_path=service_log_path)
---> 68 self.service.start()
69
70 try:
C:ProgramDataAnaconda3libsite-packagesseleniumwebdrivercommonservice.py in start(self)
86 raise WebDriverException(
87 "'%s' executable may have wrong permissions. %s" % (
---> 88 os.path.basename(self.path), self.start_error_message)
89 )
90 else:
WebDriverException: Message: 'Application' executable may have wrong permissions. Please see https://sites.google.com/a/chromium.org/chromedriver/home
Finally I tried the following code which run without an exception but on the other hand nothing happened.
from bs4 import BeautifulSoup
import requests
from urllib.parse import urljoin
from os.path import basename
base = "http://www.stockpup.com/data/"
url = requests.get('http://www.stockpup.com/').text
soup = BeautifulSoup(url)
for link in (urljoin(base, a) for a in soup.select("a[href$=.csv]")):
with open(basename(link), "w") as f:
f.writelines(requests.get(link))
Your advice will be appreciated.
There are lots of libraries you can use in Python to do this, I suggest you take a look at
beautifulsoup
or scrapy
– gogaz
Aug 8 at 10:48
beautifulsoup
scrapy
The code suggested by Padraic Cunningham from stackoverflow.com/a/39056833/1228815 should do the job.
– Ivar van Wooning
Aug 8 at 11:00
You can use
beautifulsoup
to get a list of the href of all the a tag in the webpage ending in csv
. You can then iterate over the list to download each one.– Ankit S
Aug 8 at 11:04
beautifulsoup
csv
@Ivar Would you like to have a look at the code and comment if possible?
– user8270077
Aug 8 at 11:20
2 Answers
2
i think you should check out selenium, its cool
from selenium import webdriver
ins = webdriver.Chrome(path to the chrome driver)
source = BeautifulSoup(ins.page_source)
div = source.find_all('div', 'class':'col-md-4 col-md-offset-1')
all_as = div[0].find_all('a')
href = ''
for i in range(len(all_as)):
if 'CSV' in all_as[i].text:
href = all_as[i]['href']
ins.get('http://www.stockpup.com/'.format(href))
break
Note: Please remove the break statement if you want to download all the attachments, or just give a number where you want to stop
If you still want to do with requests, then i suggest take the href out from the a tag and just append to the stockpup url and then execute, it will download the csv files for you.Hope this helps!!
Another way to do this, much simpler, using requests and beautifulsoup
import pandas as pd
import requests
source = requests.get('http://www.stockpup.com/data/')
soup = BeautifulSoup(source.content)
div = soup.find_all('div', 'class':'col-md-4 col-md-offset-1')
all_as = div[0].find_all('a')
href = ''
for i in range(len(all_as)):
if 'CSV' in all_as[i].text:
href = all_as[i]['href']
data = pd.read_csv('http://www.stockpup.com/'.format(href))
data.to_csv(give the path where you want to save)
// path e.g r'C:/Users/sarthak_negi_/Downloads/file.csv'.. file.csv being the name which you will give for your csv file
// keep changing the name for every csv otherwise it will overwrite
break
Now remove the break for all the csv's. AS far as the error for above selenium attempt, i think path to chrome driver was wrong. you need to give the exe path of the driver. make sure to
Thank you for your ideas! I tried to implement them but I run into some problems as you can see in my updated post. Would it be possible to tell me exactly how to amend my code? It seems I can not follow your directions.
– user8270077
Aug 8 at 11:56
@Sarthak you can still do this using only selenium
– Tushortz
Aug 8 at 12:04
@user8270077 please try the updated one it will work
– Sarthak Negi
Aug 8 at 12:26
@Tushortz ya but with requests also its possible
– Sarthak Negi
Aug 8 at 12:26
@Sarthak: I am grateful! It worked like a charm!
– user8270077
Aug 8 at 17:18
Here is a simple solution:
import re
import requests
url='http://www.stockpup.com/data/'
resp = requests.get(url)
for ln in resp.text.splitlines():
if 'quarterly_financial_data.csv' in ln:
csv = re.split('/|"', ln)
print(url + csv[3])
r = requests.get(url + csv[3])
fcsv = open(csv[3], 'w')
fcsv.write(r.text)
fcsv.close()
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
You have to use webscraping using scrapy or any other scraping libraries for python.
– serbia99
Aug 8 at 10:48