download an html page using wget with only partial link

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP



download an html page using wget with only partial link



I am writing a bash script to download current natgeo photo of the day html web page using wget, it changes everyday. When I go to link https://www.nationalgeographic.com/photography/photo-of-the-day/ it redirects me to current page that is https://www.nationalgeographic.com/photography/photo-of-the-day/2018/08/mandalay-golden-sunrise/ that part after photo of the day changes everyday in website. I want wget to download the 2nd html link (which changes everyday) using only the 1st link( which when typed in browser redirects me to the 2nd link). How can I do it?



till now I have tried:


wget https://www.nationalgeographic.com/photography/photo-of-the-day/



but it does not give me the desired 2nd link html page.





I review the first url, and i think you can get the code and parse the first page, then get the "twitter:image:src" meta value, and you have in this way the url of the desired image, if you get the meta value from "twitter:url" you get the desired url (the second link html page)
– juanbits
Aug 8 at 4:54





3 Answers
3



This will work for you, a nice and easy single line code.


curl https://www.nationalgeographic.com/photography/photo-of-the-day/ | grep -m 1 https://www.nationalgeographic.com/photography/photo-of-the-day/ | cut -d '=' -f 3 |head -c-3 > desired_url



it will write the url u are looking for to a file named desired_url:



the file will look something like:



"https://www.nationalgeographic.com/photography/photo-of-the-day/2018/08/mandalay-golden-sunrise/"



which is your desired url.



To download the file u just have to do a:


url=`cat desired_url`

wget "$url"



Try this:


#! /bin/bash

url=https://www.nationalgeographic.com/photography/photo-of-the-day/

wget -q -O- "$url" > index.html

og_url=$(xmllint --html --xpath 'string(//meta[@property="og:url"]/@content)' index.html 2>/dev/null)
og_image=$(xmllint --html --xpath 'string(//meta[@property="og:image"]/@content)' index.html 2>/dev/null)

rm index.html

name=$og_url%/
name=$name##*/
file="$name".jpg

wget -q -O "$file" "$og_image"
echo "$file"



First it loads the base URL. Then it uses xmllint to extract the relevant information. Standard error gets ignored, because the HTML code contains many errors. But xmllint is still able to parse the relevant parts of the HTML page. The name of the image is part of an URL, which is stored in the value of the attribute content in a meta element with the attribute property=og:url. The URL of the image is stored in the a similar meta element with the attribute property=og:image. Bash's parameter substitution is used to craft a file name. File name and URL are used in the second wget to load the image. Finally the script reports the name of the created file.


xmllint


content


meta


property=og:url


meta


property=og:image


wget



If you strictly wish to use wget you will have to download the page of the first URL to acquire the address that changes every day.Since we will not use the downloaded page for anything else we can just download it to /tmp. I am renaming the downloaded page file to NG.html


wget https://www.nationalgeographic.com/photography/photo-of-the-day -O /tmp/NG.html



I assume that the URL you want is the direct link to the picture which in this case is



https://yourshot.nationalgeographic.com/u/fQYSUbVfts-T7odkrFJckdiFeHvab0GWOfzhj7tYdC0uglagsDNcYRm8vejuXg0QxTzqdASwYhpl6e-h74GxPyqutLd15lrhO2QpHQIwDhQhoBQJTxpBSU4oz1-dHfqeGM_woeke6FIaD5wOrPsDo_UOe_nesId87TLVU8qeMyW07MHDznqt_vj5hZAtvQEpuBxw4bZQEeUoPC_zgoESthc9dS8cSTY2RA/



How do we get that?



One way is to grep for the tag with "twitter:url" and print one line below it.


grep -A 1 twitter:url /tmp/NG.html



The "-A 1" parameter prints one more line after the line containing the pattern we searched for. The result is like this:


grep -A 1 twitter:url /tmp/NG.html
<meta property="twitter:url" content="https://www.nationalgeographic.com/photography/photo-of-the-day/2018/08/mandalay-golden-sunrise/"/>
<meta property="og:image" content="https://yourshot.nationalgeographic.com/u/fQYSUbVfts-T7odkrFJckdiFeHvab0GWOfzhj7tYdC0uglagsDNcYRm8vejuXg0QxTzqdASwYhpl6e-h74GxPyqutLd15lrhO2QpHQIwDhQhoBQJTxpBSU4oz1-dHfqeGM_woeke6FIaD5wOrPsDo_UOe_nesId87TLVU8qeMyW07MHDznqt_vj5hZAtvQEpuBxw4bZQEeUoPC_zgoESthc9dS8cSTY2RA/"/>



Now we can grep for "og:image" to select only the line that contains our URL. We could not grep for "og:image" before because there are other tags in the document with "og:image" in it.



So now we will get only the last line containing the URL:


grep -A 1 twitter:url /tmp/NG.html | grep "og:image"
<meta property="og:image" content="https://yourshot.nationalgeographic.com/u/fQYSUbVfts-T7odkrFJckdiFeHvab0GWOfzhj7tYdC0uglagsDNcYRm8vejuXg0QxTzqdASwYhpl6e-h74GxPyqutLd15lrhO2QpHQIwDhQhoBQJTxpBSU4oz1-dHfqeGM_woeke6FIaD5wOrPsDo_UOe_nesId87TLVU8qeMyW07MHDznqt_vj5hZAtvQEpuBxw4bZQEeUoPC_zgoESthc9dS8cSTY2RA/"/>



Now we can use cut to extract the URL from inside the HTML tag



if we use the '"' symbol as delimiter(separator), the 4th field will be the URL:


1 <meta property=
2 og:image
3 content=
4 https://yourshot.nationalgeographic.com/u/fQYSUbVfts-T7odkrFJckdiFeHvab0GWOfzhj7tYdC0uglagsDNcYRm8vejuXg0QxTzqdASwYhpl6e-h74GxPyqutLd15lrhO2QpHQIwDhQhoBQJTxpBSU4oz1-dHfqeGM_woeke6FIaD5wOrPsDo_UOe_nesId87TLVU8qeMyW07MHDznqt_vj5hZAtvQEpuBxw4bZQEeUoPC_zgoESthc9dS8cSTY2RA/
5 />



So now applying cut with delimiter '"' and selecting the 4th field it gives us:


grep -A 1 twitter:url /tmp/NG.html | grep "og:image" | cut -d '"' -f 4
https://yourshot.nationalgeographic.com/u/fQYSUbVfts-T7odkrFJckdiFeHvab0GWOfzhj7tYdC0uglagsDNcYRm8vejuXg0QxTzqdASwYhpl6e-h74GxPyqutLd15lrhO2QpHQIwDhQhoBQJTxpBSU4oz1-dHfqeGM_woeke6FIaD5wOrPsDo_UOe_nesId87TLVU8qeMyW07MHDznqt_vj5hZAtvQEpuBxw4bZQEeUoPC_zgoESthc9dS8cSTY2RA/



Now we can supply this URL to wget and save it as jpg


wget $( grep -A 1 twitter:url /tmp/NG.html | grep "og:image" | cut -d '"' -f 4) -O image.jpg



In Summary, you will need to run 2 lines:


wget https://www.nationalgeographic.com/photography/photo-of-the-day -O /tmp/NG.html
wget $( grep -A 1 twitter:url /tmp/NG.html | grep "og:image" | cut -d '"' -f 4) -O image.jpg






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

Firebase Auth - with Email and Password - Check user already registered

Dynamically update html content plain JS

How to determine optimal route across keyboard