Extract title from HTML and rename file to title
Clash Royale CLAN TAG#URR8PPP
Extract title from HTML and rename file to title
I have multiple files named output.html. I want to extract their title, which I can do successfully using following command:
cat output.html | sed -n 's/.*<title>(.*)</title>.*/1/ip;T;q'
Example:
7N8UGL0:~/Downloads$ cat output.html | sed -n 's/.*<title>(.*)</title>.*/1/ip;T;q'
SEIKO 5 Finder - SNK559 Automatic Watch
Now I want to rename the output.html to the extracted title:
SEIKO 5 Finder - SNK559 Automatic Watch.html
I already managed to put this into a script:
#!/bin/bash
title=`cat output.html | sed -n 's/.*<title>(.*)</title>.*/1/ip;T;q'`
echo $title
Further, I have a lot of these output.html files in directories named in epoch time format
ls -l
drwxrwxrwx 1 userna userna 512 Aug 7 19:33 1500122724.81
drwxrwxrwx 1 userna userna 512 Aug 7 19:33 1500122724.82
drwxrwxrwx 1 userna userna 512 Aug 7 19:33 1500122724.83
drwxrwxrwx 1 userna userna 512 Aug 7 19:32 1500122724.84
drwxrwxrwx 1 userna userna 512 Aug 7 18:36 1500122724.85
drwxrwxrwx 1 userna userna 512 Aug 7 18:35 1500122724.86
I would like to be able to extract the html title for all output.html in all the directories and rename the output.html accordingly.
Many thanks in advance,
jmt
2 Answers
2
Use the command find
to
find
-type f
-exec rename.bash ;
Find is recursive through each directory.
So the complete command would look like:
find <YOUR TOP DIRECTORY> -type f -name output.html -exec rename.bash ; -print
The -print
at the end will list all processed files to stdout.
Your rename script receives in argument the full path and filename of the output.html it found. So you will have to do your sed command, then a mv
from the argument you received to the path/THE-TITLE-VALUE-YOU-JUST-EXTRACTED-WITH-SED.html
.
-print
mv
path/THE-TITLE-VALUE-YOU-JUST-EXTRACTED-WITH-SED.html
FYI I would suggest you be careful with this renaming. Spaces in filenames, although perfectly "legal" can cause issues later. Make sure also your titles do not include special characters to the shell like *,!().
and many more. All alphanumeric is fine, along with -
and _
.
*,!().
-
_
I was able to solve this by writing following script:
#!/bin/bash
for file in $(find . -name output.html)
do
newfilename=`cat $file | sed -n 's/.*<title>(.*)</title>.*/1/ip;T;q'`
mv $file "$newfilename.html"
done
It does as follows:
Now I want to find a way to identify special characters like /: as I get an error when the HTML title contains any of those.
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Thanks for the reply Nic3500. For this to work I think I'd have to write the rename.bash script, which I did not do yet. I was able to resolve this, using the method which I provided as an anwser. Thank you again for your support.
– jmt
Aug 9 at 15:34