Category Archives: HTML

How to Extract links from a Webpage

Turn this…

20160616-scott-on-technology-page-source

into a neat file with clean links…

20160616-scott-on-technology-extracted-links

Things you need:

  1. A web browser
  2. A webpage from which to extract links
  3. An online tool to automatically extract the links
  4. A program to open .csv files (Excel/Google Sheets/free alternative, etc.)

Hint: If you only need links from a portion of the page use Firefox.  It allows you select a portion of the page and view only the “selection source” instead of the entire page.

Process

  • Navigate to the desired webpage in your favorite browser
  • View and copy the page source
    • Chrome: “View page source”
    • Firefox: “View Page Source” or “View Selection Source”
  • Paste the copied html into an online tool and click the button to generate a CSV file with the resulting links
  • Download, open, & enjoy!

Once the file is open you can do more other things in Excel like sorting by Domain or removing duplicate links.

Online Tool for Extracting Links from webpages:

Regex to extract two letter words

Extracting the two letter words from Wiktionary Category:English two-letter words https://en.wiktionary.org/wiki/Category:English_two-letter_words

Online, interactive regex tester: https://regex101.com/

Regex (will grab all 2 letter words including the explanation portion (in, of)):

/(\b[a-zA-Z]{2}\b)/g

Source Text:

Result:

 

Regex for grabbing from HTML (this will only result in the 2 letter words that are links:

Page source:

Result:

Wiktionary.org content text is available under the Creative Commons Attribution-ShareAlike License

YouTube NoCookie CORS issue

In doing some CSP (Content-Security-Policy) testing I noticed what appears to be CORS (cross-origin resource sharing) issue between youtube-nocookie.com and youtube.com.

The errors:

  • XMLHttpRequest cannot load https://www.youtube.com/annotations_invideo?cap_hist=1&video_id=g8FhoLmU160. No ‘Access-Control-Allow-Origin’ header is present on the requested resource. Origin ‘https://www.youtube-nocookie.com’ is therefore not allowed access.
  • XMLHttpRequest cannot load https://googleads.g.doubleclick.net/pagead/ads?sdkv=h.3.0.0&sdki=8000005&video_product_type=5&correlator=3864188496642048&client=ca-pub-6219811747049371&url=https%3A%2F%2Fscottontechnology.com%2Finstall-and-use-both-chrome-32-bit-and-chrome-64-bit-on-windows%2F&adk=2686905804&num_ads=1&channel=Vertical_303%2BVertical_32%2BVertical_5%2BVertical_737%2Bafv_overlay%2Bafv_user_id_uS_BLNTw4cypksAqCl9BGA%2Bafv_user_tinkererguy%2Binvideo_overlay_480x70_cat28%2Bivp%2Byt_cid_877827%2Byt_mpvid_0KUmPWauKRRfsFCB%2Byt_no_ap%2Byt_no_cp%2Byt_no_housead_in_auction%2Bytdevice_1%2Bytel_embedded%2Bytps_default&output=xml_vast3&flash=20.0.0&sz=474×360&host=ca-host-pub-5944971763221149&ht_id=6026736&hl=en&ea=0&image_size=450×50%2C468x60&ad_type=text_image_flash&eid=41351017&u_tz=-360&u_his=10&u_java=false&u_h=768&u_w=1366&u_ah=728&u_aw=1366&u_cd=24&u_nplug=5&u_nmime=7&dt=1455234283188&unviewed_position_start=1&videoad_start_delay=0&osd=2&frm=2&sdr=1&t_pyv=allow&min_ad_duration=0&max_ad_duration=110000&ca_type=flash&video_doc_id=yt_g8FhoLmU160&description_url=http%3A%2F%2Fwww.youtube.com%2Fvideo%2Fg8FhoLmU160&lact=485&lsv=false&yt_pt=APb3F2__AxaWxXJj5Yjj4-GdVqNGxdNyF13H-ZIbmDGqqcpVruUWYht30JR16YC7FYByTrqzlkpJNSiaYUB6uXkUoI09u1N8Ojf7fk3qFHEJq4V5hhIXxLt5EP4XWp2Lnhgs-SuVZfLRTJMNJ5c_1TX2WLwIfNRstFR8XpkRl2srMsssUkxSwsh0vGygIjOTDvUZzAjJVbBkWlklFQ3E7BaoJFAcwqTQKUazz3va1him528am6e0J2MBF5l4fIcMPG_a8s-WERzt4VHlC0A0OsD11HXc0-PmRbyqub6E2w&ytdevice=1&yt_vis=0&dbp=ChZRMnd3UVY1M0RxTk9CajNaeS10djZBEAEgAQ&ref=https%3A%2F%2Fscottontechnology.com%2Finstall-and-use-both-chrome-32-bit-and-chrome-64-bit-on-windows%2F. No ‘Access-Control-Allow-Origin’ header is present on the requested resource. Origin ‘https://www.youtube-nocookie.com’ is therefore not allowed access.

Noticed by others:

How to figure out the location where a form posts

If you are trying to figure out where a form posts (for instance to see whether or not it submits information securely) and the post location isn’t readily view-able in the page source, here is how to do it:

Using Firefox, (download if necessary) open Firebug.

Download Firefox

Download Firebug

Hit Ctrl+Shift+C to open the console.

Switch to the Net tab and submit the form.  (The console window needs to be open before you hit submit.)

In the list window you should see a POST request with the endpoint and the complete information about the request including Headers, Post, Response, JSON, Cache, and Cookies.

You can copy the endpoint location by right clicking and selecting Copy Location.

20150917 Superlogout Page Source