Summary: Microsoft Scripting Guy, Ed Wilson, shows how to use Windows PowerShell 3.0 to easily download web page links from a blog.
Microsoft Scripting Guy, Ed Wilson, is here. Today the weather outside is beautiful here in Charlotte, North Carolina in the United States. I opened the windows around the scripting house, and from my office, I am looking out on the green trees in our front yard. Our magnolia tree is still in bloom, as are our neighbor’s hibiscus plants. (Luckily for my neighbor, I get my hibiscus flowers from an organic grower on the Internet; otherwise, he might open his door one morning to find my teacup and I in his garden.)
The Scripting Wife continues to hammer away at the details for our three-week European tour, and the emails, Facebook posts, and tweets are flying back-and-forth across the big pond nearly 24-hours a day. When we have everything organized, I will post updates on the Scripting Guys Community page.
Use Invoke-WebRequest to obtain links on a page
By using the Invoke-WebRequest cmdlet in Windows PowerShell 3.0, downloading page links from a website is trivial. When I write a Windows PowerShell script using Windows PowerShell 3.0 features, I add a #Requires statement. (I did the same thing in the early days of Windows PowerShell 2.0 also. When Windows PowerShell 3.0 is ubiquitous, I will probably quit doing this.) Here is the #Requires statement.
#requires -version 3.0
The next thing I do is use the Invoke-WebRequest cmdlet to return the Hey, Scripting Guy! Blog. I store the returned object to a variable named $hsg as shown here.
$hsg = Invoke-WebRequest -Uri http://www.scriptingguys.com/blog
The object that is stored in the $hsg variable is an HTMLWebResponseObject object with a number of properties. These properties are shown here.
PS C:\> $hsg | gm -MemberType Property
TypeName: Microsoft.PowerShell.Commands.HtmlWebResponseObject
Name MemberType Definition
---- ---------- ----------
AllElements Property Microsoft.PowerShell.Commands.WebCmdletElementCollection AllElements {...
BaseResponse Property System.Net.WebResponse BaseResponse {get;set;}
Content Property string Content {get;}
Forms Property Microsoft.PowerShell.Commands.FormObjectCollection Forms {get;}
Headers Property System.Collections.Generic.Dictionary[string,string] Headers {get;}
Images Property Microsoft.PowerShell.Commands.WebCmdletElementCollection Images {get;}
InputFields Property Microsoft.PowerShell.Commands.WebCmdletElementCollection InputFields {...
Links Property Microsoft.PowerShell.Commands.WebCmdletElementCollection Links {get;}
ParsedHtml Property mshtml.IHTMLDocument2 ParsedHtml {get;}
RawContent Property string RawContent {get;}
RawContentLength Property long RawContentLength {get;}
RawContentStream Property System.IO.MemoryStream RawContentStream {get;}
Scripts Property Microsoft.PowerShell.Commands.WebCmdletElementCollection Scripts {get;}
StatusCode Property int StatusCode {get;}
StatusDescription Property string StatusDescription {get;}
I decide that I need to use the Links property to return the hyperlinks from the Hey, Scripting Guys! Blog. This command is shown here.
$hsg.Links
As I looked over the returned links, I noticed that their appeared to be several different classes of links. To review the different types of links, I piped the classes to the Sort-Object cmdlet, and I used the Unique switch. This command is shown here, along with the associated output.
PS C:\> $hsg.Links | select class | sort class -Unique
class
-----
external-link view-post
internal-link advanced-search
internal-link rss
internal-link view-application
internal-link view-detail-list
internal-link view-group
internal-link view-home
internal-link view-list
internal-link view-post
internal-link view-post-archive-list
internal-link view-user-profile
last
menu-title
MSTWButtonLink
page
rss-left
rss-right
selected
sidebar-tile-comments
sidebar-tile-contact
sidebar-tile-subscribe
tweet-url hashtag
tweet-url username
twtr-fav
twtr-join-conv
twtr-profile-img-anchor
twtr-reply
twtr-rt
twtr-timestamp
twtr-user
From the list, I can see that I am interested in only the “internal-link view-post” class of links. I add a Where-Object command (using the simplified syntax) to return only the “internal-link view-post” class links, and I am greeted with the output shown here. (I have deleted all but one instance of the record.)
PS C:\> $hsg.Links |
Where class -eq 'internal-link view-post'
innerHTML : <span></span>Use PowerShell Redirection Operators for Script Flexibility
innerText : Use PowerShell Redirection Operators for Script Flexibility
outerHTML : <a class="internal-link view-post" href="http://blogs.technet.com/b/heyscriptingguy/archive/2012/09/20/use-powersh
ell-redirection-operators-for-script-flexibility.aspx"><span></span>Use PowerShell
Redirection Operators for Script Flexibility</a>
outerText : Use PowerShell Redirection Operators for Script Flexibility
tagName : A
class : internal-link view-post
href : /b/heyscriptingguy/archive/2012/09/20/use-powershell-redirection-operators-for-script-flex
ibility.aspx
From this output, I see that I am interested in only the outerText, and the href properties. I select these two properties, and am left with the script that is shown here.
Get-WebPageLinks.ps1
#requires -version 3.0
$hsg = Invoke-WebRequest -Uri http://www.scriptingguys.com/blog
$hsg.Links |
Where class -eq 'internal-link view-post' |
select outertext, href
The script and associated output are shown in the image that follows.
One thing that might be interesting is to send the output to the Out-Gridview cmdlet. This would permit easier analysis of the data. To do that would require only adding the Out-GridView command to the end of the script. The modification is shown here.
$hsg.Links |
Where class -eq 'internal-link view-post' |
select outertext, href | Out-GridView
Join me tomorrow when I will talk about more cool Windows PowerShell 3.0 stuff.
I invite you to follow me on Twitter and Facebook. If you have any questions, send email to me at scripter@microsoft.com, or post your questions on the Official Scripting Guys Forum. See you tomorrow. Until then, peace.
Ed Wilson, Microsoft Scripting Guy