now let us see one example of data
scraping from websites you can see here
i have a page on my local where we have
some fields here five fields and
corresponding to that we have created a
excel table so i have added all the
headers and then i also have a next link
i click on next then it moves on to the
next record i want to extract all the
records in the excel sheet in different
rows one by one
so let's see how we would be performing
this data scraping part you can see on
the location or the url of the page i
have pg one for page one pe two for page
two and you can see with the index
number i can easily move over to any
page let us try to automate this very
simple page generally we don't get this
kind of simple navigation but we have
some complicated type of navigation
which opens up maybe using a page scroll
or clicking on next button or some other
actions so let us start with the simple
one so i add the libraries microsoft
html object library and internet
controls so them i as internet explorer
mhd has html document i initialize the
browser set i equals new internet
explorer i make it visible then i
navigate to the url so ie dot visible
equals true and then ie dot navigate now
since this is a local url i can give the
file path
and it will be taking me to the first
page
all right so we have opened up the page
right and after that i can add a
statement to wait until the page loads
so do until i e dot ready state equals
ready state underscore complete and i e
dot b z equals false
and till the time the page is not loaded
right it will stay inside the loop and
inside the loop i am adding do event so
that the excel do not get stuck
and i am still able to perform all the
events on the excel file all right so
that's the basic that we have built up
from last couple of lectures now moving
on to this example of data scripting
okay so i surround this weight statement
inside a for loop
and not just the weight statement but
i'll be performing more actions inside
it i'll be telling you why i've added
this code in the for loop so for now let
us find the last available row in the
excel sheet where we need to write the
data which we will be extracting from
the portal
so that would come from sheet one dot
cells rows dot count comma one and excel
up dot row plus one right remember you
need to add plus one otherwise it will
keep on overwriting to one row only
so sheet one dot cells now this is the
last row variable lr comma
one right so i'm talking about the first
column right in the last available row
where we need to write the data i'm
talking about first column what would be
the value that value will come from the
page right so this value i need to
extract so i would first go to the page
and inspect the value for syn i need to
have this in there and before i try to
extract any element i need to set the
document
so set ht equals i dot document
now i inspect this sin i can see here we
have the id for this sin and the number
is inside here as the text
so let us see whether we have a unique
element here whose id is in so i can
check uniqueness here in the developer
options itself just hit ctrl f on the
html source and then you can search the
element using text or maybe using xpath
or css selectors so if i put a hash syn
that means i am looking for all the
elements who have tag name as a and hash
means id
so
all the elements with tag a and their id
should be sent so i get one here that
means that element is unique on the page
all right so ht.getelementbyid
syn as the id and then dot inner text i
need to get the inner text and write in
that cell
now in the similar fashion you can
extract all the fields for example
company name status and the registration
number and put in different columns of
the excel file the next action is
clicking on this next button so that we
reach to the next page and perform the
same set of operations for the next page
so you can see here we have href and the
href is actually changing with every
page dot pg2.html
is the href for the second page
and let's try to find this element using
the tag anchor
you can see that we have more than one
elements present for this match so i
cannot use get elements by tag name
directly but i'll be using this property
to find that right element and then
perform click operation because one
property is unique for this particular
element which is text
so
set lms equals st dot get elements by
tag name
now iterate through all the elements so
for each element lms and it ends with
next and i put a condition here that if
lm dot i need to talk about one of the
properties can i use href or can i use
text let us try to use hf
we know that the references are not same
the links are not same for all the pages
it changes so if i try to search that it
should contain html that would probably
work for this example but in general
practice nowadays we don't see dot html
in the end of the pages all right so
that's not the right approach instead if
i use lm dot inner text and i say that
this inner text should be
next only if it contains next maybe in
that case i want to click on that
element so lm dot click if that is true
now when it would be clicking on that
link it would be actually loading a new
page so loading a page takes time so
that's why i added this block in the
very beginning that when it moves to the
next statement it should encounter this
block first and it should wait for page
to load
right so it waited for the page to load
it sets the new document it evaluated
the last row again and then it fetches
the syn number and then it clicks on the
next button for that next page and
similarly it will keep on moving and it
will be extracting the data for all the
records one by one
so you can see here when we run it in
unattended mode it extracts all the send
numbers one by one
so in the similar way i also need to
extract the name the company status roc
and registration number so i add a
comment here for all these and i'll be
adding a code statement for extraction
of these
fields one by one
okay so first inspects the company name
you can see we don't have any name or id
for this property but it is a tag td
which comes under this tr right tr is
table row td is table data t body is the
parent one so inside the t body tag
which is table body
let's find how many t bodies we have we
have only one and how many table rows we
have under it so we have four rows under
it
okay so there is one table which has
four rows and if we see the number of
rows so first row is company name second
one is status third one is roc and
fourth one is registration number so
these are the four rows that we have for
this table
okay so we need to iterate through all
the rows one by one and we need to
extract the value of second column
because the value would be in second
column in the first column we have their
names which is like company name and
company status okay so sheet1 dot cells
lr comma two dot value
ht dot get elements by tag name what is
the address for that element that would
be inside the t body and under that t
body we need to go to a particular row
okay so get elements by tag name okay t
body i put here and i put an index 0
because we have only one t body here on
the page and then under that also i need
to go to another tag which is table row
so i put another get elements by tag
name
tr which is again the first row because
i want to extract value from the first
row because first row is having company
name so i put 0 as an index and inside
that row i need to extract the second
column so get elements by tag name td i
put 1 here for the second column okay so
let's see when i run this am i able to
get the company name or not if i'm
successful i'll be putting the same
logic for other fields as well i had a
watch here i can see that we have the
first company here and i let the code to
run so in the excel i should have the
names for all the companies so this
works absolutely fine for the name as
well
so so far we are able to fetch the syn
numbers using the id and the names using
this logic now let us have a look at the
third one the third one is looking a bit
different because it is red in color so
maybe some difference in properties
let's have a look at the properties
let's inspect this element
so we can see we have multiple rows here
in the html four rows
now
this is not in the td directly but under
that td we have a paragraph and under
that paragraph we have a span and we
have also checked here that we have only
one tag with the name span present on
the entire page so we can directly use
the tag name as the
locator strategy so get elements by tag
name span and then index zero dot value
so this would work absolutely fine you
need not give that complex logic that we
have used for the previous value
moving on to the third row so first row
we are done second row we are done for
the third row if i expand i again see
two tds so i use the same logic that
i've used for the name i put it here
and then i would be just changing the
row index so here i put the column
fourth because i want this value in the
fourth column and the row index tr which
is here is 2 right so 0 1 and 2 that's
the third row similarly for the
registration number i need to put here
the fourth row which is having index 3
so i'll be changing the column number as
well as the index of the row number we
are done with the changes let's try to
run this code and see whether we are
able to do this entire data scraping
using this code or not
so it has extracted all information in
the excel table format
all right so that's how we do scraping
we'll see another example in the next
video