Hey guys, in this video I want to go over kind of a beginner video on Python for
data science and in particular I want to go over how you can import a text file
that has a data set into Python. So it's basically going to be how you can import
data into a Python script and in particular in this video we're going to
use a Python library called pandas, which is basically the most common Python
library that you use in order to import data into Python.
Since pandas is a Python library, you have to have it installed on your local
version of Python. In my case, I already do have that Python library
installed already but if you don't, you want to open up a terminal window and
run "pip install pandas." If I run that, I already have it installed so it's not
really going to do anything. It's just going to say "Requirement already satisfied"
but basically for my computer setup: I'm running Python 3.8 and I already have
pandas the Python library installed. So we're going to use that Python library
to import text files or dot CSV files into Python and I have this Python
script open in VS Code. This is just a short script that has two different
examples for how you can import data. Let's go ahead and go over this script.
For importing data, I'm going to assume that you have some dot CSV file or dot
txt file or dot dat file on your computer that you want to import. You can
also import a URL and I'll get to that in the second example but for running a
Python script, you basically always want to create a directory for your Python
script and so I on my local computer have a folder called import underscore
data and this directory has the Python script that I'm running and the txt file
that I want to import. This is the data set and as you can see it's a data set
with a bunch of rows and a couple different data types and different
columns and it's housing prices in London. This was just a random data set
that I found on Kaggle. I downloaded that and I have it in the same folder as my
Python script, my import data dot py. And so now let's actually go into the Python
script and see how we can import the data. The first thing we're going to want
to do, since we want to use the pandas library is we want to
import the pandas library. So the first thing you want to do at the top of your
Python script is run the line "import pandas" and optionally you can set it as
pd. This is basically just loading in that Python library and then that lets
me run functions from the Python library pandas and when I call those functions
I'm going to call them pd dot and then give the name of the function that I
want to call. The function in pandas that reads txt files or dot CSV files or dot
dat files, which are basically the most common text files that have data in them,
is read underscore csv. When you want to import your data, you're going to set
some variable name--so I'm calling it London underscore data--that's going to
be the variable that holds my data frame and I'm going to set that variable equal
to pd dot read underscore csv. This is going to be the function that reads in a
csv file and then the only parameter that it needs is the path to the file.
If you run it on your computer, you're going to want to pass in in quotation
marks the path to the file that you want Python to read in. In this example script,
instead of passing in directly the name of the string I pass in a variable but
the variable I set to be equal to the string that has the path file, so it's
the same thing. I could have just put this string directly into the
parentheses but instead, to make the code a little bit cleaner, I just created a
variable path to file and that has the string with the full path and then I
pass into read underscore csv the variable name of the string. So I'm going
to go ahead and run those two lines of code. I have set the variable path
underscore to file to be equal to the path file and I'm going to load that in
to read underscore csv. Now in Python when you import data, depending on what
IDE you're using, it doesn't really show a lot of changes to the output just by
loading in the data. So you might not see that something has happened but there
are ways of checking that you loaded in your data set correctly. You can do that
by checking what's in your london underscore data variable. For example,
in this line of code I say to print the head of the
london underscore data variable. If I run this line of code, then it will print to
the terminal the first five lines of the data set. When I look at the output on
the terminal of the head, it looks like the data set was loaded in correctly.
If I go back to the data that we're loading in, on this file--the dot CSV file--
the first row of the data is actually the column names, which is very common in
a lot of data sets. If you don't pass in any optional parameters to read
underscore csv, the default is it's going to read the first row as the column
names. And so that did it correctly for this file because the first row is
column names. Then if you wanted to get for example the shape of the data set, so
if you wanted to know how many rows of data do we have and how many columns are
in the data set, you could do the variable name of your data frame--so
london underscore data--dot shape. And then I'm going to print that to the terminal
window. And when I run that it outputs the shape of the data set that we
imported. This data set had 13,549 rows and seven columns and then I'm also
going to just print to the screen the different data types that are the
columns of our data frame. That's going to be print london underscore data dot
dtypes, which stands for data types, and then this prints to the screen what are
considered the different data types of each column. There's a date column and
that's considered an object; there's an area column; there's code; houses sold
that's a float; number of crimes, etc. That gives me the data types for the seven
different columns. Okay so if you check that with your data set, you can see that
it should have been loaded in correctly. That's one example of loading in or
importing data in Python. The second example I'm going to do is
just similar with a different data set and so this is going to be an example of
loading in data from a URL. You can also do that in Python. So the same thing
as last time, instead of passing in the path or the URL directly into read
underscore csv, I just save it as a variable name beforehand. And to import
data into Python from a URL, it's basically the same thing. In quotation
marks, you're creating a string and that string has the complete URL to your data
set. And this was just another data set that I found online. So this is
resid underscore energy dot dat. And so this would be the full URL that I copy into my quotation marks.
And this data set is a little bit different from the other one.
The first row in this data set does not have any column names and also this is not a
CSV file it's a dot dat file and the columns of data in this data set are no
longer separated by commas. That's the default separator for the different
columns in read underscore csv because CSV is a comma separated value file.
Read underscore csv can read from text files and it can read from dot dat files
like this example but you're going to have to add an additional parameter that
tells it when to separate into different columns. In this case, I'm going to do
an example of using spaces as the separator for the different columns.
To do that, I'm going to again call read underscore csv. I'm going to pass in the
name of the URL as a string. Then the next optional parameter that I have to
define in this case is going to be the separator. The separator is a
parameter and this one tells you what separates the cells into different
columns and we said that we want to have spaces. To have spaces it's going to
be "r" and then in quotation marks backslash "s+". And so this line of
code basically says that if there is one or more space, set that into the next
column, so use that as the separator and additionally we have to add another
optional parameter which is header because for this particular data set the
first row does not have column names, so we have to let Python know that we do
not have any column information. To do that, you put header is equal to none
with capital n and so that's a special command that says the first row in the
data set that you're importing is actually the first row of the data.
There's no header information which is assumed if you don't add any parameters
besides the path to file. I'm going to run these two lines of code to import my
data set and then the same thing. Just to check that we imported the data correctly,
I'm going to print out some descriptive statistics about the data set. I'm
going to print the head, which gives me the first five rows of the data set, and
I can compare that to the URL that has the data. The first line should be
1984, 235.83. That's correct. And then I'm going to load in
the shape. So it's 27 rows and nine columns. And then I can also see the
information about the data types. In this case, we did not name any of the nine
rows so it just names them as 0 through 8 and only the first one was an integer
and that's correct because the first column was just the year and all the
other values are floats. So by checking that, we see that we have correctly
imported the data using pandas and using read underscore csv. If you want more
information about optional parameters you can always check the function
information. Pandas dot read underscore csv has more optional parameters and
different values that you can pass into those parameters and panda also has
other functions for reading in other data as well. I went over read
underscore csv. It also has other options. For example if you want to read in an
excel file, you would do pd dot read underscore excel and then you could pass
in through a string the path file to an excel sheet, for example. That's a very
common way of importing data into Python and that's it for this video~~
*chiptune music*