By Michael Fudge
Data Analysis is the process of systematically applying statistical and/or logical techniques to describe and illustrate, condense and recap, and evaluate data [1].
The goals of data analysis are to:
Most data analysis problems start with a fundamental question (but they don't always have to). Examples:
Don't think of data analysis as crunching numbers and churning out graphs.
I like to think of data analysis as "information storytelling." Unlike slapping a chart on a powerpoint slide, it's a full disclosure of the process:
You could do this in Excel, but Python and Jupyter Notebook offer several advantages:
Data analysis is a skill every information professional should have, and this is a primary reason we teach programming in the iSchool. When you can code it makes your data analysis that much better!
Pandas is a Python library for data analysis. The homepage of Pandas is http://pandas.pydata.org/.
Pandas takes the pain and suffering out of data analysis by doing a lot of the work for you. Pandas is so elegant at times, people feel it's "indistinguishable from magic." In this lesson we will try to our best to explain what Pandas is doing "behind the curtain" and expose the magic behind Pandas.
Before you can use pandas you must import it into Python. When we import Pandas we usually alias it as pd
so that we don't have to type pandas
in our code. For example instead of typing pandas.DataFrame()
we can save ourselves a few keystrokes with pd.DataFrame()
import pandas as pd
# this turns off warning messages
import warnings
warnings.filterwarnings('ignore')
The key to understanding how Pandas works is to know about its underlying data structures. the series, the data frame, and the index.
A series is the most basic structure in Pandas. It is simply a named-list of values. Every series has a data type, expressed as a numpy
data type. (Pandas uses another module called numpy
for is implementation.)
The following example creates a series called grades
we use Python's named arguments feature to set the data
, name
and dtype
arguments
grades = pd.Series(data = [100,80,100,90,80,70], name = "Grades", dtype='int')
grades
NOTE: We don't have to set the dtype
argument when we create a Series, typically the type will be inferred by the data types of the values in the list.
For example, here we create the same series, but instead we let Python infer the type from the list.
grades = pd.Series(data = [100,80,100,90,80,70], name = "Grades") # we left off dtype='int', assumes int anyways thanks to the list
grades
What's with those numbers to the left of the grades? That's the index. An index is an ordered list of values. It's how Pandas accesses specific values in a Series
(or as we will see in a bit... a DataFrame
)
The index of the series works a lot like the index in a list or a string. For example, this code prints the first and last grades in the series.
print("first grade:", grades[0])
print("last grade:", grades[5])
As you would expect, a Series
is iterable meaning we can use slice notiation and if we want loop over the values:
for grade in grades:
print(grade, end =" ")
The beauty of Pandas is that for most operations, you won't even need a loop! For example, we can calculate each of the following without a loop:
print("Highest grade:", grades.max())
print("Average grade:", grades.mean())
print("lowest grade:", grades.min())
print("Sum of grades:", grades.sum())
print("Count of grades", grades.count())
Where did all these functions come from? They're methods (object functions) associated with the Series
object. If you dir(grades)
you can see them!
If you want to learn more, here's the official documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.html
The DataFrame is simply a dictionary of series. You can think of it as a table of data that you can access and manipulate in a variety of ways. I like to think of the DataFrame
as a programmable spreadsheet. It has rows and columns which can be accessed and manipulated with Python.
DataFrame
is the most common Pandas data structure. As you'll see its expressive and versitile, making it an essential tool in data analysis.
To make a DataFrame
we must create a Python dict
of our series. We use the series name as the key and the series itself as the value.
This example creates a DataFrame
from two series of Student names and Grade-Point-Averages (GPA's).
names = pd.Series( data = ['Allen','Bob','Chris','Dave','Ed','Frank','Gus'])
gpas = pd.Series( data = [4.0, 3.0, 3.4, 2.8, 2.5, 3.8, 3.0])
years = pd.Series( data = ['So', 'Fr', 'Fr', 'Jr', 'Sr', 'Sr', 'Fr'])
series_dict = { 'Name': names, 'GPA': gpas, 'Year' : years } # dict of Series, keys are the series names
students = pd.DataFrame( series_dict )
students
What? Pandas didn't use the same column order as the dict? We entered Name, GPA, Year
and the DataFrame is GPA, Name, Year
. No sweat! This can be fixed by including the columns
named argument when we create the DataFrame
:
students = pd.DataFrame( series_dict, columns = ['Name', 'GPA', 'Year'] )
students
What happens when we iterate over a DataFrame
? Let's try it!
for column in students:
print(column)
It iterates over the columns! Why?!?!? If you recall, a DataFrame
is a dict
of Series
, this is what happens when you loop over a dict
.
So how do you loop through the rows? Use the DataFrame
's to_records()
method, of course.
This example prints out the students on the Dean's list (GPA of 3.4 or better)
for student in students.to_records():
if student['GPA'] >=3.4:
print("%s with a GPA of %.3f is on the Dean's list." % (student['Name'], student['GPA']) )
Of course if you've been paying attention, you might suspect that we:
do not need to iterate over the DataFrame to do this!
Pandas allows us to be expressive in our data selections. For example, here's a data frame of only the student names and GPA's on the dean's list.
deans_list = students[ ['Name', 'GPA'] ][ students.GPA >= 3.4 ]
deans_list
What you just witnessed sure looks like magic, but it's not. Once you understand how Pandas works, it becomes easy to employ approaches like the one you saw above to your data analysis.
Let's dive deeper.
To select a single column we access its key from the DataFrame
dictionary. What we get back is a Series
.
For example this retrieves the GPA's:
students['GPA'] # Series
Oh? You wanted that as a DataFrame
not a series, correct? Then you need to retrieve a list of columns instead:
students[ ['GPA'] ] # DataFrame because ['GPA'] is a list
Likewise you can use this notation to select more than one column. Like this example which selects Name
and Year
:
students[ ['Name', 'Year'] ]
We've seen how to select columns from the DataFrame
so let's move on to rows.
Typically DataFrames have far more rows than can fit on your screen. When getting a sense of what your data had to offer you can use three DataFrame
methods.
head(n)
will return the first n rows. For example, here's the first 3 rows:
students.head(3)
Likewise we can get the last 2 rows in the DataFrame
using the tail(n)
method:
students.tail(2)
How about a random sample of the DataFrame
? The sample(n)
method will retrieve n rows at random from the data frame. Let's get 3 rows at random:
students.sample(3)
A boolean index is a Series
of boolean values. For example. This boolean index describes rows in the DataFrame
where the student Year
is a freshman 'Fr'
:
students['Year'] == 'Fr'
At a glance this might not seem useful. But when we apply a boolean index to a DataFrame
it only selects the rows which are True
in the boolean index. For example, this is only the freshman 'Fr'
.
students[ students['Year'] == 'Fr' ]
I'll admit the syntax is a little awkward, but you get accustomed to it over time. It is a very expressive way to filter rows in a DataFrame
.
Pandas has special operators for logical and (&), or (|), and not(~).
For example, this retrieves students with a GPA between 2.5 and 3.0. Notice the parenthesis. These are required.
students[ (students.GPA >=2.5) & (students.GPA <= 3.0)]
And as we've seen you can combine row selections using boolean indexing with column selections.
# columns boolean index
students[ ['Name', 'GPA'] ][ students.GPA >= 3.4 ]
Even more useful is we can assign these results to a variable, and since that variable is a DataFrame
the same rules apply!
deans_list = students[ ['Name', 'GPA'] ][ students.GPA >= 3.4 ]
deans_list[['Name']].head(2)
It's easy to manipulate the data in a DataFrame
. This typically done as part of the data preparation steps of data analysis. Sometimes we need to create new columns for a better analysis or clean up existing data. Sometimes this is referred to as feature engineering.
For example, this code add's a column called 'Deans List'
The column is initially set to 'No'
for all students:
students['Deans List'] = 'No'
students
Next, we set 'Deans list'
to 'Yes'
for any student who has a GPA of 3.4 or better:
students['Deans List'][ students['GPA'] >= 3.4 ] = 'Yes'
students
NOTE: If you get a warning that states: A value is trying to be set on a copy of a slice from a DataFrame
You get this warning because you manipulated an existing dataframe, instead of a copy.
You can also use this approach to clean up data. For example Ed
should be Eddie
.
students['Name'][students['Name'] == 'Ed'] = 'Eddie'
students
Here's an example of how to create a new calculation from an existing column. Here we create a column 'Pts From Deans List'
to calculate the number of points the student's GPA is from the 'Deans List'
students['Pts From Deans List'] = students['GPA'] - 3.4
students
We conclude our Pandas tour with some examples of how easily Pandas handles larger data sets. The true power of Pandas is not in creating your own Series
and DataFramaes
but loading them from files.
For example this code reads a CSV (comma-separated values) file stored on GitHub and loads it into the customers DataFrame
:
customers = pd.read_csv('https://raw.githubusercontent.com/mafudge/datasets/master/customers/customers.csv')
customers.head()
Here's an example of loading a Weather data for Syracuse, NY from 1997 to 2015, also found on GitHub:
weather = pd.read_csv('https://raw.githubusercontent.com/mafudge/datasets/master/weather/syracuse-ny.csv')
weather.head()
There's too many columns in this dataset. This is where the DataFrame.columns
property comes in handy:
weather.columns
This example displays the min, mean and max temperatures for July, 4 2015. Notice how we use .str.startswith()
to retrieve the rows in question.
weather[ weather['EST'] == '2015-7-4'][['EST','Min TemperatureF', 'Mean TemperatureF', 'Max TemperatureF']]
You can even read HTML tables off a website with Pandas and import them into a DataFrame
!
For example this retieves a table of US cities by population, from Wikipedia https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population
data = pd.read_html('https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population')
data[3].head()
The read_html()
method returns a list of DataFrames
. The assumption here is there is more than one table on the webpage. data[3]
represents the 4th table on the webpage, which is the one we wanted in this case.
Also the data needs to be cleaned up a bit. We will discuss how to do this in our In-Class Coding Lab.
To demonstrate how versatile the read_html()
method can be, let's read the schedule of classes off the iSchool website: https://ischool.syr.edu/classes
Remember that read_html()
returns a list of DataFrames
so we must select the appropriate index containing the data we want. In this case the schedule is in data[0]
data = pd.read_html('https://ischool.syr.edu/classes/', attrs = {'id': 'classSchedule'} )
schedule = data[0]
schedule.head()
Unfortunately, the schedule
DataFrame is missing column names.
But, Pandas allows us to add these at any time. Very Cool!
schedule.columns = ['Course','Section','ClassNo','Credits','Title','Instructor','Time','Days','Room']
schedule.head()
And now we can do some analysis, like finding classes on wednesdays:
schedule[ schedule['Days'] =='W']
This next example demonstrates how to load json data into a DataFrame
. If the JSON data is structured like a list of dict, then it's quite simple. The example would work the same as if we read the json data from a file first.
We need orient='records'
because our JSON data is a list of values.
json_data = '[{"GPA":3.0,"Name":"Bob"},{"GPA":3.7,"Name":"Sue"},{"GPA":2.4,"Name":"Tom"}]'
tweets = pd.read_json(json_data, orient='records')
tweets.head()
Let's demonstrate how to convert a list of dict in python into a pandas DataFrame
we will then demonstrate how to convert it back to a list of dict.
This type of scenairio is useful when you call an API and would like to load the data into pandas, or take the pandas DataFrame
and process it further in outside of pandas.
students_list_of_dict = [
{ 'Name' : 'Bob', 'GPA' : 3.0 },
{ 'Name' : 'Sue', 'GPA' : 3.7 },
{ 'Name' : 'Tom', 'GPA' : 2.4 }
]
students_df = pd.DataFrame(students_list_of_dict)
students_df
Now let's convert the students_df
back into a list of dict. Choosing the named argument orient ='record'
produces a list of dict.
students = students_df.to_dict(orient='records')
students
In all these examples, we've been placing the dataframe object as last command in the cell. The Jupyter environment will display whatever item appears last, however what if we want to display the data frame in the middle of our code? Or display more than one dataframe?
This is where the display()
function of the IPYthon
package comes into play:
from IPython.display import display
customers = pd.read_csv('https://raw.githubusercontent.com/mafudge/datasets/master/customers/customers.csv')
choice = input("Customers: Would you like to see the first 2, the last 2, or 2 at random? [first, last, random]?").lower()
if choice == 'first':
display(customers.head(2))
elif choice == 'last':
display(customers.tail(2))
else:
display(customers.sample(2))
print("And we're done here!")
This approach can be used to build intractions with data frames. For example this code asks you to enter an email then it prints the customer details.
customers = pd.read_csv('https://raw.githubusercontent.com/mafudge/datasets/master/customers/customers.csv')
customer_list = customers.sample(5)
print("CUSTOMER INQUIRY")
display(customer_list[['Email']])
email = input("Enter a customer email to retrieve their complete record: ").lower()
print("CUSTOMER DETAILS FOR " + email )
display(customer_list[ customer_list['Email'] == email])