Data Analysis with Pandas

By Michael Fudge

What is data analysis?

Data Analysis is the process of systematically applying statistical and/or logical techniques to describe and illustrate, condense and recap, and evaluate data [1].

The goals of data analysis are to:

  1. discover useful information,
  2. provide insights,
  3. suggest conclusions and
  4. support decision-making.

Some examples

Most data analysis problems start with a fundamental question (but they don't always have to). Examples:

  1. Do students who study in groups perform better on examinations that those who study alone?
  2. What role (if any at all) does weather play in consumer shopping habits?
  3. What types of passengers were most likely to survive the Titantic?
  4. Among American Football teams who "go for it" on 4th downs, what is their win percentage?
  5. Are we stocking products in warehouses closest to the customers who purchase them?

Data analysis is "information storytelling"

Don't think of data analysis as crunching numbers and churning out graphs.

I like to think of data analysis as "information storytelling." Unlike slapping a chart on a powerpoint slide, it's a full disclosure of the process:

  1. helping the reader understand your methodology (how you acquired and prepared the data)
  2. sharing your complete analysis (including things that didn't work)
  3. provding a narrative as to what the results mean, and most importantly
  4. providing an honest and accurate analysis.

Can't I just do this in Excel?

You could do this in Excel, but Python and Jupyter Notebook offer several advantages:

  • its easier to automate and update the process later on, since its code
  • You can intermix code with analysis in the Jupyter notebook
  • it can integrate with a variety of services and systems (because its code!).

Data analysis is a skill every information professional should have, and this is a primary reason we teach programming in the iSchool. When you can code it makes your data analysis that much better!

What is Pandas?

Pandas is a Python library for data analysis. The homepage of Pandas is http://pandas.pydata.org/.

Pandas takes the pain and suffering out of data analysis by doing a lot of the work for you. Pandas is so elegant at times, people feel it's "indistinguishable from magic." In this lesson we will try to our best to explain what Pandas is doing "behind the curtain" and expose the magic behind Pandas.

Table of Contents:

  1. Importing Pandas
  2. Pandas Data Structures
  3. Selecting rows and columns in a DataFrame
  4. Manipulating a DataFrame
  5. Real-world Pandas

1. Importing pandas

Before you can use pandas you must import it into Python. When we import Pandas we usually alias it as pd so that we don't have to type pandas in our code. For example instead of typing pandas.DataFrame() we can save ourselves a few keystrokes with pd.DataFrame()

In [1]:
import pandas as pd

# this turns off warning messages
import warnings
warnings.filterwarnings('ignore')

2. Pandas Data Structures

The key to understanding how Pandas works is to know about its underlying data structures. the series, the data frame, and the index.

Series

A series is the most basic structure in Pandas. It is simply a named-list of values. Every series has a data type, expressed as a numpy data type. (Pandas uses another module called numpy for is implementation.)

The following example creates a series called grades we use Python's named arguments feature to set the data, name and dtype arguments

In [2]:
grades = pd.Series(data = [100,80,100,90,80,70], name = "Grades", dtype='int')
grades
Out[2]:
0    100
1     80
2    100
3     90
4     80
5     70
Name: Grades, dtype: int32

NOTE: We don't have to set the dtype argument when we create a Series, typically the type will be inferred by the data types of the values in the list.

For example, here we create the same series, but instead we let Python infer the type from the list.

In [3]:
grades = pd.Series(data = [100,80,100,90,80,70], name = "Grades") # we left off dtype='int', assumes int anyways thanks to the list
grades
Out[3]:
0    100
1     80
2    100
3     90
4     80
5     70
Name: Grades, dtype: int64

Index

What's with those numbers to the left of the grades? That's the index. An index is an ordered list of values. It's how Pandas accesses specific values in a Series (or as we will see in a bit... a DataFrame)

The index of the series works a lot like the index in a list or a string. For example, this code prints the first and last grades in the series.

In [4]:
print("first grade:", grades[0])
print("last grade:", grades[5])
first grade: 100
last grade: 70

As you would expect, a Series is iterable meaning we can use slice notiation and if we want loop over the values:

In [5]:
for grade in grades:
    print(grade, end =" ")
100 80 100 90 80 70 

Good News! You don't need to iterate!

The beauty of Pandas is that for most operations, you won't even need a loop! For example, we can calculate each of the following without a loop:

In [6]:
print("Highest grade:", grades.max())
print("Average grade:", grades.mean())
print("lowest grade:", grades.min())
print("Sum of grades:", grades.sum())
print("Count of grades", grades.count())
Highest grade: 100
Average grade: 86.6666666667
lowest grade: 70
Sum of grades: 520
Count of grades 6

Where did all these functions come from? They're methods (object functions) associated with the Series object. If you dir(grades) you can see them!

If you want to learn more, here's the official documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.html

DataFrame

The DataFrame is simply a dictionary of series. You can think of it as a table of data that you can access and manipulate in a variety of ways. I like to think of the DataFrame as a programmable spreadsheet. It has rows and columns which can be accessed and manipulated with Python.

DataFrame is the most common Pandas data structure. As you'll see its expressive and versitile, making it an essential tool in data analysis.

To make a DataFrame we must create a Python dict of our series. We use the series name as the key and the series itself as the value.

This example creates a DataFrame from two series of Student names and Grade-Point-Averages (GPA's).

In [7]:
names = pd.Series( data = ['Allen','Bob','Chris','Dave','Ed','Frank','Gus'])
gpas = pd.Series( data = [4.0, 3.0, 3.4, 2.8, 2.5, 3.8, 3.0])
years = pd.Series( data = ['So', 'Fr', 'Fr', 'Jr', 'Sr', 'Sr', 'Fr'])
series_dict = { 'Name':  names, 'GPA': gpas, 'Year' : years }  # dict of Series, keys are the series names
students = pd.DataFrame( series_dict )
students
Out[7]:
GPA Name Year
0 4.0 Allen So
1 3.0 Bob Fr
2 3.4 Chris Fr
3 2.8 Dave Jr
4 2.5 Ed Sr
5 3.8 Frank Sr
6 3.0 Gus Fr

What? Pandas didn't use the same column order as the dict? We entered Name, GPA, Year and the DataFrame is GPA, Name, Year. No sweat! This can be fixed by including the columns named argument when we create the DataFrame:

In [8]:
students = pd.DataFrame( series_dict, columns = ['Name', 'GPA', 'Year'] )
students
Out[8]:
Name GPA Year
0 Allen 4.0 So
1 Bob 3.0 Fr
2 Chris 3.4 Fr
3 Dave 2.8 Jr
4 Ed 2.5 Sr
5 Frank 3.8 Sr
6 Gus 3.0 Fr

Iterating over a DataFrame

What happens when we iterate over a DataFrame? Let's try it!

In [9]:
for column in students:
    print(column)
Name
GPA
Year

It iterates over the columns! Why?!?!? If you recall, a DataFrame is a dict of Series, this is what happens when you loop over a dict.

So how do you loop through the rows? Use the DataFrame's to_records() method, of course.

This example prints out the students on the Dean's list (GPA of 3.4 or better)

In [10]:
for student in students.to_records():
    if student['GPA'] >=3.4:
        print("%s with a GPA of %.3f is on the Dean's list." % (student['Name'], student['GPA']) )
Allen with a GPA of 4.000 is on the Dean's list.
Chris with a GPA of 3.400 is on the Dean's list.
Frank with a GPA of 3.800 is on the Dean's list.

Of course if you've been paying attention, you might suspect that we:

do not need to iterate over the DataFrame to do this!

Pandas allows us to be expressive in our data selections. For example, here's a data frame of only the student names and GPA's on the dean's list.

In [11]:
deans_list = students[ ['Name', 'GPA'] ][ students.GPA >= 3.4 ]
deans_list
Out[11]:
Name GPA
0 Allen 4.0
2 Chris 3.4
5 Frank 3.8

3. Selecting rows and columns in a DataFrame

What you just witnessed sure looks like magic, but it's not. Once you understand how Pandas works, it becomes easy to employ approaches like the one you saw above to your data analysis.

Let's dive deeper.

Selecting Columns

To select a single column we access its key from the DataFrame dictionary. What we get back is a Series.

For example this retrieves the GPA's:

In [12]:
students['GPA']  # Series
Out[12]:
0    4.0
1    3.0
2    3.4
3    2.8
4    2.5
5    3.8
6    3.0
Name: GPA, dtype: float64

Oh? You wanted that as a DataFrame not a series, correct? Then you need to retrieve a list of columns instead:

In [13]:
students[ ['GPA'] ]  # DataFrame because ['GPA'] is a list
Out[13]:
GPA
0 4.0
1 3.0
2 3.4
3 2.8
4 2.5
5 3.8
6 3.0

Likewise you can use this notation to select more than one column. Like this example which selects Name and Year:

In [14]:
students[ ['Name', 'Year'] ]
Out[14]:
Name Year
0 Allen So
1 Bob Fr
2 Chris Fr
3 Dave Jr
4 Ed Sr
5 Frank Sr
6 Gus Fr

Selecting Rows

We've seen how to select columns from the DataFrame so let's move on to rows.

Typically DataFrames have far more rows than can fit on your screen. When getting a sense of what your data had to offer you can use three DataFrame methods.

head(n)will return the first n rows. For example, here's the first 3 rows:

In [15]:
students.head(3)
Out[15]:
Name GPA Year
0 Allen 4.0 So
1 Bob 3.0 Fr
2 Chris 3.4 Fr

Likewise we can get the last 2 rows in the DataFrame using the tail(n) method:

In [16]:
students.tail(2)
Out[16]:
Name GPA Year
5 Frank 3.8 Sr
6 Gus 3.0 Fr

How about a random sample of the DataFrame? The sample(n) method will retrieve n rows at random from the data frame. Let's get 3 rows at random:

In [17]:
students.sample(3)
Out[17]:
Name GPA Year
3 Dave 2.8 Jr
1 Bob 3.0 Fr
6 Gus 3.0 Fr

Selecting rows with boolean indexing

A boolean index is a Series of boolean values. For example. This boolean index describes rows in the DataFrame where the student Year is a freshman 'Fr':

In [18]:
students['Year'] == 'Fr' 
Out[18]:
0    False
1     True
2     True
3    False
4    False
5    False
6     True
Name: Year, dtype: bool

At a glance this might not seem useful. But when we apply a boolean index to a DataFrame it only selects the rows which are True in the boolean index. For example, this is only the freshman 'Fr'.

In [19]:
students[ students['Year'] == 'Fr' ]
Out[19]:
Name GPA Year
1 Bob 3.0 Fr
2 Chris 3.4 Fr
6 Gus 3.0 Fr

I'll admit the syntax is a little awkward, but you get accustomed to it over time. It is a very expressive way to filter rows in a DataFrame.

Pandas has special operators for logical and (&), or (|), and not(~).

For example, this retrieves students with a GPA between 2.5 and 3.0. Notice the parenthesis. These are required.

In [20]:
students[ (students.GPA >=2.5) & (students.GPA <= 3.0)]
Out[20]:
Name GPA Year
1 Bob 3.0 Fr
3 Dave 2.8 Jr
4 Ed 2.5 Sr
6 Gus 3.0 Fr

And as we've seen you can combine row selections using boolean indexing with column selections.

In [21]:
#             columns            boolean index
students[ ['Name', 'GPA'] ][ students.GPA >= 3.4 ]
Out[21]:
Name GPA
0 Allen 4.0
2 Chris 3.4
5 Frank 3.8

Even more useful is we can assign these results to a variable, and since that variable is a DataFrame the same rules apply!

In [22]:
deans_list = students[ ['Name', 'GPA'] ][ students.GPA >= 3.4 ]
deans_list[['Name']].head(2)
Out[22]:
Name
0 Allen
2 Chris

Manipulating a DataFrame

It's easy to manipulate the data in a DataFrame. This typically done as part of the data preparation steps of data analysis. Sometimes we need to create new columns for a better analysis or clean up existing data. Sometimes this is referred to as feature engineering.

For example, this code add's a column called 'Deans List' The column is initially set to 'No' for all students:

In [23]:
students['Deans List'] = 'No'
students
Out[23]:
Name GPA Year Deans List
0 Allen 4.0 So No
1 Bob 3.0 Fr No
2 Chris 3.4 Fr No
3 Dave 2.8 Jr No
4 Ed 2.5 Sr No
5 Frank 3.8 Sr No
6 Gus 3.0 Fr No

Next, we set 'Deans list' to 'Yes' for any student who has a GPA of 3.4 or better:

In [24]:
students['Deans List'][ students['GPA'] >= 3.4 ] = 'Yes'
students
Out[24]:
Name GPA Year Deans List
0 Allen 4.0 So Yes
1 Bob 3.0 Fr No
2 Chris 3.4 Fr Yes
3 Dave 2.8 Jr No
4 Ed 2.5 Sr No
5 Frank 3.8 Sr Yes
6 Gus 3.0 Fr No

NOTE: If you get a warning that states: A value is trying to be set on a copy of a slice from a DataFrame You get this warning because you manipulated an existing dataframe, instead of a copy.

You can also use this approach to clean up data. For example Ed should be Eddie.

In [25]:
students['Name'][students['Name'] == 'Ed'] = 'Eddie'
students
Out[25]:
Name GPA Year Deans List
0 Allen 4.0 So Yes
1 Bob 3.0 Fr No
2 Chris 3.4 Fr Yes
3 Dave 2.8 Jr No
4 Eddie 2.5 Sr No
5 Frank 3.8 Sr Yes
6 Gus 3.0 Fr No

Here's an example of how to create a new calculation from an existing column. Here we create a column 'Pts From Deans List' to calculate the number of points the student's GPA is from the 'Deans List'

In [26]:
students['Pts From Deans List'] = students['GPA'] - 3.4
students
Out[26]:
Name GPA Year Deans List Pts From Deans List
0 Allen 4.0 So Yes 0.6
1 Bob 3.0 Fr No -0.4
2 Chris 3.4 Fr Yes 0.0
3 Dave 2.8 Jr No -0.6
4 Eddie 2.5 Sr No -0.9
5 Frank 3.8 Sr Yes 0.4
6 Gus 3.0 Fr No -0.4

Real-world Pandas

We conclude our Pandas tour with some examples of how easily Pandas handles larger data sets. The true power of Pandas is not in creating your own Series and DataFramaes but loading them from files.

Reading comma-separated values (csv) into a DataFrame

For example this code reads a CSV (comma-separated values) file stored on GitHub and loads it into the customers DataFrame:

In [27]:
customers = pd.read_csv('https://raw.githubusercontent.com/mafudge/datasets/master/customers/customers.csv')
customers.head()
Out[27]:
First Last Email Gender Last IP Address City State Total Orders Total Purchased Months Customer
0 Al Fresco [email protected] M 74.111.18.161 Syracuse NY 1 45 1
1 Abby Kuss [email protected] F 23.80.125.101 Phoenix AZ 1 25 2
2 Arial Photo [email protected] F 24.0.14.56 Newark NJ 1 680 1
3 Bette Alott [email protected] F 56.216.127.219 Raleigh NC 6 560 18
4 Barb Barion [email protected] F 38.68.15.223 Dallas TX 4 1590 1

Here's an example of loading a Weather data for Syracuse, NY from 1997 to 2015, also found on GitHub:

In [28]:
weather = pd.read_csv('https://raw.githubusercontent.com/mafudge/datasets/master/weather/syracuse-ny.csv')
weather.head()
Out[28]:
EST Max TemperatureF Mean TemperatureF Min TemperatureF Max Dew PointF MeanDew PointF Min DewpointF Max Humidity Mean Humidity Min Humidity ... Max VisibilityMiles Mean VisibilityMiles Min VisibilityMiles Max Wind SpeedMPH Mean Wind SpeedMPH Max Gust SpeedMPH PrecipitationIn CloudCover Events WindDirDegrees
0 1997-1-1 27 12.0 -2 22 4 -8 92 74 59 ... 10 9 1 14 5 NaN 0.05 6 Snow 89
1 1997-1-2 34 28.0 23 33 29 21 100 96 88 ... 9 2 0 8 4 NaN 0.08 8 Fog-Rain-Snow 82
2 1997-1-3 44 40.0 36 44 38 34 100 96 89 ... 10 4 0 15 6 NaN 0.09 8 Fog-Rain 273
3 1997-1-4 48 40.0 34 44 36 33 96 90 83 ... 10 10 8 13 4 NaN 0.00 8 Rain 80
4 1997-1-5 55 46.0 37 50 43 29 89 81 73 ... 10 10 10 21 11 30.0 0.13 8 Rain 199

5 rows × 23 columns

There's too many columns in this dataset. This is where the DataFrame.columns property comes in handy:

In [29]:
weather.columns
Out[29]:
Index(['EST', 'Max TemperatureF', 'Mean TemperatureF', 'Min TemperatureF',
       'Max Dew PointF', 'MeanDew PointF', 'Min DewpointF', 'Max Humidity',
       'Mean Humidity', 'Min Humidity', 'Max Sea Level PressureIn',
       'Mean Sea Level PressureIn', 'Min Sea Level PressureIn',
       'Max VisibilityMiles', 'Mean VisibilityMiles', 'Min VisibilityMiles',
       'Max Wind SpeedMPH', 'Mean Wind SpeedMPH', 'Max Gust SpeedMPH',
       'PrecipitationIn', 'CloudCover', 'Events', 'WindDirDegrees'],
      dtype='object')

This example displays the min, mean and max temperatures for July, 4 2015. Notice how we use .str.startswith() to retrieve the rows in question.

In [30]:
weather[ weather['EST'] == '2015-7-4'][['EST','Min TemperatureF', 'Mean TemperatureF', 'Max TemperatureF']]
Out[30]:
EST Min TemperatureF Mean TemperatureF Max TemperatureF
6664 2015-7-4 60 69.0 78

Reading HTML tables into a DataFrame

You can even read HTML tables off a website with Pandas and import them into a DataFrame!

For example this retieves a table of US cities by population, from Wikipedia https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population

In [31]:
data = pd.read_html('https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population')
data[3].head()
Out[31]:
0 1 2 3 4 5 6 7 8 9 10
0 2016 rank City State[5] 2016 estimate 2010 Census Change 2016 land area 2016 population density Location NaN NaN
1 1 New York[6] New York 8537673 8175133 7000443466791304800♠+4.43% 301.5 sq mi 780.9 km2 28,317/sq mi 10,933/km2 40°39′49″N 73°56′19″W / 40.6635°N 73.9387°W...
2 2 Los Angeles California 3976322 3792621 7000484364243092050♠+4.84% 468.7 sq mi 1,213.9 km2 8,484/sq mi 3,276/km2 34°01′10″N 118°24′39″W / 34.0194°N 118.4108°...
3 3 Chicago Illinois 2704958 2695598 6999347232784710470♠+0.35% 227.3 sq mi 588.7 km2 11,900/sq mi 4,600/km2 41°50′15″N 87°40′54″W / 41.8376°N 87.6818°W...
4 4 Houston[7] Texas 2303482 2100263 7000967588344888229♠+9.68% 637.5 sq mi 1,651.1 km2 3,613/sq mi 1,395/km2 29°47′12″N 95°23′27″W / 29.7866°N 95.3909°W...

The read_html() method returns a list of DataFrames. The assumption here is there is more than one table on the webpage. data[3] represents the 4th table on the webpage, which is the one we wanted in this case.

Also the data needs to be cleaned up a bit. We will discuss how to do this in our In-Class Coding Lab.

To demonstrate how versatile the read_html() method can be, let's read the schedule of classes off the iSchool website: https://ischool.syr.edu/classes

Remember that read_html() returns a list of DataFrames so we must select the appropriate index containing the data we want. In this case the schedule is in data[0]

In [32]:
data = pd.read_html('https://ischool.syr.edu/classes/', attrs = {'id': 'classSchedule'} )
schedule = data[0]
schedule.head()
Out[32]:
Course Section Class Credits Title Instructor(s) Time Day Room(s)
0 GET300 M001 37202 3.0 Enterprise Data Analysis Michael A Leonardo 5:15pm - 8:05pm M Hinds Hall 010
1 GET302 M001 37133 3.0 Global Financial Sys Arch Frank Jr Marullo 5:00pm - 7:45pm Tu Hinds Hall 011
2 GET433 M001 37152 3.0 Multi-tier App. Development P Douglas Taber 11:00am - 12:20pm TuTh Hinds Hall 013
3 GET487 M001 41408 3.0 Global Tech Jeffrey Fouts 12:00am - 12:00am NaN Online
4 GET602 M001 37134 3.0 Global Financial Sys Arch Frank Jr Marullo 5:00pm - 7:45pm Tu Hinds Hall 011

Adding columns to a DataFrame

Unfortunately, the schedule DataFrame is missing column names.

But, Pandas allows us to add these at any time. Very Cool!

In [33]:
schedule.columns = ['Course','Section','ClassNo','Credits','Title','Instructor','Time','Days','Room']
schedule.head()
Out[33]:
Course Section ClassNo Credits Title Instructor Time Days Room
0 GET300 M001 37202 3.0 Enterprise Data Analysis Michael A Leonardo 5:15pm - 8:05pm M Hinds Hall 010
1 GET302 M001 37133 3.0 Global Financial Sys Arch Frank Jr Marullo 5:00pm - 7:45pm Tu Hinds Hall 011
2 GET433 M001 37152 3.0 Multi-tier App. Development P Douglas Taber 11:00am - 12:20pm TuTh Hinds Hall 013
3 GET487 M001 41408 3.0 Global Tech Jeffrey Fouts 12:00am - 12:00am NaN Online
4 GET602 M001 37134 3.0 Global Financial Sys Arch Frank Jr Marullo 5:00pm - 7:45pm Tu Hinds Hall 011

And now we can do some analysis, like finding classes on wednesdays:

In [34]:
schedule[ schedule['Days'] =='W'] 
Out[34]:
Course Section ClassNo Credits Title Instructor Time Days Room
31 IST256 M003 40685 3.0 Appl.Prog.For Information Syst Angela Usha Ramnarine-Rieks 3:45pm - 5:05pm W FALK ROOM 201
34 IST256 M006 40688 3.0 Appl.Prog.For Information Syst Wade Stringer 9:30am - 10:50am W Hinds Hall 011
36 IST256 M008 41760 3.0 Appl.Prog.For Information Syst Avinash Kadaji 8:00am - 9:20am W Hinds Hall 018
37 IST256 M009 42235 3.0 Appl.Prog.For Information Syst Deborah L Nosky 2:15pm - 3:35pm W Hinds Hall 018
44 IST323 M001 37076 3.0 Intro to Information Security Christopher Croad 2:15pm - 5:05pm W Hinds Hall 027 Hinds Hall 117

Reading JSON data into a DataFrame

This next example demonstrates how to load json data into a DataFrame. If the JSON data is structured like a list of dict, then it's quite simple. The example would work the same as if we read the json data from a file first.

We need orient='records' because our JSON data is a list of values.

In [35]:
json_data = '[{"GPA":3.0,"Name":"Bob"},{"GPA":3.7,"Name":"Sue"},{"GPA":2.4,"Name":"Tom"}]'

tweets = pd.read_json(json_data, orient='records')
tweets.head()
Out[35]:
GPA Name
0 3.0 Bob
1 3.7 Sue
2 2.4 Tom

Converting a list of dictionary to a pandas DataFrame (and back)

Let's demonstrate how to convert a list of dict in python into a pandas DataFrame we will then demonstrate how to convert it back to a list of dict.

This type of scenairio is useful when you call an API and would like to load the data into pandas, or take the pandas DataFrame and process it further in outside of pandas.

In [36]:
students_list_of_dict = [
    { 'Name' : 'Bob', 'GPA' : 3.0 },
    { 'Name' : 'Sue', 'GPA' : 3.7 },
    { 'Name' : 'Tom', 'GPA' : 2.4 }
]

students_df = pd.DataFrame(students_list_of_dict)
students_df
Out[36]:
GPA Name
0 3.0 Bob
1 3.7 Sue
2 2.4 Tom

Now let's convert the students_df back into a list of dict. Choosing the named argument orient ='record' produces a list of dict.

In [37]:
students = students_df.to_dict(orient='records')
students
Out[37]:
[{'GPA': 3.0, 'Name': 'Bob'},
 {'GPA': 3.7, 'Name': 'Sue'},
 {'GPA': 2.4, 'Name': 'Tom'}]

Printing data frames inline with other code.

In all these examples, we've been placing the dataframe object as last command in the cell. The Jupyter environment will display whatever item appears last, however what if we want to display the data frame in the middle of our code? Or display more than one dataframe?

This is where the display() function of the IPYthon package comes into play:

In [38]:
from IPython.display import display

customers = pd.read_csv('https://raw.githubusercontent.com/mafudge/datasets/master/customers/customers.csv')

choice = input("Customers: Would you like to see the first 2, the last 2, or 2 at random? [first, last, random]?").lower()
if choice == 'first':
    display(customers.head(2))
elif choice == 'last':
    display(customers.tail(2))
else:
    display(customers.sample(2))
print("And we're done here!")
Customers: Would you like to see the first 2, the last 2, or 2 at random? [first, last, random]?random
First Last Email Gender Last IP Address City State Total Orders Total Purchased Months Customer
9 Cam Rha [email protected] M 24.1.25.140 Chicago IL 0 0 1
13 Justin Case [email protected] M 23.192.215.44 Boston MA 3 1050 1
And we're done here!

Building interactions

This approach can be used to build intractions with data frames. For example this code asks you to enter an email then it prints the customer details.

In [39]:
customers = pd.read_csv('https://raw.githubusercontent.com/mafudge/datasets/master/customers/customers.csv')

customer_list = customers.sample(5)

print("CUSTOMER INQUIRY")
display(customer_list[['Email']])
email = input("Enter a customer email to retrieve their complete record: ").lower()

print("CUSTOMER DETAILS FOR " + email )
display(customer_list[ customer_list['Email'] == email])
CUSTOMER INQUIRY
Email
22 [email protected]
23 [email protected]
7 [email protected]
5 [email protected]
27 [email protected]
Enter a customer email to retrieve their complete record: [email protected]
CUSTOMER DETAILS FOR [email protected]
First Last Email Gender Last IP Address City State Total Orders Total Purchased Months Customer
7 Candi Cayne [email protected] F 24.39.14.15 Portland ME 1 620 2
In [ ]: