Data Science

Pandas in Python – Basics

pandas is an open source Python library for data analysis. Python has always been great for prepping and munging data, but it’s never been great for analysis – you’d usually end up using R or loading it into a database and using SQL (or worse, Excel). pandas makes Python great for analysis.

pandas is a dependency of statsmodels, making it an important part of the statistical computing ecosystem in Python.

Data Structures

pandas introduces two new data structures to Python – Series and DataFrame, both of which are built on top of NumPy (this means it’s fast).

# Importing the Pandas Module
import pandas as pd

Series

A Series is a one-dimensional object similar to an array, list, or column in a table. It will assign a labeled index to each item in the Series. By default, each item will receive an index label from 0 to N, where N is the length of the Series minus one.

# create a Series with an arbitrary list
s = pd.Series([7, 'Heisenberg', 3.14, -1789710578, 'Happy Eating!'])
s
0                7
1       Heisenberg
2             3.14
3      -1789710578
4    Happy Eating!
dtype: object

Alternatively, you can specify an index to use when creating the Series.

s = pd.Series([7, 'Heisenberg', 3.14, -1789710578, 'Happy Eating!'],
              index=['A', 'Z', 'C', 'Y', 'E'])
s
A                7
Z       Heisenberg
C             3.14
Y      -1789710578
E    Happy Eating!
dtype: object

The Series constructor can convert a dictionary as well, using the keys of the dictionary as its index.

d = {'Chicago': 1000, 'New York': 1300, 'Portland': 900, 'San Francisco': 1100,
     'Austin': 450, 'Boston': None}
cities = pd.Series(d)
cities
Austin            450
Boston            NaN
Chicago          1000
New York         1300
Portland          900
San Francisco    1100
dtype: float64

You can use the index to select specific items from the Series.

cities['Chicago']
1000.0
cities[['Chicago', 'Portland', 'San Francisco']]
Chicago          1000
Portland          900
San Francisco    1100
dtype: float64

Or you can use boolean indexing for selection.

cities[cities < 1000]
Austin      450
Portland    900
dtype: float64

That last one might be a little weird, so let’s make it more clear – cities < 1000 returns a Series of True/False values, which we then pass to our Series cities, returning the corresponding True items.

less_than_1000 = cities < 1000
print(less_than_1000)
print('\n')
print(cities[less_than_1000])
Austin            True
Boston           False
Chicago          False
New York         False
Portland          True
San Francisco    False
dtype: bool


Austin      450
Portland    900
dtype: float64

You can also change the values in a Series on the fly.

# changing based on the index
print('Old value:', cities['Chicago'])
cities['Chicago'] = 1400
print('New value:', cities['Chicago'])
('Old value:', 1000.0)
('New value:', 1400.0)
# changing values using boolean logic
print(cities[cities < 1000])
print('\n')
cities[cities < 1000] = 750

print cities[cities < 1000]
Austin      450
Portland    900
dtype: float64


Austin      750
Portland    750
dtype: float64

What if you aren’t sure whether an item is in the Series? You can check using idiomatic Python.

print('Seattle' in cities)
print('San Francisco' in cities)
False
True

Mathematical operations can be done using scalars and functions.

# divide city values by 3
cities / 3
Austin           250.000000
Boston                  NaN
Chicago          466.666667
New York         433.333333
Portland         250.000000
San Francisco    366.666667
dtype: float64
# square city values
np.square(cities)
Austin            562500
Boston               NaN
Chicago          1960000
New York         1690000
Portland          562500
San Francisco    1210000
dtype: float64

You can add two Series together, which returns a union of the two Series with the addition occurring on the shared index values. Values on either Series that did not have a shared index will produce a NULL/NaN (not a number).

print(cities[['Chicago', 'New York', 'Portland']])
print('\n')
print(cities[['Austin', 'New York']])
print('\n')
print(cities[['Chicago', 'New York', 'Portland']] + cities[['Austin', 'New York']])
Chicago     1400
New York    1300
Portland     750
dtype: float64


Austin       750
New York    1300
dtype: float64


Austin       NaN
Chicago      NaN
New York    2600
Portland     NaN
dtype: float64

Notice that because Austin, Chicago, and Portland were not found in both Series, they were returned with NULL/NaN values.

NULL checking can be performed with isnull and notnull.

# returns a boolean series indicating which values aren't NULL
cities.notnull()
Austin            True
Boston           False
Chicago           True
New York          True
Portland          True
San Francisco     True
dtype: bool
# use boolean logic to grab the NULL cities
print(cities.isnull())
print('\n')
print(cities[cities.isnull()])
Austin           False
Boston            True
Chicago          False
New York         False
Portland         False
San Francisco    False
dtype: bool


Boston   NaN
dtype: float64

DataFrame

A DataFrame is a tablular data structure comprised of rows and columns, akin to a spreadsheet, database table, or R’s data.frame object. You can also think of a DataFrame as a group of Series objects that share an index (the column names).

For the rest of the tutorial, we’ll be primarily working with DataFrames.

Reading Data

To create a DataFrame out of common Python data structures, we can pass a dictionary of lists to the DataFrame constructor.

Using the columns parameter allows us to tell the constructor how we’d like the columns ordered. By default, the DataFrame constructor will order the columns alphabetically (though this isn’t the case when reading from a file – more on that next).

Alternatively, you can specify an index to use when creating the Series.

import pandas as pd
#Create Lists of Data
Name =['Alex','Warren','Michael']
Age = [25,34,38]
Country=['USA','UK','Australia']
# Creating Columns
Columns = ['Name','Age','Country']
#Create a dictionary Type
Raw_Table= {'Name':Name,
             'Age':Age,
             'Country':Country}
# Creating a DataFrame with RawTable and Columns in Dictionary and List type.
df=pd.DataFrame(Raw_Table,columns = Columns)
print(df)
# Print Type
print type(df)
  Name Age Country 
0 Alex  25  USA 
1 Warren 34  UK 
2 Michael 38  Australia
<class 'pandas.core.frame.DataFrame'>
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s