1. Article
  2. CSE / ISE

Pandas Data Structures for Data Science

Data Science is a vast field that is concerned with the extraction of information from structured and unstructured data, through means of algorithms, processes, and scientific methods. It can also be considered as the combination of Data Mining and Computer Science, i.e., the use of computer science techniques in data mining.

Interest in Data Science is at an all-time high and has exploded in popularity over the past couple of years. It has evolved into a complex multi-disciplinary subject involving but not limited to Machine Learning, Data Visualization, and Data Mining. In this article, we will be focusing on the basics of data manipulation using the python pandas module, which has increasingly gained popularity over the years.

Keep in mind that this article is a non-comprehensive introductory overview of data structures in python pandas. If you want a more detailed explanation of the discussed topics, we encourage you to take a look at the pandas documentation online.

All explanation is provided using Pandas version 1.0.5 on the Jupyter Notebook. You can use pandas.__version__ (note the double underscore!) to check the pandas version you have installed.

Different data structures in Pandas

Pandas provides two data structures that can be used hand-in-hand with numpy functions to optimize data handling and visualization experience. These are:

  1. Series
  2. DataFrames

The fundamental behavior about data types, indexing, and axis labeling/alignment apply across both of these data structure objects.

For the following tutorial, we will need to import the python numpy and pandas modules into our namespace. Use the following code to do so:

import pandas as pd
import numpy as np

Remember that pd and np are just aliases that can be used instead of pandas and numpy. We can give any alias to any module while importing them.

Great! Now we are ready to delve into understanding these data structures.

Series

Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating-point numbers, Python objects, etc.). The axis labels are collectively referred to as the index object. The basic method to create a Series is as follows:

s = pd.Series(data, index, dtype)

The parameters of the Series constructor are:

dataAn ndarray or list, scalar values (constants), and Dictionaries (key-value pairs).
indexUnique and hashable list of values, similar to the primary key in SQL.
Default np.arange(len(data))
dtypeStands for data type. If type not passed, it is inferred from the data passed.
1. Creating a Series from ndarray

If the data passed is an ndarray, then the size of the index object must also be the same as the data. If no index is passed, then it assigns index as 0 to len(data)-1. Consider the following example:

data = np.array(['A','B','C','D'])
S = pd.Series(data)
print(S)

Its output is as follows:

0    A
1    B
2    C
3    D
dtype: object
2. Creating a Series using scalar values

It is mandatory for us to provide the index values if the data passed is a scalar. This value is repeated as many times as the length of the index object.

s = pd.Series('abc',index = ['c1','c2','c3','c4'])
print(s)

It gives the following output:

c1    abc
c2    abc
c3    abc
c4    abc
dtype: object 
3. Creating a Series from a Dictionary

A dictionary can be passed as a parameter to the Series constructor. If no index is specified, it considers the dictionary keys in order as the indices. In the scenario where the index is passed, the dictionary is unpacked and the Series is constructed using specified indices and dictionary values in order.

d = {'Name':'Amit','Qualification':'B.Tech','Branch':'ISE'}
i = np.array(['N','Q','B'])
s1 = pd.Series(d)
s2 = pd.Series(d,index = i)
print(s1)
print(s2)

Since the Series s1 is created without specifying the index, it considers the dictionary keys as its indices.

Name            Amit
Qualification   B.Tech
Branch          ISE
dtype: object

The Series s2 is created by specifying indices. The Series generated is as follows:

N    Amit
Q    B.Tech
B    ISE
dtype: object

DataFrames

A DataFrame is a two-dimensional data structure that consists of rows and columns. It can also be defined as a collection of two or more Series objects that share the same index object. Think of it as an Excel spreadsheet, or an SQL table.

We can create a dataframe in python in the following manner:

df = pd.DataFrame(data, index, columns, dtype)

The different parameters of the dataframe constructor are:

dataCan be either ndarrays, lists, dictionaries, series, constants, or a DataFrame
indexAlso called row labels; default is 0 to len(rows)-1
columns Column labels; default is 0 to len(cols)-1 unless specified.
dtypeSpecify the data type of each column
1. Creating DataFrames from a Dictionary of ndarrays

Every ndarray must be of the same length, and index (if passed) must be of the same length as well. If no index object is passed, then the index is set from 0 to len(nparray)-1.

Consider the following example where a dictionary of arrays is passed without an index:

d = {'Branch':np.array(['CSE','ISE','ECE','CIV']),'Students':np.array([64,62,55,53]))
df = pd.DataFrame(d)
print(df)
Output:
   Branch  Students
0     CSE        64
1     ISE        62
2     ECE        55
3     CIV        53

In this next code sample, we have specified the index values. This creates a dataframe with the indices we have passed.

d = {'Branch':np.array(['CSE','ISE','ECE','CIV']),'Students':np.array([64,62,55,53]))
df = pd.DataFrame(d, index = ['A','B','C','D'])
print(df)
Output:
   Branch  Students
A     CSE        64
B     ISE        62
C     ECE        55
D     CIV        53

Note that here ndarrays can be replaced with regular lists as well.

2. Creating DataFrames using nested Lists

DataFrames can be created using a list of lists. Consider the following code snippet:

data = [['Books',300],['Stationery',100],['Calculator',600]]
df = pd.DataFrame(data,columns = ['Items','Cost'])
print(df)

This gives the following output:

         Items   Cost
0        Books    300
1   Stationery    100
2   Calculator    600

If the index is passed, then it must be of the same length as the data. Keep in mind that it is set from 0 to len(rows)-1 by default.

3. Creating DataFrames using a list of Dictionaries

DataFrames can be created using dictionaries, where the keys are automatically set as the column names by default.

data = [{'a':100,'b':200,'c':300},{'a':10,'b':20},{'c':30}]
df = pd.DataFrame(data)
print(df)

The output generated is:

     a     b     c
0  100   200   300
1   10    20   NaN
2  NaN   NaN    30

Pay attention to how pandas handles missing values as np.nan (Not a number). We can also specify the columns we want to create our dataframe with. However, if pandas detects a mismatch in dictionary keys and the columns specified, it replaces the column values with NaN.

The end… or is it?

If you persevered to the end of this article, hats off to you! This is all you need to know about what the pandas data structures are and how to create them. Just know that data science is 80% preparing data, and 20% complaining about preparing data. This is just the beginning. Don’t tell us that we didn’t warn you before.

References

https://en.wikipedia.org/wiki/Data_science
https://www.tutorialspoint.com/python_pandas/python_pandas_series.htm
https://www.tutorialspoint.com/python_pandas/python_pandas_dataframe.htm

Contributor
Do you like Prateek Joshi's articles? Follow on social!
Comments to: Pandas Data Structures for Data Science

Your email address will not be published. Required fields are marked *

Attach images - Only PNG, JPG, JPEG and GIF are supported.

Download VTU Connect

Get it on Google Play

Latest Articles

VTU Updates

VTU Provisional Degree Certificates (PDC) and extract of Grade/ Marks Card will be issued to the students who have completed their degree during August / September 2020 examinations and the following are the prerequisites and conditions for submission of application to Provisional Degree Certificate and Extract of 8th semester grade/marks card. Prerequisites for Applying for […]
VTU in association with IIT Bombay – Spoken Tutorial has organized Student Development Program on Scilab, Drupal & Python. These programs helps Students of B.E., B.Tech, M.Tech, MCAdepartments in their academic activities and also in enriching their technical skills. Students can enroll to more than one course. However the registration fee to be paid for […]

Trending

VTU has released circular specifying the question paper pattern and instructions for August/September 2020 examinations. 1. For Regular Students: Students of Terminal Semesters of U.G and P.G programs have to answer any five full questions, irrespective of modules. 2. For Eligible Arrear Students: The question paper pattern remains unchanged as per respective scheme. 3. All […]

Join our Newsletter

Get our monthly recap with the latest news, articles and resources.

By subscribing you agree to our Privacy Policy.

Categories

Login

Welcome to VTU Connect

Register to VTU Connect now for instant updates on new articles and news
Join VTU Connect

%d bloggers like this: