Data Science is a vast field that is concerned with the extraction of information from structured and unstructured data, through means of algorithms, processes, and scientific methods. It can also be considered as the combination of Data Mining and Computer Science, i.e., the use of computer science techniques in data mining.

Interest in Data Science is at an all-time high and has exploded in popularity over the past couple of years. It has evolved into a complex multi-disciplinary subject involving but not limited to Machine Learning, Data Visualization, and Data Mining. In this article, we will be focusing on the basics of data manipulation using the python pandas module, which has increasingly gained popularity over the years.

Keep in mind that this article is a non-comprehensive introductory overview of data structures in python pandas. If you want a more detailed explanation of the discussed topics, we encourage you to take a look at the pandas documentation online.

**All explanation is provided using Pandas version 1.0.5 on the Jupyter Notebook. You can use pandas.__version__ (note the double underscore!) to check the pandas version you have installed.**

### Different data structures in Pandas

*Pandas* provides two data structures that can be used hand-in-hand with numpy functions to optimize data handling and visualization experience. These are:

*Series**DataFrames*

The fundamental behavior about data types, indexing, and axis labeling/alignment apply across both of these data structure objects.

For the following tutorial, we will need to import the python *numpy *and *pandas *modules into our namespace. Use the following code to do so:

import pandas as pd import numpy as np

Remember that *pd* and *np* are just aliases that can be used instead of pandas and numpy. We can give any alias to any module while importing them.

Great! Now we are ready to delve into understanding these data structures.

### Series

**Series** is a one-dimensional labeled array capable of holding any data type (integers, strings, floating-point numbers, Python objects, etc.). The axis labels are collectively referred to as the **index** object. The basic method to create a Series is as follows:

s = pd.Series(data, index, dtype)

The parameters of the Series constructor are:

data | An ndarray or list, scalar values (constants), and Dictionaries (key-value pairs). |

index | Unique and hashable list of values, similar to the primary key in SQL. Default np.arange(len(data)) |

dtype | Stands for data type. If type not passed, it is inferred from the data passed. |

##### 1. Creating a Series from ndarray

If the data passed is an ndarray, then the size of the index object must also be the same as the data. If no index is passed, then it assigns index as 0 to **len(data)-1**. Consider the following example:

data = np.array(['A','B','C','D']) S = pd.Series(data) print(S)

Its output is as follows:

0 A 1 B 2 C 3 D dtype: object

##### 2. Creating a Series using scalar values

It is mandatory for us to provide the index values if the data passed is a scalar. This value is repeated as many times as the length of the index object.

s = pd.Series('abc',index = ['c1','c2','c3','c4']) print(s)

It gives the following output:

c1 abc c2 abc c3 abc c4 abc dtype: object

##### 3. Creating a Series from a Dictionary

A dictionary can be passed as a parameter to the Series constructor. If no index is specified, it considers the dictionary keys in order as the indices. In the scenario where the index is passed, the dictionary is unpacked and the Series is constructed using specified indices and dictionary values in order.

d = {'Name':'Amit','Qualification':'B.Tech','Branch':'ISE'} i = np.array(['N','Q','B']) s1 = pd.Series(d) s2 = pd.Series(d,index = i) print(s1) print(s2)

Since the Series s1 is created without specifying the index, it considers the dictionary keys as its indices.

Name Amit Qualification B.Tech Branch ISE dtype: object

The Series s2 is created by specifying indices. The Series generated is as follows:

N Amit Q B.Tech B ISE dtype: object

### DataFrames

A **DataFrame** is a two-dimensional data structure that consists of **rows **and **columns**. It can also be defined as a collection of two or more Series objects that share the same index object. Think of it as an Excel spreadsheet, or an SQL table.

We can create a dataframe in python in the following manner:

df = pd.DataFrame(data, index, columns, dtype)

The different parameters of the dataframe constructor are:

data | Can be either ndarrays, lists, dictionaries, series, constants, or a DataFrame |

index | Also called row labels; default is 0 to len(rows)-1 |

columns | Column labels; default is 0 to len(cols)-1 unless specified. |

dtype | Specify the data type of each column |

##### 1. Creating DataFrames from a Dictionary of ndarrays

Every ndarray must be of the same length, and index (if passed) must be of the same length as well. If no index object is passed, then the index is set from 0 to len(nparray)-1.

Consider the following example where a dictionary of arrays is passed without an index:

d = {'Branch':np.array(['CSE','ISE','ECE','CIV']),'Students':np.array([64,62,55,53])) df = pd.DataFrame(d) print(df)

Output:Branch Students 0 CSE 64 1 ISE 62 2 ECE 55 3 CIV 53

In this next code sample, we have specified the index values. This creates a dataframe with the indices we have passed.

d = {'Branch':np.array(['CSE','ISE','ECE','CIV']),'Students':np.array([64,62,55,53])) df = pd.DataFrame(d, index = ['A','B','C','D']) print(df)

Output:Branch Students A CSE 64 B ISE 62 C ECE 55 D CIV 53

Note that here ndarrays can be replaced with regular lists as well.

##### 2. Creating DataFrames using nested Lists

DataFrames can be created using a list of lists. Consider the following code snippet:

data = [['Books',300],['Stationery',100],['Calculator',600]] df = pd.DataFrame(data,columns = ['Items','Cost']) print(df)

This gives the following output:

Items Cost 0 Books 300 1 Stationery 100 2 Calculator 600

If the index is passed, then it must be of the same length as the data. Keep in mind that it is set from 0 to len(rows)-1 by default.

##### 3. Creating DataFrames using a list of Dictionaries

DataFrames can be created using dictionaries, where the keys are automatically set as the column names by default.

data = [{'a':100,'b':200,'c':300},{'a':10,'b':20},{'c':30}] df = pd.DataFrame(data) print(df)

The output generated is:

a b c 0 100 200 300 1 10 20 NaN 2 NaN NaN 30

Pay attention to how pandas handles missing values as np.nan (Not a number). We can also specify the columns we want to create our dataframe with. However, if pandas detects a mismatch in dictionary keys and the columns specified, it replaces the column values with NaN.

#### The end… or is it?

If you persevered to the end of this article, hats off to you! This is all you need to know about what the pandas data structures are and how to create them. Just know that data science is 80% preparing data, and 20% complaining about preparing data. This is just the beginning. Don’t tell us that we didn’t warn you before.

##### References

https://en.wikipedia.org/wiki/Data_science

https://www.tutorialspoint.com/python_pandas/python_pandas_series.htm

https://www.tutorialspoint.com/python_pandas/python_pandas_dataframe.htm

## No Comments

Leave a comment Cancel