Learn data science and machine learning by building real-world projects on Jovian

Sign up to execute **playing-with-pandas-basics** and 160,000+ data science projects. Build your own projects and share them online!

Updated a year ago

In [1]:

```
from PIL import Image
Pandas=Image.open('/Users/Laptop Point/Desktop/jovian_image/pandas.png')
Pandas
```

Out[1]:

This notebook is dedicated to the pandas library of python. The main objective of creating this notebook is to learn the basics of pandas library and slowly and steadily move to advanced level of pandas so as to get the skills of data manipulation and further analysing so as to visualise it with the other libraries of python such as Matplotlib and seaborn.

During importing a dataset the data set might be in csv, xlsx or any other format. Here, we will see how to import csv and xlsx format.

In [2]:

`import pandas as pd`

To import and read a csv data, Use the fuction *pd.read_csv()* with the directory of the data as an argument in string.

In [3]:

`data_csv=pd.read_csv('/Users/Laptop Point/Desktop/data.csv')`

Above **pd.read_csv** function converts the data into pandas dataframe and store in given variable as in here it is **data_csv** that allows the manupulations and analysis of the data using *pandas Library*.

In [4]:

`data_csv`

Out[4]:

To import and read a xlsx data, use the funtion *pd.read_excel()* function and use the file directory as an argument in string.
But before using **pd.read_excel()** function, you might need to install and **import xlrd**.

In [5]:

```
# Uncomment the below line to import xlrd.
# pip install xlrd
# import xlrd
data_excel=pd.read_excel('/Users/Laptop Point/Desktop/data.xlsx')
```

In [6]:

`data_excel`

Out[6]:

Let's check the data type and shape of the dataframe using .shape() and type() functions

In [7]:

`type(data_csv)`

Out[7]:

`pandas.core.frame.DataFrame`

In [8]:

`data_csv.shape`

Out[8]:

`(7, 2)`

We can also see some basic information using **.info()** method

In [9]:

`data_csv.info()`

```
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Day 7 non-null object
1 Number 7 non-null int64
dtypes: int64(1), object(1)
memory usage: 240.0+ bytes
```

Using *.info* method as shown above, basic info of the data can be clearly seen as there are 7 rows and 2 columns, the columns names are *Day* and *Number*. Non-null counts are shown along with data type. As shown here, the datatype of entries of *Day* column is object and *Number* column is integer.

For the coumn with numerical entries, we can also view some statistical information like mean, standard deviation, minimum/maximum values and number non-empty values using **.describe()** method.

In [10]:

`data_csv.describe()`

Out[10]:

To get the list of the columns of the dataframe, use the proprty .columns

In [11]:

`data_csv.columns`

Out[11]:

`Index(['Day', 'Number'], dtype='object')`

To know how to retrieve a single dataset from the whole dataframe is also sometimes required. This targeted dataset might be a single row or column.

To retrieve a single column from the dataframe, we can use indexing notation as shown below.

In [12]:

`data_csv`

Out[12]:

In [13]:

`data_csv['Day']`

Out[13]:

```
0 Monday
1 Tuesday
2 Wednesday
3 Thursday
4 Friday
5 Saturday
6 Sunday
Name: Day, dtype: object
```

As we can see above, the column named **Day** has been obtained as series. We can check the datatype of data obtained.

In [14]:

`type(data_csv['Day'])`

Out[14]:

`pandas.core.series.Series`

Now if we have a dataframe with many number of columns and we want to obtain some specific columns, then for this purpose, we can pass the name of the columns as a list in the indexing notationas shown below.

In [15]:

`data_csv[['Day', 'Number']]`

Out[15]:

In [16]:

`type(data_csv[['Day', 'Number']])`

Out[16]:

`pandas.core.frame.DataFrame`

The data obtained from passing the list of specific columns is a dataframe as we can see the datatype obtained above.

To retrieve one specific row, pandas provides **.loc[ ]** which gives the data of a specific row as a pandas series method as shown below.

In [17]:

`data_csv.loc[1]`

Out[17]:

```
Day Tuesday
Number 2
Name: 1, dtype: object
```

In [18]:

`data_csv.loc[2]`

Out[18]:

```
Day Wednesday
Number 3
Name: 2, dtype: object
```

In [19]:

`type(data_csv.loc[2])`

Out[19]:

`pandas.core.series.Series`

Now suppose if we want to obtain a data from a given row and column, this can be done by providing the number of the row and name of the column in different indexing notations as given below.

In [20]:

`data_csv['Day'][4]`

Out[20]:

`'Friday'`

In [21]:

`data_csv['Number'][4]`

Out[21]:

`5`

For the same purpose shown above, pandas also provide a method **.at[ ]**, as shown below.

In [22]:

`data_csv.at[4, 'Day']`

Out[22]:

`'Friday'`

In [23]:

`data_csv.at[4, 'Number']`

Out[23]:

`5`

In [24]:

`data_csv.at[5, 'Day']`

Out[24]:

`'Saturday'`

Instead of accessing a spcific column with **[ ]** indexing notation, pandas also allow to access a specific column with . notation. However, this method only works for accessing columns whose names do not contain spaces or special characters.

In [25]:

`data_csv.Day`

Out[25]:

```
0 Monday
1 Tuesday
2 Wednesday
3 Thursday
4 Friday
5 Saturday
6 Sunday
Name: Day, dtype: object
```

In [26]:

`data_csv.Number`

Out[26]:

```
0 1
1 2
2 3
3 4
4 5
5 6
6 7
Name: Number, dtype: int64
```

In [27]:

`type(data_csv.Day)`

Out[27]:

`pandas.core.series.Series`

Before diving further into the matter and changing the index or columns or both, let's copy the data into another valrable so that the original dataset or dataframe remain intact and we can use whenever required.

So, for copying a dataframe into another variable, pandas provides **.copy()** method

In [28]:

`data_copy=data_csv.copy()`

In [29]:

`data_copy`

Out[29]:

Copying the dataframe into another variable is important such that any change in the copied dataframe will not result in the change of original dataframe.

In [30]:

`data_copy=data_copy.rename(columns={'Number':'Count'})`

In [31]:

`data_copy`

Out[31]:

In the above example, data_copy has been replaced with the changed column name dataframe. The same can be done with an e3xtra argument *inplace* set to *True*, as shown below

In [32]:

`data_copy.rename(columns={'Count':'Counting'}, inplace=True)`

In [33]:

`data_copy`

Out[33]:

But note that *inplace* is set to *False* by default. So, if we don't provide *inplace* argument then it will not change the original dataframe, as shown below

In [34]:

`data_copy.rename(columns={'Counting':'Nums'})`

Out[34]:

Now, let's check what happened to original data

In [35]:

`data_copy`

Out[35]:

So, there is no change in the original data

Now, let's change it back to the Number

In [36]:

`data_copy.rename(columns={'Counting':'Number'}, inplace=True)`

In [37]:

`data_copy`

Out[37]:

Generally, what happens while reading a csv or xlsx file into pandas dataframe is, pandas provides the dataframe of its own index number starting from 0 as its first indexing numnber.

Now, for providing a column that is already in the dataframe as the new index, pandas provide **.set_index** method

Let's change the index of dataframe *data_copy* to the *Day*

In [38]:

`data_copy.set_index('Day')`

Out[38]:

But remember since here, we did not provide **inplace=True** that means our original dataframe will remain unaffected

In [39]:

`data_copy`

Out[39]:

In [40]:

`data_copy.set_index('Day', inplace=True)`

In [41]:

`data_copy`

Out[41]:

* Note:* We can also set multiple columns as index just by passing the name of the column in

Now suppose if we want to reset the index, we can use **.reset_index()** method

In [42]:

`data_copy.reset_index()`

Out[42]:

Now again if we want to reset the index in original data, we can pass the argument **inplace=True**

In [43]:

`data_copy.reset_index(inplace=True)`

In [44]:

`data_copy`

Out[44]:

**Let's import bigger data comparison to previous one for learning more pandas functions and methods to analyse data. And, let's read the imported data into pandas dataframe.**

Since the data is xlsx so, let's read the data using *pd.read_excel()* function.

In [3]:

```
data_df=pd.read_excel('/Users/Laptop Point/Desktop/data1.xlsx')
data_df
```

Out[3]:

Now, let's use some prelearned pandas function and methods on recently imported data.

*type()**.info()**.describe()**.set_index()**.reset_index()**.loc[]**.at[]**.rename()*

**1. Checking the data type**

In [4]:

`type(data_df)`

Out[4]:

`pandas.core.frame.DataFrame`

**2. Some basic info about the dataframe**

In [5]:

`data_df.info()`

```
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sample 6 non-null object
1 data 6 non-null int64
2 for 6 non-null int64
3 pandas 6 non-null int64
dtypes: int64(3), object(1)
memory usage: 320.0+ bytes
```

**3. Info about numerical values in dataframe**

In [6]:

`data_df.describe()`

Out[6]:

**4. Changing Index**

In [7]:

`data_df`

Out[7]:

In [8]:

`data_df.set_index('sample')`

Out[8]:

In [9]:

`data_df.set_index('data')`

Out[9]:

**Setting Multiple Index**

In [10]:

`data_df.set_index(['sample', 'data'])`

Out[10]:

**5. Restting the Index**

In [11]:

`data_df.reset_index(drop=True)`

Out[11]:

**6. Locating a row**

In [12]:

`data_df.loc[3]`

Out[12]:

```
sample p
data 4
for 10
pandas 9
Name: 3, dtype: object
```

In [13]:

`data_df.loc[5]`

Out[13]:

```
sample e
data 6
for 12
pandas 2
Name: 5, dtype: object
```

**7. Loacating a value with specific row and column**

In [15]:

`data_df.at[3, 'data']`

Out[15]:

`4`

In [16]:

`data_df.at[4, 'pandas']`

Out[16]:

`12`

**8. Renaming a column**

In [18]:

`data_df.rename(columns={'data':'numbers', 'for':'in'})`

Out[18]:

In [19]:

`data_df`

Out[19]:

*.sample( )* method gives a subset random rows from dataframe. The number of rows to get is given as an argument in *.sample( )* method as given below.

In [20]:

`data_df.sample(3)`

Out[20]:

In [21]:

`data_df.sample(3)`

Out[21]:

*.head()* method is used to obtain the sample from top of the dataframe in respective order as in dataframe. The number of rows required is given as argument in the method as shown below

In [22]:

`data_df.head(3)`

Out[22]:

*.tail()* method is used to obtain the sample from below in respective order as in dataframe. The number of rows required is given as argument in the method as shown below

In [23]:

`data_df.tail(2)`

Out[23]:

In [24]:

`data_df`

Out[24]:

Now suppose if we want to obtain a dataframe only consisting of the data with the 'pandas' column less than 8. So, it can be done as given below

In [25]:

`data_df.pandas<8`

Out[25]:

```
0 True
1 True
2 False
3 False
4 False
5 True
Name: pandas, dtype: bool
```

In [26]:

`data_df[data_df.pandas<8]`

Out[26]:

In [27]:

`data_df['pandas']<8`

Out[27]:

```
0 True
1 True
2 False
3 False
4 False
5 True
Name: pandas, dtype: bool
```

In [28]:

`data_df[data_df['pandas']<8]`

Out[28]:

In [29]:

`import jovian`

In [ ]:

`jovian.commit()`

```
[jovian] Attempting to save notebook..
```