Python DataFrames - Busk Wales

Hello everyone! Today, I’d like write about Pandas library (link to the site). Pandas stands for “Python Data Analysis Library”. Based on the Wikipedia page about Pandas, “the name is an abbreviation of “panel data” which is an econometric term used to describe large-scale structured databases.” However, I believe Pandas is a cute title for a very useful Python library!

Pandas is an absolute game changer when it comes to analysing data using Python and is among of the most favored and widely utilized tools in data wrangling, if not the most popular one. Pandas is an open-source software that is open source, and completely free (under the BSD licence) and was first developed by Wes McKinney.

What’s amazing what’s cool about Pandas is it uses information (like an CSV (or TSV document, for instance or an SQL database) and produces an Python object that has columns and rows – a Python dataframe. It looks similar to tables in statistical program (think Excel or SPSS for instance). If you’re familiar with R will recognize the similarities with R as well). It is so much easier to work with compared to using lists or dictionary databases using for loops as well as list comprehension (please be sure to read one of my earlier blog posts that cover the basics of data analysis with Python. The process would be much simpler to do the same thing with Pandas! ).

Installation and Beginning the Process

To “get” Pandas you would require installing it. Additionally, you must be running Python 3.5.3 or higher. as a prerequisite for installation (will be compatible using Python 3.6, 3.7, or 3.8) It’s also dependent on various other programs (like NumPy) and has dependent dependencies that are optional (like Matplotlib, which is used for plotting). So, I believe that the most efficient way to set up Pandas installed is installing it using an appropriate package such as the Anaconda distribution . It is “a cross-platform distribution that is designed for scientific computing and data analysis.” It is possible to install Pandas for Windows, OS X and Linux versions. If you’d like to install it in another way, here are the full instructions for installation.

To be able to utilize Pandas within the Python IDE (Integrated Development Environment) such as Jupyter Notebook or Spyder (both come with Anaconda as a default) it is necessary to download this Pandas library before you do so. Importing a library involves loading it into memory, and it’s then available to work with. To load Pandas all you need to do is execute these instructions:

import pandas as pd
Import numpy into np as numpy is a

Usually , you’ll need to include the second part (‘as as’) so you can access Pandas by using ‘pd.command instead of having to type ‘pandas.command” each time you want to use it. You should also include numpy too since it’s a useful to use for scientific computing using Python. The moment is now Pandas is now ready to use! Be aware that you will be required to do this each time you open an entirely brand new Jupyter Notebook, Spyder file or any other.

Working with Pandas

Loading as well as Saving Data with Pandas

If you’re looking to utilize Pandas to analyze data You’ll typically use it in three distinct methods:

Convert a Python’s list Numpy array or dictionary to the Pandas data frame
Open a local file with Pandas typically it’s a CSV file, but it can also be a Text files (like TSV), Excel, Excel
Open an external file or database , like the CSV or JSONon the web using the URL or read from an SQL table or database

There are various commands for all of the options however, once you have opened a document the file will appear as follows:

pd.read_filetype()

As I’ve said before There are various file types Pandas can utilize Therefore, you need to substitute “filetype” by the real the filetype (like CSV). The location, name of the file, etc. in the parenthesis. Within the parenthesis, you could add additional arguments that are related to the method of opening the file. There are many arguments to consider and to learn all of them you must go through the document (for instance the documentation for pd.read_csv() contains all of the arguments you could provide to that Pandas function).

To convert a particular Python object (dictionary lists, lists, etc.) the fundamental instruction is:

pd.DataFrame()

In the parenthesis, you’ll need to indicate the object(s) you’re constructing the data frame using. The command can also be used with different options (clickable hyperlink).

You may also save a data frame you’re working on to various types of data files (like CSV, Excel, JSON and SQL tables). The basic code is:

df.to_filetype(filename)

Monitoring and Viewing Data

Once you’ve loaded your files, it’s time to look. What do you think the data frame looks? If you were to search for the names of the frame will provide the entire table. However you could also obtain only the first row using df.head(n) and the final n rows using df.tail(n). df.shape will provide you with the amount of rows and columns. df.info() will provide an index and datatype, and memory information. The command s.value_counts(dropna=False) would allow you to view unique values and counts for a series (like a column or a few columns). One of the most useful commands can be df.describe() that provides data summary statistics for columns that are numerical. You can also collect statistics for the whole data frame or series (a column, etc.):

df.mean()Returns the average of columns
df.corr()Returns the relationship between columns within the data frame.
df.count()Returns the total number of values that are not null in each column of the data frame
df.max()Returns the most significant value for each column
df.min()Returns what is the lowest number of each column
df.median()Returns the median value of each column
df.std()Returns the average deviation for each column

The Selection of Data

One thing that is made much simpler to do in Pandas is choosing the information you need when compared to selecting the value from a list or the dictionary. It is possible to select the column (df[col[col]) and return it with the the label col as Series, or a couple of columns (df[[col1 col1Col1, col2) and then return the columns as a DataFrame. You can choose by location (s.iloc[0]) and by index (s.loc[‘index_one index_one’or index (s.loc[‘index_one’]) . To choose the first row you could utilize df.iloc[0, for example, to select the first row of the first column, use df.iloc[0,0[0,0] . The same can be utilized in different combinations. I hope this gives some idea about the various options for selection and indexing that you can do within Pandas.

Sort, Filter and Groupby

There are many different ways in order to sort columns. For instance the following example: df[df[year] >] will show only the column that is larger than 1984. You can apply and (and) or (or) to include different filters to your. This is also known as boolean filtering.

It is possible to sort values in a certain column in an ascending order using df.sort_values(col1) ; and also in a descending order using df.sort_values(col2,ascending=False). Furthermore, it’s possible to sort values by col1 in ascending order then col2 in descending order by using df.sort_values([col1,col2],ascending=[True,False]).

The final command within this subsection is called groupby. It is a method of dividing all the information into categories according to certain criteria, applying an algorithm to each group separately and then combining the results into the form of a data structure. df.groupby(col) returns a groupby object for values from one column while df.groupby([col1,col2]) returns a groupby object for values from multiple columns.

Data Cleaning

Cleaning data is an vital step in the process of the analysis of data. For instance, we will always look for missing values within the data using pd.isnull() which searches for null values, and produces an array of booleans (an array of false for missing values, and false for missing values). To get the sum of missing or null values, you can run pd.isnull().sum(). pd.notnull() will be the reverse of pd.isnull(). When you’ve compiled an inventory of missing values , you can eliminate the missing values through df.dropna() for remove the rows as well as df.dropna(axis=1) to remove the columns. A different method is to fill in the gaps with other values using df.fillna(x) that fills in the empty values by using an x (you can add anything you like) and s.fillna(s.mean()) for replacing all values that are null by the mean (mean is replaceable by almost any function in Statistics section).

Sometimes it is required to replace values with different values. For instance, s.replace(1,’one’) would replace all values greater than 1 by the value ‘one’. It’s possible to do it for multiple values: s.replace([1,3],[‘one’,’three’]) would replace all 1 with ‘one’ and 3 with ‘three’. You can also rename specific columns by running: df.rename(columns=) or use df.set_index(‘column_one’) to change the index of the data frame.

Join/Combine

The final set of Pandas commands is for joining and combining frames of data, or columns or rows. These three commands include:

df1.append(df2)-Add the rows from df1 up to the end of the second df (columns must be the same)
df.concat([df1, 2],axis=1) Add the columns in df1 until the end of the df2 (rows should be the same)
df1.join(df2,on=col1,how=’inner’) — SQL-style join the columns in df1 with the columns on df2 where the rows for colhave identical values. What is the equivalent of one ofthe following the following: ‘left’, right’ outer’, and ‘inner’

These are the most fundamental Pandas commands, but I hope you’ll be able to appreciate how powerful Pandas are for analysis of data. This is just the beginning of the iceberg in the end, whole books could be (and have been) written on data analysis using Pandas. I hope that this article has inspired you to consider taking the data and playing with it with Pandas! 🙂