Converting a Python dictionary to a DataFrame

Converting Python Dictionaries to DataFrames

In data analysis or machine learning applications, we often need to convert dictionaries to DataFrames. A DataFrame is a data structure in Pandas, similar to a table in Excel.

What is a dictionary? What is a DataFrame?

A dictionary is one of the commonly used data types in Python. It is an unordered collection of key-value pairs, and the value can be accessed by the key.

Here is an example of a dictionary:

data = {"name": ["Alice", "Bob", "Charlie", "Dave"],
"age": [25, 32, 18, 47],
"city": ["Beijing", "Shanghai", "Guangzhou", "Shenzhen"],
"height": [165, 170, 180, 175]}

You can access the data in the dictionary as follows:

print(data["name"]) # Outputs ['Alice', 'Bob', 'Charlie', 'Dave']
print(data["age"]) # Outputs [25, 32, 18, 47]

A DataFrame is a data structure in Pandas. It is a labeled two-dimensional array. Each column can have a different data type (e.g., string, integer, floating-point number, etc.).

The following is an example of a DataFrame:

import pandas as pd

df = pd.DataFrame({"name": ["Alice", "Bob", "Charlie", "Dave"],
                   "age": [25, 32, 18, 47],
                   "city": ["Beijing", "Shanghai", "Guangzhou", "Shenzhen"],
                   "height": [165, 170, 180, 175]})

print(df)

Output results:

 name age city height
0 Alice 25 Beijing 165
1 Bob 32 Shanghai 170
2 Charlie 18 Guangzhou 180
3 Dave 47 Shenzhen 175

How to convert a dictionary to a DataFrame?

You can use the DataFrame() function in Pandas to convert dictionary data to a DataFrame. The syntax of this function is as follows:

df = pd.DataFrame(data)

import pandas as pd

data = {“name”: [“Alice”, “Bob”, “Charlie”, “Dave”],
“age”: [25, 32, 18, 47],
“city”: [“Beijing”, “Shanghai”, “Guangzhou”, “Shenzhen”],
“height”: [165, 170, 180, 175]}

df = pd.DataFrame(data)

print(df)

The output is the same as in the example above.

If you want to specify the order of the columns in a DataFrame, you can use the following code:

df = pd.DataFrame(data, columns=["name", "age", "city", "height"])

If you want to select only some columns to include, you can use the following code:

df = pd.DataFrame(data, columns=["name", "age"])

In this case, only the “name” and “age” columns are included, while “city” and “height” are omitted.

If there are missing key-value pairs in the dictionary, you can use the fillna() method to fill them. Here is an example:

import pandas as pd

data = {"name": ["Alice", "Bob", "Charlie", "Dave"],
"age": [25, 32, 18, 47],
"city": ["Beijing", "Shanghai", "Guangzhou", "Shenzhen"]}

df = pd.DataFrame(data).fillna(0)

print(df)

Output:

 name age city height
0 Alice 25 Beijing 0.0
1 Bob 32 Shanghai 0.0
2 Charlie 18 Guangzhou 0.0
3 Dave 47 Shenzhen 0.0

How to operate on DataFrame?

Below are some common DataFrame operations:

  • df.head(n): Returns the first n rows of a DataFrame, with a default of 5.
  • df.tail(n): Returns the last n rows of a DataFrame, with a default of 5.
  • df.shape: Returns the dimensions of a DataFrame, i.e., the number of rows and columns.
  • df.columns: Returns the column names of a DataFrame.
  • df.dtypes: Returns the data type of each column in a DataFrame.
  • df.info(): Outputs basic information about a DataFrame, including the number of rows and columns, the number of missing values, and the data type of each column.
  • df.describe(): Outputs basic statistics of numeric variables in a DataFrame, including count, mean, standard deviation, minimum, maximum, and more.
  • df.sort_values(by=column name): Sorts values by a specified column.
  • df.groupby(by=column name): Groups a DataFrame, often used with aggregate functions such as mean() and sum().

Here are some simple examples:

import pandas as pd

data = {"name": ["Alice", "Bob", "Charlie", "Dave"],
"age": [25, 32, 18, 47],
"city": ["Beijing", "Shanghai", "Guangzhou", "Shenzhen"],
"height": [165, 170, 180, 175]}

df = pd.DataFrame(data)

print(df.head()) # Output the first 5 rows
print(df.tail()) # Output the last 5 rows
print(df.shape) # Output the dimensions
print(df.columns) # Output the column names
print(df.dtypes) # Output the data types
print(df.info()) # Output basic information
print(df.describe()) # Output basic statistics
print(df.sort_values(by="age")) # Sort by age
print(df.groupby(by="city").mean()) # Group by city and calculate the mean

Output:

 name age city height
0 Alice 25 Beijing 165
1 Bob 32 Shanghai 170
2 Charlie 18 Guangzhou 180
3 Dave 47 Shenzhen 175
name age city height
0 Alice 25 Beijing 165
1 Bob 32 Shanghai 170
2 Charlie 18 Guangzhou 180
3 Dave 47 Shenzhen 175
(4, 4)
Index(['name', 'age', 'city', 'height'], dtype='object')
name object
age        int64
city object
height int64
dtype: object
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 # Column Non-Null Count Dtype
--- ------ -------------- -----
 0 name 4 non-null object
 1 age 4 non-null int64
 2 city 4 non-null object
 3 height 4 non-null int64
dtypes: int64(2), object(2)
memory usage: 256.0+ bytes
None
             age height
count 4.000000 4.000000
mean 30.500000 172.500000
std 12.125724 6.454972
min    18.000000 165.000000
25% 23.250000 169.250000
50% 28.500000 172.500000
75% 35.750000 175.750000
max 47.000000 180.000000
       name age city height
2 Charlie 18 Guangzhou 180
0 Alice 25 Beijing 165
1 Bob 32 Shanghai 170
3 Dave 47 Shenzhen 175
           age height
city
Beijing 25 165
Guangzhou 18 180
Shanghai 32 170
Shenzhen 47 175

Conclusion

Using Pandas The DataFrame() function easily converts a dictionary into a DataFrame, which facilitates data manipulation and analysis. I hope this article helps you learn data analysis in Python! Be familiar with the basic operations and common functions of DataFrames to make data analysis and processing more efficient.

Leave a Reply

Your email address will not be published. Required fields are marked *