Converting a Python dictionary to a DataFrame
Converting Python Dictionaries to DataFrames
In data analysis or machine learning applications, we often need to convert dictionaries to DataFrames. A DataFrame is a data structure in Pandas, similar to a table in Excel.
What is a dictionary? What is a DataFrame?
A dictionary is one of the commonly used data types in Python. It is an unordered collection of key-value pairs, and the value can be accessed by the key.
Here is an example of a dictionary:
data = {"name": ["Alice", "Bob", "Charlie", "Dave"],
"age": [25, 32, 18, 47],
"city": ["Beijing", "Shanghai", "Guangzhou", "Shenzhen"],
"height": [165, 170, 180, 175]}
You can access the data in the dictionary as follows:
print(data["name"]) # Outputs ['Alice', 'Bob', 'Charlie', 'Dave']
print(data["age"]) # Outputs [25, 32, 18, 47]
A DataFrame is a data structure in Pandas. It is a labeled two-dimensional array. Each column can have a different data type (e.g., string, integer, floating-point number, etc.).
The following is an example of a DataFrame:
import pandas as pd
df = pd.DataFrame({"name": ["Alice", "Bob", "Charlie", "Dave"],
"age": [25, 32, 18, 47],
"city": ["Beijing", "Shanghai", "Guangzhou", "Shenzhen"],
"height": [165, 170, 180, 175]})
print(df)
Output results:
name age city height
0 Alice 25 Beijing 165
1 Bob 32 Shanghai 170
2 Charlie 18 Guangzhou 180
3 Dave 47 Shenzhen 175
How to convert a dictionary to a DataFrame?
You can use the DataFrame()
function in Pandas to convert dictionary data to a DataFrame. The syntax of this function is as follows:
df = pd.DataFrame(data)
import pandas as pd
data = {“name”: [“Alice”, “Bob”, “Charlie”, “Dave”],
“age”: [25, 32, 18, 47],
“city”: [“Beijing”, “Shanghai”, “Guangzhou”, “Shenzhen”],
“height”: [165, 170, 180, 175]}
df = pd.DataFrame(data)
print(df)
The output is the same as in the example above.
If you want to specify the order of the columns in a DataFrame, you can use the following code:
df = pd.DataFrame(data, columns=["name", "age", "city", "height"])
If you want to select only some columns to include, you can use the following code:
df = pd.DataFrame(data, columns=["name", "age"])
In this case, only the “name” and “age” columns are included, while “city” and “height” are omitted.
If there are missing key-value pairs in the dictionary, you can use the fillna()
method to fill them. Here is an example:
import pandas as pd
data = {"name": ["Alice", "Bob", "Charlie", "Dave"],
"age": [25, 32, 18, 47],
"city": ["Beijing", "Shanghai", "Guangzhou", "Shenzhen"]}
df = pd.DataFrame(data).fillna(0)
print(df)
Output:
name age city height
0 Alice 25 Beijing 0.0
1 Bob 32 Shanghai 0.0
2 Charlie 18 Guangzhou 0.0
3 Dave 47 Shenzhen 0.0
How to operate on DataFrame?
Below are some common DataFrame operations:
df.head(n)
: Returns the first n rows of a DataFrame, with a default of 5.df.tail(n)
: Returns the last n rows of a DataFrame, with a default of 5.df.shape
: Returns the dimensions of a DataFrame, i.e., the number of rows and columns.df.columns
: Returns the column names of a DataFrame.df.dtypes
: Returns the data type of each column in a DataFrame.df.info()
: Outputs basic information about a DataFrame, including the number of rows and columns, the number of missing values, and the data type of each column.df.describe()
: Outputs basic statistics of numeric variables in a DataFrame, including count, mean, standard deviation, minimum, maximum, and more.df.sort_values(by=column name)
: Sorts values by a specified column.df.groupby(by=column name)
: Groups a DataFrame, often used with aggregate functions such asmean()
andsum()
.
Here are some simple examples:
import pandas as pd
data = {"name": ["Alice", "Bob", "Charlie", "Dave"],
"age": [25, 32, 18, 47],
"city": ["Beijing", "Shanghai", "Guangzhou", "Shenzhen"],
"height": [165, 170, 180, 175]}
df = pd.DataFrame(data)
print(df.head()) # Output the first 5 rows
print(df.tail()) # Output the last 5 rows
print(df.shape) # Output the dimensions
print(df.columns) # Output the column names
print(df.dtypes) # Output the data types
print(df.info()) # Output basic information
print(df.describe()) # Output basic statistics
print(df.sort_values(by="age")) # Sort by age
print(df.groupby(by="city").mean()) # Group by city and calculate the mean
Output:
name age city height
0 Alice 25 Beijing 165
1 Bob 32 Shanghai 170
2 Charlie 18 Guangzhou 180
3 Dave 47 Shenzhen 175
name age city height
0 Alice 25 Beijing 165
1 Bob 32 Shanghai 170
2 Charlie 18 Guangzhou 180
3 Dave 47 Shenzhen 175
(4, 4)
Index(['name', 'age', 'city', 'height'], dtype='object')
name object
age int64
city object
height int64
dtype: object
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 4 non-null object
1 age 4 non-null int64
2 city 4 non-null object
3 height 4 non-null int64
dtypes: int64(2), object(2)
memory usage: 256.0+ bytes
None
age height
count 4.000000 4.000000
mean 30.500000 172.500000
std 12.125724 6.454972
min 18.000000 165.000000
25% 23.250000 169.250000
50% 28.500000 172.500000
75% 35.750000 175.750000
max 47.000000 180.000000
name age city height
2 Charlie 18 Guangzhou 180
0 Alice 25 Beijing 165
1 Bob 32 Shanghai 170
3 Dave 47 Shenzhen 175
age height
city
Beijing 25 165
Guangzhou 18 180
Shanghai 32 170
Shenzhen 47 175
Conclusion
Using Pandas The DataFrame()
function easily converts a dictionary into a DataFrame, which facilitates data manipulation and analysis. I hope this article helps you learn data analysis in Python! Be familiar with the basic operations and common functions of DataFrames to make data analysis and processing more efficient.