Trying Data Science(3) : Cramming Pandas - Series / DataFrames

2023年7月14日 18:01

Hello, everyone!

Fortunately, my study for data science is still going on.
Today I'm going to cover Pandas.
It is too heavy to cram it at once, so I devided it into 2 parts.

A Series is very similar to a NumPy array.
In fact it is built on top of the NumPy array object.

What is a difference between the NumPy array and a Series?
It is that a Series can have axis labels.
It means that a Series can be indexed by a label, instead of just a number location.

Then, let's get started!

Series

Creating a Series follows below.

labels = ['a','b','c']
my_list = [10,20,30]
arr = np.array([10,20,30])
d = {'a':10,'b':20,'c':30}

Using an Index

The key to using a Series is understanding its index.
Pandas makes use of these index names or numbers.
It allows for fast look ups of information (like a hash table or dictionary).

ser1 = pd.Series([1,2,3,4],index = ['USA', 'Germany','USSR', 'Japan'])          
ser1

Out

USA        1
Germany    2
USSR       3
Japan      4
dtype: int64

DataFrame

What is a DataFrame?
DataFrame is a bunch of Series objects.
It puts Series objects together to share the same index.

Let's create a dataframe.

df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split())
df

Out

 	    W         	X	        Y	       Z
A	2.706850	0.628133	0.907969	0.503826
B	0.651118	-0.319318	-0.848077	0.605965
C	-2.018168	0.740122	0.528813	-0.589001
D	0.188695	-0.758872	-0.933237	0.955057
E	0.190794	1.978757	2.605967	0.683509

Creating a new column follows below.

In

df['new'] = df['W'] + df['Y']
df

Out

       W	        X	        Y	        Z	       new
A	2.706850	0.628133	0.907969	0.503826	3.614819
B	0.651118	-0.319318	-0.848077	0.605965	-0.196959
C	-2.018168	0.740122	0.528813	-0.589001	-1.489355
D	0.188695	-0.758872	-0.933237	0.955057	-0.744542
E	0.190794	1.978757	2.605967	0.683509	2.796762

When removing columns and saving it to original dataframe, you need to write 'inplace=True'.

In

df.drop('new',axis=1,inplace=True)
df

Out

        W	        X	        Y	       Z
A	2.706850	0.628133	0.907969	0.503826
B	0.651118	-0.319318	-0.848077	0.605965
C	-2.018168	0.740122	0.528813	-0.589001
D	0.188695	-0.758872	-0.933237	0.955057
E	0.190794	1.978757	2.605967	0.683509

When selecting rows
In

df.loc[['A','B'],['W','Y']]

Out

    	 W	       Y
A	2.706850	0.907969
B	0.651118	-0.848077

Resetting the index
In

df.reset_index()

Out

	 index	  W	        X	       Y	       Z
0	A	2.706850	0.628133	0.907969	0.503826
1	B	0.651118	-0.319318	-0.848077	0.605965
2	C	-2.018168	0.740122	0.528813	-0.589001
3	D	0.188695	-0.758872	-0.933237	0.955057
4	E	0.190794	1.978757	2.605967	0.683509

As I explained above, if you want to save the change into original dataframe, write 'inplace=True'.

df.set_index('States',inplace=True)
df

Out

	     W	         X	         Y	        Z
States				
CA	2.706850	0.628133	0.907969	0.503826
NY	0.651118	-0.319318	-0.848077	0.605965
WY	-2.018168	0.740122	0.528813	-0.589001
OR	0.188695	-0.758872	-0.933237	0.955057
CO	0.190794	1.978757	2.605967	0.683509

Conclusion
It is a little bit tricky to understand new concepts of pandas for me, but I have no choice but to be familiar with them. It is quite basic level, so need to sitck to it! Next session, I'll cover missing data, group by, and merging and so on. It shares some similarities of SQL. I'm looking forward to it!

エンジニアファーストの会社 株式会社CRE-CO
ソンさん

この記事が気に入ったらサポートをしてみませんか？