Trying Data Science(3) : Cramming Pandas - Series / DataFrames
Hello, everyone!
Fortunately, my study for data science is still going on.
Today I'm going to cover Pandas.
It is too heavy to cram it at once, so I devided it into 2 parts.
A Series is very similar to a NumPy array.
In fact it is built on top of the NumPy array object.
What is a difference between the NumPy array and a Series?
It is that a Series can have axis labels.
It means that a Series can be indexed by a label, instead of just a number location.
Then, let's get started!
Series
Creating a Series follows below.
labels = ['a','b','c']
my_list = [10,20,30]
arr = np.array([10,20,30])
d = {'a':10,'b':20,'c':30}
Using an Index
The key to using a Series is understanding its index.
Pandas makes use of these index names or numbers.
It allows for fast look ups of information (like a hash table or dictionary).
In
ser1 = pd.Series([1,2,3,4],index = ['USA', 'Germany','USSR', 'Japan'])
ser1
Out
USA 1
Germany 2
USSR 3
Japan 4
dtype: int64
DataFrame
What is a DataFrame?
DataFrame is a bunch of Series objects.
It puts Series objects together to share the same index.
Let's create a dataframe.
In
df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split())
df
Out
W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 -0.319318 -0.848077 0.605965
C -2.018168 0.740122 0.528813 -0.589001
D 0.188695 -0.758872 -0.933237 0.955057
E 0.190794 1.978757 2.605967 0.683509
Creating a new column follows below.
In
df['new'] = df['W'] + df['Y']
df
Out
W X Y Z new
A 2.706850 0.628133 0.907969 0.503826 3.614819
B 0.651118 -0.319318 -0.848077 0.605965 -0.196959
C -2.018168 0.740122 0.528813 -0.589001 -1.489355
D 0.188695 -0.758872 -0.933237 0.955057 -0.744542
E 0.190794 1.978757 2.605967 0.683509 2.796762
When removing columns and saving it to original dataframe, you need to write 'inplace=True'.
In
df.drop('new',axis=1,inplace=True)
df
Out
W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 -0.319318 -0.848077 0.605965
C -2.018168 0.740122 0.528813 -0.589001
D 0.188695 -0.758872 -0.933237 0.955057
E 0.190794 1.978757 2.605967 0.683509
When selecting rows
In
df.loc[['A','B'],['W','Y']]
Out
W Y
A 2.706850 0.907969
B 0.651118 -0.848077
Resetting the index
In
df.reset_index()
Out
index W X Y Z
0 A 2.706850 0.628133 0.907969 0.503826
1 B 0.651118 -0.319318 -0.848077 0.605965
2 C -2.018168 0.740122 0.528813 -0.589001
3 D 0.188695 -0.758872 -0.933237 0.955057
4 E 0.190794 1.978757 2.605967 0.683509
As I explained above, if you want to save the change into original dataframe, write 'inplace=True'.
In
df.set_index('States',inplace=True)
df
Out
W X Y Z
States
CA 2.706850 0.628133 0.907969 0.503826
NY 0.651118 -0.319318 -0.848077 0.605965
WY -2.018168 0.740122 0.528813 -0.589001
OR 0.188695 -0.758872 -0.933237 0.955057
CO 0.190794 1.978757 2.605967 0.683509
Conclusion
It is a little bit tricky to understand new concepts of pandas for me, but I have no choice but to be familiar with them. It is quite basic level, so need to sitck to it! Next session, I'll cover missing data, group by, and merging and so on. It shares some similarities of SQL. I'm looking forward to it!
エンジニアファーストの会社 株式会社CRE-CO
ソンさん
この記事が気に入ったらサポートをしてみませんか?