If we are in analytics world, then we must know about the tools called SAS, SPSS, R and many more. I have experience of working on three tools SAS, R, and Python. SAS is always first choice for data analysis because of ease of use, well acceptability and plenty of experts in market.
Since this is the era of Freeware, so I will talk about R and Python. The primary advantage of R and Python is that it’s free and company doesn’t need to invest money. Besides, the all advance analytical methodologies are available in Python or R for which you need to pay huge money in paid software.
Why to choose Python over R programming:
1. Multiprocessing support in Python. In R, its single processing so complex methodology takes longer time to execute. R is highly depended on RAM. Sometime 1 GB data handling with 2GB RAM system is quite tough. Example in Random Forest, some study has said that Python takes 1/10th time compare to R.
2. Memory management: In Python you don’t require very high configuration computer. Whereas R is highly depended on RAM. Example if a table of 2GB or more, then you can’t easily operate in R. But that can be done easily in Python.
3. Even more number of readymade functions available in Python to do Data preparation and charting compare to R.
Python has long list of functions to perform multiple tasks at all the stages of Data mining project. I have explained below few examples. For statistical modeling you need below libraries.
1. import pandas as pd
2. import statsmodels.api as sm
3. import pylab as py
4. import numpy as np
Here are few functions available for each stage of Data Mining. I have also mentioned similar methods in SAS or R.
1. Reading data from sources (df is data frame):
a. df = pd.read_csv(“path of file”/”file_name.csv”)
b. similarly read_table, read_fwf, read_clipboard, XML, MongoDB
c. Lots of option to customize the reading of files like header, no of rows to read, delimiter etc.
2. Data Cleaning
a. Df.describe() – Like proc summary in SAS or describe() in R
b. df.get_dtype_counts() – list the number of columns on each data type.
c. pd.crosstab(df['Default'], df['Rating_score'], rownames=['Default']) – like table() in R or like Proc freq in SAS.
d. df.hist(), pl.show() – Draw the data distribution of all the columns. Like Proc Univariate in SAS.
e. dummy_rating = pd.get_dummies(df['rating_ score'],prefix='R')
f. Head() – to see 1st 5 rows like haed() in R
g. Binning of data: New_col = pd.cut(data, 4) # Cut into quartiles
3. Missing and outlier treatment
a. Number of method to identify and teat missing value with any value like by mean, mode or median or any other value. Few examples:
· df['col_name'].fillna(' missing') - Fill NULL with any usre defined value
· df.fillna(method='pad') - Fill NULL with above row value.
· df.dropna(axis=0) -drop rows have any null value
· df.dropna(axis=1) -drop cols have any null value
b. Easy to find the outlier values in data by visual presentation, or by filtering of some characteristic.
· df[(np.abs(df) > 10000).any(1)] – selecting rows having values exceeding some value.
· Replacing list of values, could be missing or any outlier-df.replace([NaN, 100], [nan, 500])
4. Creating Train and Test dataset
5. Different way to randomly select data for train and test
a. numpy.random.shuffle(df) – reshuffle the dataset
b. train, test = df[:80,:], df[80:,:] – in 80/20 ration.
· Other method, Spliting in 80/20 ration:
c. df_train, df_test = train_test_split(df, test_size=0.2, random_state=0)
6. Running predictive modeling
a. logit = sm.Logit(df['default'], df[train_dataframe])
b. clf = RandomForestClassifier(n_ estimators=10, max_depth=None, min_samples_split=1, random_state=0); >>> scores = cross_val_score(clf, “response variable”,”Predictor variables list”)
c. Has all the advance analytics methodology support.
No comments:
Post a Comment