GraphLab is a Python library that gives many out of the box features to use. It is a great library to learn the Machine Learning foundations. Many courses out there teaches several algorithms with a bunch of tools, and non-real world examples. However, if you are new to Machine Learning, GraphLab(powered by DATO) is a great library to start.
In this post I’ll try to give some intuition about SFrames and I’ll show some simple data visualization examples using iPython Notebooks.
First go ahead and download GraphLab Create from https://dato.com/products/create/ . If you are a student you can use GraphLab Create for 1 year at no charge for academic purposes. https://dato.com/download/academic.html . After downloading and installing GraphLab Create, launch iPython notebooks. Also here is a simple data set that I’ll use for the rest of my post. people-example.csv
# Import the GraphLab library import graphlab # Then initialize SFrames variable. This will hold our sample data. sf = graphlab.SFrame('people-example.csv')
You will have an output very similar to below
Finished parsing file /Users/muhammetergenc/people-example.csv Parsing completed. Parsed 7 lines in 0.055689 secs. ------------------------------------------------------ Inferred types from first 100 line(s) of file as column_type_hints=[str,str,str,int] If parsing fails due to incorrect types, you can correct the inferred type list above and pass it to read_csv in the column_type_hints argument ------------------------------------------------------ Finished parsing file /Users/muhammetergenc/people-example.csv Parsing completed. Parsed 7 lines in 0.047955 secs.
Now we have loaded our data and let’s start with basics.
# Type sf and press shift + enter on iPython Notebook sf
sf.head() function will also fetch the few lines from the beginning of the file. You can also use sf.tail() function to retrieve a few lines of data from the end of the file. However because we don’t have that many records in our dataset, the output of those 3 functions will be the same.
Graph Lab Canvas is a built-in visualization tool that comes with GraphLab Create.
# We can take any data structure in GraphLab Canvas. # We will use our sample data for the following examples. sf.show()
You will have an output which will redirect you to the Canvas web application.
You can click on each column and see the most frequent items. Also in Table view you can view your data in a clean and very nicely structured way. SFrames are not storing the data in memory. So you may even view 1 billion of rows in GraphLab Canvas.
Here are some more simple operations;
# Set the target as iPython notebook to view visualization directly in your notebook. graphlab.canvas.set_target('ipynb') #View the age column's visualization in iPython notebook in categorical format. sf['age'].show(view='Categorical') #We can also calculate the mean value or the max value of the age column. sf['age'].mean() sf['age'].max()
Create new columns in SFrame
sf['Full Name'] = sf['First Name'] + ' ' + sf['Last Name']
This code will create a new column that consists of the First Name and the Last Name columns.
If you noticed in our Country column we have United States for some rows and USA for some other rows. We can write a function and and use it in a for loop to fix this problem for each row. However there is a more clean and neat way to to this in GraphLab Canvas.
Advanced Transformation of our Data
Let’s write a function that will change ‘USA’ to ‘United States’.
def transform_country(country): if country == 'USA': return 'United States' else: return country
Now in the next line if you try
transform_country('Turkey') #You will get an output as Turkey
But if you try
transform_country('USA') #You will get an output as "United States"
Let’s apply this to all the rows in our dataset.
sf['Country'] = sf['Country'].apply(transform_country)
Now print out our data set again by typing sf.
Now we have cleaned our data and added a new row very easily using GraphLab Canvas!