AI machine learning
Introduction
machine learning is turning data into numbers then finding patterns in those numbers when the rules to create a desired output from inputs is too complex or are not well known
you should use a traditional programming solution rather than machine learning if you can build a simple rule-based system
machine learning can adapt to changing environments or scenarios
machine learning can help to discover insights into large volumes of data however, the patterns learned can be uninterpretable by humans and the outputs from machine learning aren't always predictable and may be erroneous as they are based upon relatively large datasets and probability
machine learning with “shallow algoriths” such as decision trees is best on structured data such as rows and columns of data, examples include:
AI deep learning is typically used for unstructured data and uses neural networks although a tensor can create more structure, examples include:
basic steps
get raw data
clean data to remove duplicates or irrelevancies and to convert text into numerical values
split the data into training vs testing data
create a model such as neural network or decision trees
train the model with the training data
test the model to make predictions
evaluate the accuracy and improve - fine tune parameters
Python libraries for AI
NumPy
special python n-dimensional array type for faster processing along with additional properties, methods
array must be all the same data type
np1 = np.array([… ])
np1.shape() gives number of items similar to len()
np2 = np.arange(10) will create an array [0,1,2,3,4,5,6,7,8,9] can use startvalue, endvalue, step
np3 = np.zeros(10) will create an array [0,0,0,0,0,0,0,0,0]
np4 = np.zeros
1) will create a 2D array of zeros
np3 = np.full( (10),4) will create an array filled with value 4 [4,4,4,4,4,4,4,4]
np7 = np.array(python_list) will create a numpy array from a python array
Pandas - data frame like Excel
MatPlotLib 2D charting
SciKit-Learn - machine learning model types
Jupyter - to allow your code to be segmented into cells and each cell run by itself or all together and provides better inspection of data
once installed via Anaconda, to run, open up a terminal window, then type $jupyter notebook
ipynb files
if green left bar - in edit mode (hit ESC to go to command mode)
if blue left bar - in command mode
each mode has different shortcuts - press H to see them
shift-tab for tool tip
Ctrl-Enter to run cell without adding new cells
Ctrl-slash to comment out selection
MS Visual Studio Code
Anaconda (5Gb) - installs the above
Machine learning with Jupyter, sklearn and Pandas
basic code
import pandas as pd
from sklearn.tree import DecisionTreeClassifier # (if this is the model you wish to use)
from sklearn.model_selection import train_test_split # only needed when training model
from sklearn.metrics import accuracy_score # only needed when evaluating model
from sklearn.externals import joblib # needed to save the trained model
from sklearn import tree # only needed to visualise the model tree
df= pd.read_csv('csv_filename') # this imports the csv data file into pandas into a dataframe variable df - this will display the data in a table
df.shape # this will output number of rows and columns
df.describe() # this gives the data statistics of each column - count, mean, std, min, 25%, 50%, 75%, max
df.values() # displays the array
split the data into train and test parts
X = df.drop(columns=['output_columnname']) # will create a new dataset X with that column removed (by conventional these are capitalized names)
y = df['output_columnname'] # create a new dataset y with only the output column
X_train, X_test, y_train, y_test train_test_split(X,y,test_size = 0.2) # split your data into training and test parts, in this case 20% of data will be used in testing phase
model = DecisionTreeClassifier() # create your model type
now train, view and save your model
model.fit(X_train,y_train) # trains the model
tree.export_graphviz(model, out_file='chartfilename.dot', feature_names = [train_column1, traincolumn2], class_names= sorted(y.unique()), label='all', rounded=True, filled=True) #optionally to save a graphic display of trained and generated decision tree in the model
joblib.dump(model, 'saved_model_filename.joblib') # to save the trained model
now evaluate model using your test dataset:
predictions = model.predict( X_test )
predictions # to display the predictions as an array of predictions
score = accuracy_score(y_test, predictions)
score # to display score
later, you can load your saved model and use it to create predictions without needing to re-train it
import pandas as pd
from sklearn.tree import DecisionTreeClassifier # (if this is the model you wish to use)
from sklearn.externals import joblib # needed to load or save the trained model
model = joblib.load('saved_model_filename.joblib') # to load the trained model
predictions = model.predict( your_new_dataset_array_to_analyse )
predictions # to display the predictions as an array of predictions