OzEMedicine - Wiki for Australian Emergency Medicine Doctors

AI machine learning

Introduction

machine learning is turning data into numbers then finding patterns in those numbers when the rules to create a desired output from inputs is too complex or are not well known
you should use a traditional programming solution rather than machine learning if you can build a simple rule-based system
machine learning can adapt to changing environments or scenarios
machine learning can help to discover insights into large volumes of data however, the patterns learned can be uninterpretable by humans and the outputs from machine learning aren't always predictable and may be erroneous as they are based upon relatively large datasets and probability
machine learning with “shallow algoriths” such as decision trees is best on structured data such as rows and columns of data, examples include:
- gradient boosted machine such as XGBoost
- random forest
- naive Bayes
- Nearest neighbour
- support vector machine
AI deep learning is typically used for unstructured data and uses neural networks although a tensor can create more structure, examples include:
- neural networks
- fully connected neural network
- convolutional neural network
- recurrent neural network
- transformer

basic steps

get raw data
clean data to remove duplicates or irrelevancies and to convert text into numerical values
split the data into training vs testing data
create a model such as neural network or decision trees
train the model with the training data
test the model to make predictions
evaluate the accuracy and improve - fine tune parameters

Python libraries for AI

NumPy
- special python n-dimensional array type for faster processing along with additional properties, methods
- array must be all the same data type
- np1 = np.array([… ])
- np1.shape() gives number of items similar to len()
- np2 = np.arange(10) will create an array [0,1,2,3,4,5,6,7,8,9] can use startvalue, endvalue, step
- np3 = np.zeros(10) will create an array [0,0,0,0,0,0,0,0,0]
- np4 = np.zeros¹⁾ will create a 2D array of zeros
- np3 = np.full( (10),4) will create an array filled with value 4 [4,4,4,4,4,4,4,4]
- np7 = np.array(python_list) will create a numpy array from a python array
Pandas - data frame like Excel
MatPlotLib 2D charting
SciKit-Learn - machine learning model types
Jupyter - to allow your code to be segmented into cells and each cell run by itself or all together and provides better inspection of data
- once installed via Anaconda, to run, open up a terminal window, then type $jupyter notebook
- ipynb files
- if green left bar - in edit mode (hit ESC to go to command mode)
- if blue left bar - in command mode
  - press A to insert new command line cell above and press B to insert new cell below
  - press D twice to delete the active cell?
- each mode has different shortcuts - press H to see them
- shift-tab for tool tip
- Ctrl-Enter to run cell without adding new cells
- Ctrl-slash to comment out selection
MS Visual Studio Code
- used to view the .dot chart files
  - need to run it then install extension: Graphviz (dot) language support for Visual Studio Code by Stephanvs
Anaconda (5Gb) - installs the above

Machine learning with Jupyter, sklearn and Pandas

basic code

import pandas as pd
from sklearn.tree import DecisionTreeClassifier # (if this is the model you wish to use)
from sklearn.model_selection import train_test_split # only needed when training model
from sklearn.metrics import accuracy_score # only needed when evaluating model
from sklearn.externals import joblib # needed to save the trained model
from sklearn import tree # only needed to visualise the model tree
df= pd.read_csv('csv_filename') # this imports the csv data file into pandas into a dataframe variable df - this will display the data in a table
df.shape # this will output number of rows and columns
df.describe() # this gives the data statistics of each column - count, mean, std, min, 25%, 50%, 75%, max
df.values() # displays the array

split the data into train and test parts

X = df.drop(columns=['output_columnname']) # will create a new dataset X with that column removed (by conventional these are capitalized names)
y = df['output_columnname'] # create a new dataset y with only the output column
X_train, X_test, y_train, y_test train_test_split(X,y,test_size = 0.2) # split your data into training and test parts, in this case 20% of data will be used in testing phase
model = DecisionTreeClassifier() # create your model type

now train, view and save your model

model.fit(X_train,y_train) # trains the model
tree.export_graphviz(model, out_file='chartfilename.dot', feature_names = [train_column1, traincolumn2], class_names= sorted(y.unique()), label='all', rounded=True, filled=True) #optionally to save a graphic display of trained and generated decision tree in the model
joblib.dump(model, 'saved_model_filename.joblib') # to save the trained model

now evaluate model using your test dataset:

predictions = model.predict( X_test )
predictions # to display the predictions as an array of predictions
score = accuracy_score(y_test, predictions)
score # to display score

later, you can load your saved model and use it to create predictions without needing to re-train it

need to re-check this code!

import pandas as pd
from sklearn.tree import DecisionTreeClassifier # (if this is the model you wish to use)
from sklearn.externals import joblib # needed to load or save the trained model

model = joblib.load('saved_model_filename.joblib') # to load the trained model
predictions = model.predict( your_new_dataset_array_to_analyse )
predictions # to display the predictions as an array of predictions

¹⁾

2,10

Table of Contents

AI machine learning

Introduction

basic steps

Python libraries for AI

Machine learning with Jupyter, sklearn and Pandas

basic code

split the data into train and test parts

now train, view and save your model

now evaluate model using your test dataset:

later, you can load your saved model and use it to create predictions without needing to re-train it