Solution: Machine Learning Question

machine learning practice test / quiz and need the explanation and answer to help me learn.

There will be a machine leaning exam after about 6 and half hours online, I will give you the exam. The exam will open for three hours. I will provide the material that we used I will add more tips if you did great job, any questions plz let me know.
Requirements: 500
Ch2: End-to-End Machine Learning ProjectMachine Learning project checklist (from Appendix B):1. Frame the problem and look at the big picture.2. Get the data.3. Explore the data to gain insights.4. Prepare the data to better expose the underlying data patterns to ML algorithms.5. Explore many different models and shortlist the best ones.6. Fine-tune your models and combine them into a great solution.7. Present your solution.8. Launch, monitor, and maintain your system.You’ll often see slight variations of this checklist. Modify to suit your own situation and needs!An alternate ML Project checklist (from Andrew Ng’s Coursera stuff):Introduction to our provided data set:CaliforniaHousing Prices dataset (StatLib repository)●Based on data from the 1990 California census (hence currently unrealistic).●Author removed some features and added a categorical feature (more instructive).●Point of possible confusion: Examples (rows) are not individual houses!!!Block groups (a.k.a “districts”): smallest geographical unit for which the US CensusBureau publishes sample data (a block group typically has a population of 600 to 3,000).
1.Frame the ProblemGather Information:Talk to people. Hunt them down! This includes not only data stewards, developers responsiblefor the systems that created/compiled the data, and project leads or C-suite, but also anyonethat is responsible for downstream components that could be affected.The more you nail down at the start, the easier it will be to manage risk and expectations.What is the business objective?“Knowing the objective is important because it will determine how you frame the problem, whichalgorithms you will select, which performance measure you will use to evaluate your model, andhow much effort you will spend tweaking it.”Proposed Objective:Prediction of a district’s medianhousing price.(almost) Universal Objective of Data Science:Turndata into money!Ancillary provided details:●Our predictions will be fed to another ML system along with many other signals.●Median district house price is currently estimated manually by experts via complex rules.○Costly and time-consuming.○Estimates were off by more than 20%.●Owners of the downstream system confirmed that they want a numeric value, not acoarse-grained approximation via category (cheap/medium/expensive).Determinations:●We have labels -> supervised learning task.●We’re trying to predict a continuous target -> regression problem (multiple regression).●We’re only predicting one target from each example -> univariate regression.Recall (from the footnotes of Chapter1):Why is regressioncalled regression?”Fun fact: this odd-sounding name is a statistics term introduced by Francis Galton while hewas studying the fact that the children of tall people tend to be shorter than their parents.Since the children were shorter, he called this regression to the mean. This name was thenapplied to the methods he used to analyze correlations between variables.”
Performance Measure:Basically, how are we going to evaluate and compare models? How can we tell when we’vesatisfied the objective of the project?Root Mean Square Error (RMSE) a.k.a “l2 norm”.m is the number of instancesx(i)is a vector of all the feature valuesy(i)is its labelh(x(i)) is your system’s prediction function, a.k.a”hypothesis function”ŷ(i)= h(x(i)) is a predicted value for the target/labelfor that instance (ŷ is pronounced “y-hat”).
Some notes regarding RMSE, if you’re curious:●The squaring–why?○To penalize large errors.○One of the main reasons is that it is very easy to differentiate (important forderivative-based methods such as gradient descent).●The square root–why?○Brings us back to the natural, interpretable units of the problem. Without thesquare root, we’d end up talking about $2, whateverthe heck that is?●Why RMSE and not MAE?○The higher the norm index, the more it focuses on large values and neglectssmall ones. The choice of index 2 is slightly arbitrary.○MAE is preferable when you know you’ve got plenty of outliers and you know thatresiduals are not going to end up having a Normal/Guassian distribution.Author’s suggestion:Check assumptions!
Aside:PipelinesA sequence of data processing components is called a data pipeline.“Components typically run asynchronously… each component pulls in a large amount of data,processes it, and spits out the result in another data store.”Is this true?Synchronous – i.e. sequential, blocking; one task executed at a time; coordinated, or alignedwith a clock or timer; executing task must return before proceeding with next taskAsynchronous – non-blocking; “fire and forget” i.e. call functions and continue doing other stuff,knowing that those functions will eventually return results on their own timePython’s asyncio (standard library package) provides typical async/await.The key difference between synchronous and asynchronous processing is in what the processordoes while it waits for an I/O task to complete.In synchronous execution, the processor remains idle and waits for the I/O task to completebefore executing the next set of instructions.Asynchronous execution is not necessarily parallel execution; think about making breakfast.Good example: Preparing breakfast (source?).Pour a cup of coffee.Heat up a pan, then fry two eggs.Fry three slices of bacon.Toast two pieces of bread.Add butter and jam to the toast.Pour a glass of orange juice.2.Get the DataSetup:All notebooks, data, extra goodies available at author’s github repo:https://github.com/ageron/handson-ml2If you just want to run the code/notebooks, tinker around, and not deal with installing a bunch ofstuff, you can just use Google Colab:
If you want to run things on your own machine:Preferred: Anaconda–the easiest way to get up and running.Less Preferred: What the author does in the book i.e. venv/virtualenv/whatever (unless youhave a particular reason to use these).For Cool Kids: Docker!Aside:The importance of virtual environments–it’sall about dependencies.Official Python Documentation: “A virtual environment is a Python environment such that thePython interpreter, libraries and scripts installed into it are isolated from those installed in othervirtual environments, and (by default) any libraries installed in a “system” Python, i.e., one whichis installed as part of your operating system.”Why bother?●Easier to work on different projects while avoiding package version conflicts–differentenvs for different types of projects.●By keeping project dependencies static, or at least isolated, predictable, and explicit, youensure that if you revisit your project at a later time, it’ll actually run. The python DS/MLecosystem evolves at a rapid pace; functions get deprecated, and package APIs changeall the time.●For ease of sharing and collaboration. If someone wants to run your code, rather thanhaving to guess which versions of the dependencies you used, they can just recreatetheir own version of your environment (from a file, which yououghtto have provided).○Ex: $ conda env export > environment.yaml$ conda env create -f ●Avoiding headaches.○If you bork one env, at least your others are fine.○conda ‘solving environment’… wait 487593453 hours, or just bail?DS/ML folks frequently use ‘conda’ for package/environment management. It comes withAnaconda/Miniconda distributions, and it works pretty dang well (until it doesn’t).
General python developers tend to use ‘venv’ (part of the python standard library), a ‘venv’extension that tries to fix particular limitations/annoyances (‘virtualenv’, ‘virtualenvwrapper’), or’venv’ analog (‘pyenv’, ‘pipenv’).Suggestion: As you build up your various environmentsover time, try to stick with ‘conda’ aslong as you can. When the time comes that you can’t, just switch to using ‘pip’.Basic example of first time using:$ conda create -n jupyter matplotlib numpy pandas scipy scikit-learn$ conda activate $ jupyter notebook# If your default browser hasn’t popped up, just manually go to http://localhost:8888/Helpful:$ conda init# if you didn’t already agree to have the Anaconda installer do this for you$ conda info# spits out a bunch of version$ conda env list# shows you available envs and which one is currently activate (*)$ conda list –explicit# lists packages installed in your currently activated envTake a Quick Look at the Data Structure:The core pandas object is the DataFrame; think of it like an excel sheet on steroids or a table ina SQL database. Each column in the DataFrame is a Series object, which is a one-dimensionalndarray with axis labels; think of it as a snazzy array/list.Most useful methods for DataFrames/Series:head(), or alternatively sample()
info()Note: With Pandas, ‘object’ dtype usually means ‘text/string’.value_counts()describe() – Very good for sanity check; look min/max/meanHistograms – Quick & easy way to see an approximation of the distribution of numerical data.
What to look for:●Lines/cutoffs.○Why? Could indicate problems with data collection, corruption, preprocessing likeclipping/winsorization…●Basic/known distribution types.Why? We have more math and tools available for known distributions. Also,some model types base their theoretical foundations on assumptions about their inputdistributions (OLS/linear regression). Homoscedasticity?!!??!?!?!?!?!?!?!●Very pronounced skew.○Why? Again, considerations for model assumptions. But also, skew will be one ofthe factors involved with decisions about feature scaling and preprocessing.Takeaways:●Median income attribute does not look like it is expressed in US dollars.○”…the data has been scaled and capped at 15 (actually, 15.0001) for highermedian incomes, and at 0.5 (actually, 0.4999) for lower median incomes. Thenumbers represent roughly tens of thousands of dollars (e.g., 3 actually meansabout $30,000).”○It’s an example of a preprocessed feature.●Housing median age, median house value were also capped. Potentially problematic.●Attributes have very different scales.●There are some tail-heavy distributions.Recall left skew vs right skew:Where is the meanin relation to the median?
Warning:Histograms can be deceiving!https://towardsdatascience.com/6-reasons-why-you-should-stop-using-histograms-and-which-plot-you-should-use-instead-31f937a0a81c

Create a Test Set:”your brain is an amazing pattern detection system, which means that it is highly prone tooverfitting: if you look at the test set, you may stumble upon some seemingly interesting patternin the test data that leads you to select a particular kind of Machine Learning model”It’s vitally important to avoid “data snooping bias”, “data leakage”–anything that would give usfalse confidence in our model.Many different ways to sample and split a dataset into train and test sets…Some cool visuals:https://scikit-learn.org/stable/auto_examples/model_selection/plot_cv_indices.htmlMost mature data science packages are going to have some mechanism for controlling therandomness used through their code/algorithms, typically through setting a seed for thepseudo-random number generator that’s used under the hood.
Scikit-Learn’s train_test_split() function:Discussion:Anybody actually using hashing, low-leveltechniques for sampling/splitting,anything other than tried-and-tested utility functions?Stratified Sampling: The population is divided into homogeneous(def) subgroups called strata.You should not have too many strata, and each stratum should be large enough.To stratify, we need groups, hence using Panda’s pd.cut() method to do “binning”.’median_income’ -> bin into temporary column ‘income_cat’ -> use ‘income_cat’ for the purposeof stratification -> drop ‘income_cat’Scikit-Learn’s StratifiedShuffleSplit() object:”We spent quite a bit of time on test set generation for a good reason: this is an often neglectedbut critical part of a Machine Learning project.”We tend to miss out on this when not working on a real problem; kaggle handles test sets for us,but in an actual business setting it would be our responsibility to ensure the test set isrepresentative of the data we want to make predictions on!Suggestion:Data Dictionary. Probably a good idea,if it hasn’t already been done, to accumulateand document all the basic info you have about the data set.3.Discover and Visualize the Data to Gain InsightsThis part of the process is usually calledExploratoryData Analysis (EDA).Author’s tip:”If the training set is very large,you may want to sample an exploration set, to makemanipulations easy and fast.”
Ex: df_for_plotting = df.sample(n=df.shape[0]//10)Also to mitigate visual clutter!This is particularly true when playing with visualizations using unsupervised methods like t-SNE,UMAP, etc.Author’s tip:”Our brains are very good at spottingpatterns in pictures, but you may need to playaround with visualization parameters to make the patterns stand out.”-> adjust ‘alpha’, use various sizes/colors/shapes/’hue’Location matters! Proximity to the ocean is important, as well as proximity to urban/city centers.”A clustering algorithm should be useful for detecting the main cluster and for adding newfeatures that measure the proximity to the cluster centers.”Extra plotting tip:Get them plots bigger!1)Set a better default plot size, up in the ‘imports’ section of your notebook:2)Explicitly create a figure object,thenplot:
3)Use ‘figsize’ arg that’s available in many plotting functions (see above CA scatterplot).Other miscellaneous plotting tips:●Assign the plot function to a variable, or use ‘;’ to mute plot function object/handle output.●Try not to be lazy… label axis, enable legends, customize tick labels/format/spacing, etc.●Anyone else have some tips???Correlations:Standard correlation coefficient (also called Pearson’s r) with Pandas corr() method.Correlation coefficient [–1, 1]~0 means no linear correlationWarning:Common correlation coefficients (like Pearson’sr) have limitations:●Correlation coefficient only measures linear correlations.●May completely miss out on nonlinear relationships.●Strength of correlation is not related to slope.
Hence the need for scatter plots.Takeaways:●Correlation is noticeable/strong; imagine drawing an upward line along the density.●Price cap at $500, evident from the horizontal line at.●Other artifacts are evident from lines around 460k, 350k, 280k, 220k, and so on.Author’s suggestion:”You may want to try removingthe corresponding districts to prevent youralgorithms from learning to reproduce these data quirks.”Remember, we have control over the train set. This may sound sketchy, but whatever we can do(with appropriate justification) to make the data more representative and lead to bettergeneralization in the end is fair game.Intuitive Feature Engineering:”try out various attribute combinations. For example, the total number of rooms in a district is notvery useful if you don’t know how many households there are. What you really want is thenumber of rooms per household. Similarly, the total number of bedrooms by itself is not veryuseful: you probably want to compare it to the number of rooms. And the population perhousehold also seems like an interesting attribute combination to look at.”With ML problems, this sort of feature engineering can matter much more than model choice.
Takeaways:●Houses with a lower bedroom/room ratio tend to be more expensive. Think about this fora moment… larger houses will tend to have allotted a reasonable number of bedrooms,then the remaining ‘room budget’ will go to leisure rooms.●Number of rooms per household is more informative than the total number of rooms(larger houses tend to be more expensive).Remember:Data science is an iterative process. Onceyou get a prototype up and running, youcan analyze the output to gain more insights. Some packages/models may provide you withstatistics like p-values (linear regression implementations like OLS in the statsmodels package),or some type of feature importance.4.Prepare the Data for Machine Learning AlgorithmsAuthor’s suggestion:Write functions.●Allows you to reproduce these transformations easily on any dataset.●Accumulate your own snippets and boilerplate code.●You can use these functions when deploying your model.●Easy experimentation.Remember:Separate the predictors and the labels–don’tnecessarily want to apply the sametransformations to the predictors and the target values.Data CleaningMost Machine Learning algorithms cannot work with missing features.
Problem: ‘total_bedrooms’ attribute has some missing values. There are a few options…●Get rid of the corresponding districts (drop rows).●Get rid of the whole attribute (drop column).●Set the values to some value (fill or impute).Note:Usually a good idea to avoid inplace=True.Scikit-Learn’s SimpleImputer object:”Only the total_bedrooms attribute had missing values, but we cannot be sure that there won’tbe any missing values in new data after the system goes live, so it is safer to apply the imputerto all the numerical attributes”Aside:Scikit-Learn DesignA key aspect you’ll notice while working with sklearn–Consistency!●Estimators – Any object that can estimate some parameters based on a dataset○will always have a fit() method●Transformers – Estimators (such as an imputer) that can also transform a dataset.○will always have transform(), fit_transform() methods●Predictors – Estimators that are capable of making predictions on dataset.○will always have predict(), score() methodsHandling Text and Categorical AttributesOnly one categorical variable: ‘ocean_proximity’.Check value_counts(), nunique().Need to encode as a numeric type. Why? “ML algorithms prefer to work with numbers.”Scikit-Learn’s OrdinalEncoder class:
Problem:”One issue with this representation is thatML algorithms will assume that twonearby values are more similar than two distant values.”Scikit-Learn’s OneHotEncoder class:Custom TransformersNot everything is built-in.Note:’self’ in python is the same thing as ‘this’in C++/Java.”Scikit-Learn relies on duck typing (not inheritance), all you need to do is create a class andimplement three methods: fit() (returning self), transform(), and fit_transform().”Discussion:Is this really true? The custom classbelow inherits from BaseEstimator andTransformerMixin…
Feature Scaling”With few exceptions, Machine Learning algorithms don’t perform well when the input numericalattributes have very different scales.”https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#sphx-glr-auto-examples-preprocessing-plot-all-scaling-pyScaling options:●MinMaxScaler●StandardScaler●RobustScaler●QuantileTransformer●PowerTransformerAs with all the transformations, fit to training data only.
Transformation PipelinesBasically objections that you can compose sequentially.”Pipeline constructor takes a list of name/estimator pairs defining a sequence of steps. All butthe last estimator must be transformers (i.e., they must have a fit_transform() method).”Simple example:Better example example via ColumnTransformer:ColumnTransformer – Applies each transformer to the appropriate columns then concatenates.FeatureUnion – Applies each transform to theentiredata set then concatenates.5.Select and Train a ModelSuggestion:Start simple, gradually progress to morecomplex or computationally expensivemodels.LinearRegression()
Works, but it’s pretty bad.”most districts’median_housing_values range between $120,000 and $265,000, so a typicalprediction error of $68,628 is not very satisfying. “What’s happening here? Underfitting.●features do not provide enough information to make good predictions●that the model is not powerful enough.DecisionTreeRegressor()Weird! Overfitting. What does this mean? The model has effectively memorized all the trainingexamples. If you were to feed it data it hasn’t seen before, it would most likely crap the bed.Recall:Bias-Variance Tradeoff
Better Evaluation Using Cross-ValidationRather than train_test_split(), we could use Scikit-Learn’s K-fold cross-validation class.Note:Number of folds is somewhat arbitrary; the appropriatevalue depends on the data.Scikit-Learn’s cross_val_score convenience class:
RandomForestRegressor() – Prototypical example of a “bagging” ensemble model.Author’s tip:Saving models via pickle.●Warning: pickling and similar data serialization routines can introduce vulnerabilities inyour code!●joblib is supposed to be the replacement for pickle; more efficient for ndarrays.●Another, less flexible, but more space-efficient: keep the validation setup fixed, and foreach model store the parameters (in case you need to re-train it), and oof_predictions, incase you want to use that output down the line for further ensembling (blending/stacking,weighted averages of models, etc).6.Fine-Tune Your ModelGrid SearcHEvaluate all the possible combinations of hyperparameter values.
Key attributes:●.best_params_●.best_estimator_●.cv_results_Author’s tip:”When you have no idea what value ahyperparameter should have, a simpleapproach is to try out consecutive powers of 10.” a.k.a. logspace.Note:Once you become familiar with particular machinelearning algorithms, you’ll better beable to tell what parameter values are odd… good to sanity check hyperparameter optimization.Tip:Setting n_jobs=-1 is handy for many sklearn objects.●-1 means using all processors.●Or, n_jobs=n_cpus – 1 , in order to avoid that the machine gets stuck.Randomized SearchPreferable when hyperparameter search space is very large.Main benefits:●May yield a good configuration of hyperparameters in less time/compute than a reallyextensive gridsearch.●Some marginal control over computing budget via n_iter.Other Hip/Trendy Options (via external packages):●hyperopt●optuna
Ensemble Methods”Another way to fine-tune your system is to try to combine the models thatperform best… especially if the individual models make very different types of errors.”Common types of ensembling:●Bagging or weighted averages●Boosting●Stacking/BlendingAnalyze the Best Models and Their Errorsgrid_search.best_estimator_.feature_importances_Illustrates the importance of diagnostic models!Further options for improvement, which basically amount to removing noise:●Drop bad features●Remove outliersTry to find patterns in the errors. If you can spot a pattern to exploit, chances are you can finaglewith the features and hyperparameters so that the model can exploit the pattern too.”You should also look at the specific errors that your system makes, then try to understand whyit makes them and what could fix the problem (adding extra features or getting rid ofuninformative ones, cleaning up outliers, etc.).”
Evaluate Your System on the Test Set”Run your full_pipeline to transform the data (call transform(), not fit_transform()—you do notwant to fit the test set!), and evaluate the final model on the test set”8.Launch, Monitor, and Maintain Your SystemProbably better saved for the dedicated chapters with fleshed-out examples.Hojillions of options of varying complexity; on-prem vs cloud, managed services vs not somanaged.The Reality (Andrew Ng’s Coursera stuff):Monica Rogati (https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007)
“Data-Centric AI” sentiment periodically resurfaces–or maybe it never really went anywhere?(Andrew Ng’s Coursera stuff, yet again):
(What happened to… ) 7.Present Your Solution●What you have learned.●What worked, what did not.●What assumptions were made.●What your system’s limitations are.●Document everything!!!●Create nice presentations with clear visualizations and easy- to-remember statements(e.g., “the median income is the number one predictor of housing prices”).Conclusion:”In this California housing example, thefinal performance of the system is not betterthan the experts’ price estimates, which were often off by about 20%, but it may still be a goodidea to launch it, especially if this frees up some time for the experts so they can work onmore interesting and productive tasks.”Things that don’t work:●Emailing your tech-illiterate CEO a ghastly, unformatted MS Excel spreadsheet with lotsof numbers.●Skipping the presentation all together, appealing to your authority in all things data: “Justtrust me, bro”.●Getting distracted by other fires, subsequently neglecting documentation. Ideally yournotebooks and code should be at leastsomewhatself-documenting.
Chapter 7. Ensemble Learning and Random ForestsNidhin Pattaniyil
Table of Contents-Voting Classifier -Bagging and Pasting-Random Forests-Boosting-Stacking
Introduction-Ensemble: group of predictors-Ensemble Learning: aggregate predictions of a group of predictors-Ensemble method: ensemble learning algorithm-Ensemble Methods:-Bagging -Boosting-Stacking-Work best when predictors are independent from one another as possible
Voting Classifiers
Voting Classifiers-Hard Voting Classifier: predict the class that gets the most vote
Voting Classifiers: Soft Voting-clf1 -> [0.2, 0.8], clf2 -> [0.1, 0.9], clf3 -> [0.8, 0.2]–With equal weights, the probabilities will get calculated as the following:–Prob of Class 0 = 0.33*0.2 + 0.33*0.1 + 0.33*0.8 = 0.363–Prob of Class 1 = 0.33*0.8 + 0.33*0.9 + 0.33*0.2 = 0.627–The probability predicted by ensemble classifier will be [36.3%, 62.7%].
Logistic Regression: 0.864RandomForestClassifier 0.896SVC: 0.888VotingClassifier 0.904
Bagging and Pasting
Bagging and Pasting-Use same training algorithm but train on different random subsets of training set-Two types:-Bagging: sampling with replacement; -Pasting: sampling without replacement-Each individual predictor has a higher bias -Ensemble has a similar bias but a lower variance than a single predictor trained on original training set
Bagging and Pasting-Ensemble’s prediction will likely generalize better than single Decision Tree
Out-of-bag DatasetWikipedia: Out-of-bag error
Random Patches and Random Subspaces-Sample features ( bootstrap_features and max_features)-Sample records: ( bootstrap and max_samples)-Random Patches-Sampling both training instances and features-Random Subspaces-Keeping all training instances but sampling features-Sampling features results in even more predictor diversity, trading a bit more bias for a lower variance.
Random Forests
Random Forest-Ensemble of Decision Trees trained via bagging-If using a BaggingClassifier of DecisionTreeClassifier, you could just use RandomForestClassifier-All the hyperparameters of DecisionTree and BaggingClassifier-at each split , it only searches for the best feature among a random subset of features.. -leads to greater tree diversity thus higher bias, low variance-
Extra-Trees-Faster to train than RandomForest-Extra Trees uses random thresholds instead of searching for best threshold
Feature Importance-For each feature we can collect how on average it decreases the impurity.-The average over all trees in the forest is the measure of the feature importance.-weighted average, where each node’s weight is equal to the number of training samples that are associated with it
Boosting
Boosting-In Random Forest, all the trees can be independently trained-Train predictors sequentially , each trying to correct its predecessors error-Two popular boosting methods: -AdaBoost : increase misclassified instance weight at each iteration-Gradient Boosting: new predictor trained on residual errors of previous predictor
AdaBoost
Gradient Boosting (step 0)-Trying to predict incomeReference: https://www.analyticsvidhya.com/blog/2021/03/gradient-boosting-machine-for-data-scientists/
Gradient Boosting (step 1)-Train model 1-compute predictionsReference: https://www.analyticsvidhya.com/blog/2021/03/gradient-boosting-machine-for-data-scientists/
Gradient Boosting (step 2)-Using the predictions , compute residual-Save model 1 predictionsReference: https://www.analyticsvidhya.com/blog/2021/03/gradient-boosting-machine-for-data-scientists/
Gradient Boosting (step 3)-Train a new model where the target is the error from model 1 -Save model 1 predictions-Repeat for further modelsReference: https://www.analyticsvidhya.com/blog/2021/03/gradient-boosting-machine-for-data-scientists/
Gradient Boosting -Model 0: predicts the target-Model 1 and above, target is the previous errorReference: https://www.analyticsvidhya.com/blog/2021/03/gradient-boosting-machine-for-data-scientists/
Gradient Boosting-XGBoost, LightGBM, Catboost are other popular libraries-Gradient Boosting also used for ranking
Stacking
Stacking-Instead of using hard voting, train a model to perform the aggregating-Training-Create a hold out dataset-Train classifiers on split 1-Get output from classifier on split 2 and use as training data-Blender is trained from first layers predictions
Summary-Ensemble methods: Bagging / Boosting / Stacking -Voting: Hard or Soft Voting -Sample Training Data / Sample Features -Random Forests: Bagging Tree Classifier ; feature importance, OOB score-Boosting: AdaBoost / Gradient Boosting-Stacking: model to perform aggregation
DIMENSIONALITY REDUCTIONCHAPTER 8 –HANDS-ON MACHINE LEARNING WITH SCIKIT-LEARN, KERAS, AND TENSORFLOW
DIMENSIONALITYREDUCTION –PROS AND CONS•Pros•Removes correlated features•Improves model efficiency•Reduces overfitting•Improves visualization•Cons•PCA is a linear algorithm and does not work well for polynomial or other complex functions•Can lead to inefficiencies after reduction if we don’t choose the right number of dimensions to eliminate•Less interpretability•Preserves global shapes rather than local shapes(Chaitanya Narava, 2020, “A Complete Guide on Dimensionality Reduction”, A Complete Guide On Dimensionality Reduction | by Chaitanyanarava| Analytics Vidhya | Medium)
CURSE OF DIMENSIONALITY•As it relates to Copy-Move Forgery project•-(AnujaDixit and R. K. Gupta, 2016, “Copy-Move Image Forgery Detection a Review”, areviewIJIGSP.pdf)•Has anyone else run into the curse of dimensionality in other types of projects?
MAIN APPROACHES FOR DIMENSIONALITY REDUCTION•Projection –this approach works well when many features are almost constant with many highly correlated•Manifold Learning –the Swiss Roll dataset•Assumptions•Most real-world, high-dimensional datasets lie close to a much lower-dimensional manifold•Task will be simpler if expressed in this lower-dimensional space
PRINCIPAL COMPONENT ANALYSIS HYPERPARAMETERS•Randomized PCA –Faster way to tune your model than using “full” when d is much smaller than n•Incremental PCA –does not require the full training set to fit in memory for the algorithm to run•Kernel PCA (kPCA) –allows you to perform nonlinear projections
DIFFERENT TYPES OF DIMENSIONALITY REDUCTION TECHNIQUES•Principal Component Analysis (or PCA) –the most popular•Hyperparameters allow you to alter the PCA to:•Randomized PCA•Incremental PCA•Kernel PCA (or kPCA)•Locally Linear Embedding (or LLE) –an unsupervised Manifold Learning method that computes low-dimensional, neighborhood-preserving embeddings of high-dimensional data (NON-LINEAR?)•Random Projections –projects the data to a lower-dimensional space using a random linear projection•Multidimensional Scaling (or MDS) –a linear method that transforms the given matrix into a low-dimensional matrix based on the distance each element•Isomap–Manifold Learning method that is non-linear and is better than linear methods when dealing with all types of real image and motion tracking•T-Distributed Stochastic Neighbor Embedding (t-SNE) –non-linear and more robust towards outlier•Linear Discriminant Analysis (LDA) –linear technique similar to ANOVA which builds the feature combinations based on differences rather than similarities
QUESTIONS
Hands-On Machine LearningClassiﬁcation – Vibhu Sapra
“Hello World of ML”Overview▸Explore MNIST dataset▸Binary Classiﬁcation▸Performance Metrics▹Cross-Validation, Confusion Matrix, Precision & Recall, ROC▸Multiclass Classiﬁcation▸Next Steps
MNIST DatasetData Exploration
Hello World of MLDataset ▸70,000 handwritten digits from 0-9▹60k Train, 10k Test Pre Split▸Each image has 784 features ▹28 x 28 pixels▹Each feature is pixel intensity from 0(white) to 255 (black) ▸Benchmark for many models▸Balanced samples per class▸Not perfect – don’t expect 100%
Most SamplesMNIST ExamplesConfusing Samples

Binary ClassiﬁcationSimple Introduction
Binary ClassiﬁerBinary Classiﬁer▸Deciding between 2 groups▹Spam or Not Spam▹Is this digit the #5 or not the #5?▸Many approaches▹Logistic Regression, KNN, Decision Trees, Random Forest, NN, etc.▸Performance Metrics are tricky▹Accuracy – problematic▹Confusion Matrix▹Precision & Recall▹ROC Curve

Example ModelSGDClassiﬁer▸Author doesn’t really explain this▹Chapter 4▸Linear Model ﬁt to data▸Pass training sample through a linear function and predict 1 or 0▸Look at actual label adjust function accordingly & repeat▸More in chapter 4
Model EvaluationEvaluation▸Majority of this chapter▸K-fold Cross-Validation▹Chapter 2▸Train the model K times on K-1 samples of the data ▹Test on every held out section▸More model evaluation in chapter 4
Model EvaluationCross-Validation

Performance MeasuresClass imbalance▸Can’t always use accuracy▹All false was still > 90% ▸Class imbalance▹90% of our samples were not 5▹10% of our samples were not 5▸Try other methods▹Confusion Matrix▹Precision & Recall▹ROC Curve
Confusion MatrixConfusion Matrix*
Confusion MatrixConfusion Matrix
Confusion MatrixConfusion MatrixTrue Negative53892False Positive687False Negative1891True Positive3530
Precision & RecallPrecision / Recall
Precision & RecallPrecision / Recall▸Precision = accuracy of the positive predictions▹Precision = TP / (TP + FP) ▹Precision = 3530 / (3530 + 687) = .8371▸Recall (aka sensitivity) = true positive rate (TPR) = ratio of positive correctly identiﬁed▹Recall = TP / (TP + FN)▹Recall = 3530 / (3530 + 1891) = .6512
Precision & RecallBasic Example
Precision & Recall CatsDogsPrecision- Model to predict dogs- Accuracy of Positive PredictionsRecall- Model to predict dogs- Recall = Ratio of positively predicted dogs
Precision CatsDogsPrecision- Model to predict dogs- Only look at dogs side- Precision = How accurate the model is at predicting it’s class- Precision = ⅗ = 60%- Our model was trying to accurately predict if an image was a dog or not a dog (a cat in this case) and had a precision of 60%
Recall CatsDogsRecall- Model to predict dogs- Ignore all the cats- Recall = Total correctly predicted / total dogs- Recall = ¾ = .75- Our model was trying to predict if a dog was a dog and had a recall of 75%
Precision & RecallPrecision / Recall▸Precision = Dog side▹Precision Accuracy at predicting the #5▹Precision = 3530 / (3530 + 687) = .8371▸Recall = all Dogs▹Recall = Percent of all 5s correctly labeled▹Recall = 3530 / (3530 + 1891) = .6512■Doesn’t look so great anymore
Precision & RecallPrecision / Recall▸Trade off between precision and recall ▹With precision – make sure what you’re saying is positive is actually positive▹With recall – make sure you’re not missing out on positive observations▹As one increases, the other decreases▹Metrics like F1 scores average them both
F1 ScoreF1 Score▸Harmonic mean of precision and recall▹Gives more weight to low values▹Only get a high F1 score if both are high▹Typically precision & recall are similar
Optimization guidePrecision / Recall / F1▸So do I want to improve F1, precision, or recall?▸Depends on the situation▹Classiﬁer to detect if videos are safe for kids■Reject many good videos (low recall) but keep safe one (high precision)▹Classiﬁer to detect shoplifters?■May give false positives (high recall) but captures all thieves (low precision)
ROC CurvesSkipping Sections▸Only have 45 min – few really dense topics (come back in a few weeks?)▹ROC / AUC curves■Sensitivity vs speciﬁcity▹Hard to follow without a real model / concrete examples ■Should come back to these in another session after having done a few examples■MNIST not a great example to understand ROC/AUC■Doesn’t really help understand out current example
ROC CurvesSkipping Sections▸Multiclass OvR and OvO classiﬁers▹One vs One / One vs Rest▸Multiclass error analysis
Various ClassiﬁcationsSimple Introduction
Multiclass ClassiﬁcationMulticlass Classiﬁcation▸Let’s predict 0-9, not just #5▸Many approaches▹Logistic regression, Random Forest, Naive Bayes, NN, etc.)▹Some won’t work like SGD or SVM – they’re only binary■Can train one classiﬁer per digit / class and output highest conﬁdence if you want to use these old approaches

Multiclass ClassiﬁcationMultilabel Classiﬁcation- Let’s output multiple labels per digit- Outputs Binary [1/0] outputs per label- First check if a digit is a 7,8, or 9. Second check if it’s even- Many approaches – KNN, Random Forest, NN, etc.)Multioutput Classiﬁcation- Each label can have multiple outputs- Remove noise from MNIST Images- Each output for each 28×28 is a pixel value from 1-255 (multiple outputs per pixel)- Many approaches – KNN, Random Forest, NN, etc.)
Next TimeSkipped Sections▸ROC / AUC ▸Precision / Recall tradeoff curves▸OvO / OvR Classiﬁers▸Multilabel classiﬁcation (time)▸Multioutput classiﬁcation (time)
Discussion led by: Kalika Kay Curry AurélienGéron
CHAPTER 9:Unsupervised Learning Led by: Kalika Kay CurryAurélienGéron
Kalika Kay CurryUnsupervised LearningModeling Unlabeled DataWhy Unsupervised Learning•Unsupervised Learning –High Potential•Works to cloister/compare unlabeled instances (manufacturing example).•Dimensionality Reduction is the most common form of unsupervised learning method.Types of Unsupervised Learning•Clustering, objects are grouped together•Anomaly Detection –learn what’s normal to find what’s abnormal. •Density estimation –Probability density function (PDF) is estimated of the random process that generated the dataset. Used for anomaly detection, analysis, and visualization.p 3
Kalika Kay CurryPart I: ClusteringFocused on clustering unsupervised learning modelsp 4
Kalika Kay CurryWhy Cluster•Customer segmentation•Data analysis •Anomaly detection (outlier detection)•Search engines p 5•Dimensionality reduction (Preprocessing)•Include cluster in the pipeline.•GridSearchCVfor best k value •Semi supervised learning•Train a dataset from labeled clusters•Segment an image•Color segmentation
Kalika Kay CurryTypes of Clustering Models•K-Means•DBScan•Agglomerative –Hierarchical, matrix required for large datasets.•BIRCH –Hierarchical, designed for large datasets.•Mean-Shift –not suited for large datasets, similar to dbscan•Affinity Propagation –not suited for large datasets. Uses a voting mechanism.•Spectral –combo-unsupervised learning method. Combines an embedded dimensionality reduction with another unsupervised learning. Used for complex data structures and to cut graphs.p 6
Kalika Kay CurryK-Means History•Best described with a Voronoi diagram (right).•AKA Lloyd-Forgy•Developed in 1957 by Lloyd (copyright) •Developed in 1965 by Forgy•In 2006, Arthur and Vassilvitskiiprovided introduced a Kmeans++, faster way of identifying centroids and is the default. •There are a couple of other varieties such as Mini-batch K-Means; is faster, but has more inertia.•Requires the n_clustersparameter (k)p 7
Kalika Kay CurryoReminder: Scale the dataoCluster amount▪Inertia elbow chart▪Silhouette line graph▪Silhouette diagramoLimited Solution▪We’re all a little limited. ▪Ladybug example from image segmentation.▪Does not perform well with •Non spherical shapes (whatever that means)•Different densities•Varying sizes•Inertia identifies the best solution (best centroid location). •Mean squared distance between each instance and its closest centroid•Getting lucky: Centroid initialization methods (Risk Mitigation)•Random centroid initializations can produce suboptimal solutions as a convergence occurs at random. •Kmeans++ Algorithm•n_inithas a default setting and determines how many times the centroids should be selected for the optimal solution.•If you know the initialization points for the clusters, you can set the initparameter manually. with an n_initparameter set to one. p 8K-Means Usage
Kalika Kay CurryTypes of ClusteringHard Clustering and Soft Clustering•Hard Clustering •is one instance is assigned to a cluster•Soft Clustering •assigns each instance a score per cluster.•Can be the distance from the centroid, which can be found using the transform method; which represents the Euclidian distance of an instance form each centroid. •Can be the a similarity/affinity score such as the Gaussian Radial Basis Function. (chapter 5)•The transform method can be used to gather the distance between each instance to each centroid. It is in fact, the Euclidian distance.•Note: the similarity/affinity function is not provided as an example in the githubsolutions/notebook provided to the author.•The affinity function states that the number of features will increase, drastically with an extremely large training set. What does this mean for dimensionality reduction; when we’re working with clustering? Can we expect to see that the similarity/affinity results would be overkill or over extensive?p 9Conversely to a distance transformation we can use a similarity function.
Kalika Kay CurryDBScanhttps://scikit-learn.org/stable/modules/clustering.html#dbscan•Continuous regions of high density.•Epsilon neighborhood a count of the number of instances that are within a small distance from it. •Min_samples: if it has at least these number of samples, it’s a core instances.•If you’re in the same neighborhood; you belong to the same cluster. This can include many core instances, so if there is a long running instance; you’re all one. •If it’s not a core instance and is not in the neighborhood, it’s an outlier, represented with a -1. •No predict method; a classifier is better at predicting the cluster the data may belong to (knnexample to the right)p 10
Kalika Kay CurryPart II: Gaussian Mixture Modeling (GMM)Focused on Gaussian Mixture Modeling (GMM). p 11
Kalika Kay CurryWhat is Gaussian Mixture Modeling?•Model that uses soft clustering (returns a probability of an instance belonging to a cluster).•Assumes that the instances are gaussian in type with unknown distributions.•Ellipsoidal in shape, as opposed to the dense clusters seen with K-Means. •Similar to K-Means algorithm, using the Expectation Maximization algorithm –which is where the soft cluster comes from.•Also like K-Means, subject to poor convergence so several iterations are required to reach best solution. This is achieved with the n_initparameter whose default is 1; so be sure to check in on this one.p 12
Kalika Kay CurryFunny Pictures (Computational Complexity)pages 261 and 271•A couple of diagrams are available on both pages 261 (Gaussian) & 271 (Bayesian Gaussian) that provide a visual representation of what’s going on with the model, for those interested.•Covariance type hyper parameter to increase the chances of finding a optimal solution. Default is set to full, which means each cluster can take on any shape or size. •Covariance hyper parameter impacts computation speeds –tied and full are the longest; spherical and diagare a bit faster. p 13
Kalika Kay CurryAnomaly Detection•It can detect anomalies (outliers) or those that deviate from the norm. •You do this by looking at the density of the clusters, which is returned as the score of the samples. The anomalies are those clusters that are below an agreed upon threshold. •Gaussian is, by definition, an assumed normalized dataset so in cases where an extreme number of outliers are present; an EllipticEnvelopeclass exists as a more robust form of anomaly detection (??? Or does it do something else ???)p 14
Kalika Kay CurryAND > >> SCORE > > > •Alkaikeand Bayesian information criterion (AIC and BIC, respectively) scores are used to assist with cluster selection.•The lower the score the greater better the choice, maybe.•Another method is to search for the best value for the covariance_typehyperparameter. Spherical is faster but doesn’t fit the data as well. •Bayesian Gaussian Model Mixtures also help resolve the number of cluster conundrum by providing weights, but BGMM comes with many constraints one of which is a restriction to ellipsoidal shapes. p 15
Thank YouKalika Kay CurryKalika_kay@hotmail.com

The post Machine Learning Question first appeared on Bessays.

Need help with your own assignment?

Related Articles

St. Vincent de Paul observed the needs of impoverished people with disparities in access to healthcare. He ultimately dedicated his life to answering a single question: ‘What mu

MOOD STABILIZER

Acceleration Debate

The Clinical Judgment Measurement Model (CJMM) and Decision-Making Tool The CJMM