Fundamentals of Machine Learning
Let's face it - computing was created to analyze data. You rarely if ever have a program looking for data. Rather, you have data looking for code to analyze it. Machine learning represents the state-of-the-art in making sense of data. Unfortunately, for many years it has been out of reach for the common developer – until now.
This is perhaps one of the highest paid and most sought-after skills today. No question about it - this is the place to really make a big as a developer.
Figure 1: The world of machine learning
Machine learning represents the logical extension of simple data retrieval and storage. It is about developing building blocks that make computers learn and behave more intelligently.
Machine learning makes it possible to mine historical data and make predictions about future trends. Without realizing it, you are probably already using the benefits of machine learning. Search engine results, online recommendations, ad targeting, fraud detection, and spam filtering are all examples of what is possible with machine learning.
Machine learning is about making data-driven decisions. While instinct might be important, it is difficult to beat empirical data.
The many facets of machine learning
Once you start to dive deep into the topic you start addressing such topics as:
Supervised and unsupervised learning
Classification
Markov models and Bayesian networks and much more
Mahout and Hadoop
The Apache Mahout project's goal is to build a scalable machine learning library.
There is some degree of overlap with big data analytics within a Hadoop
There is an entire machine learning open-source project that you can get for free with Hadoop. You can learn more here:
Mahout includes algorithms for clustering, classfication and collaborative filtering. You can also find:
Matrix factorization based recommenders
K-Means, Fuzzy K-Means clustering
Latent Dirichlet Allocation
Singular Value Decomposition
Logistic regression classifier
(Complementary) Naive Bayes classifier
Random forest classifier
My alma mater was UC Berkeley and they offer many awesome courses there
I wish I had more time. I would seriously consider taking this free MIT online class, which you can find here:
Azure is democratizing machine learning
Historically, machine learning, has required complex software and high-end computers . This field of computing required a seasoned data scientist . What's been needed is a fully managed cloud service for this form of machine learning, also known as predictive analytics .
Welcome To ML Studio
MAML - Microsoft Azure Machine Learning is an Azure Service. It is a web application that has a studio called Studio ML. You create experiments with this web application that represent your machine learning activities.
A visual composition surface is used to create a machine learning workflow. The design surface of the web app allows you to add modules. Additional modules can be authored in R.
The point of a visual design surface is to remove complexity of creating algorithms, cleaning data, finding Features.
There are 2 Phases to using MAML. The first phase is the experiment. That is where you start with the data and begin to clean it up. This is going to take 60% to 70% of the total time. In this phase, you will be combining data, removing rows, eliminating columns. In this phase you will also take your model, and train it. From there the output will be scored and evaluated.
In phase 2 you will operationalize it, which means it will be put behind a web service. This will allow you connect your machine learning model to other business processes. This is the real magic of the Azure Machine Learning offering. Operationalizing your models and exposing them to your business is a key step and is often extremely difficult with other approaches. Operationalizing Azure Machine Learning is extremely simple.
Using simple drag-and-drop gestures along with some data flow graphs you are able to set up some experiments and take advantage of sophisticated algorithms about writing code.
There is a pool of VMs running machine learning algorithms using an orchestration engine, freeing the data scientist from moving data and moving to different services.
The ML Studio is targeting the emerging data scientists. You can train 10 models in minutes, not days. You can put a predictive model into production in minutes, not weeks or months. Some customers are reporting a 10X-100X in reduction in cost relative to competition. I invite readers to go get some pricing for SAS. See https://www.sas.com/en_us/software/analytics/rapid-predictive-modeler.html.
These models can also be shared with other parts of a company. Employees can create their own workspaces, giving re-use and cross-teaming. The models can be locked as well, allowing them to be reused but not modified. In other words, these can be immutable models, allowing sharing and innovation but not breaking what is considered ‘golden.’
The predictive models can be shared as a service across an enterprise leverage Azure as the public cloud back-end. Average waiting from one service in Azure to another is between 50ms to 100 ms. This is very fast and will allow companies to leverage machine learning back-ends running predictive models from other services in Azure. For example, you can write JSON-based back ends that leverage your predictive models, allowing you to build decision making dashboards for your business.
Machine Learning algorithms are built to continually improve over time by leverage training sets. Training sets make it possible to continually improve the robustness of your predictive model.
Data Scientists Code in R
R is a popular open source programming environment for statistics and data mining. The good news is that it is easily integrated into ML Studio. I have a lot of friends using functional languages for machine learning, such as F#. It's pretty clear, however, that R is dominant in this space.
Polls and surveys of data miners are showing R's popularity has increased substantially in recent years. R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and is currently developed by the R Development Core Team, of which Chambers is a member. R is named partly after the first names of the first two R authors. R is a GNU project and is written primarily in C, Fortran.
Data Analytics
Below is a framework that provides a way for you to think about the predictive nature of machine learning. It's all about providing insight to business decisions where limited resources are applied to grow revenue or limit expenses. This might include insights into consumer spending patterns, or to optimizing supply chain.
How to think about the analytics spectrum
One great way to think about machine learning is to break down analytics into 3 questions:
What happened?
- Historical
What will happen?
- Predictive
What should I do next?
- Prescriptive
How to think of the personas doing analytics
The information worker
Typically using a self-service approach using Power BI.
- Power BI for Office 365 is a self-service business intelligence (BI) solution delivered through Excel and Office 365 that provides information workers with data analysis and visualization capabilities to identify deeper business insights about their data
IT professionals
- Involved in data transformation, data warehousing, creating data merchant cubes for analytics, and data modeling
- Work for GM's are directors
Data scientists
Deeply technical and skilled not just with code, but with mathematics, statistics, and probability
Can use a variety of techniques to apply probability to predictions (ie, there is a 42% chance that prices will go up in the next 18 hours)
Like Monte Carlo simulations, parameterizing the model
What to look for in a data scientist
Domain Knowledge
Clear Understanding Of The Scientific Method
- Objectivity, Hypothesis, Validation, Transparency
Strong in Math and Statistics
Intellectual Curiosity and Critical Thinking
Visualization and Communication
Advanced Computing And Data Management
Academic backgrounds
If you were to go to school, went to study to be a data scientist, what courses would you take?
Applied Mathematics
Computer Science
Econometrics
Statistics
Engineering
Industries that really benefit from that of science
Financial Services
Telecommunications
Information Technology
Manufacturing
Utilities
Healthcare
Marketing
Some video help
Some Key Links
Wrapping up
This post provided a high-level view of some of the characteristics and concepts with respect to machine learning. In the next post will start playing around with the Azure portal.
Figure 2: The Azure Portal
Comments
Anonymous
September 16, 2014
At an offsite and this article was usefulAnonymous
September 26, 2014
Very nice article and collection of resourcesAnonymous
January 29, 2015
Great article and a good start pointAnonymous
July 16, 2015
That is great