博客文章头图

Learn Data Science - 05 Data etc.

1704 字
8 分钟
Github
Programming

6 Data Visualization

Open .R scripts in Rstudio for line-by-line execution.

See 10 Toolbox/3 R, Rstudio, Rattle for installation.

1 Data exploration in R

In mathematics, the graph of a function f is the collection of all ordered pairs (x, f(x)). If the function input x is a scalar, the graph is a two-dimensional graph, and for a continuous function is a curve. If the function input x is an ordered pair (x1, x2) of real numbers, the graph is the collection of all ordered triples (x1, x2, f(x1, x2)), and for a continuous function is a surface.

2 Uni, bi and multivariate viz

Univariate

The term is commonly used in statistics to distinguish a distribution of one variable from a distribution of several variables, although it can be applied in other ways as well. For example, univariate data are composed of a single scalar component. In time series analysis, the term is applied with a whole time series as the object referred to: thus a univariate time series refers to the set of values over time of a single quantity.

Bivariate

Bivariate analysis is one of the simplest forms of quantitative (statistical) analysis.[1] It involves the analysis of two variables (often denoted as X, Y), for the purpose of determining the empirical relationship between them.

Multivariate

Multivariate analysis (MVA) is based on the statistical principle of multivariate statistics, which involves observation and analysis of more than one statistical outcome variable at a time. In design and analysis, the technique is used to perform trade studies across multiple dimensions while taking into account the effects of all variables on the responses of interest.

3 ggplot2

About

ggplot2 is a plotting system for R, based on the grammar of graphics, which tries to take the good parts of base and lattice graphics and none of the bad parts. It takes care of many of the fiddly details that make plotting a hassle (like drawing legends) as well as providing a powerful model of graphics that makes it easy to produce complex multi-layered graphics.

http://ggplot2.org/

Documentation

Examples

http://r4stats.com/examples/graphics-ggplot2/

4 Histogram and pie (Uni)

About

Histograms and pie are 2 types of graphes used to visualize frequencies.

Histogram is showing the distribution of these frequencies over classes, and pie the relative proportion of this frequencies in a 100% circle.

5 Tree & tree map

About

Treemaps display hierarchical (tree-structured) data as a set of nested rectangles. Each branch of the tree is given a rectangle, which is then tiled with smaller rectangles representing sub-branches. A leaf node’s rectangle has an area proportional to a specified dimension of the data. Often the leaf nodes are colored to show a separate dimension of the data.

When to use it ?

  • Less than 10 branches.
  • Positive values.
  • Space for visualisation is limited.

Example

treemap-example

This treemap describes volume for each product universe with corresponding surface. Liquid products are more sold than others. If you want to explore more, we can go into products “liquid” and find which shelves are prefered by clients.

More information

Matplotlib Series 5: Treemap

6 Scatter plot

About

A scatter plot (also called a scatter graph, scatter chart, scattergram, or scatter diagram) is a type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data.

When to use it ?

Scatter plots are used when you want to show the relationship between two variables. Scatter plots are sometimes called correlation plots because they show how two variables are correlated.

Example

scatter-plot-example

This plot describes the positive relation between store’s surface and its turnover(k euros), which is reasonable: for stores, the larger it is, more clients it can accept, more turnover it will generate.

More information

Matplotlib Series 4: Scatter plot

7 Line chart

About

A line chart or line graph is a type of chart which displays information as a series of data points called ‘markers’ connected by straight line segments. A line chart is often used to visualize a trend in data over intervals of time – a time series – thus the line is often drawn chronologically.

When to use it ?

  • Track changes over time.
  • X-axis displays continuous variables.
  • Y-axis displays measurement.

Example

line-chart-example

Suppose that the plot above describes the turnover(k euros) of ice-cream’s sales during one year. According to the plot, we can clearly find that the sales reach a peak in summer, then fall from autumn to winter, which is logical.

More information

Matplotlib Series 2: Line chart

8 Spatial charts

9 Survey plot

10 Timeline

11 Decision tree

12 D3.js

About

This is a JavaScript library, allowing you to create a huge number of different figure easily.

https://d3js.org/

D3.js is a JavaScript library for manipulating documents based on data. 
D3 helps you bring data to life using  HTML, SVG, and CSS. 
D3’s emphasis on web standards gives you the full capabilities of modern browsers without tying yourself to a proprietary framework, combining powerful visualization components and a data-driven approach to DOM manipulation. 

Examples

There is many examples of chars using D3.js on D3’s Github.

13 InfoVis

14 IBM ManyEyes

15 Tableau

16 Venn diagram

About

A venn diagram (also called primary diagram, set diagram or logic diagram) is a diagram that shows all possible logical relations between a finite collection of different sets.

When to use it ?

Show logical relations between different groups (intersection, difference, union).

Example

venn-diagram-example

This kind of venn diagram can usually be used in retail trading. Assuming that we need to study the popularity of cheese and red wine, and 2500 clients answered our questionnaire. According to the diagram above, we find that among 2500 clients, 900 clients(36%) prefer cheese, 1200 clients(48%) prefer red wine, and 400 clients(16%) favor both product.

More information

Matplotlib Series 6: Venn diagram

17 Area chart

About

An area chart or area graph displays graphically quantitative data. It is based on the line chart. The area between axis and line are commonly emphasized with colors, textures and hatchings.

When to use it ?

Show or compare a quantitative progression over time.

Example

area-chart-example

This stacked area chart displays the amounts’ changes in each account, their contribution to total amount (in term of value) as well.

More information

Matplotlib Series 7: Area chart

18 Radar chart

About

The radar chart is a chart and/or plot that consists of a sequence of equi-angular spokes, called radii, with each spoke representing one of the variables. The data length of a spoke is proportional to the magnitude of the variable for the data point relative to the maximum magnitude of the variable across all data points. A line is drawn connecting the data values for each spoke. This gives the plot a star-like appearance and the origin of one of the popular names for this plot.

When to use it ?

  • Comparing two or more items or groups on various features or characteristics.
  • Examining the relative values for a single data point.
  • Displaying less than ten factors on one radar chart.

Example

radar-chart-example

This radar chart displays the preference of 2 clients among 4. Client c1 favors chicken and bread, and doesn’t like cheese that much. Nevertheless, client c2 prefers cheese to other 4 products and doesn’t like beer. We can have an interview with these 2 clients, in order to find the weakness of products which are out of preference.

More information

Matplotlib Series 8: Radar chart

19 Word cloud

About

A word cloud (tag cloud, or weighted list in visual design) is a novelty visual representation of text data. Tags are usually single words, and the importance of each tag is shown with font size or color. This format is useful for quickly perceiving the most prominent terms and for locating a term alphabetically to determine its relative prominence.

When to use it ?

  • Depicting keyword metadata (tags) on websites.
  • Delighting and provide emotional connection.

Example

word-cloud-example

According to this word cloud, we can globally know that data science employs techniques and theories drawn from many fields within the context of mathematics, statistics, information science, and computer science. It can be used for business analysis, and called “The Sexiest Job of the 21st Century”.

More information

Matplotlib Series 9: Word cloud

7 Big Data

1 Map Reduce fundamentals

2 Hadoop Ecosystem

3 HDFS

4 Data replications Principles

5 Setup Hadoop

6 Name & data nodes

7 Job & task tracker

8 M/R/SAS programming

9 Sqop: Loading data in HDFS

10 Flume, Scribe

11 SQL with Pig

12 DWH with Hive

13 Scribe, Chukwa for Weblog

14 Using Mahout

15 Zookeeper Avro

16 Lambda Architecture

17 Storm: Hadoop Realtime

18 Rhadoop, RHIPE

19 RMR

20 NoSQL Databases (MongoDB, Neo4j)

21 Distributed Databases and Systems (Cassandra)

8 Data Ingestion

1 Summary of data formats

2 Data discovery

3 Data sources & Acquisition

4 Data integration

5 Data fusion

6 Transformation & enrichment

7 Data survey

8 Google OpenRefine

9 How much data ?

10 Using ETL

9 Data Munging

1 Dim. and num. reduction

2 Normalization

3 Data scrubbing

4 Handling missing Values

5 Unbiased estimators

6 Binning Sparse Values

7 Feature extraction

8 Denoising

9 Sampling

10 Stratified sampling

11 PCA

10 Toolbox

1 MS Excel with Analysis toolpack

2 Java, Python

3 R, Rstudio, Rattle

4 Weka, Knime, RapidMiner

5 Hadoop dist of choice

6 Spark, Storm

7 Flume, Scibe, Chukwa

8 Nutch, Talend, Scraperwiki

9 Webscraper, Flume, Sqoop

10 tm, RWeka, NLTK

11 RHIPE

12 D3.js, ggplot2, Shiny

13 IBM Languageware

14 Cassandra, MongoDB

13 Microsoft Azure, AWS, Google Cloud

14 Microsoft Cognitive API

15 Tensorflow

https://www.tensorflow.org/

TensorFlow is an open source software library for numerical computation using data flow graphs.

Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them.

The flexible architecture allows you to deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API.

TensorFlow was originally developed by researchers and engineers working on the Google Brain Team within Google’s Machine Intelligence research organization for the purposes of conducting machine learning and deep neural networks research, but the system is general enough to be applicable in a wide variety of other domains as well.

本站内容采用 CC BY-NC-SA 4.0 许可,请注明出处;商业转载请联系作者授权。