User:LI AR/Books/Cracking the DataScience Interview

Source: Wikipedia, the free encyclopedia.

The Wikimedia Foundation's book rendering service has been withdrawn. Please upload your Wikipedia book to one of the external rendering services.

You can still create and edit a book design using the Book Creator and upload it to an external rendering service:

MediaWiki2LaTeX provides a softcopy PDF service. Uniquely, it remains under active support and may be used online or installed locally.
Pedia Press offer final tidying and ordering of print-on-demand bound copies in (approximately) A5 format.

For help with downloading a single Wikipedia page as a PDF, see Help:Download as PDF.


Cracking The Data Science Interview Basic Stuff To Know

This user book is a user-generated collection of Wikipedia articles that can be easily saved, rendered electronically, and ordered as a printed book. If you are the creator of this book and need help, see Help:Books (general tips) and WikiProject Wikipedia-Books (questions and assistance).

Edit this book: Book Creator · Wikitext

Order a printed copy from: PediaPress

[ About ] [ Advanced ] [ FAQ ] [ Feedback ] [ Help ] [ WikiProject ] [ Recent Changes ]

Cracking the DataScience Interview

Basic Stuff To Know

Generic pages: Glossaire_de_l'exploration_de_données; Big_data

Inspired from books like:
- "A collection of Data Science Interview Questions Solved in Python and Spark vol I & II"
- "120 real data science interview questions"

Tips / Known Limits of DS

DataScience is (very) experimental (Andrew Ng): https://pbs.twimg.com/media/CBXshmjWgAAgLKa.jpg

Bias–variance_tradeoff / http://www.ritchieng.com/machinelearning-learning-curve/

Survivorship_bias

Correlation_does_not_imply_causation

Curse_of_dimensionality

Vanishing_gradient_problem

Machine Learning definition and types: Artificial_intelligence; List_of_machine_learning_concepts; Machine_learning; Data_mining; Knowledge_extraction; Knowledge_extraction#Knowledge_discovery; Pattern_recognition; Signal_processing; Supervised_learning; Semi-supervised_learning; Unsupervised_learning; Reinforcement_learning; Online_machine_learning; Incremental_learning; Q-learning; One-shot_learning / https://www.quora.com/What-is-zero-shot-learning; Feature_learning; Learning_to_rank; Similarity_learning; Biclustering; Natural_language_processing; Biomimetics; Collective_intelligence; Data_stream_mining; Sequential_pattern_mining; Clickstream; Semantics; Semantic_Web; Speech_recognition; Speech_synthesis; Collaborative_filtering

Competitions

Datasets: List_of_datasets_for_machine_learning_research

Software

http://www.databaseetl.com/data-mining-tools/
IDEs / DS-GUI
- R
  - (DS-GUI) :Rattle_GUI http://rattle.togaware.com/
  - (IDE) :RStudio https://www.rstudio.com
- Python
  - (DS-GUI) :Orange_(software) https://orange.biolab.si/
  - (IDE) :Project_Jupyter#Jupyter_Lab https://jupyterlab.readthedocs.io
- Java
- Online
  - DEAD http://www.gamifiedonlineweka.ga/
- Paid Software
  - (DS-GUI) :Minitab https://minitab.com/
  - (DS-GUI) :Tableau_Software https://www.tableau.com/
R/Packages
Python
- https://www.python.org/
- :Scikit-learn http://scikit-learn.org/stable/
C++
- https://orange.biolab.si/
Alteryx
- https://www.alteryx.com/ [Commercial]
Comparison
- http://onlinelibrary.wiley.com/wol1/doi/10.1002/widm.1204/full
DeepLearning
GANs (Generative Adversial Networks)
DataViz
- https://matplotlib.org/
- https://plot.ly/
- :GGobi http://www.ggobi.org/
- http://ggplot2.org/
- http://ggvis.rstudio.com/
- https://d3js.org/
- https://datascienceplus.com/creating-graphs-with-python-and-goopycharts/
- https://www.tableau.com/ [Commercial]
- http://bokeh.pydata.org/en/latest/ [Python]
- http://pyqtgraph.org/ [Python]
- https://uber.github.io/deck.gl [Uber's internal DataViz tool]
- http://rawgraphs.io/
- http://scidavis.sourceforge.net/
- http://home.gna.org/veusz/
- http://jwork.org/dmelt/
Graphs
GUI
- https://www.rstudio.com/products/shiny/

Data Manipulation: Annotate examples: https://prodi.gy/; Data_pre-processing; Data_cleansing; Data_reduction; Data_wrangling; Data_scrubbing; Data_editing; Data_scraping; Data_curation; Data_pre-processing; Data_fusion; Data_integration; Data_binning; Sanitization_(classified_information); Extract,_transform,_load; Imputation_(statistics); Interpolation; Outlier

Local_case-control_sampling#Imbalanced_datasets

Sampling_(statistics)

Sampling_(statistics)#Stratified_sampling

Stratified_sampling

Jackknife_resampling

Oversampling_and_undersampling_in_data_analysis

Oversampling_and_undersampling_in_data_analysis#SMOTE

"Essay Why Most Published Research Findings Are False"
- http://robotics.cs.tamu.edu/RSS2015NegativeResults/pmed.0020124.pdf
"A Few Useful Things to Know about Machine Learning"
- https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf
Working with text

Unicode_equivalence#Normalization

URL_normalization

Text_segmentation

Tokenization_(lexical_analysis)

Word2vec https://www.tensorflow.org/tutorials/word2vec

https://google.github.io/seq2seq/
NLP in Python

 https://github.com/explosion/thinc

Working with spatial data

Trend_surface_analysis

Spatial_descriptive_statistics#Ripley.27s_K_and_L_functions

Signal processing

Dynamic_time_warping

Signal processing - Images

Normalization_(image_processing)

Normalized_frequency_(unit)

Image_segmentation

Techniques for Feature/Attribute Selection/Dimensionality Reduction: High-dimensional_statistics; Dimensionality_reduction; Factor_analysis; Principal_component_analysis; Independent_component_analysis; Singular_value_decomposition; Multidimensional_scaling; T-distributed_stochastic_neighbor_embedding; Autoencoder; Deep_learning#Stacked_.28de-noising.29_auto-encoders; Elastic_map; Linear_discriminant_analysis

Signal processing

Compressed_sensing

Working with spatial data

Spatial_analysis

Spatial_analysis#Spatial_dependency_or_auto-correlation

Maths (Stats / Algebra)

Inspiration for this section: https://github.com/soulmachine/machine-learning-cheat-sheet

Pseudo-random_number_sampling

Glossary_of_probability_and_statistics

Bijection,_injection_and_surjection

Mode_(statistics)

Range_(mathematics)

Interquartile_range

Standard_deviation

Collinearity#Usage_in_statistics_and_econometrics

Exponential_smoothing

https://stats.stackexchange.com/questions/100019/window-models-in-stream-data-processing

Autoregressive_model

Autoregressive–moving-average_model

Autoregressive_integrated_moving_average

Autocorrelation

Cross-correlation

Entropy_in_thermodynamics_and_information_theory

Moment_(mathematics)

Likelihood_function

Cumulative_distribution_function

Probability_mass_function

Probability_density_function

Prior_probability

Prior_knowledge_for_pattern_recognition

Permutation https://fr.wikipedia.org/wiki/Arrangement

Combination https://fr.wikipedia.org/wiki/Combinaison_(math%C3%A9matiques)

Dependent_and_independent_variables

Independence_(probability_theory)

Hoeffding's_inequality

Pareto_efficiency

Nash_equilibrium

Pareto_principle

Taxicab_geometry

Norm_(mathematics)#Euclidean_norm

Norm_(mathematics)

Trace_(linear_algebra)

Eigenvalues_and_eigenvectors

Projection_(mathematics)

Hadamard_product_(matrices)

Kernel_(statistics)

Radial_basis_function

Latent_variable

Statistical_inference

Inductive_reasoning

Deduction_and_induction

Transduction_(machine_learning)

Stochastic_process

Probability_theory

Posterior_probability

Bayesian_inference

https://www.analyticsvidhya.com/blog/2017/03/conditional-probability-bayes-theorem/

Bayesian_network

Naive_Bayes_spam_filtering

Naive_Bayes_classifier

Belief_propagation#Approximate_algorithm_for_general_graphs

Regularization_(mathematics)

Normalization_(statistics)

Quantile_normalization

Nyström_method (+PCA)

Preference_(economics)

Delaunay_triangulation

Neighbourhood_(mathematics)

Genetic Algorithms

Mutation_(genetic_algorithm)

Crossover_(genetic_algorithm)

Selection_(genetic_algorithm)

Fitness_function

Utility#Utility_functions

SVM

Kernel_(image_processing)

Kernel_(statistics)

Neural Networks

Rectifier_(neural_networks)

Backpropagation

Gradient_descent

Stochastic_gradient_descent

Gradient_boosting

Softmax_function

- Softmax is a "discriminant learning metric": examples for all classes!={i} help learn even for class {i} since sum of evaluations is forced to be 1 (the method creates a link in the evaluations of the classes)

Sigmoid_function

Hyperbolic_function#Tanh

Dropout_(neural_networks)

Radial_basis_function

Signal processing

Signal_processing

Low-pass_filter

High-pass_filter

Energy_(signal_processing)

Fast_Fourier_transform

Discrete_wavelet_transform

Coherence_(signal_processing)

Time Series

Decomposition_of_time_series

Seasonal_adjustment

Frequency_domain

Spectral_density

A*_search_algorithm

Multi-armed_bandit

Distances: Distance; Euclidean_distance [dim1]; Edit_distance; Hamming_distance; Manhattan_distance [dim1]; Levenshtein_distance; Needleman–Wunsch_algorithm; Minkowski_distance [dim n == generalization]; Mahalanobis_distance; Canberra_distance; Distance_correlation; Angular_distance; String_metric; Jaro–Winkler_distance; Jaccard_index; Kendall_tau_distance; Chebyshev_distance; Tf–idf; Neural_coding

For graphs: http://blog.smola.org/post/33412570425
https://fr.wikipedia.org/wiki/Algorithme_de_Needleman-Wunsch
Clouds

Hausdorff_distance [between clouds of points, a point and a cloud]

Distance#Distances_between_sets_and_between_a_point_and_a_set

Distributions

https://blog.cloudera.com/blog/2015/12/common-probability-distributions-the-data-scientists-crib-sheet/

Discrete_uniform_distribution

Normal_distribution

Bernoulli_distribution

Binomial_distribution

Poisson_distribution

Chi-squared_distribution

Log-normal_distribution

Pareto_distribution

Chi-squared_distribution

Gibbs_distribution

Weibull_distribution

Gamma_distribution

Beta_distribution

Hypergeometric_distribution

Dirac_delta_function

https://ercim-news.ercim.eu/en107/special/robust-and-adaptive-methods-for-sequential-decision-making [Characterization of the simplicity of a distribution: BernsteinExponent+TsybakovMarginCondition]

Evaluation: Performance_indicator; Mean_absolute_percentage_error; Mean_absolute_scaled_error; Symmetric_mean_absolute_percentage_error; Regression-kriging

Information_gain_ratio

Kullback–Leibler_divergence

Gini_coefficient

Pearson_correlation_coefficient

http://www.cbcb.umd.edu/~salzberg/docs/murthy_thesis/node15.html

Akaike_information_criterion https://twitter.com/DataSciFact/status/963129411250933760

Bayesian_information_criterion

Brier_score == RMSE

Structural_similarity

Type_I_and_type_II_errors

False_positive_rate

False_coverage_rate

False_discovery_rate

Confusion_matrix

Accuracy_and_precision

Precision_and_recall

Sensitivity_and_specificity

Receiver_operating_characteristic

Receiver_operating_characteristic#Area_under_the_curve

Discounted_cumulative_gain

Cross-validation_(statistics)

Errors_and_residuals

If residual is consistantly >0 or <0 on a range of the training set => the model has failed to capture something in the data or we use wrong type of model (e.g. linear reg on parabolic data; DataSkeptic/Heteroskedasticity)

Heteroscedasticity

Clustering

- See also the Calinski-Harabasz Index: http://stats.stackexchange.com/questions/97429/intuition-behind-the-calinski-harabasz-index

Silhouette_(clustering)

- Others

Item_response_theory

- http://www.sthda.com/english/articles/29-cluster-validation-essentials/96-determining-the-optimal-number-of-clusters-3-must-know-methods/#elbow-method

Working with Text: Part_of_speech; Semantic_similarity; Tf–idf; Cosine_similarity; Okapi_BM25

See also Mr Gomez page on Weka: http://www.esp.uem.es/jmgomez/tmweka/

Named-entity_recognition

Conditional_random_field

Latent_Dirichlet_allocation

Sentiment_analysis

Document_classification

Automatic_summarization

Working with Images

http://mirror.imagej.net/plugins/mexican-hat/index.html
- If your model seeks to penalize near misses, the Mexican hat function is a good choice.

Working with concepts (Ontologies)

https://en.wikipedia.org/wiki/YAGO_%28database%29 http://wiki.dbpedia.org/ http://conceptnet.io/ http://cogcomp.org/Data/QA/QC/definition.html

Visualization: Data_visualization; Exploratory_data_analysis; List_of_graphical_methods; Category:Statistical_charts_and_diagrams; Statistical_graphics; Visual_perception; Heat_map; Misleading_graph; Pareto_chart

Need to develop "critical thinking":
- https://www.nytimes.com/column/whats-going-on-in-this-graph
- https://www.nytimes.com/column/learning-whats-going-on-in-this-picture

(Statistical) tests: A/B_testing

Evaluating an hypothesis

Statistical_power

Statistical_hypothesis_testing

Student's_t-test

Chi-squared_test

Type_I_and_type_II_errors

Detecting abrupt changes in time series

Stationary_process

Structural_break

Kruskal–Wallis_one-way_analysis_of_variance

Pairwise_summation

MOSUM: https://cran.r-project.org/web/packages/strucchange/vignettes/strucchange-intro.pdf
Time series / Chaos

Lyapunov_exponent

Kolmogorov_complexity

Machine Learning Techniques: Statistical_classification; One-class_classification; Binary_classification; Multiclass_classification; Multi-label_classification; Structured_prediction; Cluster_analysis; Elbow_method_(clustering); Nearest_neighbor_search#Approximate_nearest_neighbor; Regression_analysis; Linear_regression; Logistic_regression; Ridge_regression; Kriging; Multivariate_adaptive_regression_splines; Association_rule_learning; Apriori_algorithm; Survival_analysis; Monte_Carlo_method; Monte_Carlo_algorithm; Multinomial_logistic_regression; Lasso_(statistics); Expectation–maximization_algorithm; Markov_chain_Monte_Carlo; Hidden_Markov_Models; Viterbi_algorithm; Convolutional_code; Forward–backward_algorithm; Markov_random_field; Mean_field_theory; Mean_field_particle_methods; CART; Decision_tree_learning; Decision_tree; Pruning_(decision_trees); ID3_algorithm; C4.5_algorithm; Random_forest; Support_vector_machine; Support_vector_machine#Support_vector_clustering_.28SVC.29; Support_vector_machine#Regression; Conditional_random_field; Latent_semantic_analysis; Genetic_algorithm; Evolutionary_algorithm; Evolutionary_computation; Voronoi_diagram; Local_outlier_factor; Ordered_weighted_averaging_aggregation_operator; Support_vector_machine

Neural Networks
- History: http://www.chronicle.com/article/The-Believers/190147/
- The various types of NN as a picture: http://www.asimovinstitute.org/wp-content/uploads/2016/09/neuralnetworks.png

Types_of_artificial_neural_networks

Comparison_of_deep_learning_software/Resources

Artificial_neural_network

Feedforward_neural_network

Multilayer_perceptron

Radial_basis_function_network

Long_short-term_memory

Time_delay_neural_network

Recursive_neural_network

Recurrent_neural_network

Hopfield_network

Content-addressable_memory

Boltzmann_machine

Self-organizing_map

Learning_vector_quantization

Long_short-term_memory

Liquid_state_machine

Autoassociative_memory

Convolutional_neural_network

Neuroevolution_of_augmenting_topologies

Deep_learning#Deep_neural_network_architectures

Deep_belief_network

Generative_adversarial_networks

- https://stackoverflow.com/questions/4752626/epoch-vs-iteration-when-training-neural-networks

Neural_Turing_machine

- http://spinningbytes.com/demos/

Instantaneously_trained_neural_networks

Spiking_neural_network

Signal Processing

Optical_character_recognition

Fuzzy Logic

Inference_engine

Type-2_fuzzy_sets_and_systems

T-norm_fuzzy_logics

Adaptive_neuro_fuzzy_inference_system

Fuzzy_control_system

Working with spatial data

Spatial_association

Ensemble Techniques

Weak learner: https://stats.stackexchange.com/questions/82049/what-is-meant-by-weak-learner#82063

Ensemble_learning

Ensembles_of_classifiers

Ensemble_learning#Implementations_in_statistics_packages

Ensemble Learning = Boosting, Bagging or Stacking: http://stats.stackexchange.com/questions/18891/bagging-boosting-and-stacking-in-machine-learning#19053
Applying Bagging should help reduce variance and overfitting.

Bootstrap_aggregating

Boosting_(machine_learning)

Gradient_boosting

Committee_machine

Applications: Bayesian_spam_filtering; Root_cause_analysis; Inpainting

https://github.com/phillipi/pix2pix
https://www.youtube.com/user/keeroyz
Chatbots
- Personality
  - https://en.wikipedia.org/wiki/Big_Five_personality_traits

Experimentation framework

Goal: test various parameters on various algorithms to determine the best model(s)
Weka's "Experimenter" mode: http://weka.sourceforge.net/manuals/ExplorerGuide.pdf
AutoWeka: http://www.cs.ubc.ca/labs/beta/Projects/autoweka/
R::mlrMBO: https://github.com/mlr-org/mlrMBO

Coding / Exposing API to the rest of the application: Microservices

BigData: Data_lake; Streaming_algorithm; Star_schema; OLAP_cube; Solid-state_drive; MongoDB

Map-Reduce framework

Apache_Hadoop https://hadoop.apache.org/

Scrapping

Apache_Flume http://flume.apache.org/

Storage

Apache_Hadoop#HDFS https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html

Apache_HBase http://hbase.apache.org/

Apache_Hive https://hive.apache.org/

Transfers - to/from RelationalDB

Sqoop http://sqoop.apache.org/

Transfers - serialization/streaming

Apache_Avro http://avro.apache.org/

Apache_Kafka https://kafka.apache.org/

Storage - In memory

Apache_Spark https://spark.apache.org/

Apache_Flink http://flink.apache.org/

Admin

Apache_ZooKeeper http://zookeeper.apache.org/

Apache_Cassandra https://cassandra.apache.org

Ambari http://ambari.apache.org/

Apache_Oozie http://oozie.apache.org/

Programming

Pig_(programming_tool) https://pig.apache.org/

ML

Apache_Mahout http://mahout.apache.org/

Apache_SystemML http://systemml.apache.org/

Working with text

Elasticsearch https://www.elastic.co/

Working with text - Data Viz

Kibana https://www.elastic.co/products/kibana

Small/Micro Data
- https://arxiv.org/abs/1610.00946

Multi-Agent Systems: Agent-based_model; Multi-agent_system; Agent-oriented_software_engineering

https://www.researchgate.net/publication/266182243_Agent_Groupe_Role_et_Service_Un_modele_organisationnel_pour_les_systemes_multi-agents_ouverts [JFerber: AGR Methodology]

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.47.7968&rep=rep1&type=pdf [YDemazeau: Vowels Methodology]

Ant_colony_optimization_algorithms

Quantum Machine Learning: Quantum_machine_learning; Quantum_tunnelling; Quantum_annealing; Adiabatic_quantum_computation

Resources

Books

Free Books

  https://github.com/janishar/mit-deep-learning-book-pdf

  https://web.stanford.edu/~hastie/ElemStatLearn/printings/ESLII_print10.pdf

- http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Sixth%20Printing.pdf

  http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf

  http://infolab.stanford.edu/~ullman/mmds/booka.pdf

- http://www.guidetodatamining.com/

  http://www.guidetodatamining.com/assets/guideChapters/Guide2DataMining.pdf

- http://www.mlyearning.org/

  https://github.com/ajaymache/machine-learning-yearning

Paid Books
- "Artificial Intelligence for Humans, Volume 1: Fundamental Algorithms", Jeff Heaton, 2013, ISBN:9781493682225
- "Artificial Intelligence for Humans, Volume 2: Nature-Inspired Algorithms", Jeff Heaton, 2014, ISBN: 978-1499720570
- "Artificial Intelligence for Humans, Volume 3: Deep Learning and Neural Networks", Jeff Heaton, 2015, ISBN: 978-1505714340
- "Introduction to Machine Learning (Adaptive Computation and Machine Learning)", E. Alpaydin, MIT Press, 2004, ISBN: 978-0262012430
- "Machine Learning: An Artificial Intelligence Approach", R.S. Michalski, J.G. Carbonell, T.M. Mitchell, Symbolic Computation, 1983, ISBN:978-3540132981
- "A collection of Data Science Interview Questions Solved in Python and Spark vol I & II", Antonio Gulli, CreateSpace, 2015, ISBN:978-1517216719
- "Artificial Intelligence a Modern Approach", Stuart Russell and Peter Norvig, Prentice Hall, 1995, ISBN:978-0131038059
- "An Introduction to MultiAgent Systems", Michael Wooldridge, John Wiley & Sons, 2009 (2nd ed), ISBN:978-0470519462
- "Data Mining: Practical Machine Learning Tools and Techniques", Ian H. Witten, Eibe Frank, Mark A. Hall, Christopher J. Pal, Morgan Kaufmann, ISBN:978-0128042915
- "Agent Intelligence Through Data Mining", Andreas L. Symeonidis, Pericles A. Mitkas, Springer/Apress, ISBN:978-0387257570
- "Multiagent Systems: A Modern Approach to Distributed Artificial Intelligence", Gerhard Weiss, 2000, ISBN:978-0262232036
- "Data science at the command line", Janssens, O'Reilly.
- Also look for MachineLearning, DeepLearning, Spark, Mahout, R, Python, SciKit-Learn, Data/Text Mining, ElasticSearch, Natural Language, Statistics @ O'Reilly, Packt, Manning/In Action, HeadFirst
Lists of good books

News/Blogs/RSS

Podcasts

YT Channels

MOOCs

Jobs

Teaching

http://edison-project.eu/edison/edison-data-science-framework-edsf

Curated list of similar pages

https://github.com/search?utf8=%E2%9C%93&q=curated+list+awesome+frameworks&type= https://github.com/josephmisiti/awesome-machine-learning https://github.com/onurakpolat/awesome-bigdata https://github.com/onurakpolat/awesome-analytics https://github.com/analyticalmonk/awesome-neuroscience https://github.com/igorbarinov/awesome-data-engineering https://github.com/quantmind/awesome-data-science-viz https://github.com/fasouto/awesome-dataviz https://github.com/qinwf/awesome-R https://github.com/datascience-python/awesome-datascience-python https://github.com/caesar0301/awesome-public-datasets

Retrieved from "https://en.wikipedia.org/w/index.php?title=User:LI_AR/Books/Cracking_the_DataScience_Interview&oldid=986052775"