We use NLP techniques and explore all Classification models for various dataset, we found that the set of Naive Bayes, Random Forest, Logistic, Decision Tree, and KNN, result in better solutions. On the other hand, SVM, kernel SVM result in bad solutions.
This project has 887,379 observations and 74 variables. At first we do the data exploration and analysis. Nest we decide the important variables for the data mining techniques. The most fitting is gradient boosting (91.21%), recursive partitioning(88.65%), logical regression (69.54%), and random forest (67.34%) in order.
taking care of missing data, encoding categorical data for those independent and dependent variables, and feature scaling to speed up the performance and keeping the accuracy). Explore the correlation between those independent variables, and Data visualization for the training/test set and predicted models.
Natural Language Processing (NLP), Data Mining, Data Modeling, Machine Learning, Deep Learning, Statistical Modeling, Artificial Intelligence, Information Retrieval, Regression, Matrix Factorization, Classification, and Clustering.
R Programming, Python (Numpy, Pandas, Metplotlib, Seaborn, Scikit- learning), Tableau, SAS, Hadoop (HDFS, MapReduce, Spark), Hive/ Pig, Spark, Java, Scala, SQL, C++, Matlab, PySpark, PyTorch. Eclipse, Spyder, PyCharm, RStudio, Tabelau, Visual Studio, MathLab
PostgreSQL, SQL, NoSQL, Google BigQuery, Spark, Google CloudStorage
Monday - Friday: 9am - 6pm