Trending this week: Faker helps you to create synthetic data; Why and how you should learn Productive Data Science; Top 10 data visualization tools for every data scientist.
Every week we analyze the most discussed topics on Twitter by Data Science & AI influencers.
The following topics, URLs, resources, and tweets have been automatically extracted using a topic modeling technique based on Sentence BERT, which we have enhanced to fit our use case.
In this new publication of our series of posts dedicated to the technology watch, we will talk about:
- Must-Know Feature Engineering & Selection Methods
- Very Useful ML Tools
- ML in Healthcare
- AI Trends
Discover what Data Science and AI influencers have been posted on Twitter this week in the following paragraphs.
Must-Know Feature Engineering & Selection Methods
This week, data science and AI influencers have shared some very interesting content on how to select the most relevant features and how to get them ready to use for your later machine learning model, in order to solve your problem.
The following selection deals with features selection methods:
Marcus Borba has shared a post introducing Four Popular Feature Selection Methods for Efficient Machine Learning in Python. This post demonstrates the following four popular feature selection methods implemented in python libraries:
- Univariate feature selection: This method selects the best features based on univariate statistical tests. Here, the SelectKBest function from sklearn library is used;
- Feature selection using the correlation matrix: Here, the features with higher correlations with the target variable are retained;
- Principal Component Analysis (PCA): This method is a mix of feature engineering and feature selection as it derives new features from the former ones, and selects the ones — among these new features — that explain the most the variance in the data;
- Wrapper method: This method uses one or more machine learning models to find the right features. Here, a predictive model providing p-values is used. Successive iterations are performed, at each iteration, the model is fit on the data to predict the target variable, and the features with p-values > 0.05 — or higher p-values — are removed.
Kirk Borne has shared a blog post demonstrating Unsupervised Feature Selection for Time-Series Data. This post shows an example of the application of unsupervised feature selection from time-series raw sensor data using the MSDA package — an open-source multidimensional multi-sensor data analysis framework written in Python — and further also compares it with other well-known unsupervised techniques like PCA & IPCA. The variation-trend capture algorithm in the MSDA module identifies events in the multidimensional time series by capturing the variation and trend to establish relationships aimed towards identifying the correlated features. Then, some correlated features can be removed.
KDnuggets have shared the following content on feature engineering:
An article that provides 10 examples of feature engineering for machine learning. This post gives a brief introduction to feature engineering, covering coordinate transformation, continuous data, categorical features, missing values, normalization, and more. In each case, it gives an illustration of the application of feature engineering using an appropriate method.
A post on How to Deal with Categorical Data for Machine Learning. This blog is aguide to implementing different types of encoding for categorical data, including a cheat sheet on when to use what type. The following points are explored and implemented:
- One-hot Encoding using: Python’s category_encoding library, Scikit-learn preprocessing, Pandas’ get_dummies
- Binary Encoding
- Frequency Encoding
- Label Encoding
- Ordinal Encoding
Very Useful ML Tools
Some tweets posted also shared links to very useful tools to use in your machine learning projects in order to boost your analysis and your delivery.
Here, we have selected the following articles by KDnuggets:
A post demonstrating Easy Synthetic Data in Python with Faker — a Python library that generates fake data to supplement or take the place of real-world data. Whether you need to bootstrap your database, create good-looking XML documents, fill-in your persistence to stress test it, or anonymize data taken from a production service, Faker is for you. This post shows how to use Faker by providing examples of basic usage and optimization of Faker.
An article talking about Why and how should you learn “Productive Data Science”? This blog post discusses what Productive Data Science is and the utilities and core components of a Productive Data Science workflow. Also, it gives the required skills any data scientist should possess to become more productive, and the different tools or packages he/she should master. This article covers a wide range of side topics such as software testing, module development, GUI programming, ML model deployment as web-app, which are invaluable skillsets for budding data scientists to possess and which are hard to find collectively in any one standard data science book.
Also, it gives some tools for parallel computing (e.g., Dask, Ray), scalability (e.g, Vaex, Modin), and GPU-powered data science stack (RAPIDS) with hands-on examples. In particular, it focuses on the RAPIDS suite of software libraries and APIs give you the option and flexibility to execute end-to-end data science and analytics pipelines entirely on GPUs. For example, it shows that CuML and CuPy demonstrate dramatic improvement over sklearn, on the training time required to train a linear regression model, and over Numpy on the time required for matrix calculation respectively.
A post providing a Top 10 Data Visualization Tools for Every Data Scientist. This article covers the latest data visualization tools that every data scientist can use to make their work more effective. Here, a quick introduction to data visualization is provided, then the following list of tools are introduced along with their key features: Tableau, D3, Qlikview, Microsoft Power BI, Datawrapper, E Charts, Plotly, Sisense, FusionCharts, and HighCharts.
The report delves into specific topics such as reinforcement learning, the auto industry, robotics, advancements in NLP, professional employment trends, and much more.
He also shared Top Trends On The Gartner Hype Cycle For Artificial Intelligence. This Gartner Hype Cycle highlights how AI is reaching organizations in many different ways.
ipfconline has shared the Top Movies Of 2021 That Depicted AI. While the practical ramifications of artificial intelligence frequently differ from the way it is typically shown in the film, they shared a list of the top artificial intelligence films of 2021.
They also shared the list of The Top 10 Search Engines Today.
In this article, you will find a complete list of all top internet search engines, their pros and cons, and whether Google really is the most popular.
Hope you enjoyed this new post of our series. Stay tuned! 😉